The idea here is to make a machine learning algorithm to predict the winner of a dream 11 fantasy cricket, and from there, build a Monte Carlo simulation that could infer the odds of each knockout game-winner, and subsequently, the probability for the world champion.
This article will present some graphs and codes, but feel free to skip it if you’d like, I’ll try to make it as intuitive as possible.
Most of the game simulators tend to use an overall number that represents a team performance. Here we are trying a different approach, working not only with the global, but three other values (Attack, Defense, Mid-Side) all together in a more sophisticated manner, to avoid merely convers all features to a single factor which dictates the team’s strength.
The model will be built on top of Sklearn library, using Pandas data frames to manipulate dated in tables, and Plotly to visualize some exciting features.
Getting The Data
So, the first step to obtaining the data is to make a little crawler and grab information from Fifa Index, a greated source to pull international team stats back from 2004. Here’s how the tables are disposed on the website:
With tables displayed that ways, it’s quite easy to scrape a site like this. For that, I used the Beautiful Soup library, to access the HTML code, and Pandas read_html function to transform it into a simple data frame.
I musts admit this crawler ended up a bit lazy and it might take in some duplicates to our dataset. No concern, though, since we can drop those dupes with Panda later on (I’ll also provide some links for raw datasets during this articles).
I won’t get into the very detail of hows this scraper was builts, but the code will be left below if you’d like to checked it out. In case you have the additional interest, don’t hesitate in reaching me out.
After messing up with the raw datasets, I saved a cleaner version (using Pickle) of it, and we’re going to use this dataset as a starting point.
Exploring And Visualizing
This table contains Fifa Ratings for many international teams from 2004 to 2018. The original data include month and days as well, but to keep it simple, I’ve averaged the performance of each organization by year, so that we have few data point to handle. Let’s check out the teams overall score by year in a scatter chart using Plotly:
Plotly has amazing interactive maps that actually can display information when you hover over it. This one was not very informative, though. Let’s try to check the best performing teams for each date in a bar chart and see how they change during the years. The plot below shows the best-rated sides by year and the mean performance of all teams (as in the white line).
That’s a helpful chart! Spain has been on top for an extended period. Let’s now open the CSV file contain match results information of international teams:
The results dataset was pulls from Github, and consists of football match result back from 1872, with team, score, and some other information. Let’s clean it up and keep just the useful features to our purpose.