Wednesday, October 20, 2010

Predicting the English Premier League - Part 1

My current project is to come up with a predictor for the English Premier League. Football + coding in Lisp + modeling/statistics -- throw in a fat man and it's Christmas. Not to mention, if the predictor is any good, a chance to make some money as well.

The bulk of the coding is complete -- the data loading, extracting the various bits from the match statistics, and so on. It's the prediction part that's tricky, as it should be.

The idea is to identify the predictors first. This is the first cut:
  1. The home record for the home team

  2. The away record for the away team

  3. The overall records for both teams (points scored out of the maximum possible so far)

  4. The recent records for both teams (recent == last three matches)

  5. Goal difference for the teams
These variables are not strictly fully independent (goal difference and overall record, for example), but I'm going to go with them for now. The prediction takes into account only the current season's matches, though ideally it would be better if it included all the match stats available -- this would help facts like 'Team A has never beaten team B in since 1973', for example, to come through and influence the prediction.

In machine learning terms, the problem is a multiclass classification problem, with the classes being 'home team win', 'away team win' and 'draw'; I am not looking at predicting scores, only results -- I assume one can bet on (and make money with) just results.

I turned to the venerable Elements of Statistical Learning to beef up on the theory, but soon beat a hasty retreat -- if you're just looking for overviews of the different techniques/algorithms and how to implement them, this book is not for you. I next turned to Wikipedia and found it somewhat better, though the sections on statistics and machine learning are still too heavy for my liking.

Anyway, I started with a) a linear regression model, with the three classes represented as equal intervals in [0,1] (I hope this is kosher) and b) the k nearest neighbours algorithm. I wanted to implement both of these methods myself, but the linear regression seemed too much of a diversion, and I settled on using R's glm() function for this instead. KNN proved pretty easy to implement, on the other hand.

The season is only eight match-days old, so there's not a lot of training data. Both the methods' predictions are pretty abysmal (accuracy of around 30-40%), and my own predictor which just looks at the total record and recent form matches these methods' performance (not to mention the method based on random number generation that behaves like the fricken Rain Man).

The next step is to try out other methods and read up on model selection.

Speaking of using Lisp, I have sort of come to a decision with respect to programming languages: it's going to be Lisp from now on (more particularly, Common Lisp). Yeah, I know about Stroustrup's quote about being fanatic about a single language, and the warm, fuzzy feeling I get when I think about Smalltalk, but Lisp is, to paraphrase Robert Pirsig, the high country of the programming world, what with its purity and elegance, "code is data", and the way you're able to accomplish so much with so little, with code that simply flows. I had considered this question earlier, and had settled in favour of Smalltalk, but I guess you gotta do what you gotta do. Anyway, Smalltalk and Haskell are still going to be in the toolbox. Another thing in favour of Lisp is that I made the effort to get to know the Allegro CL IDE better, and have taken a liking to it -- while the shortcomings I have mentioned earlier are still there, my productivity has definitely improved because of the increased familiarity with the shortcuts and IDE features. Oh, and the fact that I took the time to fully grok the chapter on CLOS in On Lisp and had my mind blown helped, too.

Update: Part 2 is here