Data Science - Sneezing Trees

Back to Showcase

Here I present some of the self-study mini-projects I’ve set myself in an effort to grasp some data science and machine learning concepts. I’ve written a few papers on them, and there’s even some code to look at if you like. But if you’ve never been here before, you might want to check out a few health warnings for the older projects.

Projects

Air quality prediction application

An end to end project to develop and deploy a tool for predicting air quality in Montpellier, France using machine learning techniques:

Data Sourcing – Identifying and downloading from free sources of pollutant and weather data
Data Processing – Structuring, cleaning and filling gaps in the data
Model Training and Selection – Training a portfolio of different models on the same data to identify the best performer
Update process – Automatic daily download, cleaning and update of data
Deployment – Web app using Flask, Gunicorn and Docker deployed on cloud server

View the application.

Testing methods for analysing big datasets

Since the data science world is all about “Big data”, I decided to look at different ways to manipulate datasets that are too large to be loaded into the memory of my laptop:

The chunksize parameter of pandas read_csv.
The Dask library.
PySpark API with Apache Spark.
Talend Open Studio.

View the code on Github.

Correlation between economic strength and educational success

This project examined the question “Is there a correlation between a country’s economic strength and the success of its education?”.

My answer came from comparing reading proficiency and mathematical proficiency (according to UN definitions) against GDP per capita. Tools used were Google BigQuery, Pandas and Matplotlib.

View the code on Github.

Training a multi-layer perceptron on the MNIST dataset

The MNIST dataset contains 60,000 labelled examples of images of the handwritten digits 0 – 9. In this project I trained a simple multi-layer perceptron (using PyTorch) to correctly classify the MNIST digits.

View the code on Github.

Applying an n-step Sarsa to the cart-pole balance problem

A classic machine learning problem where the learner is trying to keep a pole upright.

The animation above is the actual result from one experiment. Have a read of the pdf paper to get a feel for what’s going on, and if you’re really feeling adventurous you can download the zipped Excel file “Cart pole learner”, check out the code in the macros, and even run the learner if you like.

A couple of things to be aware of:

Everything I learnt about the underlying reinforcement learning principles applied here, and the n-step SARSA algorithm, came from the excellent Sutton & Barto book Reinforcement Learning: An Introduction shown here with my files. If nothing else, the first chapter gives a great intro to what reinforcement learning is all about.
I don’t really give you many clues as to what’s going on in the Excel file, so don’t expect to have much of an idea without some serious commitment. That’s my fault (lack of time). But maybe you’ll be able to get the gist of how the core algorithm is being implemented…

Cart-pole balance problem paper

Cart pole learner

Training an automated Tic tac toe player using reinforcement learning

Feel free to download the source code for this project in the zip file. It’s written in Java and exported as a NetBeans project, so you can see all the .java files and run it easily from NetBeans if you like. There’s a Readme in the zip file that should (might) help with making sense of it all.

Tic Tac Toe paper

Tic Tac Toe learner code

Optimum path problem using evolutionary algorithm and CTRNN

Investigating the multi-layer perceptron

Control algorithm for simple robot arm using evolutionary approach

Health warnings

Back to Projects

When I first started setting myself these mini-projects it was never with the goal of presenting them to anyone. As a result, in the older projects (the pdf only ones) I haven’t always made it easy for you to follow what I was up to, nor unfortunately do I have the time to do that. Sometimes everything I did in terms of documenting my approach and results was only:
- A mechanism to help me get things straight in my own head; and
- Intended as an aide memoire for me (albeit a clear and relatively extensive one).
Similarly, in the older projects I never intended anyone else to use the code I wrote, so some of the code here may well be difficult to follow (though some of it should be all good). I simply haven’t had time to document it to the point where others can immediately grasp what’s going on.
Even if I had documented the code in such a way, in some cases it would still be difficult to follow! That is unless you have a reasonably firm grasp of the form of machine learning algorithm that’s being implemented.
There be maths in here. Quite a lot in some cases; reasonably advanced.

Back to Projects

Not really, but the truth is more complicated

Showcase > Data Science

Projects

Air quality prediction application

Testing methods for analysing big datasets

Correlation between economic strength and educational success

Training a multi-layer perceptron on the MNIST dataset

Applying an n-step Sarsa to the cart-pole balance problem

Training an automated Tic tac toe player using reinforcement learning

Optimum path problem using evolutionary algorithm and CTRNN

Investigating the multi-layer perceptron

Control algorithm for simple robot arm using evolutionary approach

Health warnings