Photograph
Daniel Kluver

about

blog

research

CV


contact

email

twitter

Data Cleanup, the heart of Machine Learning.

Its going to be a short blog post this week. I intended to begin analyzing the portmanteau dataset from last week. But my degree has been demanding a lot of my energy and I haven't completely gotten around to it. Don't worry though, I'm not going to leave you empty handed.

Ask any data mining or machine learning professional and they will tell you: most of the work in good machine learning is not building an algorithm. In fact most people I know of don't bother building their own algorithms. most people use pre-built algorithms for any one of a myriad of good reasons. No, the real work of machine learning is in preparing the data to be entered into an algorithm.

There are two core steps needed to prepare data to be entered into a machine learning algorithm: feature extraction, and data clean up. Broadly these two processes are those that take messy, unformatted data and turn it into clean, parameterized data points for an algorithm to learn on. Data clean up is the first step, in which whatever format of data you have is made cleaner. This can mean different things for different projects, but generally this involves removing unneeded information, separating data points that might be combined in a data format, and sanitizing the information. Raw data can have nulls, Nans, Nones, and other such "error codes", it can be formatted differently than your algorithms requiring encoding, and it can come from users who may intentionally or accidentally have input bad data. After the data is "clean" and to some minor degree trustworthy you then need the second step, feature extraction.

It is rare that data is provided to you with the exact features that you need to perform high quality machine learning. Much more commonly you will need to use your cleaned data to compute the feature values that you use in your model. This is especially true when not working with numeric information. Not many algorithms work directly on non numeric information, much more frequently you need ways of taking you non-numeric data and turning it into data. Even when your data is fundamentally numeric (transactions containing IDs and amounts for example) the meaning of the data will often need processing to make sense to a learning algorithm.

These two steps, data cleanup and feature extraction are frequently hard to separate. When coding all you really want is a step that goes from raw data to algorithm inputs. The cleanup is frequently defined by the needs of the data cleanup. For example, if you are processing blog posts, you might need to remove punctuation (cleanup and cleanup. are essentially the same work). however, if all you care about is a word count, punctuation doesn't really matter and can be skipped.

You might be asking yourself, "what does this have to do with portmanteaus?" Well, in the last blog post I put up a csv file containing portmanteaus. While I did do some basic data cleanup, it turns out that I didn't do enough. I was using a regular expression to manage most of my cleanup, and it turns out that that, plus manual review, missed a few things. I've updated the dataset, still available here. I've cleaned up a small number of words that had incomplete or incorrect cleanup done last pass. Many of these had explanation words still listed, or ended with a comma or period. I also removed all of the words with more than two sources. One word PHYTOC (Peeking Harvard, Yale, Tsinghua, Oxford, Cambridge) I removed because it is, in my opinion, an initialism, not a portmanteau. The remaining fourteen I removed because I decided that simply isn't enough to train a multi word algorithm on, at least not before tackling the simpler case. I still have a list with the words that can be used at a later date.

As I said, these cleanups have been made and the dataset file has been updated. I'm hopeful that I won't find any more issues down the line with the cleanup and I can start looking into features for extraction.