As you may have guessed, it was a typo. Originally our title was meant to be The Curse of Dimensionality but, between Levannia’s inaccurate typing and our love of puns, we found ourselves with The Cruise of Dimensionality instead. Sometimes bugs are features.
So we don’t lose this learning moment, let’s start with a discussion on the The Curse of Dimensionality. It describes a situation that has become more and more common as machine learning, and particularly unsupervised learning, have gained popularity. (The name is dramatic I know, but we are data scientists, we need to find our laughs somewhere.)
The problem is simple, the more dimensions or predictors in your data, the less useful standard computational and statistical techniques become.
To explain this, let’s just take a simple case of data displayed on two dimensions x and y. This is an image that we are all familiar with, the classic scatterplot. Why don’t we pretend you are a point on this scatterplot. You’re simply sitting at your x and y coordinates, that represent the position of your home.
Now, what happens when you add a third dimension, a z axis. This can be the height that you’re located at. So now you’re sitting at home, on let’s say the second floor.
What, just happened? Yes, we increased the dimension of your location, but you may have noticed that the volume of the space that we were modelling also increased. We went from you just sitting on a scatterplot at your home coordinates to you sitting on the second floor of your home.
So let’s imagine what happens when you have hundreds or thousands of dimensions. The volume of that modelling space you’re in is increasing quite rapidly. You’re not only sitting at home on the second floor; now we’ve specified the time, the day and many, many other characteristics. It must be getting quite lonely for you. In order for someone to find you, or even to determine the floor that you’re on, they’ll require a lot more data. That’s going to be difficult for them, especially if they only have information about you and someone who lives several blocks away. I’m sorry to say this, but you’ve been cursed. Don’t worry, you’re not alone. There are many classical data sets that are also cursed. You may have heard of the classic Golub gene expression dataset: 38 observations, 3051 measurements/dimensions. It’s definitely cursed.
Fear not! There are ways to find you and combat the curse. We’ll have to reduce the dimensions that you’re in, which means we’ll have to figure out which are the most important. “Importance” of course might be something that we argue about. But here at the “Cruise of Dimensionality” we’ll discuss some of those techniques and more. We hope you’ll keep reading.