Happy Easter! Dark chocolate is by far one of my favourite foods. If there was a food pyramid for our mental health I would lobby that it should be number one. So for Easter I thought I would share some interesting analysis regarding my number one comfort food. Who wants the generic brands when you can have the good stuff?
Thankfully, I’m not the only one with a love of chocolate. It’s one of the most popular sweets in the world. According to Euromonitor, Switzerland is by far the largest consumer of chocolate. Every year, the average Swiss person consumes just under 20 lbs. Can you really blame them? They do have year round access to Toblerone bars. They aren’t just Christmas time stocking stuffers. The German and the Irish also have a sweet tooth, consuming 17.4 lbs and 16.3 lbs respectively. Americans come in ninth overall, consuming on average 9.5 lbs each year.
The data comes from the Flavours of Cacao database a subset of which can also be found on Kaggle, where you can also find some very interesting analysis. The data contains expert ratings of over 1900 individual chocolate bars, as well as information regarding their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown. For the purposes of our analysis only chocolate with a single cacao bean origin was considered.
The rating system is described below.
- 5= Elite (Transcending beyond the ordinary limits)
- 4= Premium (Superior flavor development, character and style)
- 3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
- 2= Disappointing (Passable but contains at least one significant flaw)
- 1= Unpleasant (mostly unpalatable)
Looking at this data and with the recent completion of several courses in supervised and unsupervised learning, I had hopes of providing you with some robust analytically sound models. The plan was simple. Use clustering techniques to see if the natural clusters in the chocolate ratings would tie back to specific countries. Having eaten my fair share of chocolate I was hoping to see some of my favourites come up.
Unfortunately, things did not go as planned. After cleaning the data I quickly realized that this wasn’t a simple problem. The data contains both numerical and categorical variables on different scales. Using Gower’s distance, I tried several forms of hierarchical clustering. Complete linkage shown below seems to suggest four groups in the data. However, closer inspection simply reveals that these groups are simply clusters based on cacao percentage. No surprise there.
I was still determined to find something in the data. My second thought was perhaps LASSO least absolute shrinkage and selection operator) could be used to pull out the key predictors. Perhaps, we could see what really makes a chocolate bar. However, like I said before, our data contains categorical variables. Although group LASSO could have been an option (one that I may consider later), when Liza suggested we race them I was all for it. Let’s have some fun!
I created an animation using Flourish to show the average chocolate rating by country as the years change. The countries are coloured by continent. We begin in 2006 with Mexico and Bolivia producing supreme chocolate. In 2009, things get interesting when we see the rise of Asian chocolate in Vietnam and the Philippines. However, they can’t hold on to their rankings, and by 2016 Mexico has reclaimed its place in the top five and the Congo and Australia are new contenders. The world of chocolate is certainly fast moving.
What do you see when you watch? How do your favourites perform, and how much chocolate do we need to eat before our ratings can also be considered “expert”?