… in which we explore continuous distributions with spotify data, find out about the central limit theorem and related statistical tests and become N-dimensional whale sharks.
Watch today’s video here:
Hover over the slides and press f for full screen. Press ? for a list of keyboard shortcuts.
knitr::include_url("slides6.html")
Disclaimer: As I wrote this lecture, the scope gradually increased because there was always the next thing I wanted to include. And it was kind of hard to stop. The exercises today will very unrestrictive, so that you can explore in your own tempo and so that we have more time for questions on Friday.
Today, we will explore the process of modeling and look at different types of models. In part, we will do so using the tidymodels framework. The tidymodels framework extends the tidyverse with specialized tools for all kinds of modeling tasks that fit neatly in with all the tools we already know. Go ahead and install them with:
install.packages("tidymodels")
renv
At this point, we have installed quite a lot of packages. On one hand, this is great fun because the extend what we can do and make tedious tasks fun. On the other hand, every package that we add introduces what is called a dependency. If a user doesn’t have the package installed, our analysis will not run. If we are feeling experimental and use functions from packages that are under active development and might change in the future, we will run into trouble when we update the package. But never updating anything ever again is no fun! I will show you, how to get the best of both worlds: All the packages and functions that your heart desires while maintaining complete reproducibility. This is to make sure that you can come back to your old projects 2 years from now and they still just run as they did at the time.
This solution is a package called renv
. The idea is as follows: Instead of installing all your packages into one place, where you can only have one version of a package at a time, renv
installs packages locally in your project folder. It also meticulously writes down the version numbers of all the packages you installed and keeps a cache, so it will not copy the same version twice.
It is an R package like any other, so first, we install it with:
install.packages("renv")
Then, in our RStudio project in the R console, we initialize the project to use renv
with:
renv::init()
This does a couple of things. It creates a file named .Rprofile
, in which it writes source("renv/activate.R")
. The R-profile file is run automatically every time you start a R session in this folder, so it makes sure renv
is active every time you open the project. It also creates a folder called renv
. This is the where it will install packages you want to use in the project. The most important file is the renv.lock
file. You can have a look at it, it is just a text file with all the packages and their exact versions.
You notice, that after initializing renv, we have no packages, so for example we can’t load the tidyverse as usual. We will have to install it again! However, in this case it should be fairly fast, because renv knows that it was already installed globally so it simply copies the files, which is fast. After having installed a new package, we call:
renv::snapshot()
Renv tells us, what we changed in our environment and after we confirm, it notes down the changes.
Not it is also really easy to collaborate with other people. Because after we send them our project folder, all they have to do is run:
renv::restore()
To install all packages noted down in the lockfile. We can also use this ourselves if we installed a few to many packages or did an update we regret and want to go back to what is written in the lockfile.
Finally, renv also provides functions to update or install new packages. They work like install.packages
, but a bit more versatile. For example, let me show you the different locations from which we can install packages.
renv
can install packages from GitHub as well, for example let us say, we want to test out the latest version of the purrr
package to give feedback to the developers.
https://github.com/tidyverse/purrr here it says:
# ...
# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/purrr")
Well, we don’t need devtools for this, because renv
can do this with the regular install function:
renv::install("tidyverse/purrr")
Giving it just a package name installs a package from CRAN, a pattern of "username/packgename"
installs from GitHub. Now, back to the actual topic of today!
After having initialized renv
we need to install the packages that we need for the project even if we already have them in our global package cache, just so that renv
knows about them.
Such goes the quote but statistician George Box.
»All models are wrong, but some are useful« — George Box
What this means is that any model is but a simplification of reality and must always omit details. No model can depict the complete underlying reality. However, models are useful, and to understand what they are useful for, we must first look at the different types of models out there.
The tidymodels book names three types of models, where any particular model can fall into multiple categories at once:
geom_smooth
, the default method is a so called LOESS curve, which stands for Locally Estimated Scatterplot Smoothing. It does produce insights by revealing patterns to us, but by itself can not be used e.g. to make predictions. It is just a pretty looking smooth line.Inferential Models are designed to test hypothesis or make decisions. They rely heavily on our assumptions about the data (e.g. what probability distribution the populations follows) and will be most likely encountered by you to answer research questions. They are the models that typically produce a p-value, which you compare to a threshold like we did last week
Predictive Models are designed to process the data we have and make predictions about some response variable upon receiving new data. When done correctly, we also hold out on some of the data that our model never gets to see, until it is time to evaluate and test how it performs on unseen data. Depending on how much we know (or want to know) about the underlying processes, we differentiate between mechanistic models like fitting a physically meaningful function to data and empirically driven models, which are mainly concerned with creating good predictions, no matter the underlying mechanism.
We will now explore different examples. First, let me introduce our dataset for today:
I created a playlist on spotify, which is quite diverse so that we can look at a range of features. You can even listen to it here while you do the exercises if you want. I am doing so, as I write this. The cool thing about spotify is, that they have an API, an Application Interface. APIs are ways for computer programs to talk to each other. So while we use the spotify app to look up songs, computers use the API to talk to the spotify server. And because R has a rich ecosystem of packages, someone already wrote a package that allows R to talk to this API: spotifyr
.
If you check out the R folder in this lecture, you can see how I downloaded and processed that data about the playlist. Note that the script will not work for you right away, because you first need to register with spotify as a developer and then get a so called token, like a username and password in one long text, to be allowed to send bots their way. You probably just want to download the data from my github repository the usual way.
Let’s have a look, shall we?
songs <- read_csv("data/spotify_playlist.csv")
We can get a quick overview of all columns with:
glimpse(songs, width = 60)
Rows: 347
Columns: 17
$ track_name <chr> "Africa", "Take on Me", "Wake Me…
$ track_artists <chr> "TOTO", "a-ha", "Wham!", "Elton …
$ danceability <dbl> 0.671, 0.573, 0.620, 0.504, 0.37…
$ energy <dbl> 0.373, 0.902, 0.573, 0.904, 0.89…
$ key <dbl> 9, 6, 0, 6, 8, 10, 2, 4, 6, 4, 1…
$ loudness <dbl> -18.064, -7.638, -11.893, -6.863…
$ mode <dbl> 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,…
$ speechiness <dbl> 0.0323, 0.0540, 0.0423, 0.1790, …
$ acousticness <dbl> 0.257000, 0.018000, 0.271000, 0.…
$ instrumentalness <dbl> 8.01e-05, 1.25e-03, 0.00e+00, 1.…
$ liveness <dbl> 0.0481, 0.0928, 0.0607, 0.1400, …
$ valence <dbl> 0.732, 0.876, 0.897, 0.772, 0.66…
$ tempo <dbl> 92.718, 84.412, 81.548, 176.808,…
$ time_signature <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ track_duration_ms <dbl> 295893, 225280, 231333, 183440, …
$ track_popularity <dbl> 83, 83, 79, 79, 59, 61, 71, 83, …
$ track_year <dbl> 1982, 1985, 1984, 1983, 2020, 20…
Finally some decent numbers! Not just these measly discrete values we had last week. For each song in the playlist, we get the artist, the year it arrived and a number of features like how danceable, how loud or fast the song is. You can easily imagine spotify using these numbers to suggest new songs based on the features of those that you have listened to. And in fact, we are going to lay the foundations for such an algorithm today.
When dealing with a continuous distribution, like we have for a lot of our features in the spotify songs dataset, there are always multiple ways to represent the same data. First, we just look at the numbers. We will use the valence
values for our songs:
head(songs$valence)
[1] 0.732 0.876 0.897 0.772 0.668 0.324
Notice anything interesting in the numbers? I don’t either. Our brain is way better suited for looking at graphical representations, so: To the ggplot cave!
songs %>%
ggplot(aes(x = factor(0), valence)) +
geom_point()
This is kind of hard to see, because points overlap. We can get a better picture of the distribution by using transparency or a bit of jitter:
songs %>%
ggplot(aes(x = factor(0), valence)) +
geom_jitter(width = 0.05)
Using a histogram, we can put the points into bins and get a plot similar to what we got for discrete values. Note that the plot is flipped on it’s side now.
songs %>%
ggplot(aes(valence)) +
geom_histogram()
And we might want to play around with the bin size to get a better feel for the distribution. Another way is to apply a smoothing function and estimate the density of points along a continuous range, even in places where we originally had no points:
songs %>%
ggplot(aes(valence)) +
geom_density(fill = "darkblue", alpha = 0.3)
Both of these plots can be misleading, if the original number of points is quite small, and in most cases, we are better off, showing the actual individual points as well. This is the reason, why the first plots I did where vertical, because there is a cool way of showing both the points and the distribution, while still having space to show multiple distributions next to each other. Imagine taking the density plot, turning it 90 degrees and then mirroring through the middle. What we get is a so called violin plot. To overlay the points on top, we will use something a little more predictable than jitter
this time: From the ggbeeswarm
package I present: geom_quasirandom
.
songs %>%
ggplot(aes(factor(0), valence)) +
geom_violin(fill = "darkblue", alpha = 0.3) +
ggbeeswarm::geom_quasirandom(alpha = 0.6)
This is cool, because now we can easily compare two different distributions next to each other and still see all the individual points. For example, we might ask:
“Do songs in major cord have a higher valence than songs in minor cord in our dataset?”
songs %>%
ggplot(aes(factor(mode), valence)) +
geom_violin(fill = "darkblue", alpha = 0.3) +
ggbeeswarm::geom_quasirandom(alpha = 0.6) +
labs(x = "Mode (Minor/Major)")
Note: This jittering only works, because the feature on the x-axis is discrete. If it where continuous, we would be changing the data by jittering on the x-axis.
We might also want to add summaries like the mean for each group to the plot with an additional marker. This leads us to the general concept of summary statistics. There is a number of them, and they can be quite useful to, well, summarise a complex distribution. But they can also be very misleading, as can any simplification be.
Let us start by considering different things we can say about our distribution in one number. First, we might look at the range of our numbers, the maximum and minimum. We will do this per mode, so we can compare the values.
# A tibble: 2 x 3
mode max min
<dbl> <dbl> <dbl>
1 0 0.877 0.0271
2 1 0.965 0.0176
It appears the valence can assume values between 0 and 1. A shortcut for this is the range
function:
range(songs$valence)
[1] 0.0176 0.9650
Next, we want to know the centers of the points. There are different notions of being at the center of the distribution. The mean or average is the sum of all values divided by the number of values. The median is what we call a quantile, a point that divides a distribution in equally sized parts, specifically such that 50% values are below and 50% are above the median.
# A tibble: 2 x 3
mode `mean(valence)` `median(valence)`
<dbl> <dbl> <dbl>
1 0 0.399 0.395
2 1 0.299 0.258
The median is just one of the many percentiles we can think of. If we display the 50th as well as the 25th and 75th percentile on one plot, we get what is called a boxplot:
songs %>%
add_row(mode = 1, valence = 1.2) %>%
ggplot(aes(factor(mode), valence)) +
geom_boxplot(fill = "darkblue",
alpha = 0.3,
outlier.alpha = 1) +
annotate(geom = "curve", x = 1.8, y = 1.25, xend = 1.95, yend = 1.19,
curvature = .3, arrow = arrow(length = unit(2, "mm"))) +
annotate("text", x = 1.75, y = 1.25,
label = "outlier\n (that I added because there where none)",
vjust = 0.5, hjust = 1) +
coord_cartesian(clip = "off")
The “whiskers” of the box extend to 1.5 times the box size or to the last data point, whichever makes smaller whiskers. Points that are more extreme than the whiskers are labeled outliers by the boxplot and usually displayed as their own points. Like with the violin plot, we also have the option to plot the original un-summarized points on top. In this case, we need to make sure to change the outlier color for the boxplot to NA
, because otherwise we are plotting them twice:
songs %>%
add_row(mode = 1, valence = 1.2) %>%
ggplot(aes(factor(mode), valence)) +
geom_boxplot(fill = "darkblue", alpha = 0.3,
outlier.color = NA) +
ggbeeswarm::geom_quasirandom(alpha = 0.6)
This hints at one downside of boxplots: The box is a very prominent focus point of the plot, but by definition, it only contains 50% of all datapoints. The rest is delegated to thin whiskers.
Finally, we want to know, how far the values scatter around their means and the potential population mean. This is encompassed in two closely related measures: the variance and the standard deviation.
For illustrative purposes, we can plot all datapoints for e.g the valence in the order in which they appear in the data and add a line for the mean.
songs %>%
mutate(index = 1:n()) %>%
ggplot(aes(index, valence)) +
geom_segment(aes(y = mean(songs$valence),
yend = mean(songs$valence),
x = 0,
xend = length(songs$valence))) +
geom_segment(aes(xend = index, yend = mean(songs$valence)),
color = "darkred", alpha = 0.6) +
annotate("text", x = length(songs$valence) + 13, y = mean(songs$valence),
label = "Mean") +
geom_point()
The variance is the expected value of the squared deviation of a random variable from its mean.
In other words: Take the distance of all points to the mean and sum them (add all red lines in the plot above together) and then divide by \(n-1\).
\[var(X) = \frac{\sum_{i=0}^{n}{(x_i-\bar x)^2}}{(n-1)}\]
“Hang on!” I hear you saying: “Why \(n-1\)?” And it is an excellent question. The first statement talked about an expected value. (One example of an expected value is the mean, which is the expected value of… well, the values). And indeed, and expected value often has the term \(1/n\). But the statement was talking about the expected value (of the squared deviation) for the whole population. We can only use the uncorrected version when we have the whole population (e.g. all songs that ever existed) and want to talk about that population. But usually, all we have is a sample, from which we want to draw conclusions about the population. But when we are using the sample to estimate the variance of the population, it will be biased. We can correct for this bias by using \(n-1\) instead of \(n\). This is known as Bessel’s correction. I am yet to come by a really intuitive explanation, but here is one idea: The thing we are dividing by is not necessarily the sample size any time we want to try to calculate the expected value of an estimator, it just happens to be the sample size in a bunch of cases. What the term really represents here is the degrees of freedom (DF) of the deviations. DFs can be thought of as the number of independent things. The degrees of freedom are \(n\) reduced by \(1\), because if we know the mean of a sample (we use it in our calculation), once we know all but \(1\) of the individual values, the last value is automatically known and thus doesn’t count towards the degrees of freedom.
Next up: The Standard Deviation (SD) is the square root of the variance. Which is more commonly used on error bars, because the square root inverts the squaring that was done to get the variance. So we are back in the dimensions of the data.
\[\sigma_X=\sqrt{var(X)}\]
Finally, we have the Standard Error of the Mean, sometimes only called Standard Error (SEM, SE). It is also used very commonly in error bars. The reason for a lot of people to favor it over the SD might just be, that it is smaller, but they have distinct use-cases.
\[SEM=\sigma / \sqrt{n}\]
We take the standard deviation and divide it by the square-root of \(n\). Imagine this: We actually have the whole population available. Like for example all penguins on earth. And then we repeatedly take samples of size \(n\). The means of these individual samples will vary, so it will have it’s own mean, standard deviation and variance. The standard error is the standard deviation of these means. So it is a measure of how far the means of repeated samples scatter around the true population mean. However, we don’t usually have the whole population! Measuring some property of all penguins in the world takes a long time, and running an experiment in the lab for all cells that exist and will ever exist takes an infinite amount of time. This is probably more than our research grant money can finance. So, instead, the Standard Error of the Mean used the standard deviation of our sample in the formula above. It is our best estimate for the standard deviation of the whole population. So, when you are trying so make inferences about the mean of the whole population based on your sample, it makes sense to also give the SEM as a way of quantifying the uncertainty.
While R has functions for sd
, mean
and var
, there is not built in function for the sem
, but we can easily write one ourselves:
However, be very wary of simple bar graphs with error bars; there is a lot that can be misleading about them.
songs %>%
group_by(mode) %>%
summarise(across(speechiness, list(m = mean, sd = sd, sem = sem))) %>%
ggplot(aes(factor(mode), speechiness_m, fill = factor(mode))) +
geom_errorbar(aes(ymin = speechiness_m - speechiness_sem,
ymax = speechiness_m + speechiness_sem,
color = factor(mode)),
size = 1.3, width = 0.3, show.legend = FALSE) +
geom_col(size = 1.3, show.legend = FALSE) +
coord_cartesian(ylim = c(0.06, 0.08)) +
scale_fill_manual(values = c("#1f6293", "#323232")) +
scale_color_manual(values = c("#1f6293", "#323232")) +
labs(title = "Don't Do This at Home!",
y = "Speechiness",
x = "Mode (Minor / Major)") +
theme(
plot.title = element_text(size = 44, family = "Daubmark",
color = "darkred")
)
When people say “The y-axis has to include 0,” this is the reason for it. It is no always true, when there is another sensible baseline that is not 0, but especially for barplots not having the y-axis start at 0 is about the most misleading thing you can do. The main reason for this is that humans perceive the height of the bars via their area, and this is no longer proportional when the bars don’t start at 0. This plot also makes no indication of the type of error-bars used or the sample size in each group. It uses the speechiness
feature, but it hides the actual distribution behind just 2 numbers (mean and SEM) per group:
So the next time you see a barplot ask the question:
I hope you can take some inspiration from this chapter and now have the vocabulary to know where to look when it comes to your own data.
I had a lot of fun making the graphs for today’s session. Naturally, there will be a couple of questions as to how they where done. There is two pointers I want to give you.
Firstly, for all things ggplot
, the third edition of the ggplot book is currently being worked on by three absolute legends of their craft (“Welcome | Ggplot2,” n.d.; Wickham 2016). Hadley Wickham is the author of the original ggplot and ggplot2 package, Danielle Navaro makes amazing artwork with and teaches ggplot and Thomas Lin Pedersen is the current maintainer of ggplot2 and constantly makes cool features for it. The under-development book is already available online for free: https://ggplot2-book.org/.
Secondly, we need to briefly talk about a concept we have only brushed by: graphics devices are to R what your printer is to your computer. When we create a plot in R, it starts out as mere numbers, but something has to turn these numbers into pixels (in the case of raster-images) or vectors (in the case of vector images; you might know svg or pdf files. Sorry, but these are not the vectors in R but rather descriptions of lines). This is the job ob the graphics device. When we use the ggsave
function for example, it figures out what to use based on the file extension, but we can also specify it manually. I am mentioning this here, because in the plot I just showed you, I used a different font than the default. This is something that can be incredibly tricky for graphics devices, because fonts are handled differently on every operating system. Luckily, it is about to get way easier, because Thomas Lin Pedersen is working on another package, a graphics device, that is both really fast and works well with fonts. You can check the current development version here: https://ragg.r-lib.org/
There are many different distributions out there. Luckily, one of them is quite special and can be used in a multitude of settings. It is the harmlessly named Normal Distribution. R has the usual functions for it (density, probability, quantile, random).
tibble(x = seq(-3, 3, 0.01)) %>%
ggplot(aes(x)) +
geom_function(fun = dnorm) +
stat_function(geom = "area", fun = dnorm,
fill = "darkblue", alpha = 0.3) +
labs(y = "density", title = "Normal Distribution Density")
tibble(x = seq(-3, 3, 0.01)) %>%
ggplot(aes(x)) +
geom_function(fun = pnorm) +
labs(y = "probability", title = "Cummulative Probability")
Now, why is is distribution so special?
The Central Limit Theorem (CLT) states that the sample mean of a sufficiently large number of independent random variables is approximately normally distributed. The larger the sample, the better the approximation.
For a great visualization of the central limit theorem, check out this interactive tutorial by Seeing Theory.
Because a lot of values we measure are actually the sum of many random processes, distributions of things we measure can often be approximated with a normal distribution.
We can visually test if some values follow the normal distribution by using a quantile-quantile plot, which plots the quantiles of our sample against where the quantiles should be on the normal distribution. A straight line means it is perfectly normal.
The values close to the mean are pretty normal, but the tails of the distribution stray further from the normal distribution. There are way more very small and very large values than would be expected from a normal distribution.
There is one thing that comes up a lot in biological data: because a lot of processes in biology are reliant on signal cascades, they tend to be the result of many multiplicative effects, rather than additive effects, as would be required for the Central Limit Theorem. As a result, they are not distributed normally, but rather log-normally, because taking the logarithm of all values transforms multiplicative effects into additive effects!
The CLT is only valid for large sample sizes. For smaller sample sizes, the distribution of means has fatter tails than a normal distribution. This is why for most statistical tests, we use the t-distribution instead of the normal distribution. As the degrees of freedom get higher, the t-distribution approaches the normal distribution.
base <- ggplot() + xlim(-5, 5)
base +
geom_function(aes(colour = "normal"), fun = dnorm, size = 1.2) +
geom_function(aes(colour = "t, df = 1"), fun = dt, args = list(df = 1), size = 1.2) +
geom_function(aes(colour = "t, df = 3"), fun = dt, args = list(df = 3), size = 1.2) +
geom_function(aes(colour = "t, df = 30"), fun = dt, args = list(df = 30), size = 1.2) +
guides(color = guide_legend(title = ""))
Remember the valence plot by mode?
songs %>%
ggplot(aes(factor(mode), valence)) +
geom_violin(fill = "darkblue", alpha = 0.3) +
ggbeeswarm::geom_quasirandom(alpha = 0.6)
For demonstrative purposes I am going to cheat a little and pretend that both distributions are approximatley normal so that we can look at some hypothesis tests:
The first test is called student’s t-test. “Student” was the pseudonym of it’s inventor. And the “t” stands for the t-distribution. We can use it to test the null hypothesis, that two samples come from the same (approximately normal) distribution
t.test(valence ~ mode, data = songs)
Welch Two Sample t-test
data: valence by mode
t = 3.9287, df = 328.06, p-value = 0.0001042
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
0.04999479 0.15027858
sample estimates:
mean in group 0 mean in group 1
0.3988897 0.2987530
We receive a p-value for the probability to get a difference in means as extreme or more extreme as observed in our samples. Here, the p-value is small, not because the difference is very large, but because we have a lot of values.
Tests, that rely on the assumption of normality are called parametric tests, but when this assumption can not be met, we need non-parametric tests.
The Wilcoxon rank-sum test, or Mann–Whitney U test, is one of these. I get’s around the assumption of normality by transforming the data into ranks first. i.e. all points (independent of group) are ordered and their values replaced by their position in the ordering (their rank). If we think of the t-test as testing for a difference in means, we can think of the Wilcoxon rank-sum test as testing for a difference in medians.
For example, let us test for a difference in speechiness
for the two modes (Minor/Major):
wilcox.test(speechiness ~ mode, data = songs)
Wilcoxon rank sum test with continuity correction
data: speechiness by mode
W = 17918, p-value = 0.0003848
alternative hypothesis: true location shift is not equal to 0
Both tests have the argument alternative
, which can be any of c("two.sided", "less", "greater")
. This is the direction of our alternative hypothesis. Are we testing, for x being greater or less than y? Or are we testing for a difference in any direction (the default)? Having a hypothesis about the direction beforehand will result in smaller p-values (half of the two-sided ones), but you need to have this hypothesis before looking at the data, and especially not after running e.g. the two sided test and then deciding, that you want a smaller p-value! This is not how p-values work.
If you are unsure about how to tell the functions, which of two groups is supposed to be greater or lesser, you can also supply the data as x
and y
instead of using the formula interface as I did above:
speechiness_minor <- songs %>% filter(mode == 0) %>% pull(speechiness)
speechiness_major <- songs %>% filter(mode == 1) %>% pull(speechiness)
wilcox.test(speechiness_minor, speechiness_major)
Wilcoxon rank sum test with continuity correction
data: speechiness_minor and speechiness_major
W = 17918, p-value = 0.0003848
alternative hypothesis: true location shift is not equal to 0
If we save the result of the test, we can inspect the object further and extract information from it:
w_test <- wilcox.test(speechiness_minor, speechiness_major)
w_test$p.value
[1] 0.0003848449
The t.test on a lonely sample can also be used to create confidence intervals around a mean. In short for example a 95% confidence interval is the range in which we would expect the mean of a sample to fall in 95% of cases when we repeat an experiment an infinite amount of times. These confidence intervals are also sometimes used as error bars in plots.
One Sample t-test
data: valence_minor
t = 21.508, df = 144, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.3622315 0.4355478
sample estimates:
mean of x
0.3988897
test$conf.int
[1] 0.3622315 0.4355478
attr(,"conf.level")
[1] 0.95
Lastly for today, we are going a bit out of scope. We are leaving the realm of looking at individual features and try to condense all the information into as little space as possible.
The general notion of Dimensionality Reduction is to take all the features that we have and construct new features from them, so that we can represent our data with fewer features while loosing little information.
For example, when two features are highly correlated i.e. one changes when the other does, we might be better off replacing them with a single new feature, that goes along the axis of maximum variance between the two. A number along this line accounts for most of the variance in these points, and the rest can be accounted for by a number describing the distance to that line (a perpendicular axis), which is less important than the first axis we found.
songs %>%
ggplot(aes(x = energy,
y = loudness,
label = track_name)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Imagine you are whale shark
And want to orient your mouth in such a way that you can eat the greatest amount of krill in one sweep
This is your first principal component. The second is perpendicular to the first. This is a throwback to “Math for Natural Scientists” and liner algebra, we are defining a new coordinate system here.
But whale sharks swim in 3 dimensions, not 2, and our data has even more dimensions, with one features being represented as one dimension.
It can be quite hard for humans to imaging being an N-dimensional whale shark.
But R and tidymodels has us covered:
PCA is not a model in itself, but rather a data preprocessing step that generates new features (the principal components), which we can later use for other models. But today, we will do just the preprocessing by itself.
In tidymodels, preprocessing is done by defining a recipe:
songs_rec <- recipe( ~ ., data = songs) %>%
update_role(track_name, track_artists, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
We then take the recipe and prepare it.
songs_prep <- prep(songs_rec)
songs_prep
Data Recipe
Inputs:
role #variables
id 2
predictor 15
Training data contained 347 data points and no missing data.
Operations:
Centering and scaling for danceability, energy, key, ... [trained]
PCA extraction with danceability, energy, key, ... [trained]
From this prepared recipe, we extract a tidy form of the step we care about (usually the last one) to see, what happened to our data.
songs_compontents <- tidy(songs_prep, 2)
songs_compontents
# A tibble: 225 x 4
terms value component id
<chr> <dbl> <chr> <chr>
1 danceability -0.300 PC1 pca_BKzit
2 energy -0.376 PC1 pca_BKzit
3 key -0.0498 PC1 pca_BKzit
4 loudness -0.373 PC1 pca_BKzit
5 mode 0.153 PC1 pca_BKzit
6 speechiness -0.146 PC1 pca_BKzit
7 acousticness 0.364 PC1 pca_BKzit
8 instrumentalness 0.359 PC1 pca_BKzit
9 liveness -0.123 PC1 pca_BKzit
10 valence -0.318 PC1 pca_BKzit
# … with 215 more rows
The original features where replace by Principal Components that explain most of the variance. We can see, which features ended up contributing to which components:
songs_compontents %>%
mutate(terms = tidytext::reorder_within(terms, by = value, within = component)) %>%
filter(component %in% paste0("PC", 1:3)) %>%
ggplot(aes(value, terms, fill = value > 0)) +
geom_col() +
facet_wrap(~ component, scales = "free") +
ggthemes::scale_fill_colorblind() +
guides(fill = "none") +
tidytext::scale_y_reordered()
We had to use 2 little helper functions from the tidytext
package to properly order the bar. The first component is largely comprised of a high acousticness and instrumentalness and less energy in the positive direction. So we expect e.g. classical music to be very high on that axis. A high value on the second component means a high danceability while being low in tempo.
We can now explore, how the data looks like in these new dimensions. We do so, by baking the prepared recipe. We set new_data
to NULL
, because we want to use the data that was already used to prepare the recipe (i.e. calculate the principal components).
songs_baked <- bake(songs_prep, new_data = NULL)
songs_baked %>% rmarkdown::paged_table()
As always, this only becomes clear with a plot:
You can now imagine, using this simpler representation of the songs in principal component space, to for example propose new songs to users based on songs that are close to songs they listened to in this representation.
Lastly, I want to stress, that the principal components are not created equal. The first component is always the most important. Here, we see, that almost 40% of the variance can be explained just by the first component, so exploring more than 2 really makes little sense here.
sdev <- songs_prep$steps[[2]]$res$sdev
percent_variation <- sdev^2 / sum(sdev^2)
tibble(component = unique(songs_compontents$component),
percent_var = percent_variation) %>%
mutate(component = fct_inorder(component)) %>%
ggplot(aes(component, percent_var)) +
geom_col() +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = NULL, y = "Percent variance explained by each PCA component")
The tidytuesday project also had a spotify dataset. This one es even more interesting, because it ranges across different playlists of various genres and is annotated with said genres. And it has more data (Over 30000 songs)! Download it here:
https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
Sometimes we have to experience true horror to see the light in the darkness. Take the spotify data and make a plot that is truly horrible! I would appreciate a couple of sentences about your thought process and what makes your plot particularly bad. You can strike terror into the reader’s heart in multiple ways. Here are some ideas, mix and match what suits you:
The title of this exercise is stolen from this talk by Alison Hill.
Now use what you learned to make a great plot! Pick some features that you are interested in and visualize them as informative and beautiful as possible, while still staying honest to the data. Maybe you are interested in changes over time, maybe you find your favorite artist and want to situate them in the context of other works. Maybe you want to explore how different features relate to each other or even want to attempt to recreate the PCA to see, if you can find clusters of genres. It is your call.
I am curious to see, what you come up with!
There is no exercise on statistics here, but please think about the topics of today’s lecture and note down questions that come up, so we can talk about them on Friday.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/jmbuhr/dataIntro20, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Buhr (2020, Dec. 7). Introduction to Data Analysis with R: Lecture 6. Retrieved from https://jmbuhr.de/dataIntro20/lectures/lecture6/
BibTeX citation
@misc{buhr2020lecture, author = {Buhr, Jannik}, title = {Introduction to Data Analysis with R: Lecture 6}, url = {https://jmbuhr.de/dataIntro20/lectures/lecture6/}, year = {2020} }