The joy of spreadsheets
To get straight to the point: This post is about data manipulation and visualization in R.
Obviously, that is something completely different from my first post and with more content coming up, I have to think about giving the blog a clearer structure 🤔
But for now, it should be fine. As you will definitely see some code during your read, please do not be intimidated! You will not miss the final statement, even if you are not familiar with programming. As some kind of a reminiscence to the previous topic, this post again deals with movies and the development of their duration over the years.
Should I start learning how to code?
Learning how to code from scratch is awfully painful.
In the beginning, I realized, that you first have to know which language fits your needs. But without previous knowledge, it is nearly impossible to choose one. Luckily, there are many articles trying to help you to come to a decision.
For those of you working with Microsoft Excel on a daily basis, but not being fully satisfied and wanting to try something new, Python or R might be worth a look. I personally started to make full use of R half a year ago and I can highly recommend it to anybody dealing with data of whatever nature.
Why use R?
The benefits of using R are easy to identify and I will only list some of them.
R:
- is actively developed
- is available for every system
- allows you to perform a reproducible, high-quality analysis
- has powerful graphical capabilities
- is 100% free
4 easy steps to look at your data
This blog post will not explain how to install R or RStudio (use RStudio!).
What I want to show is a typical analysis pipeline, that can easily be adopted to any kind of spreadsheet data. I will go step-by-step from data import to drawing a final conclusion.
So let´s start by loading the packages we need. These packages (install them first) add new functions and therefore increase the power of R dramatically.
# 1. step: Load all packages we need ----------------------------------------------------------------------------
library(dplyr) # dplyr is a grammar of efficient data manipulation developed by Hadley Wickham
library(ggplot2) # ggplot2 is a powerful plotting system in R
library(plotly) # plotly is a graphing library to do interactive, publication-quality graphs
library(viridis) # viridis comes with different color scales to make pretty plots
library(ggjoy) # joyplots provides a way of visualizing changes in distributions over time or space
library(gridExtra) # gridExtra enables you to arrange multiple grid-based plots on a page
The next step is to import some data.
KAGGLE is a platform for predictive modeling and analytics competitions and offers a great source if you are searching for datasets to improve your coding skills. As I already mentioned in the beginning, I chose to take a look at movie data (5000+movie data scraped from IMDB website).
# 2. step: Load data (CSV file) from KAGGLE ---------------------------------------------------------------------
movie_file <- path_to_file # download first then specify absolute or relative filepath in "/.../.."
imdb_data <- read.csv(movie_file) # base function of R to read a CommaSeparatedFile
Now take a look at your imported data with glimpse from the dplyr package.
Notice that there are 5043 observations and 29 varibales, reflecting data from more than 5000 movies since 1916. For this analysis we are only interested in the columns title_year and duration.
# 3. step: Look at your data ------------------------------------------------------------------------------------
glimpse(imdb_data)
FALSE Observations: 5,043
FALSE Variables: 28
FALSE $ color <fctr> Color, Color, Color, Color, , Color...
FALSE $ director_name <fctr> James Cameron, Gore Verbinski, Sam ...
FALSE $ num_critic_for_reviews <int> 723, 302, 602, 813, NA, 462, 392, 32...
FALSE $ duration <int> 178, 169, 148, 164, NA, 132, 156, 10...
FALSE $ director_facebook_likes <int> 0, 563, 0, 22000, 131, 475, 0, 15, 0...
FALSE $ actor_3_facebook_likes <int> 855, 1000, 161, 23000, NA, 530, 4000...
FALSE $ actor_2_name <fctr> Joel David Moore, Orlando Bloom, Ro...
FALSE $ actor_1_facebook_likes <int> 1000, 40000, 11000, 27000, 131, 640,...
FALSE $ gross <int> 760505847, 309404152, 200074175, 448...
FALSE $ genres <fctr> Action|Adventure|Fantasy|Sci-Fi, Ac...
FALSE $ actor_1_name <fctr> CCH Pounder, Johnny Depp, Christoph...
FALSE $ movie_title <fctr> Avatar , Pirates of the Caribbean: ...
FALSE $ num_voted_users <int> 886204, 471220, 275868, 1144337, 8, ...
FALSE $ cast_total_facebook_likes <int> 4834, 48350, 11700, 106759, 143, 187...
FALSE $ actor_3_name <fctr> Wes Studi, Jack Davenport, Stephani...
FALSE $ facenumber_in_poster <int> 0, 0, 1, 0, 0, 1, 0, 1, 4, 3, 0, 0, ...
FALSE $ plot_keywords <fctr> avatar|future|marine|native|paraple...
FALSE $ movie_imdb_link <fctr> http://www.imdb.com/title/tt0499549...
FALSE $ num_user_for_reviews <int> 3054, 1238, 994, 2701, NA, 738, 1902...
FALSE $ language <fctr> English, English, English, English,...
FALSE $ country <fctr> USA, USA, UK, USA, , USA, USA, USA,...
FALSE $ content_rating <fctr> PG-13, PG-13, PG-13, PG-13, , PG-13...
FALSE $ budget <dbl> 237000000, 300000000, 245000000, 250...
FALSE $ title_year <int> 2009, 2007, 2015, 2012, NA, 2012, 20...
FALSE $ actor_2_facebook_likes <int> 936, 5000, 393, 23000, 12, 632, 1100...
FALSE $ imdb_score <dbl> 7.9, 7.1, 6.8, 8.5, 7.1, 6.6, 6.2, 7...
FALSE $ aspect_ratio <dbl> 1.78, 2.35, 2.35, 2.35, NA, 2.35, 2....
FALSE $ movie_facebook_likes <int> 33000, 0, 85000, 164000, 0, 24000, 0...
After some filtering, I use the amazing ggplot2 package to plot the distributions, save them in a last step to 4 different variables and make use of the gridExtra package to plot them next to each other.
# 4. step: Different types of frequency plots to check whether films are getting shorter over time --------------
# joyplot: new package, very useful to visualize changes in distributions over time
joy_plot <- imdb_data %>% # %>% = pipe command from the magrittr package
filter(title_year > 1959, title_year < 2017) %>% # filter for years between 1959 & 2017
mutate(title_year = as.factor(title_year)) %>%
ggplot(aes(x = duration, y = title_year, fill = ..x..)) +
geom_joy_gradient(scale = 5, rel_min_height = 0.01, alpha = 0) +
scale_fill_viridis(name = "", option = "D") +
scale_x_continuous(limits = c(0,250)) +
theme_joy(font_size = 6, grid = T, line_size = 0.2, font_family = "Avenir") +
ggtitle("Joyplot") +
theme(plot.title = element_text(size = 12, hjust = 0.5,
margin = margin(t = 12, r = 2, b = 12, l = 2, unit = "pt"))) +
guides(fill = F)
# boxplot: quartile representation
bar_plot <- imdb_data %>%
filter(title_year > 1959, title_year < 2017) %>%
mutate(title_year = as.factor(title_year)) %>%
group_by(title_year) %>%
mutate(sum = (median(duration, na.rm = T))) %>%
ggplot(aes(y = duration, x = title_year, group = title_year, fill = sum, font.family = "Avenir")) +
geom_boxplot(outlier.alpha = 0) +
coord_flip() +
scale_fill_viridis(name = "", option = "D") +
scale_y_continuous(limits = c(0,250)) +
theme_joy(font_size = 6, grid = T, line_size = 0.2, font_family = "Avenir") +
ggtitle("Boxplot") +
theme(plot.title = element_text(size = 12, hjust = 0.5,
margin = margin(t = 12, r = 2, b = 12, l = 2, unit = "pt"))) +
guides(fill = F)
# violinplot: similar to boxplots with rotated kernel density on each side
violin_plot <- imdb_data %>%
filter(title_year > 1959, title_year < 2017) %>%
mutate(title_year = as.factor(title_year)) %>%
group_by(title_year) %>%
mutate(sum = (median(duration, na.rm = T))) %>%
ggplot(aes(y = duration, x = title_year, group = title_year, fill = sum, font.family = "Avenir")) +
geom_violin() +
coord_flip() +
scale_fill_viridis(name = "", option = "D") +
scale_y_continuous(limits = c(0,250)) +
theme_joy(font_size = 6, grid = T, line_size = 0.2, font_family = "Avenir") +
ggtitle("Violinplot") +
theme(plot.title = element_text(size = 12, hjust = 0.5,
margin = margin(t = 12, r = 2, b = 12, l = 2, unit = "pt"))) +
guides(fill = F)
# datapoints: every single datapoint is plotted
dots <- imdb_data %>%
filter(title_year > 1959, title_year < 2017) %>%
mutate(title_year = as.factor(title_year)) %>%
group_by(title_year) %>%
mutate(sum = (median(duration, na.rm = T))) %>%
ggplot(aes(y = duration, x = title_year, group = title_year, fill = duration, font.family = "Avenir")) +
geom_point(shape = 21, col = "white", alpha = 0.2) +
coord_flip() +
scale_fill_viridis(name = "", option = "D") +
scale_y_continuous(limits = c(0,250)) +
theme_joy(font_size = 6, grid = T, line_size = 0.2, font_family = "Avenir") +
ggtitle("Datapoints") +
theme(plot.title = element_text(size = 12, hjust = 0.5,
margin = margin(t = 12, r = 2, b = 12, l = 2, unit = "pt"))) +
guides(fill = F)
# arrange plots next to each other
grid.arrange(joy_plot, bar_plot, violin_plot, dots, ncol = 4)
Are movies getting shorter?
Take yourself some time to look at the four different ways of how to represent distribution data.
There is nothing wrong with any of these visualization techniques, but there are huge differences in what conclusion you can draw.
The Datapoints column of the plot is best suited to see the amount of movies in every year, but you can make no generalization from this one. I would probably also not use Violinplots or Boxplots in this case, because the changes in the median duration are too small to be recognized by eye.
I for myself would go with the brand new gg_joy package, which gives the impression that the duration of movies decreases over time.
If you currently feel limited by the possibilities Excel offers you and want to do data manipulation in a reproducible way, start learning R; it´s totally worth it.
Leave a Comment