While sitting in the office, my colleague Brian and I stumbled upon this dead-on honest trailer of The Room a.k.a. one of the worst movies of all time (written, produced, directed by and starring Tommy Wiseau).
And since the movie features such Shakespearian quotes as “I did not hit her, it’s not true! It’s bullshit! I did not hit her! I did naaht! Oh hi, Mark”, we joked about mining the screenplay for other hidden gems.
Well, no sooner said than done.
Conducting the analysis below requires the packages
For loading multiple packages at once I recommend
p_load() from the
pacman package which is a wrapper function for
require() and installs missing packages if necessary.
# Install and load pacman if not already installed if (!require("pacman")) install.packages("pacman") library(pacman) # Load packages p_load(dplyr, ggplot2, magrittr, pdftools, reshape2, stringr, tidytext, wordcloud)
The full movie script can be found here. To download and import it to
R, simply run
# Download pdf download.file("https://theroomscriptblog.files.wordpress.com/2016/04/the-room-original-script-by-tommy-wiseau.pdf", "the-room-original-script-by-tommy-wiseau.pdf") # Extract text from pdf file room <- pdf_text("the-room-original-script-by-tommy-wiseau.pdf")
and extract the text via
pdf_text() from the package
After extracting the raw text from The Room’s
Concretely, this means separating the lines of the raw text (
\n indicating line breaks), removing redundant text parts such as the cover page, headers and footers, blank lines, and directing instructions as well as punctuation (except for apostrophes), non-alphabetic characters, and stopwords. (Note that I deliberately didn’t stem words.)
For most of these steps
lapply() can be used to apply the respective functions to each element of the list.
While performing these steps, the cleaned text, which consists of a sequence of strings, is split into single words – a process called tokenization.
# Separate lines with \n indicating line breaks room_tidy <- strsplit(room, "\n") # Remove cover page room_tidy <- room_tidy[-1] # Remove page numbers and headers room_tidy <- lapply(room_tidy, function(x) x[-(1:2)]) # Remove footers room_tidy <- lapply(room_tidy, function(x) x[1:(length(x)-2)]) # Remove information on act and scene room_tidy <- lapply(room_tidy, function(x) gsub("END SCENE", "", x)) room_tidy <- lapply(room_tidy, function(x) gsub("ACT.*", "", x)) room_tidy <- lapply(room_tidy, function(x) gsub("SCENE.*", "", x)) # Remove punctuation (except for apostrophes) and numbers room_tidy <- lapply(room_tidy, function(x) gsub("[^[:alpha:][:blank:]']", "", x)) # Remove directing instructions room_tidy <- lapply(room_tidy, function(x) x[!grepl("^[A-Z ']+$", x), drop = FALSE]) # Convert to lowercase room_tidy <- lapply(room_tidy, function(x) tolower(x)) # Split strings room_tidy <- lapply(room_tidy, function(x) strsplit(x, " ")) # Turn list to data.frame room_df <- data.frame(matrix(unlist(room_tidy), nrow = 10357, byrow = T), stringsAsFactors = FALSE) # Remove introductory part and last two lines room_df <- tail(room_df, -102) room_df <- head(room_df, -2) # Rename column to match anti_join(stopwords) colnames(room_df) <- "word" # Remove blank lines from text room_df %<>% filter(word != "") # Remove stopwords room_df %<>% anti_join(stop_words)
After processing the raw text and turning it into a tidy format we can extract the most common words from the movie script by counting their appearances via
count() from the
plyr package and plot them by running the following code
# Find most common words room_df_wordfreq <- room_df %>% count(word, sort = TRUE) # Plot words room_df %>% count(word, sort = TRUE) %>% filter(n > 20) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab("") + ylab("") + ggtitle("Most common words in The Room", subtitle = "Written by Tommy Wiseau") + ylim(0, 100) + coord_flip()
which creates this graph:
It comes as little surprise that the most common words in the movie script are of a similar quality to the quote in the beginning of this post with the most frequently used words being “Johnny”, “Lisa”, and “Mark” – the main characters’ names – and more elaborate expressions such as “yeah” or “ha” (sadly, naaht didn’t make it to the list).
In a similar manner, we can conduct a basic sentiment analysis and visualize the results. For this purpose we will use the
get_sentiments() function from the
tidytext package and use both the
NRC Emotion Lexicon from Saif Mohammad and Peter Turney (all sentiments) and the sentiment lexicon from
Bing Liu and collaborators (positive/negative sentiments).
# Plot total sentiment scores (nrc) room_df %>% inner_join(get_sentiments("nrc")) %>% count(word, sentiment) %>% ggplot(aes(sentiment, n)) + geom_bar(aes(fill = sentiment), stat = "identity") + theme(text = element_text(size = 30), axis.text.x = element_text(angle = 65, vjust = 0.5)) + xlab("") + ylab("") + ggtitle("Total sentiment scores in The Room", subtitle = "Written by Tommy Wiseau") + ylim(0, 500) + theme(legend.position = "none")
We can see in the
NRC sentiments plot that most words in The Room’s screenplay are positively scored, followed by negatively scored words.
Bing sentiment scores next, we can explore the previous finding further.
# Calculate contributions to positive and negative sentiments (bing) by word bing_counts <- room_df %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% ungroup() # Calculate top word contributors bing_counts_plot <- bing_counts %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) # Plot most common positive and negative words ggplot(bing_counts_plot, aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales = "free_y") + xlab("") + ylab("") + ggtitle("Most common positive and negative words in The Room", subtitle = "Written by Tommy Wiseau") + coord_flip()
As the second sentiment graph shows, the most common positive words in The Room are “love”, “happy”, and “fine”, while the most common negative words are “worry”, “crazy”, and “wrong”.
To finish it up, we now finally plot one of the infamous word clouds (albeit in a slightly more advanced version) by contrasting the most common positive words with the most common negative ones,
# Plot comparison cloud in ggplot2 colors ## Run > unique(g$data[]["fill"]) after ggplot_build() to extract colors room_df %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 60)
yielding this beautifully crafted word cloud:
Wrapping it up, the findings of this analysis can be summarized as follows: