Meghan Markle and the Media - Creating the data set

R

Outline

Since the first rumours of her romance with Prince Harry, Meghan Markle has been a feature within the pages of the tabloid press. The headlines have been rocky with controversy around the first photos of baby Archie, a lavish baby shower, private jet usage and family drama.

Some claim the media attention is justified, while others believe Meghan has been unfairly targeted for refusing to conform to predefined Royal standards. Her uneasy relationship with the tabloids has culminated in a lawsuit against four main newspaper tabloids and a refusal to work with them.

I have wanted to play with some sentiment analysis packages in R for a while, and this seemed a great opportunity. Below, I create a data set by web scraping three tabloid news websites and analysing the sentiment of the headlines.

This section focuses on how I created the data set and selected the sentiment analysis tools I would be using.

library(rvest)
library(dplyr)
library(stringr)
library(qdap)
library(DataCombine)
library(readr)
library(tidytext)
library(tidyr)
library(ggplot2)
library(sentimentr)
library(lubridate)

Creating my data set

The Sun

In order identify the correct content on from The Sun’s website, I utilised the search bar on the homepage. Searching for the term “Meghan Markle” generated the below URL formats.

First page - https://www.thesun.co.uk/who/meghan-markle/

Second page - https://www.thesun.co.uk/who/meghan-markle/page/2/

Last page - https://www.thesun.co.uk/who/meghan-markle/page/240/

The advantage of using the search bar is that the content had already been tagged as being about Meghan Markle by The Sun, meaning I didn’t miss out on headlines that were about the Duchess but did not contain her full name.

My first task was to create a for loop that would create a character vector of URLs to scrape.

sun_urls <- c("https://www.thesun.co.uk/who/meghan-markle/")

# For loop that creates numbers to append
for (n in seq(2,240)) {
  sun_urls <- c(sun_urls, paste0("https://www.thesun.co.uk/who/meghan-markle/page/",n,"/"))
}

Once I had my vector of headlines, I needed to identify the elements of HTML I wanted to extract. Once these were identified, another for loop was created to extract the headlines and put them into a data frame.

sun_headlines <- c()

# For every page in the URLs created
for (p in sun_urls) {
  # Read the HTML
  p <- read_html(p)
  
  # Extract the headline, turn it into text, clean
  hl <- html_nodes(p, css = "a.text-anchor-wrap") %>% 
    html_text() %>% 
    str_replace_all("\n", " ") %>% 
    str_replace_all("\t", "") %>% 
    trimws()
  
  sun_headlines <- c(sun_headlines, hl)
}

sun_headlines_df <- data.frame(sun_headlines)

The Sun headlines have capitalised prefixes to their headlines with no punctuation which are usually puns. An example of this is BABY JOY Meghan’s ex-husband announces he and his wife are expecting their first child.

I suspect this will interfere with sentiment analysis later on, so once I have the finalised data frames these will be removed. In all, 3,839 headlines were scraped from The Sun.

I now needed the corresponding dates in order to do some time series analysis. While the previous for loop extracted the headlines from the pillar page, I would have to actually visit every story in order to get the date the headline was published. This was done with another for loop.

sun_article_urls <- c()

# For every URL for the pillar pages about Meghan Markle
for (u in sun_urls) {
  # Read the HTML
  u <- read_html(u)
  # Extract the URLs for the stories
  sun_article_url <- html_nodes(u, css = "a.text-anchor-wrap") %>% 
    html_attr("href")
  
  sun_article_urls <- c(sun_article_urls, sun_article_url)
}

This gives me the individual URLs for every story. Once I checked the length to make sure the number of headlines matched the number of URLs, I extracted the date from each one.

# Create a function that takes one URL and extracts the date
sun_get_dates <- function(a) {
  a <- read_html(a)
  # Extracts and cleans the date
  date <- a %>% 
    html_node(css = "span.article__datestamp") %>% 
    html_text() %>% 
    str_replace(",", "") %>% 
    trimws()
  
  data.frame(date)
}

# Apply the above function to every URL of the individual stories
sun_all_dates <- lapply(sun_article_urls, sun_get_dates)
# Create a data frame of all of those dates
sun_headlines_dates <- plyr::ldply(sun_all_dates, data.frame)

Once I checked the length of the data frame with the dates was the same length as that of the headlines, I needed to combine the two data frames together. As I didn’t create a key to match the two, I join them with a column bind in the faith that the order of scraping has not been disrupted. A quick sense check using The Sun website clarifies this worked.

sun_headlines_df <- cbind("The Sun", sun_headlines_df, sun_headlines_dates)

This results in a data frame of headlines about Meghan Markle from The Sun newspaper, along with the corresponding dates. Now is the time to remove the capitalised puns from the beginning of the headline. I found the best way to do this was to split the string on the gap between the pun and the headline (a pattern of three spaces) and extract the second element.

sun_headlines_df$headline <- str_split_fixed(sun_headlines_df$headline, "   ", 2)[,2]

Daily Mail

The search bar on the Daily Mail website was also used to identify the Meghan Markle-rekated content. The URLs for the pillar pages were less clear in this instance:

First page - https://www.dailymail.co.uk/home/search.html?offset=0&size=50&sel=site&searchPhrase=%22meghan+markle%22&sort=relevant&type=article&type=video&type=permabox&days=all

Another page - https://www.dailymail.co.uk/home/search.html?offset=10000&size=50&sel=site&searchPhrase=%22meghan+markle%22&sort=relevant&type=article&type=video&type=permabox&days=all

There is a value within the URLs that goes up in increments of 50. I did not identify the maxmium value this could take, but I sorted the headlines by relevance using The Daily Mail’s own sorting function. I then scraped the 10,000 most relevant headlines.

I first had to get a list of the URLs for the Meghan Markle pillar pages:

dm_urls <- c()

# For every value between 0 and 10,000 in steps of 50
for (n in seq(0,10000, 50)) {
  # Paste the URL template with the value
  dm_url <- paste0("https://www.dailymail.co.uk/home/search.html?offset=",n,"&size=50&sel=site&searchPhrase=%22meghan+markle%22&sort=relevant&type=article&type=video&type=permabox&days=all")
  dm_urls <- c(dm_urls, dm_url)
}

The next step was creating a function that extracted both the headline and the date from one URL, then apply it to all of my URLs.

dm_get_one_page <- function(url) {
  
  # Convert URL to HTML
  url <- read_html(url)
  
  # Extract every headline from each page
  dm_hl <- html_nodes(url, css = ".sch-res-title a") %>% 
  html_text()
  
  # Extract the corresponding dates for the headlines
  dm_hl_date <- html_nodes(url, css = ".sch-res-info ") %>% 
    html_text() %>% 
    str_replace_all("\n","") 
  
  # Clean the dates
  dm_hl_date_clean <- gsub("\\,.*","",trimws(gsub(".*[-]([^.]+)[,].*", "\\1", dm_hl_date)))
  
  # Put headline and date into a data frame
  data.frame(dm_hl, dm_hl_date_clean)
}

# Apply the function to every pillar page URL 
dm_all <- lapply(dm_urls, dm_get_one_page)
# Create a data frame of the headlines and dates
dm_headlines_df <- plyr::ldply(dm_all, data.frame)

# Add "Daily Mail" as an identifying column to data frame
dm_headlines_df["Title"] <- "Daily Mail"

The Express

The same process was used to extract headlines from The Express.

exp_urls <- c()

# For loop that creates numbers to append to the URLs
for (n in seq(10, 3000, 10)) {
  exp_urls <- c(exp_urls, paste0("https://www.express.co.uk/search?s=meghan&order=relevant&o=",n))
}

# Build a function to get one page
exp_get_one_page <- function(url) {
  
  # Read the HTML in the URL
  url <- read_html(url)
  
  # Extract headline text
  exp_hl <- html_nodes(url, css = "h4.post-title") %>% 
    html_text()
  
  # Extract corresponding dates
  exp_date <- html_nodes(u, css = "time") %>% 
    html_text() %>% 
    str_replace("Published: ", "")
  
  # Put both elements into a data frame
  data.frame("The Express", exp_hl, exp_date)
}

# Apply the function to every URL
exp_all <- lapply(exp_urls, exp_get_one_page)
# Put the contents of the previous lapply into one data frame
exp_headlines_df <- plyr::ldply(exp_all, data.frame)

Clean and collate all data

Now I have one separate data frame for each publication, I need to put them into one master data frame in order to analyse sentiment. There needs to be a column added which describes the publication the headline originally came from.

# The Express
exp_headlines_df <- exp_headlines_df %>% 
  select(publication = X.The.Express.,
         headline = exp_hl,
         date = exp_date)

# The Daily Mail
dm_headlines_df <- dm_headlines_df %>% 
  select(publication = Title,
         headline = dm_hl,
         date = dm_hl_date_clean)

# The Sun
sun_headlines_df <- sun_headlines_df %>% 
  select(publication = "\"The Sun\"",
         headline = sun_headlines,
         date)

# Combine all into one master data frame
master_df <- rbind(rbind(exp_headlines_df, dm_headlines_df), sun_headlines_df)

The date formats were different for every publication, and these needed to be aligned so I could create a Date data type out of them.

First, I extracted all the years and created a year column:

master_df$year <- str_sub(master_df$date, -4, -1)

Then I extracted all of the months:

master_df$day <- parse_number(str_sub(master_df$date, 1, -6))

Finally, I needed to change the months from text to numbers. I did this by identifying the pattern of the text of each month and replacing it with the number:

# Chnage the date column from factor to character
master_df$date <- as.character(master_df$date)

# Find the patterns to use as keys as to the month value to be replaced with
master_df$month[grepl("Jan", master_df$date, ignore.case = TRUE)] <- 1
master_df$month[grepl("Feb", master_df$date, ignore.case = TRUE)] <- 2
master_df$month[grepl("Mar", master_df$date, ignore.case = TRUE)] <- 3
master_df$month[grepl("Apr", master_df$date, ignore.case = TRUE)] <- 4
master_df$month[grepl("May", master_df$date, ignore.case = TRUE)] <- 5
master_df$month[grepl("Jun", master_df$date, ignore.case = TRUE)] <- 6
master_df$month[grepl("Jul", master_df$date, ignore.case = TRUE)] <- 7
master_df$month[grepl("Aug", master_df$date, ignore.case = TRUE)] <- 8
master_df$month[grepl("Sep", master_df$date, ignore.case = TRUE)] <- 9
master_df$month[grepl("Oct", master_df$date, ignore.case = TRUE)] <- 10
master_df$month[grepl("Nov", master_df$date, ignore.case = TRUE)] <- 11
master_df$month[grepl("Dec", master_df$date, ignore.case = TRUE)] <- 12

# Check the only values I have are values between 1 and 12
unique(master_df$month)

A new date column is then created from the extracted date elements:

master_df <- mutate(master_df, clean_date = paste0(day, "/", month, "/", year))

master_df$clean_date <- as.Date(master_df$clean_date, "%d/%m/%Y")

Some additional cleaning was performed before analysis could begin:

# Find how many rows in the data frame have null dates
nrow(filter(master_df, is.na(master_df$clean_date)))

# Only 24 rows with null dates, all from The Sun. That's small enough for me to drop them
master_df <- filter(master_df, !is.na(master_df$clean_date))

# Change the headline data type from factor to character
master_df$headline <- as.character(master_df$headline)

# Create an ID for each headline
master_df$id <- seq.int(nrow(master_df))

Identifying the sentiment analysis packages to use

Before I began analysing the sentiment, some final tidying up was required.

First, I looked at any headlines that did not contain the word “Meghan” to see how relevant they were. There were just over 4,000 headlines that were loosely related, such as about the Royal family. Since they didn’t seem to be explicitly about the Duchess herself, I removed these and was left with 12,096 headlines.

Some headlines were duplicated despite being released on different dates. These were filtered out, leaving me 10,980 headlines.

Finally, I filtered out any headlines before 1st July 2016. This was the month that Harry and Meghan supposedly met, so it seemed a sensible enough point to begin my analysis.

# Clean all data and save to a new data frame
master_df_tidy <- master_df %>% 
  select(publication, headline, date = clean_date) %>% 
  filter(grepl("Meghan", headline),
         !duplicated (headline),
         date > "2016-07-01")

# Reset the id since some headlines have been removed - important for sentimentr later
master_df_tidy$id <- seq.int(nrow(master_df_tidy))

My first resource for sentiment analysis was Text Mining with R (Silge and Robinson 2016). This publication talked about three lexicons; the two I decided would work for my purpose were:

AFINN from Finn Årup Nielsen
bing from Bing Liu and collaborators

Split each headline into words

Since these packages worked on individual words, I had to split each headline into individual words. I removed stopwords while doing this.

# Create a new data frame to analyse word sentiment, stop words removed
master_df_text <- master_df_tidy %>% 
  unnest_tokens(word, headline) %>% 
  anti_join(stop_words)

# Get rid of extranneous columns
master_df_text <- master_df_text %>% 
  select(id, publication, date, word)

Get positive and negative words using Bing

Once I had separated the headlines into words I performed an inner join to the Bing lexicon to get the associated sentiment for each word. While useful to see, this is not particularly accurate for getting the sentiment of a sentence as it doesn’t take into account how a group of words works together.

pos_neg_df <- master_df_text %>% 
  inner_join(get_sentiments("bing"))

Measure the sum of sentiment using AFINN

While the Bing lexicon gave me a binary classfication on the sentiment of words, the AFINN lexicon also gives the magnitude of said sentiment. I decided to sum the magnitude of word sentiments by headline to get an aggregated view.

affin_sent <- master_df_text %>% 
  # Join the to AFINN lexicon to get measure of sentiment
  inner_join(get_sentiments("afinn")) %>% 
  # Group data frame by headline
  group_by(id, publication, date) %>% 
  # Sum the sentiment - the higher the value, the more positive the sentiment
  summarise(sentiment = sum(value)) %>% 
  # Join back to the original data frame to get the corresponding headline 
  inner_join(master_df_tidy)

The magnitude of sentiment is more suitable for my purpose, however again the impact of each word in isolation is used. Ideally, I want to be able to analyse the sentiment on an entire sentence.

Get sentence-level sentiment with sentimentr

The package sentimentr (Rinker 2019) measures sentence-level sentiment by taking into account valence shifters such as negation, amplifiers and deamplifiers.

# Create a data frame of headlines split into sentences, fget sentiment of each sentence
sent_by_sent <- master_df_tidy %>% 
  get_sentences() %>% 
  sentiment_by()

# Bind the sentiment data frame with the original data frame to match the sentiment measure with the headline
headline_sentiment <- cbind(master_df_tidy, sent_by_sent) %>% 
  select(id, publication, headline, date, ave_sentiment) %>% 
  arrange(date)

A sense check of the sentiment of the headlines holds well. I believe this is a good methodology to use in my final analysis.

Get emotions for headlines with sentimentr

The package sentimentr also has a function that allows for the measure of specific emotions from sentences. As well as how positive or negative a sentence is, it would be useful for me to see the emotions generated by these headlines.

# Create a new data frame for emotions
emo_by_sent <- master_df_tidy %>% 
  # Split headlines into sentences
  get_sentences() %>% 
  # Get emotions for each sentence
  emotion_by() %>% 
  # Filter out any emotions that don't appear, only keep the ones that do
  filter(emotion_count > 0) %>% 
  # Join back to the original data frame to match the emotions with the headline
  mutate(id = element_id) %>% 
  left_join(master_df_tidy) %>% 
  select(id, publication, headline, date, emotion_type, emotion_count, ave_emotion) %>% 
  # Filter out any null headlines
  filter(!is.na(headline))

Next steps

In my next post, I’ll be performing analysis on the text itself, finally displaying my results in a Tableau Public dashboard.

Jasmine Holdsworth

I am a Senior Data Analyst/Data Scientist based in London. I love R, Python, and anything data-related. I have previously worked at Stack Overflow and DAZN. I currently teach at General Assembly and work at Expedia.