Blog Post 3 - Web Scraping
In this post, I will use webscraping to extract and process data from the IMBD website to find out what movie or TV shows share actors with my favorite movie or show to recommend my next film to watch.
The link to the GitHub repository for this scraper project.
I will use the movie Star Wars: Episode VI - Return of the Jedi
as an example of my favorite movie. So the idea of this movie recommender systems works is based on if I enjoyed the movie Star Wars, and some other movies or TV shows have many of the same actors as the movie I like, then I might also like that movie.
§1. Setup
Before doing webscraping, we need to make sure we have installed the Python package scrapy
. In my case, I installed to my PIC16B environment by using Anaconda
. Then we can start to set up our project by running the command line scrapy startproject your_project_name
to set up a new scrapy project directory.
Once I open my terminal and navigate to the location where I would like to do my scrapy project. Then I run the following command line:
conda activate PIC16B
scrapy startproject IMDB_scraper
The above command line will create an IMDB_scraper
directory as the name for my project; then we have our directory structure like this:
└── IMDB_scraper
├── scrapy.cfg
└── IMDB_scraper
├── spiders
│ └── __init__.py
├── __init__.py
├── middlewares.py
├── settings.py
├── items.py
└── pipelines.py
We should start our project with ` cd IMDB_scraper to get in the correct
IMDB_scraper` directory.
Next, we can add the line CLOSESPIDER_PAGECOUNT = 20
to our settings.py
file to prevent our scraper from downloading too much data while we’re still testing things out. We’ll remove this line later once we finish testing our project.
§2. Spider
In this section we will write our scraper
Now, let’s create a file called imdb_spider.py
inside the spiders
director and type in the following code in imdb_spider.py
.
import scrapy
class ImdbSpider(scrapy.Spider):
name = 'imdb_spider'
# The URL is for 'Star Wars: Episode VI - Return of the Jedi' from IMDB page
start_urls = ['https://www.imdb.com/title/tt0086190/']
The above code creates a subclass called ImdbSpider
which inherits from Spider class from the package scrapy.
name
variable is used to call different spiders from the terminal; we should give the unique name for each different spider.start_urls
is the list of URLs to tell where our spider starts to crawl from. In my case, my start URL will navigate to the IMBD movie page forStar Wars: Episode VI - Return of the Jedi
. You can replace the entry of start_urls with the URL corresponding to any of your favorite movie or TV show.
Next, I will implement three parsing method for the ImdbSpider
class.
First parse method: parse( )
The first method is parse(self, response)
; this method usually parses the response argument. The parse method is the most important method in our Spider project; this method specifies what should happen when the Spider encounters a response
. A response
is an object corresponding to a webpage; it contains the raw HTML and many other useful attributes and methods for extracting information from the page.
def parse(self, response):
"""
This function starting at the website 'https://www.imdb.com/title/tt0086190/'
for the movie 'Star Wars: Episode VI - Return of the Jedi' from IMDB;
then navigate to the Cast & Crew page with URL of the form
'https://start_urls/fullcredits/'
"""
# get the link for the Cast & Crew page (fullcredits/)
# Notes: No slash at the begining of the path "fullcredits/"
next_link = response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']
# join the link of the Cast & Crew page to our current URL path
# hers is : https://www.imdb.com/title/tt0086190/fullcredits/
cast_link = response.urljoin(next_link)
# yield a request to follow the Cast & Crew page,
# and call the method parse_full_credits() once get there
yield Request(cast_link, callback = self.parse_full_credits)
In order to get the URL for navigation to the Cast & Crew page, we need to use the Developer Tools on our browser to inspect individual HTML elements. For example, I right-click on the Cast & Crew link on the Mac computer, then click Inspect Element. Then you will see some similar page like this:
Then we can find that the link for the Cast & Crew page is fullcredits/
, and is inside the anchor tag <a>...</a>
with href
attribute.
Then we can notice that the fullcredits/
is under the ul
element with class name ipc-inline-list
; and that ul
elements is under the div
element with class name SubNav__SubNavContentBlock-sc-11106ua-2
. So we can use response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']
to get the first value from the attribute href
inside the attribute a
.
We can use scrapy shell
to check:
First, run
scrapy shell 'https://www.imdb.com/title/tt0086190/'
Then run:
response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']
After that, you will get something like this:
In [1]: response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']
Out[1]: 'fullcredits/?ref_=tt_ql_cl'
Great, we can see that we have successfully extracted the fullcredits/
.
Second parse parsing method: parse_full_credits( )
The method parse_full_credits(self, response)
starts at our movie’s Cast & Crew page; find all names under the Series Cast section on that page, then navigate to each actor/actress page.
def parse_full_credits(self, response):
"""
This function pareses the Full Cast & Crew page
of the movie from URL the 'https://start_urls/fullcredits/',
then navigate to each actors page (Crew members are not included).
"""
# list of links for negative to each actor page (ex: '/name/nm0000434/')
# Notes: There is a slash at the begining of the path "/name/nm0000434/"
links = [a.attrib["href"] for a in response.css("td.primary_photo a")]
# for all actors on the current page,
# yield a request to follow those actor's link,
# and call the method parse_actor_page () when each actor’s page is reached
for link in links:
# add the base URL (https://www.imdb.com) to each actor's link (ex: https://www.imdb.com/name/nm0000434/)
url= response.urljoin(link)
yield Request(url, callback= self.parse_actor_page)
Let’s use the developer tool to inspect the first actor element inside the Cast & Crew page to get some idea of how to extract all actor/actress links. In my case, I inspect the actor called “Mark Hamill”; then you will get something similar to this:
After inspection, we can notice that all links are inside the table row with <td>
tag with the class name primary_photo; inside that row, we have an <a>
tag with the href
attribute, which contains the link we need. Therefor, we can use list comprehension [a.attrib["href"] for a in response.css("td.primary_photo a")]
to get all the actor/actress link.
Third parse method: parse_actor_page( )
The method parse_actor_page(self, response)
starts at each actor/actress page, gets the actor/actress’s name, then extracts all movie or TV show names inside the Actor/Actress
under Filmography section.
def parse_actor_page(self, response):
"""
This function get each actor's page,
It should yield a dictionary with the name of the actor and the name of each movie or TV show
"""
# get the actor's name
actor_name = response.css('span.itemprop ::text').get()
# a list contains all movies or TV shows name for the current page's actor
# male is under actor
# female is under actress
filmography_actor = [actor_name.css("a::text").get() for actor_name in response.css('div.filmo-row') if actor_name.css("[id^=actor]") or actor_name.css("[id^=actress]")]
# for all movie or TV show name on this actor's page,
# yield one dictionary contains the actor's name and each movie or TV show
for movie_or_TV_name in filmography_actor:
yield{
"actor" : actor_name,
"movie_or_TV_name" : movie_or_TV_name
}
In my case, I use the developer tool to inspect the actor’s “Mark Hamil” name to find the idea of how to extract any actor/actress’s name.
The name is located at <span>
tag with class name itemprop
, so we can use response.css('span.itemprop ::text').get()
method to extract the name.
we added ::text
to the CSS query, to allow us extract only the text elements directly inside <span>
element with the name itemprop
. “Using .get()
directly on a SelectorList instance avoids an IndexError and returns None when it doesn’t find any element matching the selection.”
I used the list comprehension get all movie or TV show names from the current page:
[actor_name.css("a::text").get() for actor_name in response.css('div.filmo-row') if actor_name.css("[id^=actor]") or actor_name.css("[id^=actress]")]
I noticed that males are under Actor, and females are under Actress in their corresponding Filmography section.
Then we should use the developer tool to inspect the movie or TV show name:
We can see the name of the movie is inside the <a>
tag, under the <div>
tag with class name filmo-row
; However, more than one class has the same name; So, we need to check the id
to get the correct data. Therefore, we should add if condition to check the id
, which starts with actor
or actress
; thus, I added if actor_name.css("[id^=actor]") or actor_name.css("[id^=actress]")
inside the list comprehension.
Good, now we can use our imdb_spider
to crawl the data by running the command:
scrapy crawl imdb_spider -o results.csv
The above command will create a CSV file called results
with a column for actors and a column for movies or TV shows inside my IMDB_scraper
director.
§3. Make Your Recommendations
In this section, I will use the data from results.csv
, then make my next movie recommendations.
Don’t gorget to comment out the line CLOSESPIDER_PAGECOUNT = 20
in the file settings.py
. Then we should run the command scrapy crawl imdb_spider -o results.csv
again to get the complete data.
I will import the required Python modules at begging for convenience.
import pandas as pd
import numpy as np
from plotly import express as px
Next cell will import the results.csv
as a Pandas DataFrame called results
.
results = pd.read_csv("results.csv")
Once I have read the results dataset into a pandas dataframe, we can take a look at the first five rows of the dataset using results.head()
.
results.head()
actor | movie_or_TV_name | |
---|---|---|
0 | Stacie Nichols | Star Wars: Episode VI - Return of the Jedi |
1 | Stacie Nichols | Under the Rainbow |
2 | Carole Morris | Star Wars: Episode VI - Return of the Jedi |
3 | Carole Morris | Under the Rainbow |
4 | Barbara O'Laughlin | Star Wars: Episode VI - Return of the Jedi |
In the next cell, I will count the occurrence of the movie or TV name by using the groupby()
and aggregation function.
df = results.groupby(["movie_or_TV_name"]).aggregate(len)
df
actor | |
---|---|
movie_or_TV_name | |
'Allo 'Allo! | 1 |
...And Mother Makes Five | 1 |
...And the Band Played On | 1 |
.COM | 1 |
10 Rillington Place | 1 |
... | ... |
Zevo-3 | 1 |
Zomb-G: Get Bit or Get Ate | 1 |
Zombie Gang Bangers | 1 |
Zorro | 4 |
iMurders | 1 |
4179 rows × 1 columns
Now, we can compute a sorted list with the top movies and TV shows that share actors with my favorite movie or TV show, then use reset_index()
to recover the actor column. Finally, we should rename our column names by using df.rename()
method.
df = df.sort_values(by="actor", ascending=False).reset_index()
df = df.rename(columns = {"movie_or_TV_name" : "movie", "actor" : "number of shared actors"})
df
movie | number of shared actors | |
---|---|---|
0 | Star Wars: Episode VI - Return of the Jedi | 217 |
1 | Star Wars: Episode V - The Empire Strikes Back | 39 |
2 | The Bill | 32 |
3 | Under the Rainbow | 31 |
4 | Doctor Who | 26 |
... | ... | ... |
4174 | Jimmy Kimmel Live! | 1 |
4175 | Joe and Max | 1 |
4176 | John Adams | 1 |
4177 | John and Julie | 1 |
4178 | iMurders | 1 |
4179 rows × 2 columns
Next, we can find out the top ten movies and TV shows that share actors with my favorite movie or TV show.
df[:10]
movie | number of shared actors | |
---|---|---|
0 | Star Wars: Episode VI - Return of the Jedi | 217 |
1 | Star Wars: Episode V - The Empire Strikes Back | 39 |
2 | The Bill | 32 |
3 | Under the Rainbow | 31 |
4 | Doctor Who | 26 |
5 | Willow | 21 |
6 | Casualty | 19 |
7 | Star Wars: Episode IV - A New Hope | 18 |
8 | The Dark Crystal | 16 |
9 | Labyrinth | 16 |
From the list above, I notice the number one movie is Star Wars: Episode VI - Return of the Jedi
since most shows will “share” the most actors with themselves; so the next movie I should check out is Star Wars: Episode V - The Empire Strikes Back
or The Bill
.
We can use plotly
to draw a histogram of movies that share at least 20 actors and exclude the film Episode VI - Return of the Jedi.
In the next cell, we should clean our data.
df_2 = results.copy()
df_2['count'] = results.groupby(["movie_or_TV_name"]).transform(len)
# remove the movie Star Wars: Episode I - The Phantom Menace
df_2 = df_2[df_2['movie_or_TV_name'] != 'Episode VI - Return of the Jedi']
# get movies that share at least 20 actors
df_2 = df_2[df_2['count'] >= 20].sort_values(by="count", ascending=False).rename(columns = {"movie_or_TV_name" : "movie"})
df_2
actor | movie | count | |
---|---|---|---|
0 | Stacie Nichols | Star Wars: Episode VI - Return of the Jedi | 217 |
3288 | Brian Wheeler | Star Wars: Episode VI - Return of the Jedi | 217 |
3150 | Franki Anderson | Star Wars: Episode VI - Return of the Jedi | 217 |
3162 | Paul Brooke | Star Wars: Episode VI - Return of the Jedi | 217 |
3187 | Dickey Beer | Star Wars: Episode VI - Return of the Jedi | 217 |
... | ... | ... | ... |
1187 | Paul Markham | Willow | 21 |
2396 | Kenneth Coombs | Willow | 21 |
3662 | Willie Coppen | Willow | 21 |
3670 | Sadie Corre | Willow | 21 |
3693 | Peter Burroughs | Willow | 21 |
366 rows × 3 columns
Make the plot.
fig = px.histogram(df_2,
y = 'movie',
color = 'actor',
title = "Histogram of Movies Recommendation")
fig.update_layout(yaxis_title = "Movie", xaxis_title = "Shared Actors")
fig.show()