Blog Post 3 - Web Scraping

In this post, I will use webscraping to extract and process data from the IMBD website to find out what movie or TV shows share actors with my favorite movie or show to recommend my next film to watch.

The link to the GitHub repository for this scraper project.

I will use the movie Star Wars: Episode VI - Return of the Jedi as an example of my favorite movie. So the idea of this movie recommender systems works is based on if I enjoyed the movie Star Wars, and some other movies or TV shows have many of the same actors as the movie I like, then I might also like that movie.

§1. Setup

Before doing webscraping, we need to make sure we have installed the Python package scrapy. In my case, I installed to my PIC16B environment by using Anaconda. Then we can start to set up our project by running the command line scrapy startproject your_project_name to set up a new scrapy project directory.

Once I open my terminal and navigate to the location where I would like to do my scrapy project. Then I run the following command line:

conda activate PIC16B
scrapy startproject IMDB_scraper

The above command line will create an IMDB_scraper directory as the name for my project; then we have our directory structure like this:

└── IMDB_scraper
    ├── scrapy.cfg
    └── IMDB_scraper
        ├── spiders
        │   └── __init__.py
        ├── __init__.py
        ├── middlewares.py
        ├── settings.py
        ├── items.py
        └── pipelines.py

We should start our project with ` cd IMDB_scraper to get in the correct IMDB_scraper` directory.

Next, we can add the line CLOSESPIDER_PAGECOUNT = 20 to our settings.py file to prevent our scraper from downloading too much data while we’re still testing things out. We’ll remove this line later once we finish testing our project.

§2. Spider

In this section we will write our scraper

Now, let’s create a file called imdb_spider.py inside the spiders director and type in the following code in imdb_spider.py.

import scrapy

class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    
    # The URL is for 'Star Wars: Episode VI - Return of the Jedi' from IMDB page
    start_urls = ['https://www.imdb.com/title/tt0086190/']

The above code creates a subclass called ImdbSpider which inherits from Spider class from the package scrapy.

  • name variable is used to call different spiders from the terminal; we should give the unique name for each different spider.
  • start_urls is the list of URLs to tell where our spider starts to crawl from. In my case, my start URL will navigate to the IMBD movie page for Star Wars: Episode VI - Return of the Jedi. You can replace the entry of start_urls with the URL corresponding to any of your favorite movie or TV show.

Next, I will implement three parsing method for the ImdbSpider class.

First parse method: parse( )

The first method is parse(self, response); this method usually parses the response argument. The parse method is the most important method in our Spider project; this method specifies what should happen when the Spider encounters a response. A response is an object corresponding to a webpage; it contains the raw HTML and many other useful attributes and methods for extracting information from the page.

def parse(self, response):
       """
       This function starting at the website 'https://www.imdb.com/title/tt0086190/' 
       for the movie 'Star Wars: Episode VI - Return of the Jedi' from IMDB;
       then navigate to the Cast & Crew page with URL of the form
       'https://start_urls/fullcredits/'
       """

       # get the link for the Cast & Crew page (fullcredits/)
       # Notes: No slash at the begining of the path "fullcredits/"
       next_link = response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']
       # join the link of the Cast & Crew page to our current URL path
       # hers is : https://www.imdb.com/title/tt0086190/fullcredits/
       cast_link = response.urljoin(next_link)

       # yield a request to follow the Cast & Crew page,
       # and call the method parse_full_credits() once get there
       yield Request(cast_link, callback = self.parse_full_credits)

In order to get the URL for navigation to the Cast & Crew page, we need to use the Developer Tools on our browser to inspect individual HTML elements. For example, I right-click on the Cast & Crew link on the Mac computer, then click Inspect Element. Then you will see some similar page like this:

blog_post3_image

Then we can find that the link for the Cast & Crew page is fullcredits/, and is inside the anchor tag <a>...</a> with href attribute. Then we can notice that the fullcredits/ is under the ul element with class name ipc-inline-list; and that ul elements is under the div element with class name SubNav__SubNavContentBlock-sc-11106ua-2. So we can use response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href'] to get the first value from the attribute href inside the attribute a.

We can use scrapy shell to check:

First, run

scrapy shell 'https://www.imdb.com/title/tt0086190/'

Then run:

response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']

After that, you will get something like this:

In [1]: response.css('div.SubNav__SubNavContentBlock-sc-11106ua-2 ul.ipc-inline-list a').attrib['href']
Out[1]: 'fullcredits/?ref_=tt_ql_cl'

Great, we can see that we have successfully extracted the fullcredits/.


Second parse parsing method: parse_full_credits( )

The method parse_full_credits(self, response) starts at our movie’s Cast & Crew page; find all names under the Series Cast section on that page, then navigate to each actor/actress page.

def parse_full_credits(self, response):
       """
       This function pareses the Full Cast & Crew page
       of the movie from URL the 'https://start_urls/fullcredits/',
       then navigate to each actors page (Crew members are not included).
       """
       
       # list of links for negative to each actor page (ex: '/name/nm0000434/')
       # Notes: There is a slash at the begining of the path "/name/nm0000434/"
       links = [a.attrib["href"] for a in response.css("td.primary_photo a")]
       
       # for all actors on the current page,
       # yield a request to follow those actor's link,
       # and call the method parse_actor_page () when each actor’s page is reached
       for link in links:
          # add the base URL (https://www.imdb.com) to each actor's link (ex: https://www.imdb.com/name/nm0000434/)
          url= response.urljoin(link)
          yield Request(url, callback= self.parse_actor_page)

Let’s use the developer tool to inspect the first actor element inside the Cast & Crew page to get some idea of how to extract all actor/actress links. In my case, I inspect the actor called “Mark Hamill”; then you will get something similar to this:

blog_post3_image

After inspection, we can notice that all links are inside the table row with <td> tag with the class name primary_photo; inside that row, we have an <a> tag with the href attribute, which contains the link we need. Therefor, we can use list comprehension [a.attrib["href"] for a in response.css("td.primary_photo a")] to get all the actor/actress link.


Third parse method: parse_actor_page( )

The method parse_actor_page(self, response) starts at each actor/actress page, gets the actor/actress’s name, then extracts all movie or TV show names inside the Actor/Actress under Filmography section.

    def parse_actor_page(self, response):
       """
       This function get each actor's page,
       It should yield a dictionary with the name of the actor and the name of each movie or TV show   
       """
       # get the actor's name
       actor_name = response.css('span.itemprop ::text').get()

       # a list contains all movies or TV shows name for the current page's actor
       # male is under actor
       # female is under actress
       filmography_actor = [actor_name.css("a::text").get() for actor_name in response.css('div.filmo-row') if actor_name.css("[id^=actor]") or actor_name.css("[id^=actress]")]
       # for all movie or TV show name on this actor's page, 
       # yield one dictionary contains the actor's name and each movie or TV show
       for movie_or_TV_name in filmography_actor:
        yield{
           "actor" : actor_name,
           "movie_or_TV_name" : movie_or_TV_name
        }

In my case, I use the developer tool to inspect the actor’s “Mark Hamil” name to find the idea of how to extract any actor/actress’s name.

blog_post3_image

The name is located at <span> tag with class name itemprop, so we can use response.css('span.itemprop ::text').get() method to extract the name. we added ::text to the CSS query, to allow us extract only the text elements directly inside <span> element with the name itemprop . “Using .get() directly on a SelectorList instance avoids an IndexError and returns None when it doesn’t find any element matching the selection.”

I used the list comprehension get all movie or TV show names from the current page:

[actor_name.css("a::text").get() for actor_name in response.css('div.filmo-row') if actor_name.css("[id^=actor]") or actor_name.css("[id^=actress]")]

I noticed that males are under Actor, and females are under Actress in their corresponding Filmography section.

Then we should use the developer tool to inspect the movie or TV show name:

blog_post3_image

We can see the name of the movie is inside the <a> tag, under the <div> tag with class name filmo-row; However, more than one class has the same name; So, we need to check the id to get the correct data. Therefore, we should add if condition to check the id, which starts with actor or actress; thus, I added if actor_name.css("[id^=actor]") or actor_name.css("[id^=actress]") inside the list comprehension.


Good, now we can use our imdb_spider to crawl the data by running the command:

scrapy crawl imdb_spider -o results.csv

The above command will create a CSV file called results with a column for actors and a column for movies or TV shows inside my IMDB_scraper director.

§3. Make Your Recommendations

In this section, I will use the data from results.csv, then make my next movie recommendations.

Don’t gorget to comment out the line CLOSESPIDER_PAGECOUNT = 20 in the file settings.py. Then we should run the command scrapy crawl imdb_spider -o results.csv again to get the complete data.

I will import the required Python modules at begging for convenience.

import pandas as pd
import numpy as np
from plotly import express as px

Next cell will import the results.csv as a Pandas DataFrame called results.

results = pd.read_csv("results.csv")

Once I have read the results dataset into a pandas dataframe, we can take a look at the first five rows of the dataset using results.head().

results.head()
actor movie_or_TV_name
0 Stacie Nichols Star Wars: Episode VI - Return of the Jedi
1 Stacie Nichols Under the Rainbow
2 Carole Morris Star Wars: Episode VI - Return of the Jedi
3 Carole Morris Under the Rainbow
4 Barbara O'Laughlin Star Wars: Episode VI - Return of the Jedi

In the next cell, I will count the occurrence of the movie or TV name by using the groupby() and aggregation function.

df = results.groupby(["movie_or_TV_name"]).aggregate(len)
df
actor
movie_or_TV_name
'Allo 'Allo! 1
...And Mother Makes Five 1
...And the Band Played On 1
.COM 1
10 Rillington Place 1
... ...
Zevo-3 1
Zomb-G: Get Bit or Get Ate 1
Zombie Gang Bangers 1
Zorro 4
iMurders 1

4179 rows × 1 columns

Now, we can compute a sorted list with the top movies and TV shows that share actors with my favorite movie or TV show, then use reset_index() to recover the actor column. Finally, we should rename our column names by using df.rename() method.

df = df.sort_values(by="actor", ascending=False).reset_index()
df = df.rename(columns = {"movie_or_TV_name" : "movie", "actor" : "number of shared actors"})
df
movie number of shared actors
0 Star Wars: Episode VI - Return of the Jedi 217
1 Star Wars: Episode V - The Empire Strikes Back 39
2 The Bill 32
3 Under the Rainbow 31
4 Doctor Who 26
... ... ...
4174 Jimmy Kimmel Live! 1
4175 Joe and Max 1
4176 John Adams 1
4177 John and Julie 1
4178 iMurders 1

4179 rows × 2 columns

Next, we can find out the top ten movies and TV shows that share actors with my favorite movie or TV show.

df[:10]
movie number of shared actors
0 Star Wars: Episode VI - Return of the Jedi 217
1 Star Wars: Episode V - The Empire Strikes Back 39
2 The Bill 32
3 Under the Rainbow 31
4 Doctor Who 26
5 Willow 21
6 Casualty 19
7 Star Wars: Episode IV - A New Hope 18
8 The Dark Crystal 16
9 Labyrinth 16

From the list above, I notice the number one movie is Star Wars: Episode VI - Return of the Jedi since most shows will “share” the most actors with themselves; so the next movie I should check out is Star Wars: Episode V - The Empire Strikes Back or The Bill.


We can use plotly to draw a histogram of movies that share at least 20 actors and exclude the film Episode VI - Return of the Jedi.

In the next cell, we should clean our data.

df_2 = results.copy()
df_2['count'] = results.groupby(["movie_or_TV_name"]).transform(len)
# remove the movie Star Wars: Episode I - The Phantom Menace
df_2 = df_2[df_2['movie_or_TV_name'] != 'Episode VI - Return of the Jedi']
# get movies that share at least 20 actors
df_2 = df_2[df_2['count'] >= 20].sort_values(by="count", ascending=False).rename(columns = {"movie_or_TV_name" : "movie"})
df_2
actor movie count
0 Stacie Nichols Star Wars: Episode VI - Return of the Jedi 217
3288 Brian Wheeler Star Wars: Episode VI - Return of the Jedi 217
3150 Franki Anderson Star Wars: Episode VI - Return of the Jedi 217
3162 Paul Brooke Star Wars: Episode VI - Return of the Jedi 217
3187 Dickey Beer Star Wars: Episode VI - Return of the Jedi 217
... ... ... ...
1187 Paul Markham Willow 21
2396 Kenneth Coombs Willow 21
3662 Willie Coppen Willow 21
3670 Sadie Corre Willow 21
3693 Peter Burroughs Willow 21

366 rows × 3 columns

Make the plot.

fig = px.histogram(df_2,
                   y = 'movie',
                   color = 'actor',
                   title = "Histogram of Movies Recommendation")

fig.update_layout(yaxis_title = "Movie", xaxis_title = "Shared Actors")
                   
fig.show()

Written on February 2, 2022