Imdb Scraper

For taking a look at IMDb’s robots.txt visit here.

With IMDbScrape you can scrape top ranked movies or the most popular movies available on IMDb with ease. Also, you can scrape actor names by their birthdays or scrape movies based on their genres.

Creating an ImdbScraper Object

While creating an instance of IMDbScraper you can pass in optional headers argument or an optional base_url argument. If you don’t have any specific preferences, default ones should do your job as we will use now:

from scrapista.imdb import IMDbScraper

ims = ImdbScraper()

Now let’s take a look at the available methods and properties from this object:

print(dir(ims))
# ['__class__', '__delattr__', ... 'async_scrape_movies', 'base_url',
# 'popular_movies', 'scrape_actors_by_bdate', 'scrape_movie',
# 'scrape_movies_by_genre', 'top_ranked_movies']

top_ranked_movies

This property will return a list of length 250. All of them being dictionaries containing data of the respective top 250 ranked movie in IMDb. Let’s see some of those movies’ data:

movies_data = ims.top_ranked_movies

print(movies_data)
# [{'name': 'The Shawshank Redemption', 'imdb_rating': 9.2, 'url': 'https://www.imdb.com/title/tt0111161/',
# 'img_src': 'https://www.imdb.com/title/tt0111161/'}, {'name': 'The Godfather',
# 'imdb_rating': 9.1, 'url': 'https://www.imdb.com/title/tt0068646/',
# 'img_src': 'https://www.imdb.com/title/tt0068646/'}, ... ]

scrape_movies_by_genre

With this method you are expected to pass in genre as an argument. It will return you the list of movies data on relative to that genre:

action_movies = ims.scrape_movies_by_genre("action")

print(action_movies)
# [{'name': 'Black Widow', 'url': 'https://www.imdb.com/title/tt3480822/'}, {'name':
# 'Loki', 'url': 'https://www.imdb.com/title/tt9140554/'}, {'name': 'Gunpowder
# Milkshake', 'url': 'https://www.imdb.com/title/tt8368408/'}, ... ]

scrape_movie

This method expects an ımdb URL of the movie you’d like to get data about. The returned object will be a dict object containing all the data items:

movie_data = ims.scrape_movie("https://www.imdb.com/title/tt8332922/")

print(movie_data)
# {'name': 'A Quiet Place Part II', 'rating_count': 110000, 'rating_value': 7.4,
# 'running_time(min)': 97, 'genres': ['Drama', 'Horror', 'Sci-Fi'], 'release':
# '2020', 'restriction': '16+', 'trailer': 'https://www.imdb.com/title/tt8332922/mediaviewer/rm417579265/?ref_=tt_ov_i',
# 'image_source': 'https://www.imdb.com/title/tt8332922/mediaviewer/rm417579265/?ref_=tt_ov_i',
# 'metascore': 71.0, 'review_count': 1000, 'critic_count': 313,
# 'cast': ['Emily Blunt', 'Millicent Simmonds', 'Cillian Murphy']}

async_scrape_movies

This method is very similar to the one above. As a difference, this expects a list of movie URLs and it makes asynchronous requests to all of them. This makes the scraping process so much quicker:

movies = ["https://www.imdb.com/title/tt0111161/", "https://www.imdb.com/title/tt0050083/"]

movies_data = ims.async_scrape_movies(movies)

print(movies_data)
# [{'name': 'The Shawshank Redemption', 'rating_count': 2000000, 'rating_value': 9.3,
# 'running_time(min)': 142, 'genres': ['Drama'], 'release': '1994', 'restriction': '13+',
# 'trailer': 'https://www.imdb.com/title/tt0111161/mediaviewer/rm10105600/?ref_=tt_ov_i',
# 'image_source': 'https://www.imdb.com/title/tt0111161/mediaviewer/rm10105600/?ref_=tt_ov_i',
# 'metascore': 80.0, 'review_count': 9000, 'critic_count': 183, 'cast':
# ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']}, {'name': '12 Angry Men',
# 'rating_count': 716000, 'rating_value': 9.0, 'running_time(min)': 96, 'genres':
# ['Crime', 'Drama'], 'release': '1957', 'restriction': 'R?', 'trailer':
# 'https://www.imdb.com/title/tt0050083/mediaviewer/rm2927108352/?ref_=tt_ov_i',
# 'image_source': 'https://www.imdb.com/title/tt0050083/mediaviewer/rm2927108352/?ref_=tt_ov_i',
# 'metascore': 96.0, 'review_count': 1000, 'critic_count': 159,
# 'cast': ['Henry Fonda', 'Lee J. Cobb', 'Martin Balsam']}]

This method of scraping multiple pages of movies is probably the fastest way. There is an optional argument called checkpoints in this method so that you can get outputs on your terminal after the the same amount of movies have been scraped as your checkpoints. So, if you passed 5 as your checkpoint there would be a log printed on the console after every 5 movies that’s been scraped.

scrape_actors_by_bdate

This method expects an optional date argument in string. If you don’t pass in one, the default is today. The date format passed into the method should be “month-day” for example, “05-24”:

# passing in a custom date
actors = ims.scrape_actors_by_bdate("05-24")

print(actors)
# ['Doug Jones', 'Daisy Edgar-Jones', 'Alfred Molina', 'Brianne Howey', 'John C.
# Reilly', ... 'Amelia Cooke', 'Benjamin Sutherland']

Now, let’s call the default method, without passing any arguments:

# not passing in a date will be assumed the passed date is today
actors = ims.scrape_actors_by_bdate()

print(actors)
# ['Jonathan Rhys Meyers', 'Nikolaj Coster-Waldau', 'Alyvia Alyn Lind', 'Maya
# Rudolph', 'Taylor Schilling', ... 'Jeté Laurence', 'Bill Engvall']