Imdb Scraper

For taking a look at IMDb’s robots.txt visit here.

With IMDbScrape you can scrape top ranked movies or the most popular movies available on IMDb with ease. Also, you can scrape actor names by their birthdays or scrape movies based on their genres.

Creating an ImdbScraper Object

While creating an instance of IMDbScraper you can pass in optional headers argument or an optional base_url argument. If you don’t have any specific preferences, default ones should do your job as we will use now:

from scrapista.imdb import IMDbScraper

ims = ImdbScraper()

Now let’s take a look at the available methods and properties from this object:

print(dir(ims))
# ['__class__', '__delattr__', ... 'async_scrape_movies', 'base_url',
# 'popular_movies', 'scrape_actors_by_bdate', 'scrape_movie',
# 'scrape_movies_by_genre', 'top_ranked_movies']

top_ranked_movies

This property will return a list of length 250. All of them being dictionaries containing data of the respective top 250 ranked movie in IMDb. Let’s see some of those movies’ data:

movies_data = ims.top_ranked_movies

print(movies_data)
# [{'name': 'The Shawshank Redemption', 'imdb_rating': 9.2, 'url': 'https://www.imdb.com/title/tt0111161/',
# 'img_src': 'https://www.imdb.com/title/tt0111161/'}, {'name': 'The Godfather',
# 'imdb_rating': 9.1, 'url': 'https://www.imdb.com/title/tt0068646/',
# 'img_src': 'https://www.imdb.com/title/tt0068646/'}, ... ]

popular_movies

This other property returns the most popular movies in IMDb. Unlike the top ranked ones, most popular movies change often and can be vary depending on the day you scrape them:

popular_movies = ims.popular_movies

print(popular_movies)
# [{'name': 'Space Jam: A New Legacy', 'url': 'https://www.imdb.com/title/tt3554046/'},
# {'name': 'Black Widow', 'url': 'https://www.imdb.com/title/tt3480822/'}, {'name':
# 'Old', 'url': 'https://www.imdb.com/title/tt10954652/'}, ... ]

scrape_movies_by_genre

With this method you are expected to pass in genre as an argument. It will return you the list of movies data on relative to that genre:

action_movies = ims.scrape_movies_by_genre("action")

print(action_movies)
# [{'name': 'Black Widow', 'url': 'https://www.imdb.com/title/tt3480822/'}, {'name':
# 'Loki', 'url': 'https://www.imdb.com/title/tt9140554/'}, {'name': 'Gunpowder
# Milkshake', 'url': 'https://www.imdb.com/title/tt8368408/'}, ... ]

scrape_movie

This method expects an ımdb URL of the movie you’d like to get data about. The returned object will be a dict object containing all the data items:

movie_data = ims.scrape_movie("https://www.imdb.com/title/tt8332922/")

print(movie_data)
# {'name': 'A Quiet Place Part II', 'rating_count': 110000, 'rating_value': 7.4,
# 'running_time(min)': 97, 'genres': ['Drama', 'Horror', 'Sci-Fi'], 'release':
# '2020', 'restriction': '16+', 'trailer': 'https://www.imdb.com/title/tt8332922/mediaviewer/rm417579265/?ref_=tt_ov_i',
# 'image_source': 'https://www.imdb.com/title/tt8332922/mediaviewer/rm417579265/?ref_=tt_ov_i',
# 'metascore': 71.0, 'review_count': 1000, 'critic_count': 313,
# 'cast': ['Emily Blunt', 'Millicent Simmonds', 'Cillian Murphy']}

async_scrape_movies

This method is very similar to the one above. As a difference, this expects a list of movie URLs and it makes asynchronous requests to all of them. This makes the scraping process so much quicker:

movies = ["https://www.imdb.com/title/tt0111161/", "https://www.imdb.com/title/tt0050083/"]

movies_data = ims.async_scrape_movies(movies)

print(movies_data)
# [{'name': 'The Shawshank Redemption', 'rating_count': 2000000, 'rating_value': 9.3,
# 'running_time(min)': 142, 'genres': ['Drama'], 'release': '1994', 'restriction': '13+',
# 'trailer': 'https://www.imdb.com/title/tt0111161/mediaviewer/rm10105600/?ref_=tt_ov_i',
# 'image_source': 'https://www.imdb.com/title/tt0111161/mediaviewer/rm10105600/?ref_=tt_ov_i',
# 'metascore': 80.0, 'review_count': 9000, 'critic_count': 183, 'cast':
# ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']}, {'name': '12 Angry Men',
# 'rating_count': 716000, 'rating_value': 9.0, 'running_time(min)': 96, 'genres':
# ['Crime', 'Drama'], 'release': '1957', 'restriction': 'R?', 'trailer':
# 'https://www.imdb.com/title/tt0050083/mediaviewer/rm2927108352/?ref_=tt_ov_i',
# 'image_source': 'https://www.imdb.com/title/tt0050083/mediaviewer/rm2927108352/?ref_=tt_ov_i',
# 'metascore': 96.0, 'review_count': 1000, 'critic_count': 159,
# 'cast': ['Henry Fonda', 'Lee J. Cobb', 'Martin Balsam']}]

This method of scraping multiple pages of movies is probably the fastest way. There is an optional argument called checkpoints in this method so that you can get outputs on your terminal after the the same amount of movies have been scraped as your checkpoints. So, if you passed 5 as your checkpoint there would be a log printed on the console after every 5 movies that’s been scraped.

scrape_actors_by_bdate

This method expects an optional date argument in string. If you don’t pass in one, the default is today. The date format passed into the method should be “month-day” for example, “05-24”:

# passing in a custom date
actors = ims.scrape_actors_by_bdate("05-24")

print(actors)
# ['Doug Jones', 'Daisy Edgar-Jones', 'Alfred Molina', 'Brianne Howey', 'John C.
# Reilly', ... 'Amelia Cooke', 'Benjamin Sutherland']

Now, let’s call the default method, without passing any arguments:

# not passing in a date will be assumed the passed date is today
actors = ims.scrape_actors_by_bdate()

print(actors)
# ['Jonathan Rhys Meyers', 'Nikolaj Coster-Waldau', 'Alyvia Alyn Lind', 'Maya
# Rudolph', 'Taylor Schilling', ... 'Jeté Laurence', 'Bill Engvall']