IMDB Movie Scraper

Web scraping, web harvesting, or web data extraction is used for extracting data from websites.

USE CASES
  • Find how many films have a specified rating
  • Check if a particular actor was included in the cast of a movie
  • Get list of movies shot at a particular place
  • Count movies released in a particular year
The Components of a Web Page

When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we're getting files from the server. The server then sends back files that tell our browser how to render the page for us. In this project we are performing the GET request at http://www.imdb.com/chart/top .The files fall into a few main types:

  • HTML -- contain the main content of the page.
  • CSS -- add styling to make the page look nicer.
  • JS -- Javascript files add interactivity to web pages.
  • Images -- image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There's a lot that happens behind the scenes to render a page nicely, but we don't need to worry about most of it when we're web scraping. When we perform web scraping, we're interested in the main content of the web page, so we look at the HTML.

Inspect option

Most of the time you will finding yourself inspecting the HTML the website. We can easily do it with "Inspect" option of our browser.

BeautifulSoup with Requests

BeautifulSoup is a library that allows us to parse the HTML source code in a beautiful way. Along with it we need a Request library that will fetch the content of the url.
It's very straightforward to start scraping a website. Most of the time we will find ourselves inspecting HTML of the website to access the classes and IDs we need. Then the basic code would be to import the libraries, do the request, parse the html and then to find the required class.

Parsing Movies released in and after 2019


Parsing Movies with rating more than 8.8

The code and parsed JSON file is provided in my Github repository.

Github Repo Link: https://github.com/SkyrimCode/IMDB-Movie-Scraper



Comments