MP
2 min readMay 17, 2020

Building Scrapper for fetching daily Covid-19 numbers in under 20 lines

Steps to build Covid-19 scrapper

This is part of series “Towards Building Covid 19 Tracker

We would like to fetch Covid-19 numbers for India. These numbers can be found over here(https://www.mohfw.gov.in/).

As can be seen from Webpage, numbers are shown under 4 categories

  • Active Cases (53946)
  • Cured/Discharged (34108)
  • Deaths (2872)
  • Migrated (1)
Screen capture of Covid-19 India numbers

For fetching data programmatically we will use python’s library BeautifulSoup.

On Python 3 we can install it as follows

pip install BeautifulSoup4

Now that we have our environment ready; lets start to investigate the page structure. For this we use chrome’s “inspection” tool. To invoke Chrome tools you can use shortcut (ctrl+shift+I — on windows).

Relevant Dom Tree is as follows —

Dom Tree Structure of Covid Cases

If you notice the DOM Tree above you will notice all the Counts for different categories is under div class=”site-stats-count”. On close observation you will notice each count is split across li element differentiated by specific element classes(bg-blue, bg-green, bg-red, bg-orange). We will use CSS selector to reach to that element. Complete css selector would look as following

.site-stats-count li.bg-blue strong

With Beautifulsoup our code will look as follows —

active_count = soup.select('.site-stats-count li.bg-blue strong')[0].text

As can be seen although Beautifulsoup eases off whole process of parsing data from source, it does not download source of page itself. We need to first download source of page using some popular library like requests. Here is how we will use it

requests.get('https://www.mohfw.gov.in/').text

That’s it output of requests library should then be fed to BeautifulSoup and there we have complete covid-19 india numbers data scraper ready for use.

Here is complete code

`# initialisationindia_source = 'https://www.mohfw.gov.in/'# Scrapping# query the website and return the html to the variable ‘page’page = requests.get(india_source)pgtxt = page.text# parse the html using beautiful soup and store in variable `soup`soup = BeautifulSoup(pgtxt, 'html.parser')# Take out the <div> of name and get its valueactive_count = soup.select('.site-stats-count li.bg-blue strong')[0].textprint(f"Active Cases::{active_count}")cure_count = soup.select('.site-stats-count li.bg-green strong')[0].textprint(f"Cured/Discharged Cases::{cure_count}")death_count = soup.select('.site-stats-count li.bg-red strong')[0].textprint(f"Deaths Cases::{death_count}")migrated_count = soup.select('.site-stats-count li.bg-orange strong')[0].textprint(f"Migrated Cases::{migrated_count}")

Once you have data it can then be stored in desired format. I am using csv as it is very handy for later parts.

Next we will look at different statistical plots and how to draw them.

MP
MP

Written by MP

Startup guy. Loves Programming

No responses yet