Building Scrapper for fetching daily Covid-19 numbers in under 20 lines

2 min readMay 17, 2020

Steps to build Covid-19 scrapper

This is part of series “Towards Building Covid 19 Tracker”

We would like to fetch Covid-19 numbers for India. These numbers can be found over here(https://www.mohfw.gov.in/).

As can be seen from Webpage, numbers are shown under 4 categories

Active Cases (53946)
Cured/Discharged (34108)
Deaths (2872)
Migrated (1)

Screen capture of Covid-19 India numbers

For fetching data programmatically we will use python’s library BeautifulSoup.

On Python 3 we can install it as follows

pip install BeautifulSoup4

Now that we have our environment ready; lets start to investigate the page structure. For this we use chrome’s “inspection” tool. To invoke Chrome tools you can use shortcut (ctrl+shift+I — on windows).

Relevant Dom Tree is as follows —

If you notice the DOM Tree above you will notice all the Counts for different categories is under div class=”site-stats-count”. On close observation you will notice each count is split across li element differentiated by specific element classes(bg-blue, bg-green, bg-red, bg-orange). We will use CSS selector to reach to that element. Complete css selector would look as following

.site-stats-count li.bg-blue strong

With Beautifulsoup our code will look as follows —

active_count = soup.select('.site-stats-count li.bg-blue strong')[0].text

As can be seen although Beautifulsoup eases off whole process of parsing data from source, it does not download source of page itself. We need to first download source of page using some popular library like requests. Here is how we will use it

requests.get('https://www.mohfw.gov.in/').text

That’s it output of requests library should then be fed to BeautifulSoup and there we have complete covid-19 india numbers data scraper ready for use.

Here is complete code

`# initialisationindia_source = 'https://www.mohfw.gov.in/'# Scrapping# query the website and return the html to the variable ‘page’page = requests.get(india_source)pgtxt = page.text# parse the html using beautiful soup and store in variable `soup`soup = BeautifulSoup(pgtxt, 'html.parser')# Take out the <div> of name and get its valueactive_count = soup.select('.site-stats-count li.bg-blue strong')[0].textprint(f"Active Cases::{active_count}")cure_count = soup.select('.site-stats-count li.bg-green strong')[0].textprint(f"Cured/Discharged Cases::{cure_count}")death_count = soup.select('.site-stats-count li.bg-red strong')[0].textprint(f"Deaths Cases::{death_count}")migrated_count = soup.select('.site-stats-count li.bg-orange strong')[0].textprint(f"Migrated Cases::{migrated_count}")

Once you have data it can then be stored in desired format. I am using csv as it is very handy for later parts.

Next we will look at different statistical plots and how to draw them.

Written by MP

No responses yet