Building Scrapper for fetching daily Covid-19 numbers in under 20 lines
Steps to build Covid-19 scrapper
This is part of series “Towards Building Covid 19 Tracker”
We would like to fetch Covid-19 numbers for India. These numbers can be found over here(https://www.mohfw.gov.in/).
As can be seen from Webpage, numbers are shown under 4 categories
- Active Cases (53946)
- Cured/Discharged (34108)
- Deaths (2872)
- Migrated (1)
For fetching data programmatically we will use python’s library BeautifulSoup.
On Python 3 we can install it as follows
pip install BeautifulSoup4
Now that we have our environment ready; lets start to investigate the page structure. For this we use chrome’s “inspection” tool. To invoke Chrome tools you can use shortcut (ctrl+shift+I — on windows).
Relevant Dom Tree is as follows —
If you notice the DOM Tree above you will notice all the Counts for different categories is under div class=”site-stats-count”. On close observation you will notice each count is split across li
element differentiated by specific element classes(bg-blue, bg-green, bg-red, bg-orange). We will use CSS selector to reach to that element. Complete css selector would look as following
.site-stats-count li.bg-blue strong
With Beautifulsoup our code will look as follows —
active_count = soup.select('.site-stats-count li.bg-blue strong')[0].text
As can be seen although Beautifulsoup eases off whole process of parsing data from source, it does not download source of page itself. We need to first download source of page using some popular library like requests. Here is how we will use it
requests.get('https://www.mohfw.gov.in/').text
That’s it output of requests library should then be fed to BeautifulSoup and there we have complete covid-19 india numbers data scraper ready for use.
Here is complete code
`# initialisationindia_source = 'https://www.mohfw.gov.in/'# Scrapping# query the website and return the html to the variable ‘page’page = requests.get(india_source)pgtxt = page.text# parse the html using beautiful soup and store in variable `soup`soup = BeautifulSoup(pgtxt, 'html.parser')# Take out the <div> of name and get its valueactive_count = soup.select('.site-stats-count li.bg-blue strong')[0].textprint(f"Active Cases::{active_count}")cure_count = soup.select('.site-stats-count li.bg-green strong')[0].textprint(f"Cured/Discharged Cases::{cure_count}")death_count = soup.select('.site-stats-count li.bg-red strong')[0].textprint(f"Deaths Cases::{death_count}")migrated_count = soup.select('.site-stats-count li.bg-orange strong')[0].textprint(f"Migrated Cases::{migrated_count}")
Once you have data it can then be stored in desired format. I am using csv as it is very handy for later parts.
Next we will look at different statistical plots and how to draw them.