Web Scraping : Hands-on Tutorial

Extracting data from epidemic-stats

Manthan Shettigar
3 min readJul 31, 2021

Introduction:

Scraping is extracting information from website pages. Web scraping eliminates huge measures of data from different sites.

Think about the accompanying situation: You’re assessing electronic devices on different sites and need the costs, brand name, and client audits to figure out which one is the awesome. Getting these subtleties by visiting different sites will consume most of the day. Web scratching proves to be useful in the present circumstance since it permits you to get the outcomes you want with only a couple lines of code.

Project Outline

  • We’re going to scrape the covid-19 statistics from https://epidemic-stats.com/
  • We will get a list of infected,deaths,recovered stats country wise
  • We will create a csv file in the following format:
country,infected,deaths,recovered,death_percent,recovered_percent
USA,35688506,629064,29652038,1.8,83.1
India,31613993,423842,30781263,1.3,97.4

Importing requests library

Lets fetch the Response from the website

Lets Check the status code:

200 means the response was successful .

For more information : Status-codes

response.text returns the content of the response, in Unicode.Lets store it in page_contents.

lets display the first 500 chars contents of HTML doc . We wont print the whole content because it is too large !! to display.

Importing the Beautiful-Soup Library

Parse the page contents to beautiful-soup constructor

Extracting Country names

Right-Click on the element and select Inspect

We want to scrape the Country Column from the site .So lets inspect it.

  • We find that USA element has a class of “text-primary” and it is embedded in anchor tag.
  • So we need to find all the countries which belong to the same class ('text-primary') and it is inside of anchor tag of HTML.
  • We find it by using find_all method and pass in the "a" char and class attribute to find all the matching patterns from the HTML doc.
  • There are 210 matches from the doc i.e there are 210 countries

Lets display the first 5 matching list of strings:

Extracting Infected values

Lets do the same with infected values from the doc , to fetch all the values which matches the class 'infected-badges' and span tag of the element

Unsurprising that there are 210 values of infected since there are 210 countries.

lets display the first five list of string values of infected from covid-19

Now we have fetched the country names and infected , lets explore:

  • We don't want the whole tag , we just want the text inside the tag for eg : <a class="text-primary" href="/coronavirus/usa"> <img src="https://www.countryflags.io/US/flat/16.png"/>USA</a> from this tag we would only want the text 'USA' inside the anchor tag and rest all stuff is unnecessary .
  • This can be done by using the ‘text’ method which returns the text inside the tag.
  • lets apply this text method on the first string from the country_a_tag list.

To remove white-spaces we can use strip method.

Extracting no. of deaths data :

Extracting no. of recovered data :

Extracting Death percent data :

  • Death percent = (Deaths/Infected) * 100
  • We don't need ‘%’ character lets strip it.
  • Also convert the string to float.

Extracting Recovered percent data :

  • Recovered percent = (Recovered/Infected) * 100

Finally, we have extracted all the values from the site

  • Now,lets create a empty list.
  • Iterate through all the list of tags.
  • Extract the text from the list & if needed strip,convert it into integer datatype.
  • Append it to the empty list which will be used to create a data-frame from the list.

Putting it all together :

Creating a Data-frame:

  • Import pandas library
  • Use lists in dictionary to create data-frame

Final Step : Generating CSV from data frame

I hope this mini scraping project helped you understand the basics of web scraping with Python. You can find me on LinkedIn.

--

--