What I Learnt From Scraping

Introduction

Web scraping is a method to extract data from websites that do not expose an API. It is usually done to extract data from weather websites and such to analyze and work on their data. While it is usually harmless, some websites take measuers to protect against their data being scraped, or make it harder to scrape it by usual means.

In a project I have recently worked on, I needed to do some scraping in order to collect data to perform analysis on. However, all the websites that I had to scrape did not provide an API, so I had to resort to scraping their content. And through this, I have learnt a lot about web scraping and automation!

What I Scraped

What I had to scrape is data from a popular tournmanet website for the game of Shadowverse, called JCG Open. Specifically, I had to scrape the tournaments data in order to provide reports and collect samples of decks to perform clustering on. However, as the website relied a lot on Javascript and made use of infinite scrolling for the tournament participants data, conventional web scraping methods were of no use. As a workaround, I had to learn other libraries to scrape with than the conventional Beautiful Soup or Cheerio. Instead, I relied on Selenium, which was not straightforward for me at first, but changed my life in the end.

The method

For starters, since Selenium uses a headless browser, it can deal with Javascript features like auto-scrolling. It is essentially as if you were browsing the website yourself. This enabled me to tackle the data that were loaded with each infinite scroll, by making the scraper scroll until no more data would load. Only after it has finished loading the scraper would start scraping the data and store them in a structured way for future use as samples.