Blog / 2013 / 08 / 28 / Web Scraping Javascript Heavy Website Keeping Things Simple

Web Scraping a Javascript Heavy Website: Keeping Things Simple

Last update on May 19, 2014.

This is a guest post from blogger and scraping expert Hartley Brody — enjoy!
Construction worker with putty knife over obsolete background

One of the most common difficulties with web scraping is pulling information from sites that do a lot of rendering on the client side. When faced with scraping a site like this, many programmers reach for very heavy-handed solutions like headless browsers or frameworks like Selenium. Fortunately, there's usually a much simpler way to get the information you need.

But before we dive into that, let's first take a step back and talk about how browsers work so we know where we're headed. When you navigate to a site that does a lot of rendering in the browser -- like Twitter or Forecast.io -- what really happens?

First, your browser makes a single request for an HTML document. That document contains enough information to bootstrap the loading of the rest of the page. It loads some basic markup, potentially some inline CSS and Javascript, and probably a few <script> and <link> elements that point to other resources that the browser must then download in order to finish rendering the page.

Before the days of heavy Javascript usage, the original HTML document contained all the content on the page. Any external calls to load CSS of Javascript were merely to enhance the presentation or behavior of the page, not change the actual content.

But on sites that rely on the client to do most of the page rendering, the original HTML document is essentially a blank slate, waiting to be filled in asynchronously. In the words of Jamie Edberg -- first paid employee at Reddit and currently a Reliability Architect at Netflix -- when the page first loads, you often "get a rectangle with a lot of divs, and API calls are made to fill out all the divs."

To see exactly what this "rectangle with a lot of divs" looks like, try navigating to sites like Twitter or Forecast.io with Javascript turned off in your browser. This will prevent any client-side rendering from happening and allow you to see what the original page looks like before content is added asynchronously.

Once you've seen the content that comes with the original HTML document, you'll start to realize how much of the content is actually being pulled in asynchronously. But rather than wait for the page to load... and then for some Javascript to load... and then for some data to come back from the asynchronous Javascript requests, why not just skip to the final step?

If you examine the network traffic in your browser as the page is loading, you should be able to see what endpoints the page is hitting to load the data. Flip over to the XHR filter inside the "Network" tab in the Chrome web inspector. These are essentially undocumented API endpoints that the web page is using to pull data. You can use them too!

The endpoints are probably returning JSON-encoded information so that the client-side rendering code can parse it an add it to the DOM. This means it's usually straightforward to call those endpoints directly from your application and parse the response. Now you have the data you need without having to execute Javascript or wait for the page to render or any of that nonsense. Just go right to the source of the data!

Let's take a look at how we might do this on Twitter's homepage. When a logged-in user navigates to twitter.com, Tweets are added to a user's timeline with calls to this endpoint. Pull that up in your browser and you'll see a JSON object that contains a big blob of HTML that's injected into the page. Make a call to this endpoint and then parse your info from the response, rather than waiting for the entire page to load.

It's a similar situation when we look at Forecast.io. The HTML document that's returned from the server provides the skeleton for the page, but all of the forecast information is loaded asynchronously. If you pull up your web inspector, refresh the page and then look for the XHR requests in the "Network" tab, you'll see a call to this endpoint that pulls in all the forecast data for your location.

scraping-forecast-io

Now you don't need to load the entire page and wait for the DOM to be ready in order to scrape the information you're looking for. You can go directly to the source to make your application much faster and save yourself a bunch of hassle.

Wanna learn more? I've written a book on web scraping that tons of people have already downloaded. Check it out!

PS: Forecast.io actually has a great API that I'd suggest you check out if you want to use weather data in your application.

About the author

Harley is a 20-something, full-stack web developer. Author of Marketing for Hackers and The Ultimate Guide to Web Scraping.

Next entry

Previous entry

Related entries

Similar entries

Pingbacks

Pingbacks are closed.

Trackbacks

Comments

Comments are closed.