This is a guest post from blogger and scraping expert Hartley Brody — enjoy!
One of the most common difficulties with web scraping is pulling information from sites that do a lot of rendering on the client side. When faced with scraping a site like this, many programmers reach for very heavy-handed solutions like headless browsers or frameworks like Selenium. Fortunately, there's usually a much simpler way to get the information you need.
But before we dive into that, let's first take a step back and talk about how browsers work so we know where we're headed. When you navigate to a site that does a lot of rendering in the browser -- like Twitter or Forecast.io -- what really happens?
<link> elements that point to other resources that the browser must then download in order to finish rendering the page.
But on sites that rely on the client to do most of the page rendering, the original HTML document is essentially a blank slate, waiting to be filled in asynchronously. In the words of Jamie Edberg -- first paid employee at Reddit and currently a Reliability Architect at Netflix -- when the page first loads, you often "get a rectangle with a lot of divs, and API calls are made to fill out all the divs."
If you examine the network traffic in your browser as the page is loading, you should be able to see what endpoints the page is hitting to load the data. Flip over to the
XHR filter inside the "Network" tab in the Chrome web inspector. These are essentially undocumented API endpoints that the web page is using to pull data. You can use them too!
Let's take a look at how we might do this on Twitter's homepage. When a logged-in user navigates to twitter.com, Tweets are added to a user's timeline with calls to this endpoint. Pull that up in your browser and you'll see a JSON object that contains a big blob of HTML that's injected into the page. Make a call to this endpoint and then parse your info from the response, rather than waiting for the entire page to load.
It's a similar situation when we look at Forecast.io. The HTML document that's returned from the server provides the skeleton for the page, but all of the forecast information is loaded asynchronously. If you pull up your web inspector, refresh the page and then look for the
XHR requests in the "Network" tab, you'll see a call to this endpoint that pulls in all the forecast data for your location.
Now you don't need to load the entire page and wait for the DOM to be ready in order to scrape the information you're looking for. You can go directly to the source to make your application much faster and save yourself a bunch of hassle.
Wanna learn more? I've written a book on web scraping that tons of people have already downloaded. Check it out!
PS: Forecast.io actually has a great API that I'd suggest you check out if you want to use weather data in your application.