Web scraping

Introduction

This article describes a technique for web scraping, and a particular class, ipl.webscraper.WebScraper.

This class works by taking a sample web page which has the same structure as the web pages we will subsequently be scraping data from. This sample page is called the template, and is an ordinary HTML document, but contains special directives which mark the particular elements we wish to locate in the pages we will be scraping data from.

Scrape IDs

There are two types of special directives. The first is the “scrape ID”. We insert these into the template to indicate which corresponding elements we wish to retrieve from subsequent web pages.

A scrape ID can either refer to a tag, in which case it takes the form of an HTML attribute :-

<table scrapeid="spellingtable">

or it can be placed at the start of text content :-

<a href="/news/special_reports" class="navigation-wide-list__link">
    <span>$scrapeid=title$Special Reports</span>
</a>

In this case the lookup process will return the corresponding string in the subject web page we are searching, rather than a matching element.

A template can contain several scrape IDs.

showtemplate

There is a useful utility program, showtemplate, in the examples directory of the Object Icon distribution. This shows how scrape IDs are used by the WebScraper class to define the path to a particular element. showtemplate can be used as follows :-

$ showtemplate template.html
object ipl.webscraper.WebScraper#1(
   paths=table#4784{
         "deflist"->
            list#1216[
               record ipl.webscraper.RootCmd#1(),
               record ipl.webscraper.TagCmd#1(n=1;name="BODY"),
               record ipl.webscraper.TagCmd#2(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#3(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#4(n=4;name="DIV"),
               record ipl.webscraper.TagCmd#5(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#6(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#7(n=1;name="MAIN"),
               record ipl.webscraper.TagCmd#8(n=1;name="ARTICLE"),
               record ipl.webscraper.TagCmd#9(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#10(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#11(n=4;name="DIV"),
               record ipl.webscraper.TagCmd#12(n=1;name="UL")
            ]
      }
   debug_flag=&yes
)

Here we have defined one scrape ID, “deflist”, in a template file “template.html” :-

...
<ul scrapeid="deflist">
...

The list of “commands” give the steps necessary to locate the desired element. The RootCmd step means start at the root element of the document. Each subsequent TagCmd step means descend one level, taking the nth child element with the tag name matching name. So,

record ipl.webscraper.TagCmd#4(n=4;name="DIV"),

means the 4th <DIV> child tag under the element reached thus far. If, when searching a subject document, we find that the present element doesn’t have such a child, then the lookup process fails. On the other hand, when we get to the end of the list of steps, we have found the desired element. Note that the last TagCmd searches for a <UL> tag, the same tag type as the one containing the scrape ID.

Anchors

It turns out that navigating from the root element of an HTML document tends to be very fragile to minor changes in the document structure of the subject pages.

To mitigate this problem, “anchor” directives can be declared in the template file. The idea is that some unique, non-changing text content is chosen, near to a scrape ID location. This is then marked as an “anchor”, and the scrape ID is made to refer to the anchor. The scrape ID element will then be located by first searching for the anchor text, and then by following a relative path to the target location, rather than the absolute path from the root element. This relative path will hopefully be more robust against changes.

Extending our above example, we could use an anchor and scrape together like this :-

...
<ul scrapeid="a1:deflist">
...
<div class="simple-def-source">$anchorid=a1$Source: Merriam-Webster's Learner's Dictionary</div>
...

Here the anchor id is “a1”, and the anchor text is “Source: Merriam-Webster’s Learner’s Dictionary”. The scrape ID is still “deflist”, and the link to the anchor is indicated by prefixing this with “a1:”. Now, so long as this relative structure doesn’t change, and future documents still contain the anchor text, the scrape will successfully retrieve the desired element.

Running showtemplate with this template might now give output like the following :-

$ showtemplate template.html
object ipl.webscraper.WebScraper#1(
   paths=table#4784{
         "deflist"->
            list#1216[
               record ipl.webscraper.AnchorStringCmd#1(str="Source: Merriam-Webster's Learner's Dictionary"),
               record ipl.webscraper.UpCmd#1(n=1),
               record ipl.webscraper.TagCmd#1(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#2(n=4;name="DIV"),
               record ipl.webscraper.TagCmd#3(n=1;name="UL")
            ]
      }
   debug_flag=&yes
)

Note that here there is no RootCmd to begin; rather there are two different steps. The first is a search for the anchor string. Then comes UpCmd, meaning go up one level in the document structure. This is necessary since the anchor is in a different branch of the document to the scrape ID. Then we proceed with the usual navigation down the document structure; the last three steps are the same as before.

An anchor can also refer to an element, rather than text, although this tends to be less useful :-

<div anchorid="a2" class="definition-block def-text">

Now a reference to this anchor will cause a search for the first element with the same tag name and attributes in the target document.

Creating a WebScraper

A WebScraper instance is created by passing the text data containing the template document. This can either be loaded from a file, or included using a $load (or $uload) directive, if it is not too unwieldy. In the former case, there are a couple of static methods in WebScraper to make this easier.

Once created, subject web pages can be searched using the lookup method. This takes a parsed HTML document and a scrape ID and returns the element in the document which corresponds to the scrape ID in the template. If the scrape ID pointed to a string, rather than an element, then the corresponding string is returned. If the element cannot be matched, then lookup fails, setting &why appropriately.

The string type (plain string or ucs) of the template must match the type of the source data for the parsed HTML document given to lookup; a runtime error will result if they differ.

An example

The above template example with the single scrape ID, “deflist”, in fact comes from a sample web page from Merriam-Webster’s online dictionary (as you may have guessed from the anchor text).

The template comes from the page for an arbitrary word, “example”. As described above, the template is edited to introduce a single scrape ID (and associated anchor), at the <UL> element under which the word definitions can be found. The resulting template file is template.html.

Testing the template

We can test the effectiveness of the template, by testing it against sample web pages. Firstly, we download a test web page :-

$ geturl http://www.merriam-webster.com/dictionary/sanguine -o sanguine.html

and then run it against the template using showtemplate :-

$ showtemplate template.html sanguine.html -d
object ipl.webscraper.WebScraper#1(
   paths=table#4784{
         "deflist"->
            list#1216[
               record ipl.webscraper.AnchorStringCmd#1(str="Source: Merriam-Webster's Learner's Dictionary"),
               record ipl.webscraper.UpCmd#1(n=1),
               record ipl.webscraper.TagCmd#1(n=1;name="DIV"),
               record ipl.webscraper.TagCmd#2(n=4;name="DIV"),
               record ipl.webscraper.TagCmd#3(n=1;name="UL")
            ]
      }
   debug_flag=&yes
)

Testing path deflist
Command: record ipl.webscraper.AnchorStringCmd#1(str="Source: Merriam-Webster's Learner's Dictionary")
OK: Now at DIV table#5554{"CLASS"->"simple-def-source"}
Command: record ipl.webscraper.UpCmd#1(n=1)
OK: Now at DIV table#5432{"CLASS"->"tense-box quick-def-box simple-def-box card-box def-text "}
Command: record ipl.webscraper.TagCmd#1(n=1;name="DIV")
OK: Now at DIV table#5435{"CLASS"->"inner-box-wrapper"}
Command: record ipl.webscraper.TagCmd#2(n=4;name="DIV")
OK: Now at DIV table#5522{"CLASS"->"definition-block def-text"}
Command: record ipl.webscraper.TagCmd#3(n=1;name="UL")
OK: Now at UL table#5525{"CLASS"->"definition-list no-count"}
Success: object xml.HtmlElement#1149(4)

showtemplate first shows the WebScraper object derived from the template, as described previously.

The second part of the output shows the lookup process applied to the test page, sanguine.html. At each point the current element in the search is shown, and at the end we note a successful search.

A test program

It is quite easy to use the dictionary WebScraper in a simple program, as shown in the following example.

Download dictionary.icn

import net, http, io, xml, ipl.webscraper, ipl.options

$load TEMPLATE "template.html"

procedure main(a)
   local url, hreq, hc, hres, out, ws, doc, el, x, s, opts

   ws := WebScraper(TEMPLATE) | stop(&why)

   opts := options(a, [Opt("v",, "Output scraped element in HTML format")])

   # The URL we wish to scrape
   url := URL("http://www.merriam-webster.com/dictionary/" || a[1])  | stop("usage: dictionary word")

   # A stream for the resulting data
   out := RamStream()

   # Get the page.
   hreq := HttpRequest().
      set_url(url).
      set_output_stream(out)

   hc := HttpClient()

   hres := hc.retrieve(hreq) | stop(&why)

   # Parse into a HTML document
   doc := HtmlParser().parse(out.done())

   # Search the document for the desired element
   el := ws.lookup(doc, "deflist") | stop(&why)

   if \opts["v"] then {
      write("HTML formatted output :-")
      HtmlFormatter().el
      write("\n")
   }

   # Output the results
   every x := el.search_tree() do {
      s := x.get_trimmed_string_content()
      if *s > 1 then
         write(s)
   }
end

This gives the following results :-

$ ./dictionary german
a person born, raised, or living in Germany : a person whose family is from Germany
the language of Germany that is also spoken in Austria, parts of Switzerland, and other places
$ ./dictionary sanguine
confident and hopeful
$ ./dictionary patient
able to remain calm and not become annoyed when waiting for a long time or when dealing with problems or difficult people
done in a careful way over a long period of time without hurrying

Contents