Nested HTML Downloads


Overview

The goal of this example is to show how we can parse and HTML document and download the files to which it has links, i.e. <a href=...> elements. As we process the original document, we arrange to download the others. There are three possible "obvious" approaches.
  • One approach is to parse the original document in entirety and extract its links (either by fetching the document and then parsing it or "on the fly" by connecting the output of the curl request directly to the xml parser). We then download each of those linked documents.
  • Another approach is to start the parsing of the top-level document and when we encounter a link, we immediately download that and then continue on with the parsing of the original document. In other words, when we encounter a link, we hand control to the downloading of that link.
  • An intermediate approach is to parse the first document and as we encounter a link, send a request for that document and arrange to have it be processed concurrently with the other documents. Essentially, we arrange for the processing of the links to be done asynchronously. Having encountered a link, we don't wait until it is completely downloaded and nor do wait to download all of the links after we have processed the original document. Rather, we add a request to download the link as we encounter it and continue processing.

Asynchronous, concurrent link processing

The strategy in this approach is to start the parsing of the original document. We do this in almost exactly the same way that we do in the xmlParser example. That is, we create a multi CURL handler and we create a function that will feed data from the HTTP response to the XML parser when it is required. We then put the downloading of the original/top-level file on the stack for the multi handler.

uri = "http://www.omegahat.org/index.html"
uri = "http://www.omegahat.org/RCurl/philosophy.xml"

multiHandle = getCurlMultiHandle()
streams = HTTPReaderXMLParser(multiHandle, save = TRUE)

curl = getCurlHandle(URL = uri, writefunction = streams$getHTTPResponse)
multiHandle = push(multiHandle, curl)


[1] At this point, the initial HTTP request has not actually been performed and therefore there is no data. And this is good. We want to start the XML parser. So we establish the handlers that will process the elements of interest in our document, e.g. a <ulink> for a Docbook document, or <a> for an HTML document. The function downloadLinks() is the function used to do this. And now we are ready to start the XML parser via a call to xmlEventParse() .
links = downloadLinks(multiHandle, "http://www.omegahat.org", "ulink", "url", verbose = TRUE)
xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)

At this point, the XML parser asks for some input. It calls the supplyXMLContent and this fetches data from the HTTP reply. In our case, this will cause the HTTP request to be sent to the server and we will wait until we get the first part of the document. The XML parser then takes this chunk and parses it. When it encounters an element of interest, i.e. a ulink, it calls the approriate handler function given in links. And this gets the URI of the link and then arranges to add to the multi handle an HTTP request to fetch that document. The next time that the multi curl handle is requested to get input for the XML parser, it will send that new HTTP request and the response will be available. The write handler for the new HTTP request simply collects all the text for the document into a single string. We use basicTextGatherer() for this.

There is one last little detail before we can access the results. It is possible that the XML event parser will have digested all its input before the downloads for the other documents have finished. There will be nothing causing libcurl to return to process those HTTP responses. So they may be stuck in limbo, with input pending but nobody paying attention. To ensure that this doesn't happen, we can use the complete() function to complete all the pending transactions on the multi handle.
complete(multiHandle)

And now that we have guaranteed that all the processing is done (or an error has occurred), we can access the results. The result of calling downloadLinks() gives us a function to access the download documents.
links$contents()

To get the original document also, we have to look inside the streams object and ask it for the contents that it downloaded. This is why we called HTMLReaderXMLParser() with TRUE for the save argument.

The definition of the XML event handlers is reasonably straightforward at this point. We need a handler function for the link element that adds an HTTP request for the link document to the multi curl handle. And we need a way to get the resulting text back when the request is completed. We maintain a list of text gatherer objects in the variable docs. These are indexed by the names of the documents being downloaded.

The function that processes a link element in the XML document merely determines whether the document is already being downloaded (to avoid duplicating the work) or not. If not, it pushes the new request for that document onto the curl handle and returns. This is the function op() .

There are details about dealing with relative links. We have ignored them here and only dealt with links that have an explicit http: prefix.

downloadLinks =
function(curlm, base, elementName = "a", attr = "href", verbose = FALSE)
{
 docs = list()

 contents = function() { 
    sapply(docs, function(x) x$value())
 }

 ans =  list(docs = function() docs,
             contents = contents)


 op = function(name, attrs, ns, namespaces) {

   if(attr %in% names(attrs)) {

      u = attrs[attr]
      if(length(grep("^http:")) == 0)
         return(FALSE)

      if(!(u %in% names(docs))) {
         if(verbose)
            cat("Adding", u, "to document list\n")
         write = basicTextGatherer()
         curl = getCurlHandle(URL = u, writefunction = write$update)
         curlm <<- push(curlm, curl)

         docs[[u]] <<- write
      }
   }

   TRUE
 }

 ans[elementName] = op
 
 ans
}


library(RCurl)



HTTPReaderXMLParser




[1] The creation of the regular curl handle and pushing it onto the multiHandle stack is equivalent to

handle = getURLAsynchronous(uri, 
                           write = streams$getHTTPResponse,
                           multiHandle = multiHandle, perform = FALSE)