Asynchronous, concurrent link processing
The strategy in this approach is to start the parsing of the original
document. We do this in almost exactly the same way that we do in the
xmlParser example. That is, we
create a multi CURL handler and we create a function that will feed
data from the HTTP response to the XML parser when it is required. We
then put the downloading of the original/top-level file on the stack
for the multi handler.
uri = "http://www.omegahat.org/index.html"
uri = "http://www.omegahat.org/RCurl/philosophy.xml"
multiHandle = getCurlMultiHandle()
streams = HTTPReaderXMLParser(multiHandle, save = TRUE)
curl = getCurlHandle(URL = uri, writefunction = streams$getHTTPResponse)
multiHandle = push(multiHandle, curl)
[]
At this point, the initial HTTP request has not actually been performed
and therefore there is no data.
And this is good. We want to start the XML parser.
So we establish the handlers that will process
the elements of interest in our document,
e.g. a <ulink> for a Docbook document, or <a> for an
HTML document.
The function
downloadLinks()
is the
function used to do this.
And now we are ready to start the XML parser
via a call to
xmlEventParse()
.
links = downloadLinks(multiHandle, "http://www.omegahat.org", "ulink", "url", verbose = TRUE)
xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)
At this point, the XML parser asks for some input. It calls the supplyXMLContent
and this fetches data from the HTTP reply. In our case, this will
cause the HTTP request to be sent to the server and we will wait until
we get the first part of the document. The XML parser then takes this
chunk and parses it. When it encounters an element of interest,
i.e. a ulink, it calls the approriate handler function given in
links. And this gets the URI of the link and then
arranges to add to the multi handle an HTTP request to fetch that
document. The next time that the multi curl handle is requested to
get input for the XML parser, it will send that new HTTP request and
the response will be available. The write handler for the new HTTP
request simply collects all the text for the document into a single
string. We use
basicTextGatherer()
for this.
There is one last little detail before we can access the results. It
is possible that the XML event parser will have digested all its input
before the downloads for the other documents have finished. There
will be nothing causing libcurl to return to process those HTTP
responses. So they may be stuck in limbo, with input pending but
nobody paying attention. To ensure that this doesn't happen, we can
use the
complete()
function to complete all the pending
transactions on the multi handle.
complete(multiHandle)
And now that we have guaranteed that all the processing
is done (or an error has occurred), we can access the results.
The result of calling
downloadLinks()
gives us a function to access the download documents.
links$contents()
To get the original document also, we have to look inside the
streams object and ask it for the contents that it
downloaded. This is why we called
HTMLReaderXMLParser()
with
TRUE
for the
save argument.
The definition of the XML event handlers is reasonably straightforward
at this point. We need a handler function for the link element that
adds an HTTP request for the link document to the multi curl handle.
And we need a way to get the resulting text back when the request is
completed. We maintain a list of text gatherer objects in the
variable
docs. These are indexed by the names of the
documents being downloaded.
The function that processes a link element in the XML document merely
determines whether the document is already being downloaded (to avoid
duplicating the work) or not. If not, it pushes the new request for
that document onto the curl handle and returns. This is the function
op()
.
There are details about dealing with relative links. We have ignored
them here and only dealt with links that have an explicit
http: prefix.
downloadLinks =
function(curlm, base, elementName = "a", attr = "href", verbose = FALSE)
{
docs = list()
contents = function() {
sapply(docs, function(x) x$value())
}
ans = list(docs = function() docs,
contents = contents)
op = function(name, attrs, ns, namespaces) {
if(attr %in% names(attrs)) {
u = attrs[attr]
if(length(grep("^http:")) == 0)
return(FALSE)
if(!(u %in% names(docs))) {
if(verbose)
cat("Adding", u, "to document list\n")
write = basicTextGatherer()
curl = getCurlHandle(URL = u, writefunction = write$update)
curlm <<- push(curlm, curl)
docs[[u]] <<- write
}
}
TRUE
}
ans[elementName] = op
ans
}