This example uses RCurl to download an HTML document and then collect
the name of each link within that document. The purpose of the
example is to illustrate how we can combine the RCurl package to
download a document and use this directly within the XML (or HTML)
parser without having the entire content of the document in memory.
We start the download and pass a function to the
xmlEventParse()
function for processing. As that
XML parser needs more input, it fetches more data from the HTTP
response stream. This is useful for handling very large data that is
returned from Web queries.
To do this, we need to use the
multi interface for libcurl
in order to have asynchronous or non-blocking downloading of the document.
The idea is quite simple. We initiate the download and associate a
"writer" to slurp up the body of the HTTP response. This is registered
with libcurl and is invoked whenever libcurl is in control and is
processing the HTTP response. If there is information to be read on
the HTTP stream from the server, this function reads it and appends it
to a variable
pending.
The second part of this mechanism is that we need a function that is called by
xmlEventParse()
which can provide input to the XML parser.
Of course, it will use the content coming from the HTTP server that is
collected in the function getHTTPResponse. So we create a sibling
function that shares the state of the getHTTPResponse function and so
can see the contents of the variable
pending. When the
XML parser demands some input, our function
supplyXMLContent()
checks to see if pending has
non-trivial content (i.e. is not the empty string). If it has some
content, it returns that. Otherwise, it tells libcurl to read some
more from the HTTP stream. When it hands control to libcurl in this
way, libcurl will invoke our
getHTTPResponse()
function, populating the contents of
pending. So when
libcurl yields control, we will now have content to pass to the XML
parser.
The only additional issue that we have to deal with in this setup is
that the XML event parser asks for input up to a certain size. We
cannot necessarily give it all of the content of
pending. If
pending has more characters
than the XML parser wants, we must give it the first
maxLen characters and then leave the remainder in
pending for the next request from the XML parser.
The following generator function defines the two
functions that do the pulling of the text from libcurl
and the pushing to the XML parser.
HTTPReaderXMLParser =
function(curl, verbose = FALSE, save = FALSE)
{
pending = ""
text = character()
getHTTPResponse =
function(txt) {
pending <<- paste(pending, txt, sep = "")
if(save)
text <<- c(text, txt)
if(verbose) {
cat("Getting more information from HTTP response\n")
print(pending)
}
"" # Give back something real.
}
supplyXMLContent =
function(maxLen) {
if(verbose)
cat("Getting data for XML parser\n")
if(pending == "") {
if(verbose)
cat("Need to fetch more data for XML parser from HTTP response\n")
while(pending == "") {
status = curlMultiPerform(curl, multiple = TRUE)
if(status[2] == 0)
break
}
}
if(pending == "") {
# There is no more input available from this request.
return(character())
}
# Now, we have the text, and we return at most maxLen - 1
# characters
if(nchar(pending) >= maxLen) {
ans = substring(pending, 1, maxLen-1)
pending <<- substring(pending, maxLen)
} else {
ans = pending
pending <<- ""
}
if(verbose)
cat("Sending '", ans, "' to XML\n", sep = "")
ans
}
list(getHTTPResponse = getHTTPResponse,
supplyXMLContent = supplyXMLContent,
pending = function() pending,
text = function() paste(text, collapse = "")
)
}
The remaining part is how we combine these pieces with
RCurl and the XML packages to do the parsing in this
asynchronous, interleaved manner.
The code below performs the basic steps
uri = "http://www.omegahat.org/RCurl/philosophy.xml"
handle = getCurlMultiHandle()
streams = HTTPReaderXMLParser(handle)
uri = "http://www.omegahat.org/RDoc/overview.xml"
handle = getURLAsynchronous(uri,
write = streams$getHTTPResponse,
multiHandle = handle,
perform = FALSE)
links = getDocbookLinks()
xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)
links$links()
The steps in the code are as explained as follows. We first create a
'multi handle'. This gives us the asynchronous behavior that returns
control back to us from libcurl rather than sending the request and
slurping back all the data in one single atomic action. Next, we
create our functions to do the pulling and pushing of text from HTTP
to the XML parser. These are returned from the call to
HTTPReaderXMLParser()
. And we then setup the request
to fetch the content of the URI with the call to
getURLAsynchronous()
. Note that we tell it not to
actually perform the request, i.e.
perform = FALSE
.
We are just setting it up to be done when the XML parser requests
input. This is important as this call must return so that we can call
xmlEventParse()
.
[] The next step is to establish the XML event parser.
We provide a collection of handlers that process the XML content in
the way that we want (see below). And now we are off, and the XML
parser will request input and the functions will read from the HTTP
stream.
To process the links within the Docbook document, we are looking for
each ulink element and fetching its url attribute. So we can provide
a collection of handlers that consist of a function only for ulink.
And it need only look at the attributes it is given and determine if
there is a url entry.
If there is, it appends the value to its internal collection of links.
When we are finished the parsing, we can ask for this collection
of links using the additional function links.
getDocbookLinks =
function()
{
links = character()
ulink = function(name, attrs, ns, namespaces) {
if("url" %in% names(attrs))
links[length(links) + 1 ] <<- attrs["url"]
}
list(ulink = ulink,
links = function() links)
}
To run this code, we need to
load both the
RCurl
and
XML
packages.
library(RCurl)
library(XML)