R CMD lddand see where it thinks libxml2 is. Hopefully it will list it and we might see that it is /usr/lib/libxml2.so and that you are compiling against a version in /usr/local/lib/. If that is the case, you would need to change your personal setting for LD_LIBRARY_PATH or add /usr/local/lib to /etc/ld.so.conf file at a system level./libs/XML.so
xmlTree()
function.
This uses the low-level node creation functions
(e.g. newXMLNode, newXMLComment, newXMLPINode, etc.)
but also allows us to manage a stack of "open" nodes
and a default namespace prefix.
New nodes are by default added to the most recent
"open" node, i.e. that node acts as the parent for new nodes.
xmlTree()
mean that we cannot store the
tree across R sessions since they are external pointers to C data
structures in memory.
tt = xmlTree("top") tt$addTag("b", "Some text") save(saveXML(tt), file = "tt.rda") load("tt.rda") tt = xmlTreeParse(tt, asText = TRUE, useInternal = TRUE)We don't get back the XMLInternalDOM with information about open nodes, etc. from which we could continue to add nodes. But we do get back the exact tree. We can also convert the nodes from internal nodes to regular R base nodes. And from that
<node mine:foo="abc" />
)
When I parse this document into S, the namespace prefix
on the attribute is dropped. Why and how can I fix it?
TRUE
for the addAttributeNamespaces
argument in the call to xmlTreeParse
.
mine
, in our example)
is defined in the document.
In other words, there must be be an
xmlns:mine="some url"
attribute in some node before or in the node
that is being processed.
If no definition for the namespace is in the document,
the libxml parser drops the prefix on the attribute.
useTagName
is T,
and also that there really is a tag with this name in the
document.
Again, the case is important.
"RSDTD.c", line 110: warning: argument #2 is incompatible with prototype: prototype: pointer to const uchar : "unknown", line 0 argument : pointer to const char
Daneil Veillard might add this.
expressions
option to a value larger than 256.
options(expressions=1000)The main cause of this is that S and R are programming languages not specialized for handling trees. (They are functional languages and have no facilities for pointers or references as in C or Java.)
Parameters are allowed, but the libxml parsing library is fussy about white-space, etc. The following is is ok
<!ELEMENT legend (%PlotPrimitives;)* >but
<!ELEMENT legend (%PlotPrimitives; )* >is not. The extra space preceeding the
)
causes an error in the parser something like
1: XML Parsing Error: ../Histogram.dtd:80: xmlParseElementChildrenContentDecl : ',' '|' or ')' expected 2: XML Parsing Error: ../Histogram.dtd:80: xmlParseElementChildrenContentDecl : ',' expected 3: XML Parsing Error: ../Histogram.dtd:80: xmlParseElementDecl: expected '>' at the end 4: XML Parsing Error: ../Histogram.dtd:80: Extra content at the end of the documentThis can be fixed by adding a call to SKIP_BLANKS at the end of the loop
while(CUR!= ')' { ... }
in the routine
xmlParseElementChildrenContentDecl()
in parser.c
The problem lies in the transition between the different input
buffers introduced by the entity expansion.
and we want to use an XPath expression to find the title node. We might think that]]> My Title
"/doc/topic/title"
would do the trick. But in fact, we need
/ns:doc/ns:topci/ns:titleAnd then we need to map ns to the URI "http://www.omegahat.org". We do this in a call to getNodeSet() as
getNodeSet(doc, "/ns:doc/ns:topci/ns:title", c(ns = "http://www.omegahat.org"))As a simplification, getNodeSet() will create the map between the prefix and the URI of the default namespace of the XML document if you specify a single string with no name as the value of the
namespaces
argument, e.g.
getNodeSet(doc, "/ns:doc/ns:topci/ns:title", "ns")There are some additional comments here.
xmlTreeParse(, useInternalTrue =
TRUE)
This file contained 2,895,409 nodes
(length(getNodeSet(z, "//*"))
)
This took 9.4 seconds on Intel MacBook Pro with a 2.33Ghz Dual
processor and 3G of RAM, and on a machine with dual core 64bit AMD,
it took 20 seconds.
To find the nodes of interest took 8.9 seconds on the Mac, and
(apparently) 1.1 seconds on the AMD.
xinclude = TRUE
, which
is the default, in calls to xmlTreeParse().
<xi:include xpointer="xpointer(//mynode)"/>adapting that to what you want. Note that the attribute is named xpointer. There is no href so the XInclude defaults to this document and the expression for the xpointer attribute uses the function xpointer. This is not element.
/Users/duncan/BigXML.xml:242094: error: xmlSAX2Characters: huge text node: out of memoryand something about
Extra content at the end of the document Error: 1: Extra content at the end of the documentWhat's the problem and what can I do?
How do we get around this? Well, we have to tell the parser that this is not actually "huge". We use the xmlParseDoc function. This is like xmlParse but allows us to specify options controlling the parser.
u = "http://www.omegahat.org/RSXML/BigXML.xml" doc = xmlParseDoc(u, HUGE) txt = xmlValue(getNodeSet(doc, "//data")[[1]]) nchar(txt)And that solves the problem!
A different approach is to use SAX, i.e. event-driven parsing. Our text handler function is called with chunks of text, and not the entire content in a single call.
data = character() txt = function(x) data <<- c(data, x) xmlEventParse(u, list(text = txt)) length(data) sum(nchar(data))This never raises the error about the huge text node because it never builds the node.