Todo list for R-level XML Parser

  • Connect the memory management with R's.
  • Tidy up interface for add and remove nodes.
  • Add/append child to an XMLInternalNode
    addChildren() - document.
  • Ability to remove a node from the children of a node.
  • Add support for the pull data source for xmlTreeParse and htmlTreeParse.
    Not particularly important as efficiency shouldn't be that important.
  • Switch to putting the namespace as the name of the element name at the S level.
  • Avoid the handler functions having names that could conflict with tag names.
    i.e. use .text, etc.
  • Fix the attributes in the event parsing.
    They should have a method for xmlGetAttr. There is a trailing "" element with no name. And they have no class.
  • Strategies and actions
    ???
    "xmlValidNode" class that "guarantees" children are valid XML nodes.
    xmlTreeParse uses this if no handlers. Otherwise, general non-validated XML node. No constraint on children.
  • S-level Exceptions from XML.
    Errors, warnings, etc. collected and available after parsing for structured/programmatic access.
  • Add the base, etc. information to the Input buffer when using a connection or function in the event parsing.
  • Allow connections to be used when generating an XML tree via libxml.
    Works for SAX. Not likely to do it for DOM since one can effectively read the entire document in via a connection to a string and go from there. It is not the same thing, but that's the way it is.
  • When parsing a DTD as raw text (i.e. not from a file), getting warnings about subsets, etc.
    lists This happens in libxml 2 2.4.13, etc. but not 2.5.4
    Add DOCTYPE and DTD to xmlTree()
  • Add handlers for different namespaces to xmlTree()
    A user can do this with an S-level handler that maintains a list of lists of handler functions grouped by namespace.
  • Finalizers for libxml nodes/docs.
  •       dtdFile <- system.file("data/foo.dtd", pkg="XML") 
    > foo.dtd <- parseDTD(dtdFile) 
    > tmp <- dtdElement("variable", foo.dtd)   
    Error in dtd$elements[[name]] : object is not subsettable
    > foo.dtd$elements
    ""ExternalDTD""
    
  • Appears to be an oddity on Solaris with the event driven parsing.
          source("dataFrameHandler.R")
          z <- xmlEventParse("../DTDs/Examples/mtcars.xml", handler=handler())
    
    causes problems with an incorrect number of elements in the third record. It reads the 22.8 as 2 and then 2.8 Removing some of the spaces before the 22.8 at the beginning of the record makes this go away. Need to investigate further.
    Looks like simply multiple text fragments being passed in separate calls.
  • Develop DTDs for basic types.
  • Additional chapter/package to write XML
    Handle standard types such as data frame, time series, factors, graphics/plots, etc.

    Can cat() output or paste(), but can do more to ensure well-formed documents relative to a DTD. Have a filter that knows what DTD, or collection of DTDs, to use and how to ensure that individual calls do the correct thing in the context. So basically keep a cursor.
    Can read DTDs within this one. The filter can be built from this. See Writing XML.

  • Facility for dynamically modifying the user-level handler functions for a parser from the body of one of these handlers.
    For example, the document may contain its own functions for a particular language and we would see these in the preamble and switch to using them.
  • Add facility for stopping the parsing mid-way through via a call to stop() or whatever, but that doesn't cause an error.
    Exceptions may work when Robert finishes these.
  • We can make this significantly more class-based, i.e. object oriented.
  • Process external entities.
    These are not currently being seen by the event mechanism. Probably a switch needs to be turned on.
    Fixed now!
    At present, internal references are substituted directly. See test.xml in Docs directory. h <- .Call("R_XMLParse", "Docs/test.xml",xmlHandler(), F, F)
    See replaceEntities in xmlTreeParse().
  • We could kill off the children element in a node if there aren't any.
  • [ and [<- methods for the different types of nodes. And also functions such as those in the w3c spec for nodes, getElementsByTagName, etc.
  • Also add the [[ for accessing children, avoiding the need for $children[[]].
    Done.
  • Could kill off the attributes and/or children for certain node types such as comment, text node.
  • Handle the namespaces.
    Done, for libxml. Added a field to the XMLNode.
  • Support S, at least for the document/tree parser without the callbacks.
    The callbacks require the driver mechanism used in the CORBA and Java interfaces to provide mutable state.
    All done, except mutable state. See the interface drivers in S4.
  • Add the contextual information to the function calls.
    Depth, last node, node path, etc
  • Done

  • Facilities in the XML package to create internal nodes
    PI, comment, etc.
  • as(XMLInternalNode, "character") method
    saveXML() but don't have a document object! Can we put these into a document and then save and the undo this document reference.
    Done using xmlNodeDumpOutput()
  • Closing connections from a function or connection argument.
    Done in R.
  • Allow XML text to be specified rather than treating it as a file.
    Done for libxml parser. Done for Expat.
  • Call the user level functions in the document parser.
    Done.
    If return NULL, remove from tree (or actually don't add it).
    Pass in additional information.

  • Duncan Temple Lang <duncan@wald.ucdavis.edu>
    Last modified: Thu Apr 12 05:48:24 PDT 2007