The xml package (xml and html parsing)

Introduction

This page describes the XML parser in the library. The parser takes its source input as a string, and produces as output a hierarchical tree structure representing the document. There are also formatter classes that take the document structure and produce a string representation of the document as output.

The reason for using strings as input and output rather than files is for two reasons. Firstly, it is more flexible - a file can be turned into a string easily, but not vice-versa (without using a temporary file). Secondly, it allows the parser to use the built-in string scanning functions, which increase parse speed. The downside of using strings is that the document has to be read into memory prior to parsing, so the parser may not be suitable for really huge documents.

The XML parser

The parser is contained in the class xml.XmlParser, which has a parse method for parsing a string. It returns a xml.XmlDocument object, or fails if the input was not well-formed. Here is an example program.

import
   io(stop, write),
   lang(to_string),
   xml(XmlParser)

procedure main()
   local p,s,d
   p := XmlParser()
   s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?><simple></simple>"
   d := p.parse(s) | stop("Couldn't parse:", &why)
   write(to_string(d, 3))
end

If you have a file you need to parse, just load it into a string first using the io.Files.file_to_string() method.

The `XmlDocument` and `XmlElement` classes

As noted above, the parser returns an XmlDocument instance. XmlDocument has methods to inspect the result of the parsing. Most notably, the method get_root_element() will return the xml.XmlElement which is the root of the element structure :-

e := d.get_root_element()

The XmlElement class has methods which enable you to search the children of the element, and to inspect the element attributes. For example, say that the element e represented the following structure :-

<top a1="val1" a2="val2">
   <inner a3="val3">
       Some text
   </inner>
</top>

Then

e.get_name()  # "top"
e.get_attribute("a1") # "val1"
e.get_attribute("a2") # "val2"
e.get_attribute("absent") # fails
f := e.search_children("inner")  # sets f to another XmlElement, representing
                                 # inner.  If there were several inner elements,
                                 # it would suspend them in sequence.
e.search_children("absent") # fails
f.get_string_content() # "        Some text        "
f.get_trimmed_string_content() # "Some text"

Please see the API documentation for more details.

Errors and warnings

The parser fires three types of event during parsing :-

XmlParser.WARNING_EVENT - a warning message
XmlParser.VALIDITY_ERROR_EVENT - a message indicating a validity error
XmlParser.FATAL_ERROR_EVENT - a message indicating a fatal error, ie the string being parsed is not a well-formed xml document. Only one such message is ever fired.

A fatal error will mean that the parse method will fail. A validity error won’t cause the parse method to fail because a well-formed xml document can still be constructed and used. However, the parser will count the number of validity errors, and this count can be accessed in the XmlDocument’s validity_errors field. Warnings can safely be ignored - they just indicate that the source document could be improved in certain respects.

Here is an example that listens for and prints out events from the parser.

import
   io(stop, write),
   lang(to_string),
   xml(XmlParser)

procedure eh(p, s, type)
   write(type,":",to_string(p, 3))
end

procedure main()
   local p,s,d
   p := XmlParser()
   s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?>_
         <top a1=\"val1\" a2=\"val2\">_
           <inner a3=\"val3\">_
              Some text    _
           </inner>_
         </top"
   p.connect(eh)
   d := p.parse(s) | stop("Couldn't parse:", &why)
   write("Successfully parsed")
end

The output is (in part) :-

validity error:object xml.ProblemDetail#1(
   stack=
      list#12[
         object xml.Diversion#3(id="input";subject="<?xml version ... </top";pos=63)
      ]
   msg="top has attributes but none were declared"
)

... more validity errors ...

fatal error:object xml.ProblemDetail#4(
   stack=
      list#106[
         object xml.Diversion#7(id="input";subject="<?xml version ... </top";pos=107)
      ]
   msg="'>' expected"
)
Couldn't parse:'>' expected

Note how each message detail is encapsulated in a xml.ProblemDetail instance.

Validation

The parser’s validation process can be turned off if desired, using the set_validate() method. This will increase parser speed, but may affect the result in terms of whitespace (see below).

External entity resolution

During parsing, the parser sometimes needs to resolve external entities. Typically this is when an external DTD needs to be loaded, as in the doctype declaration

<!DOCTYPE web-app
  PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN"
  "http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">

In order to resolve this, and obtain the external data, the parser uses xml.Resolver. A custom resolver class can be used as follows :-

import xml

class MyResolver(Resolver)
   public override resolve(base, external_id)
      local s, t
      s := external_id.get_public_id() # eg -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN
      t := external_id.get_system_id() # eg http://java.sun.com/j2ee/dtds/web-app_2_2.dtd
      # Do something with t and s, to return a string representing the external entity
   end
end

...

p.set_resolver(MyResolver())

The parser uses a default resolver, xml.DefaultResolver, which should be sufficient for normal purposes. It resolves system ids beginning with file:// and http:// locally and over the network respectively. If the system id doesn’t begin with either of those strings, then it is treated as a local file path.

Formatted output

Output of document structures is done using a set of formatter classes. For XML documents, use the xml.XmlFormatter class :-

r := RamStream()          # A Stream to capture the result
f := XmlFormatter(r)      # Create a formatter outputting to the stream
f.format (doc)            # doc is the XmlDocument to format
s := r.done()             # Get the string from the stream
write(s)

The formatter can use any io.Stream as its output destination; an opened file would be a typical alternative to the RamStream used above. The default output destination, if no parameter is given to the formatter’s constructor, is FileStream.stdout.

Various options can be set on the formatter; for example

f.set_text_trim()
f.set_indent(3)

will output all the text content with whitespace trimmed, and all elements formatted with an indent of 3 spaces.

Namespaces

Namespaces are fully supported as a post-processing step to normal parsing. This is on by default, but can be turned off by using p.set_do_namespaces(&no). Assuming namespaces are being processed, then the XmlElement class has extra methods which can be used to find child elements and attributes based on the global name, which is a pair of a URI and a local name. A global name is represented by a xml.GlobalName instance, which can be created with something like :-

gn := GlobalName("Local", "http://schemas.xmlsoap.org/soap/envelope/")

This global name can then be used to select the element “Prefix:Local” in the following example :-

<parent xmlns:Prefix="http://schemas.xmlsoap.org/soap/envelope/">
   <Prefix:Local attr="123"/>
</parent>

The selection methods for elements and attributes using global names can be found in the XmlElement class. Please see the API docs for full details.

Test suite

To test the parser, there is a script dotests.sh in the distribution directory which runs about 1700 test documents through the parser. The test documents come from various sources, and fall into one of three categories:-

Tests for non well-formed documents. These documents are not well formed and should cause the parse() method to fail. See testnotwf.
Tests for well-formed, invalid documents. These documents are well formed (ie they should parse), but have validity errors. See testvalid.
Tests for well-formed, valid documents. These documents are error-free in all respects, and should be parsed as such. See testvalid.

There are three or four instances (all from one test suite) where I can’t agree with their definition of what is and isn’t well-formed/invalid. The XML spec can be maddeningly vague in some respects, so I am probably just not interpreting it right.

There are also a very small number of cases where a well formed but invalid document is not reported as invalid. There are no cases where a valid document is reported as invalid, or a well-formed document will not parse.

All these tests which cause the parser problems are commented out with an appropriate commentary in the dotests.sh file.

Testxml

This is a small program which can be used to see what the parser does with a particular document. Given an input filename, testxml will parse the document and output various sections showing the formatted version of the document, a complete display of the document’s structure, and the document’s constraints read from the DTD.

Just run “testxml -?” for a list of options.

HTML parsing

The HTML parser, xml.HtmlParser, uses similar document structures for input and output as the XML parser. However, it is much simpler than the XML parser.

The parser

Code to create and use the HTML parser follows a similar form to the example for XML shown above :-

import
   io(write),
   lang(to_string),
   xml(HtmlParser)

procedure main()
   local p, s, d
   p := HtmlParser()
   s := "<html lang=\"en\">_
           Some text    _
           <p>_
           Some more_
         </html>"
   d := p.parse(s)
   write(to_string(d, 3))
end

The parse method returns an xml.HtmlDocument object, which can then be inspected as needed.

In contrast to the XML parser, the parse method will never fail. In other words, even if something that isn’t remotely like HTML is given as input, it will still try to make sense of it. This is in recognition of the fact that much HTML out on the Web is malformed! A fussy parser would be of little use.

The `HtmlDocument` and `HtmlElement` classes

As noted above, the parser returns an HtmlDocument instance. This works in a very similar way to the XmlDocument class used for XML parsing; in fact the two classes share a common base class. The most important method is get_root_element(), which will return the [xml.HtmlElement](http://objecticon.sourceforge.net/libref/index.html?xml.HtmlElement.html) which is the root of the element structure :-

    e := d.get_root_element()

HtmlElement is also related to XmlElement by way of a common base class, and they have the same methods for inspecting attributes and child elements. So, for example, say that the element e represented the following structure :-

    <html lang="en">
       Some html text
       <p>
       Some more
    </html>

Then

     e.get_name() # "HTML"
     e.get_attribute("LANG") # "en"
     f := e.search_children("P") # sets f to another HtmlElement, representing the <p>.
                                 # This has one child, namely the text content 
                                 # between the <p> and the </html>

     e.search_children("absent") # fails
     f.get_string_content()      # returns "        Some more        "
     f.get_trimmed_string_content() # returns "Some more"

Note that all element names and attribute names are capitalized during parsing (eg “html”->“HTML”).

The search_tree method of the Element class is very useful if you want to get at an Element deep within the document. Please see the API documentation for more details.

Formatted output

Output of an HTML document is done with the xml.HtmlFormatter class, which again shares a common base class with its XML equivalent, XmlFormatter. For example :-

r := RamStream()          # A Stream to capture the result
f := HtmlFormatter(r)
f.format (d)
s := r.done()             # Get the string from the stream
write(s)

Testhtml

This is a small program which can be used to see what the parser does with a particular document. Given an input filename, testhtml will parse the document and output the formatted equivalent, and a complete display of the document’s structure.

Contents

Introduction

The XML parser

The XmlDocument and XmlElement classes

Errors and warnings

Validation

External entity resolution

Formatted output

Namespaces

Test suite

Testxml

HTML parsing

The parser

The HtmlDocument and HtmlElement classes

Formatted output

Testhtml

The `XmlDocument` and `XmlElement` classes

The `HtmlDocument` and `HtmlElement` classes