This page describes the XML parser in the library. The parser takes its source input as a string, and produces as output a hierarchical tree structure representing the document. There are also formatter classes that take the document structure and produce a string representation of the document as output.
The reason for using strings as input and output rather than files is for two reasons. Firstly, it is more flexible - a file can be turned into a string easily, but not vice-versa (without using a temporary file). Secondly, it allows the parser to use the built-in string scanning functions, which increase parse speed. The downside of using strings is that the document has to be read into memory prior to parsing, so the parser may not be suitable for really huge documents.
The parser is contained in the class
xml.XmlParser, which has a
parse method for parsing a string. It returns a
xml.XmlDocument object, or fails if the input was not well-formed. Here is an example program.
import io(stop, write), lang(to_string), xml(XmlParser) procedure main() local p,s,d p := XmlParser() s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?><simple></simple>" d := p.parse(s) | stop("Couldn't parse:", &why) write(to_string(d, 3)) end
If you have a file you need to parse, just load it into a string first using the
As noted above, the parser returns an
XmlDocument has methods to inspect the result of the parsing. Most notably, the method
get_root_element() will return the
xml.XmlElement which is the root of the element structure :-
e := d.get_root_element()
XmlElement class has methods which enable you to search the children of the element, and to inspect the element attributes. For example, say that the element e represented the following structure :-
<top a1="val1" a2="val2"> <inner a3="val3"> Some text </inner> </top>
e.get_name() # "top" e.get_attribute("a1") # "val1" e.get_attribute("a2") # "val2" e.get_attribute("absent") # fails f := e.search_children("inner") # sets f to another XmlElement, representing # inner. If there were several inner elements, # it would suspend them in sequence. e.search_children("absent") # fails f.get_string_content() # " Some text " f.get_trimmed_string_content() # "Some text"
Please see the API documentation for more details.
The parser fires three types of event during parsing :-
XmlParser.WARNING_EVENT- a warning message
XmlParser.VALIDITY_ERROR_EVENT- a message indicating a validity error
XmlParser.FATAL_ERROR_EVENT- a message indicating a fatal error, ie the string being parsed is not a well-formed xml document. Only one such message is ever fired.
A fatal error will mean that the
parse method will fail. A validity error won't cause the
parse method to fail because a well-formed xml document can still be constructed and used. However, the parser will count the number of validity errors, and this count can be accessed in the
validity_errors field. Warnings can safely be ignored - they just indicate that the source document could be improved in certain respects.
Here is an example that listens for and prints out events from the parser.
import io(stop, write), lang(to_string), xml(XmlParser) procedure eh(p, s, type) write(type,":",to_string(p, 3)) end procedure main() local p,s,d p := XmlParser() s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?>_ <top a1=\"val1\" a2=\"val2\">_ <inner a3=\"val3\">_ Some text _ </inner>_ </top" p.connect(eh) d := p.parse(s) | stop("Couldn't parse:", &why) write("Successfully parsed") end
The output is (in part) :-
validity error:object xml.ProblemDetail#1( stack= list#12[ object xml.Diversion#3(id="input";subject="<?xml version ... </top";pos=63) ] msg="top has attributes but none were declared" ) ... more validity errors ... fatal error:object xml.ProblemDetail#4( stack= list#106[ object xml.Diversion#7(id="input";subject="<?xml version ... </top";pos=107) ] msg="'>' expected" ) Couldn't parse:'>' expected
Note how each message detail is encapsulated in a
The parser's validation process can be turned off if desired, using the
set_validate() method. This will increase parser speed, but may affect the result in terms of whitespace (see below).
During parsing, the parser sometimes needs to resolve external entities. Typically this is when an external DTD needs to be loaded, as in the doctype declaration
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN" "http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">
In order to resolve this, and obtain the external data, the parser uses
xml.Resolver. A custom resolver class can be used as follows :-
import xml class MyResolver(Resolver) public resolve(external_id) local s, t s := external_id.get_public_id() # eg -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN t := external_id.get_system_id() # eg http://java.sun.com/j2ee/dtds/web-app_2_2.dtd # Do something with t and s, to return a string representing the external entity end end ... p.set_resolver(MyResolver())
The parser uses a default resolver,
xml.DefaultResolver, which should be sufficient for normal purposes. It resolves system ids beginning with
http:// locally and over the network respectively. If the system id doesn't begin with either of those strings, then it is treated as a local file path.
Output of document structures is done using a set of formatter classes. For XML documents, use the
xml.XmlFormatter class :-
f := XmlFormatter() s := f.format_to_string(d) write(s)
Note how the
format_to_string() method returns a string from a document, so it is really the reverse process of a parser. The formatter can also output directly to an
io.Stream object - see the `` method.
Various options can be set on the formatter; for example
will output all the text content with whitespace trimmed, and all elements formatted with an indent of 3 spaces.
Namespaces are fully supported as a post-processing step to normal parsing. This is on by default, but can be turned off by using
p.set_do_namespaces("n"). Assuming namespaces are being processed, then the
XmlElement class has extra methods which can be used to find child elements and attributes based on the global name, which is a pair of a URI and a local name. A global name is represented by a
xml.GlobalName instance, which can be created with something like :-
gn := GlobalName("Local", "http://schemas.xmlsoap.org/soap/envelope/")
This global name can then be used to select the element "Prefix:Local" in the following example :-
<parent xmlns:Prefix="http://schemas.xmlsoap.org/soap/envelope/"> <Prefix:Local attr="123"/> </parent>
The selection methods for elements and attributes using global names can be found in the
XmlElement class. Please see the API docs for full details.
To test the parser, there is a script
dotests.sh in the distribution directory which runs about 1700 test documents through the parser. The test documents come from various sources, and fall into one of three categories:-
parse()method to fail. See
There are three or four instances (all from one test suite) where I can't agree with their definition of what is and isn't well-formed/invalid. The XML spec can be maddeningly vague in some respects, so I am probably just not interpreting it right.
There are also a very small number of cases where a well formed but invalid document is not reported as invalid. There are no cases where a valid document is reported as invalid, or a well-formed document will not parse.
All these tests which cause the parser problems are commented out with an appropriate commentary in the
This is a small program which can be used to see what the parser does with a particular document. Given an input filename, testxml will parse the document and output various sections showing the formatted version of the document, a complete display of the document's structure, and the document's constraints read from the DTD.
Just run testxml with no arguments for a list of options.
The HTML parser,
xml.HtmlParser, uses similar document structures for input and output as the XML parser. However, it is much simpler than the XML parser.
Code to create and use the HTML parser follows a similar form to the example for XML shown above :-
import io(write), lang(to_string), xml(HtmlParser) procedure main() local p, s, d p := HtmlParser() s := "<html lang=\"en\">_ Some text _ <p>_ Some more_ </html>" d := p.parse(s) write(to_string(d, 3)) end
parse method returns an
xml.HtmlDocument object, which can then be inspected as needed.
In contrast to the XML parser, the
parse method will never fail. In other words, even if something that isn't remotely like HTML is given as input, it will still try to make sense of it. This is in recognition of the fact that much HTML out on the Web is malformed! A fussy parser would be of little use.
As noted above, the parser returns an
HtmlDocument instance. This works in a very similar way to the
XmlDocument class used for XML parsing; in fact the two classes share a common base class. The most important method is
get_root_element(), which will return the [
](http://objecticon.sourceforge.net/libref/index.html?xml.HtmlElement.html) which is the root of the element structure :-
e := d.get_root_element()
HtmlElement is also related to
XmlElement by way of a common base class, and they have the same methods for inspecting attributes and child elements. So, for example, say that the element
e represented the following structure :-
<html lang="en"> Some html text <p> Some more </html>
e.get_name() # "HTML" e.get_attribute("LANG") # "en" f := e.search_children("P") # sets f to another HtmlElement, representing the <p>. # This has one child, namely the text content # between the <p> and the </html> e.search_children("absent") # fails f.get_string_content() # returns " Some more " f.get_trimmed_string_content() # returns "Some more"
Note that all element names and attribute names are capitalized during parsing (eg "html"->"HTML").
search_tree method of the Element class is very useful if you want to get at an Element deep within the document. Please see the API documentation for more details.
Output of an HTML document is done with the
xml.HtmlFormatter class, which again shares a common base class with its XML equivalent,
XmlFormatter. For example :-
f := HtmlFormatter() s := f.format_to_string(d) write(s)
This is a small program which can be used to see what the parser does with a particular document. Given an input filename, testhtml will parse the document and output the formatted equivalent, and a complete display of the document's structure.Contents