ncyoung.com

You are here: Top->Programming->XML



regular expression for closing image tags

You can use this regular expression to find image tags that are not self closing and make them self closing. It will change:

<img src="spacer.gif" height="1" width="1">

to:

<img src="spacer.gif" height="1" width="1"/>

If you have:

<img src="spacer.gif" height="1" width="1"></img>

It will STILL change it to:

<img src="spacer.gif" height="1" width="1"/></img>

So in that case you can't just replace all.

It should be easy to re-purpose this for other tags that need to be made self closing.

The find term is:

<img ([^>]*)([^/])>

The replace term is:

<img \1\2/>

XHTML Appendix

Michael Kay mentioned this w/re to XSLT 2.0's "XHTML" output mode in response to a question "why have it at all"

http://www.w3.org/TR/xhtml1/guidelines.html

My thought going in was that most of the demands for special dispensation were made by older browsers and are on the way to being irrelevant (like space before the slash in self closing br's) and for some things that's true.

Section c.4 brought up issues with me. Are there XML-based user agents or is this an as-yet hypothetical audience? Can parsers really throw out comments? I didn't think they could... certainly you can read comments in a DOM or XSL model...

C.8 was new to me and useful going forward...

C.11 is hilarious and disturbing at the same time. Add "the past" to "the end user" in the list of things it would be easier to ignore when building software.

Nathan's Markup Language (NML)

Introduction


The most optimistic of souls could characterize XML's satisfaction of the array of business needs for which it was conceived lackluster at best. I myself feel comfortable characterizing it as a complete failure. This afternoon I sat down and created its replacement, and I'm glad I finally took the time to do so, because the whole technical community will benefit.

In the spirit of many recently named "vanity algorithms" I've named my new markup language after myself. I think this language will be enshrouded in history's chronicles and I as its creator deserve to have my name go down with it. This is in stark contrast to the obscure XSL hacks named after mailing list contributors, or worse yet the CSS hacks people named after themselves (and here I have to say WAKE UP!! do you realize you're basically naming a freaking BROWSER BUG after yourself???)

[ahem]

Realities addressed



XML has failed on the following counts, all of which NML addresses:
  • - validation
  • - namespaces
  • - escaping
  • - internationalization


version



Since I have so carefully thought through all the possible ramification of the sections below, I'm starting the version numbering of this spec directly at 1.1.

Update People have pointed out some minor problems with the specification below. Although they have all been easily addressed so far, I'm a little less confident that this spec is final. Therefore I'm resetting the current version of the spec to .85

tag delimiters



Curly brackets ("{" and "}") are used to wrap NML element tags. Curly brackets are obviously more cool and complex than square brackets ("square" brackets... need more be said???) and infinitely better than "<" and ">". Why? Because "<" and ">" are not matched parentheticals at all!!! They are the mathematical symbols for FREAKING GREATER THAN AND LESS THAN!!! Markup community please GET A CLUE!!!!

If your input has curly brackets in it you may escape them by replacing them with round brackets ("(" and ")").

end tags



End tags are the same for all elements, namely "{/}". There is no rigorous reason to name end tags since NML (like XML) does not allow for tag overlapping. Any language "feature" that is included simply to ease debugging is clearly for wimps and anyone who suggests it should be scorned.

self closing tags


When you have an empty element like {person}{/} you can omit the first bracket of the end tag to save on typing, thus creating a self closing tag like this: {person}/}

ASCII escaping


NML supports the full ASCII character set. ASCII characters can either be represented directly by the character or encoded using the ASCII numerical code point preceded by a dollar sign. So for example an exclamation point can either be typed as "!" or encoded as $33. If you need to have a dollar sign in your input document you should escape it by replacing it with %. When using ASCII characters above 127, use the ASCII dot notation to indicate the offset from the standard 127 ASCII ending point. For example, to represent the ASCII character at 138, you'd use "$127.11".

Internationalization: the scribble element


The special element {scribble}/} is used when questions about NML support for high bit character sets comes up. {scribble}/} will be replaced by random scribbling in the formatted output. By using this generously in some documents, "normal" users will be convinced your application can support fancy languages like Icelandic, Sandscrit, Japanese and French.

case sensitivity


Content in NML documents is not case sensitive and case may or may not be preserved in output. You can force upper case by preceding a character with forward slash and lowercase by preceding a character with the backslash. If you want to include a slash in your document without effecting the case of the following letter, preceded the slash with a forward slash, since there is no such thing as an upper case slash.

Letters in element names are case sensitive, with the exceptions of p,q,h and m, which are not case sensitive.

other escaping requirements


~ is a reserved character in NML and should always be preceded by the word "home".

validation



Anyone who has used DTDs, schemas, relaxNG or schematron can tell you that validation for XML has utterly failed. In fact the whole idea of strong typing is questionable to start with. You should know what kind of data you have and you should communicate that directly to your users. Validation does not replace communication and in fact it is a crutch for weak business processes.

Experience with powerful and practical programming languages like Perl and Javascript further reinforces the fact that strong typing in general wastes programmer hours.

Update: People have been wining about the lack of support for validation in NML, so I'm adding the following validation support.

NML supports the most useful functionality of validation, while minimizing unnecessary complexity in the parsing layer and placing the burdens of validation where they belong, on the content author.

The mechanism for this is the optional special attribute "is-valid". This boolean attribute can hold the values "yes" and "no". If the element contents are valid, this attribute should be set to "yes". A "no" value is equivalent to leaving the attribute out. Blank values are ambiguous.

If every node in the document is valid, the "is-valid" attribute should be removed from all of them and replaced with a document level {is-valid}/} element.

namespaces


Document authors in NML are required to choose unique names for each element. This obviates the need for any namespacing mechanism, and I can't believe the creators of XML didn't think of it.

ordinals



Sometimes the document order may not reflect the true desired order of elements. The special attribute "ordinal" can be used to indicate true order of occurrence in these instances. For example, the element {people ordinal="3"}/} should be treated as the third item in the document.

output formatting


Formatting engines should support the special NML attribute "style-as". This allows document authors flexibility as to how their content will be formatted. For example, in the NML version of XHTML, {span style-as="div"} should be formatted as a div, while {b style-as="i"} should come out italic.

inclusion


NML's built in inclusion mechanisms are simple and powerful. The inclusion element is simply an empty self closing NML tag like so: {}/}. The first time this tag is encountered, the parser should build a list of files in the document's directory and all subdirectories. The file list should be alphabetized, and the contents of the first file substituted for the {}/} tag. The next time the tag is encountered, the second file in the list is used and so on. For performance reasons, all files in the list should be loaded and parsed at the first include tag encountered.

executable content


The % tag delimits executable content. Any text between % and % will be executed at parse time. Executable content can be written in any scripting language. The parser should run through all the interpreters installed on the client machine and try to execute the string with each of them, thus providing the greatest likelihood the commands will get executed at least once. Output from the command should be discarded or written to a numbered file in the user's temp directory.

An MS Outlook plug-in for this functionality is already available and runs immediately when the message is received.

security



Security problems are in general the problem of application developers and end users. Any user who can't secure their own machine should not be allowed to own a computer, and in fact should be taken out into the desert and forced to generate their own 256 bit MD5 keys using only an abacus with lifesavers for wheels while surrounded by ants. At night the ice weasels come.

performance



When coding parsers for NML, developers are encouraged to make them faster and simpler than the equivalent XML parsers. Of course there will be some developers who write poorly performing code but they will be sternly reprimanded by the creator of NML and (more importantly) censured by the vast NML user community.

killer apps



I'm working on translating my XML based csv replacement to use NML instead.

There are also rumours that the national polo league is going to use NML to represent information about national disasters.

using node-set() extension in XSL when using LibXSLT and perl

Simple!!

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ext="http://exslt.org/common">
<xsl:template match="/">
<xsl:variable name="v">my <p1/> is <p1/></xsl:variable>
number of p1s in var: <xsl:value-of select="count(ext:node-set($v)//p1)"/>
</xsl:template>
</xsl:stylesheet>

installing LibXSLT on windows using PPM

I run activeperl on windows. Activeperl's package installer (PPM) usually installs anything I need really easily, but when I went to install LibXSLT (and related libraries) it couldn't do it right off the bat. After a little digging it turned out to be quite easy to do.

First, the default repositories the PPM comes configured to use don't have the libraries in them. From the PPM command line add a new repository using the following command:
rep add xml-stuff http://theoryx5.uwinnipeg.ca/ppms/

Once you've done that the search and install commands you give to PPM will find LibXSLT and LibXML properly.

Depending on what you already have installed, LibXSLT may or may not have everything it needs to run. Sometimes it seems to follow dependencies and other times it just told me what to install next and then quit(!?). In that case use the PPM search and install commands to continue getting the stuff you need.

Some of the libraries depend on dlls. For some reason on my system PPM was unable to put the dlls into the right place (it told me at the time) so I went and got them myself.

Manually putting these three files into my c:/perl/bin directory got me up and running.

http://theoryx5.uwinnipeg.ca/ppms/scripts/libexslt_win32.dll

http://theoryx5.uwinnipeg.ca/ppms/scripts/libxml2.dll

http://theoryx5.uwinnipeg.ca/ppms/scripts/libxslt_win32.dll

topic maps

Topic maps are used to model meta data. They support making statements about a resource (as identified by a URL) and to model relationships between resources.

They also allow you to make statements about the statements and about the relationships.

To my eye the xml format for topic maps is a lot easier to understand and write than RDF statements. There is the start of querying and browsing tools for that XML format that make choosing topic maps over home-grown formats attractive.

The tao of topic maps is a good intro, the topic maps article at xml.com is not quite as good but shorter and has XML samples in it.

Someone has made a topic map representing the diary of Samuel Pepys (view of the Samuel Pepys topic). Ontopia has a topic map browser with a demo map of opera (as in singing not browsing) related information.

I found documentation of a topic map query language but no tool to test it out with unless I want to write java classes.

alternative to CSV

Since it's best to use XML for everything, here's an alternate model for CSV files:

<pre>
<csvFile>
value1<comma/>value2<comma/>value3<comma/>value4<newLine/>
value5<comma/>value6<comma/>value7<comma/>value8<newLine/>
value9<comma/>value10<comma/>value11<comma/>value12<newLine/>
</csvFile>
</pre>

And to kick matters up a notch:

Velocity XSL "alternative" DVSL

complexity vs simplicity

Complexity is a place one passes through while searching
a very crowded world of similar but different things.
It becomes simple when one can safely ignore the
differences and pick one. Complexity is a property
of the space of choices. Simplicity is a property of
the act of choosing.

-- Len Bullard on the xml-dev mailing list

XML serialization of java objects

SOAP provides Java to XML conversion for a transport layer. The XML used is pretty simple and specified as part of the SOAP protocol.

Different Java serialization schemes allow you to save the state of java objects, again each in its own internal format.

Recently I ran into a requirement to convert arbitrary Java objects to java DOM objects so that they could be transformed. I was told by the developers that this would involve writing lots of one off error prone code, but I thought given the above that this problem could probably be solved in a pluggable way and in fact was very likely to have been addressed.

A little digging turned up JOX, a nifty conversion layer between beans and XML. It will try to map XML elements to bean properties, and it will try to serialize beans as XML or convert them to DOM objects (just what the doctor ordered). If you have a DTD, you can use it to inform the way that the XML or DOM gets created from the bean. Nifty.

If your bean and XML do not already have some built in correspondence, you can't use JOX.

Quick is a considerably heavier-weight solution, but it allows you a lot more flexibility in the way you map XML to java code.

Quick has its own specification for a java object to XML format binding language that lets you build custom mapping to convert arbitrary XML to arbitrary java classes with flexibility.

Update: Came upon betwixt, a bean to XML mapping mechanism from the apache project. It also uses a mapping specification to allow more flexibility than jox, but has less complexity than Quick and no code generation step. Output only goes to text string or SAX versions though, and we wanted direct bean to DOM-object conversion.

wiki to xml - retroactive sgml

I was just reading an article on wiki to xml.

The thing that impressed me was that the author created an SGML document type based on a markup language (wiki) which didn't have one to begin with, one that I had never thought of as having been created with that in mind. I realized for the first time that SGML was flexible enought to encompass some fairly loosely defined markup methods that had never been intended to have a strict definition at all.

stupid xsl trick, XSL as validation language

Jan sent me to this gallery of stupid XSL tricks.

I like the idea of using XSL for validation, even just a couple sanity checks could really help.

XML jokes

There's an open chellenge in the group I work with to come up with a joke phrased in mark-up. My two attempts:

<you're it>

(tag you're it) and

<td padding="5px">me<td>

(I'm in a padded cell)

Also see NML and XML format for csv files

numerical iteration in XSL

I keep running into this problem in XSL: how do you do something a certain number of times?

The first answer is that you recurse. Ok, there's a place for recursion, but not everywhere.

So the next way is to find a set of nodes to iterate over. The classic version is:

<xsl:for-each select="//*[position() <= Value]">

And some hilarious variations are here.

When we get XSL 2.0 it looks like we'll have some much better options.

finding processing instructions in the DOM

I can't seem to find any really good documentation on Perl's DOM interface, or any generic DOM interface documentation that would serve instead.

So it took me some time to figure out how to get processing instructions from an XML DOM. (Processing instructions are those tags enclosed in <? ... ?>. XMLSpy uses one to point to an XSL stylesheet from an XML instance)

I found Java documentation for a getProcessingInstruction method, but Perl (Lib::XML) didn't seem to have a counterpart. Getting a list of DOM nodes omitted processor instructions.

I eventually found an xpath function processing-instruction() that returns a list of processing instructions. Used with the DOM method findnodes() it worked great.

It seems like processing instructions should almost be returned from the parsing process? It seemed counter-intuitive that information in the XML instance that was expresly placed there for use by a processing application should be so hard to get to from the application itself!

using a variable or parameter as sort order criteria in XSL

I had a lot of frustration trying to use a global parameter (passed into the stylesheet from Perl) to determine the sort order in an xsl:sort element.

Example: Applying templates to output books sorted by author might look like this:

<xsl:apply-templates select="Book">
<xsl:sort select="Author"/>
</xsl:apply-templates>

If you want to pass in a parameter (or use a variable for that matter) so that you could choose weather to sort by author, title, or date, you might think you could use the variable $orderBy in the sort like so:

<xsl:apply-templates>
<xsl:sort select="$orderBy"/>
</xsl:apply-templates>

You can't!!

You can however, work around it like this:

<xsl:apply-templates>
<xsl:sort select="*[name()=$orderBy]"/>
</xsl:apply-templates>

I kind of understand why, but not well enough to explain. I'll try to update this post later on.

XSL documentation

I'm putting together a script that gathers all kinds of information about the XSL templates I have and the way that they are used in context of the entire system. One possible source of information could be code comments in the templates, so I've been looking for commenting conventions (along the lines of self-documenting code or even literate programming, though I'm unlikely to be willing to assume that much overhead)

Anyway, I found some discussion here. I like xsldoc best of anything I've found so far.

partitioned normal format

Working to figure out what partitioned normal form for XML data means to transforms, especial to/from relational databases.

The book I have defines partitioned normal format as: some set of the atomic attributes of an element can be used as a unique key to that element and all of the non-atomic attributes are in partitioned normal form themselves.

As a partial reflection of the relational concept of normalization, it's easy to see how this would support translating to and from a relational schema that reflected the same hierarchy as the XML document schema.

But PNF is supposed to help transform the data to other formats without loss of data. I can't find much info about it but I'm intrigued.

schematron is so cool

Schematron is a way to express rules that can be checked against an XML document to produce a report: either the document adheres to the rules or it doesn't.

It supports types of rules and constraints that other schema specification languages (like DTDs and XML schema) do not.

The amazing thing is how schematron uses XML and XSL in a clever, recursive way. All of the processing and validation is done via XSLTs.

A schematron schema is defined in an XML document. A properly formed schematron document can have the schematron XSL template applied to it, and the transformation will result in another XSL template.

This second template can then be used to transform the XML document to be validated. This transformation will result in a report detailing any problems or non-compliance.

schematron basic

tutorial

practical article from XML.com