Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

Professional Services

Second Signal

Presentations

Andrew's Blog

Support

SNTT (seriously geeky) - Some helpful @Formulas - clean up your XML, and convert from hex value strings to decimal numbers

By Andrew Pollack on 11/01/2006 at 04:40 PM EDT

In working to improve this blogsite, I found some challenges in creating good XML when the source data can be variable. The problem is that in formula language lacks easy tools for doing this. There is no simple XML transformation process that works the way @URLENCODE works. It sure would have been easier.

First off, as you put together your XML string, you want to make sure to replace the "reserved" characters that are valid ascii values but still can't be in your content. By these I mean the greater than and less than brackets, ampersand, quote, and apostrophe. You want to do this to each of the fields you'll build your string out of, so that these characters are transformed before take the next step of scrubbing any high order characters from the data. I prefer to make the replacement lists in one place, as a constant at the top of the formula, as then may get used quite a bit.

xmlfrom := "<":">":"'":"&" : "\"" ;
xmlto := "<" : ">" : "'" : "&" : """;
SourceText := @ReplaceSubstring(@Text(SourceText); xmlfrom ; xmlto );

That was the easy part. What's harder is scrubbing your XML for any characters not allowed under the UTF-8 encoding schema. Basically, that means anything with unicode value above 127. The one that was messing up this blogsite was "é" from the name of a person in another story. That character comes out to a unicode value of 0xE9 (or 233 in decimal).

Formula language presents us with some challenges here. In LotusScript this would be handled fairly simply. Formula language however does not have easy functions for deal with conversion to hexadecimal values from strings, and strings are not stored as unicode values. There is also no simple function to get the unicode value of a character. Solving these problems leads me to these other useful formulae.

First, here's one to convert a string of hexadecimal digits into a numeric value. This could have been simpler for my needs, as I'm only dealing with two digit values, but I wanted something that could handle any reasonable length string of hex characters. The initial value is in the field "sourcetext" and the result will be in the numeric field "result".

result := 0;
sourcetext := "1A3";
HexChars := "A":"B":"C":"D":"E":"F" ;
DecimalValues := "10": "11" : "12" : "13" : "14" : "15";
@For(z := 1 ; z <= @Length(sourcetext) ; z := z + 1 ;
result := @If(@IsNumber(result) ; result ; 0) +
@Power(16 ; z-1) * @TextToNumber(@Replace(@Left(
@Right(sourcetext; z);1) ; HexChars; DecimalValues)));

* Notice that I'm setting "result" to zero here. If you do this inside a loop of any kind, you'll find the value keeps increasing. Make sure to set it back to zero for each new iteration, as in the final formula shown below.

Finally we're going to put the whole thing together and produce an XML string that we can be sure is valid. For this example, lets assume we're using the "Subject" field of a document in a view column formula to produce this XML.

The key to this whole thing is the function "@URLEncode" which takes as one of its possible options "ISO-8859-1". The formula looks at each character in the string, and encodes it in that format. For any character above decimal value 127 along with any values that are not valid in that format (like spaces) the encoding will change it to a % sign followed by a two digit hexadecimal value. That two digit value represents the character code. Since any other character will be unchanged, we can tell which are 'special' characters according to that format by checking the length of the string returned. For each "special" character, we check to see if it is in fact above value 127. To do this, we convert the hexadecimal value to a number value first. If it turns out that we're dealing with a space character or some other value which is valid in UTF-8 encoding but not in ISO-8859-1, we just convert it back again. For those characters which are above 127, we turn them into the XML representation allowed by UTF-8 encoding, which is "&#x" followed by the two digit hex value, and then by a closing semicolon. Thus, the "é" character from earlier encodes as "é"

Notice also that we do the transformation for brackets and ampersands and so forth on the subject string itself before we apply the XML tags around it. We don't want to change the brackets on the XML tags of course. We do the high order character transformation on the entire XML string however, because those characters are not valid anywhere on the XML document.

SourceText := Subject;
HexChars := "A":"B":"C":"D":"E":"F" ;
DecimalValues := "10" : "11" : "12" : "13" : "14" : "15";
xmlfrom := "<":">":"'":"&" : "\"" ;
xmlto := "<" : ">" : "'" : "&" : """;
SourceText := @ReplaceSubstring(@Text(SourceText); xmlfrom ; xmlto );
SourceText := "<subject>" + SourceText + "</subject>";
@For( x := 0 ; x < @Length(SourceText); x := x + 1 ;
tcharenc := @URLEncode("ISO-8859-1" ;@Middle(SourceText; x ; 1));
@If( @Length(tcharenc) = 1; ResultText := ResultText + tcharenc ;
@Transform( @Explode(tcharenc;"%") ; "chval" ; @Do(
result := 0;
@For(z := 1 ; z <= @Length(chval) ; z := z + 1 ;
result := @If(@IsNumber(result) ; result ; 0) +
@Power(16 ; z-1) * @TextToNumber(@Replace(@Left(
@Right(chval; z);1) ; HexChars; DecimalValues)));
ResultText := ResultText + @If( result > 127 ; "&#x" + chval + ";" ; @Char(result))))));

There are - loading - comments....

My own thoughts on this are...By Richard Schwartz on 11/01/2006 at 11:27 PM EDT

Not clear to me why you use 8859-1.

The UTF-8 value for a character can be 1, 2, or three bytes long (2, 4, or 6
hex digits). For example, each of these three requires two bytes: אבג They
also go right to left when displayed ;-)

And each of these requires three bytes: ①②③

Without following your code line-for-line, I can't be sure; but it seems to me
that you're in danger of losing data if you do a transformation to through
8859-1

Because I haven't found anything better.By Andrew Pollack on 11/02/2006 at 12:17 AM EDT

Using the options "Domino" or "UTF-8" produce a set of values that browsers do
not interpret correctly. In this case, it produces "%C3%A9" which are not a
valid representation of the character without also specifying another code page.

Interestingly enough, using the value "Platform" seems to work, although all
three fail when attempting the character data you entered.

Something is still missing.

My own thoughts on this are...By Kerr on 11/03/2006 at 09:20 AM EDT

Something is very much missing. Part of that is that URL encoding is a specific
task that has different requirements to discovering the Unicode code point of a
given character. The xml entity reference is just the Unicode code point, in
either a hex or decimal format and has nothing directly to do with any charset.

@URLEncode does the following:
For every char in the input string, if the character is not valid for a url
query as defined by the url spec (basically [a-zA-Z0-9] plus a few others) then
return the encoded character as defined by the specified charset, preceding
each byte with a % char.

Using this method to try and extract the Unicode code point is going to involve
knowledge of how that charset encoding relates to Unicode, which may be
non-trivial.

There is really no reason why you shouldn't be able to just put the "é"
directly on the document, as long as the charset is utf-8 and xml document
specifies that it is utf-8 (although that is the default). Unless you are
working around a bug that is causing a problem or you have characters that your
font's don't have a glyph for you should not need to use character entity
references at all.

A couple of points of clarification:
UTF-8 can encode any Unicode code point, using as many bytes as it needs using
surrogate pairs.
The only characters that are not valid in an xml document are control
characters or half of surrogate pairs. There are also unassigned code points
that shouldn’t be used.
As long as the charset that encodes (serialises) the xml document is the one
that is used to unencode it, that this is the one specified on the document,
and the charset is capable of encoding all characters on the document, then any
problems will be caused by a bug, not a limitation in Unicode, the charset or
XML.

I hear you.... but in practice it's not working as wellBy Andrew Pollack on 11/04/2006 at 12:01 AM EDT

The issue seems to be that the rss data isn't be generated by domino, so what
domino is encoding doesn't match the charset I've coded for the xml to be.

I have an idea or two to try that may help. By the way, I hate character set
issues.

*Tested -- initially anyway, I've found a VERY simple and happy way to resolve
this issue. Fascinatingly simple really. I'll try to post it tomorrow.

My own thoughts on this are...By Richard Schwartz on 11/02/2006 at 01:19 AM EDT

The problem is that XML character refs are expressed in UTF-16, not UTF-8. The
correct encoding for that character in UTF-16 is this

é

That can be shortened to é. In fact, for any character that falls within
the 8859-1 set, the UTF-16 value will have the two leading zeros, which can
simply be dropped. But if you want to support all Unicode characters, you have
to be prepared to deal with

There does not seem to be a way to use @UrlEncode to get UTF-16. One could
write a translator from UTF-8 to UTF-16 in formula language, though. Shouldn't
be too too hard. Maybe I'll take it on as a challenge.

Subject
Your Name
Homepage
*Your Email
	* Your email address is required, but not displayed.

Your thoughts....


	Remember Me

Andrew Pollack's Blog

SNTT (seriously geeky) - Some helpful @Formulas - clean up your XML, and convert from hex value strings to decimal numbers

Site Links

Useful Links

Other Recent Stories...