In working to improve this blogsite, I found some challenges in creating good XML when the source data can be variable. The problem is that in formula language lacks easy tools for doing this. There is no simple XML transformation process that works the way @URLENCODE works. It sure would have been easier.
First off, as you put together your XML string, you want to make sure to replace the "reserved" characters that are valid ascii values but still can't be in your content. By these I mean the greater than and less than brackets, ampersand, quote, and apostrophe. You want to do this to each of the fields you'll build your string out of, so that these characters are transformed before take the next step of scrubbing any high order characters from the data. I prefer to make the replacement lists in one place, as a constant at the top of the formula, as then may get used quite a bit.
xmlfrom := "<":">":"'":"&" : "\"" ;
xmlto := "<" : ">" : "'" : "&" : """;
SourceText := @ReplaceSubstring(@Text(SourceText); xmlfrom ; xmlto );
That was the easy part. What's harder is scrubbing your XML for any characters not allowed under the UTF-8 encoding schema. Basically, that means anything with unicode value above 127. The one that was messing up this blogsite was "é" from the name of a person in another story. That character comes out to a unicode value of 0xE9 (or 233 in decimal).
Formula language presents us with some challenges here. In LotusScript this would be handled fairly simply. Formula language however does not have easy functions for deal with conversion to hexadecimal values from strings, and strings are not stored as unicode values. There is also no simple function to get the unicode value of a character. Solving these problems leads me to these other useful formulae.
First, here's one to convert a string of hexadecimal digits into a numeric value. This could have been simpler for my needs, as I'm only dealing with two digit values, but I wanted something that could handle any reasonable length string of hex characters. The initial value is in the field "sourcetext" and the result will be in the numeric field "result".
result := 0;
sourcetext := "1A3";
HexChars := "A":"B":"C":"D":"E":"F" ;
DecimalValues := "10": "11" : "12" : "13" : "14" : "15";
@For(z := 1 ; z <= @Length(sourcetext) ; z := z + 1 ;
result := @If(@IsNumber(result) ; result ; 0) +
@Power(16 ; z-1) * @TextToNumber(@Replace(@Left(
@Right(sourcetext; z);1) ; HexChars; DecimalValues)));
* Notice that I'm setting "result" to zero here. If you do this inside a loop of any kind, you'll find the value keeps increasing. Make sure to set it back to zero for each new iteration, as in the final formula shown below.
Finally we're going to put the whole thing together and produce an XML string that we can be sure is valid. For this example, lets assume we're using the "Subject" field of a document in a view column formula to produce this XML.
The key to this whole thing is the function "@URLEncode" which takes as one of its possible options "ISO-8859-1". The formula looks at each character in the string, and encodes it in that format. For any character above decimal value 127 along with any values that are not valid in that format (like spaces) the encoding will change it to a % sign followed by a two digit hexadecimal value. That two digit value represents the character code. Since any other character will be unchanged, we can tell which are 'special' characters according to that format by checking the length of the string returned. For each "special" character, we check to see if it is in fact above value 127. To do this, we convert the hexadecimal value to a number value first. If it turns out that we're dealing with a space character or some other value which is valid in UTF-8 encoding but not in ISO-8859-1, we just convert it back again. For those characters which are above 127, we turn them into the XML representation allowed by UTF-8 encoding, which is "&#x" followed by the two digit hex value, and then by a closing semicolon. Thus, the "é" character from earlier encodes as "é"
Notice also that we do the transformation for brackets and ampersands and so forth on the subject string itself before we apply the XML tags around it. We don't want to change the brackets on the XML tags of course. We do the high order character transformation on the entire XML string however, because those characters are not valid anywhere on the XML document.
SourceText := Subject;
HexChars := "A":"B":"C":"D":"E":"F" ;
DecimalValues := "10" : "11" : "12" : "13" : "14" : "15";
xmlfrom := "<":">":"'":"&" : "\"" ;
xmlto := "<" : ">" : "'" : "&" : """;
SourceText := @ReplaceSubstring(@Text(SourceText); xmlfrom ; xmlto );
SourceText := "<subject>" + SourceText + "</subject>";
@For( x := 0 ; x < @Length(SourceText); x := x + 1 ;
tcharenc := @URLEncode("ISO-8859-1" ;@Middle(SourceText; x ; 1));
@If( @Length(tcharenc) = 1; ResultText := ResultText + tcharenc ;
@Transform( @Explode(tcharenc;"%") ; "chval" ; @Do(
result := 0;
@For(z := 1 ; z <= @Length(chval) ; z := z + 1 ;
result := @If(@IsNumber(result) ; result ; 0) +
@Power(16 ; z-1) * @TextToNumber(@Replace(@Left(
@Right(chval; z);1) ; HexChars; DecimalValues)));
ResultText := ResultText + @If( result > 127 ; "&#x" + chval + ";" ; @Char(result))))));
Comment Entry |
Please wait while your document is saved.
The UTF-8 value for a character can be 1, 2, or three bytes long (2, 4, or 6
hex digits). For example, each of these three requires two bytes: אבג They
also go right to left when displayed ;-)
And each of these requires three bytes: ①②③
Without following your code line-for-line, I can't be sure; but it seems to me
that you're in danger of losing data if you do a transformation to through
8859-1