UTF8 and the extended string operators

As was pointed out in a recent post the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this:

0001  /*
0002     This is the usual way of implementing fast string parsing in Basic+.
0003     We scan along the string looking for a delimiter, and remember
0004     where we found it via the Col2() function. For the next iteration
0005     we increment that position and scan from that point.
0006  */   
0007     src = xlate( "SYSPROCS", "MSG", "", "X" )
0008  
0009     pos = 1
0010     eof = len( src )
0011     
0012     loop
0013        token = src[pos," "]
0014        pos   = col2() + 1
0015        
0016        * // Do stuff...
0017        
0018     while ( pos < eof )
0019     repeat

The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead.

Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2().

BCol1() and BCol2()

The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found.

The extended "[]" operators

Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so:

0001  /*
0002     This example shows a UTF8-friendly way of parsing a string using
0003     byte offsets with the extended "[]" operators. 
0004  */
0005     src = xlate( "SYSPROCS", "MSG", "", "X" )
0006     
0007     pos       = 1
0008     delim     = " "
0009     delimSize = GetByteSize( delim )
0010     eof       = GetByteSize( src )
0011     
0012     loop
0013        token = src[pos,delim,1]    ; * // Extended - note the last "1" argument
0014        pos   = BCol2() + delimSize ; * // Get the byte offset and increment by
0015                                    ; * // the delimiter _byte_ size
0016        
0017        * // Do stuff...
0018        
0019     while ( pos < eof )
0020     repeat

(Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead)

Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons.

[EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above.

4 comments:

Matt Crozier17 March 2010 at 00:18
This is great!!
Could the same be done for the REMOVE command? A 'remove' loop would perform inordinately faster if it kept track of byte position rather than character position, especially for the circumstances in which it is most commonly used.

Cheers, M@
Captain C17 March 2010 at 10:00
Hi M@,

I'm sure it *could* be extended or something like a RemoveB statement added - have to say it's not a function I use much - I find parsing with the "[]" operators much more convenient as they don't break on each system delimiter.

You ought to come to Vegas to lobby the Powers That Be at Revelation for changes like this :) As you guys do use UTF8 I think your input on this topic would be very useful!
Matt Crozier18 March 2010 at 07:10
Yeah, I'm sorry to be missing the conference this time around. There's heaps of stuff there we could learn.
A lot of our legacy code uses Remove, but changing that to use [] instead isn't a big change. Although adding a RemoveB would be consistent, I guess.

Cheers, M@
Captain C20 March 2010 at 18:01
Hi M@,

Take a look here:

http://www.sprezzatura.com/blog/2010/03/utf8-bremove-statement.html

:)

Sprezzatura :: Making Databases Happen

Pages

Friday, 12 March 2010

UTF8 and the extended string operators

4 comments: