Pages

Friday, 12 March 2010

UTF8 and the extended string operators

As was pointed out in a recent post the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this:


0001  /*
0002     This is the usual way of implementing fast string parsing in Basic+.
0003     We scan along the string looking for a delimiter, and remember
0004     where we found it via the Col2() function. For the next iteration
0005     we increment that position and scan from that point.
0006  */   
0007     src = xlate( "SYSPROCS", "MSG", "", "X" )
0008  
0009     pos = 1
0010     eof = len( src )
0011     
0012     loop
0013        token = src[pos," "]
0014        pos   = col2() + 1
0015        
0016        * // Do stuff...
0017        
0018     while ( pos < eof )
0019     repeat


The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead.

Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2().

BCol1() and BCol2()

The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found.

The extended "[]" operators

Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so:


0001  /*
0002     This example shows a UTF8-friendly way of parsing a string using
0003     byte offsets with the extended "[]" operators. 
0004  */
0005     src = xlate( "SYSPROCS", "MSG", "", "X" )
0006     
0007     pos       = 1
0008     delim     = " "
0009     delimSize = GetByteSize( delim )
0010     eof       = GetByteSize( src )
0011     
0012     loop
0013        token = src[pos,delim,1]    ; * // Extended - note the last "1" argument
0014        pos   = BCol2() + delimSize ; * // Get the byte offset and increment by
0015                                    ; * // the delimiter _byte_ size
0016        
0017        * // Do stuff...
0018        
0019     while ( pos < eof )
0020     repeat


(Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead)

Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons.

[EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above.

4 comments:

  1. This is great!!
    Could the same be done for the REMOVE command? A 'remove' loop would perform inordinately faster if it kept track of byte position rather than character position, especially for the circumstances in which it is most commonly used.

    Cheers, M@

    ReplyDelete
  2. Hi M@,

    I'm sure it *could* be extended or something like a RemoveB statement added - have to say it's not a function I use much - I find parsing with the "[]" operators much more convenient as they don't break on each system delimiter.

    You ought to come to Vegas to lobby the Powers That Be at Revelation for changes like this :) As you guys do use UTF8 I think your input on this topic would be very useful!

    ReplyDelete
  3. Yeah, I'm sorry to be missing the conference this time around. There's heaps of stuff there we could learn.
    A lot of our legacy code uses Remove, but changing that to use [] instead isn't a big change. Although adding a RemoveB would be consistent, I guess.

    Cheers, M@

    ReplyDelete
  4. Hi M@,

    Take a look here:

    http://www.sprezzatura.com/blog/2010/03/utf8-bremove-statement.html

    :)

    ReplyDelete