0001 /*
0002 This is the usual way of implementing fast string parsing in Basic+.
0003 We scan along the string looking for a delimiter, and remember
0004 where we found it via the Col2() function. For the next iteration
0005 we increment that position and scan from that point.
0006 */
0007 src = xlate( "SYSPROCS", "MSG", "", "X" )
0008
0009 pos = 1
0010 eof = len( src )
0011
0012 loop
0013 token = src[pos," "]
0014 pos = col2() + 1
0015
0016 * // Do stuff...
0017
0018 while ( pos < eof )
0019 repeat
0002 This is the usual way of implementing fast string parsing in Basic+.
0003 We scan along the string looking for a delimiter, and remember
0004 where we found it via the Col2() function. For the next iteration
0005 we increment that position and scan from that point.
0006 */
0007 src = xlate( "SYSPROCS", "MSG", "", "X" )
0008
0009 pos = 1
0010 eof = len( src )
0011
0012 loop
0013 token = src[pos," "]
0014 pos = col2() + 1
0015
0016 * // Do stuff...
0017
0018 while ( pos < eof )
0019 repeat
The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead.
Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2().
BCol1() and BCol2()
The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found.
The extended "[]" operators
Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so:
0001 /*
0002 This example shows a UTF8-friendly way of parsing a string using
0003 byte offsets with the extended "[]" operators.
0004 */
0005 src = xlate( "SYSPROCS", "MSG", "", "X" )
0006
0007 pos = 1
0008 delim = " "
0009 delimSize = GetByteSize( delim )
0010 eof = GetByteSize( src )
0011
0012 loop
0013 token = src[pos,delim,1] ; * // Extended - note the last "1" argument
0014 pos = BCol2() + delimSize ; * // Get the byte offset and increment by
0015 ; * // the delimiter _byte_ size
0016
0017 * // Do stuff...
0018
0019 while ( pos < eof )
0020 repeat
0002 This example shows a UTF8-friendly way of parsing a string using
0003 byte offsets with the extended "[]" operators.
0004 */
0005 src = xlate( "SYSPROCS", "MSG", "", "X" )
0006
0007 pos = 1
0008 delim = " "
0009 delimSize = GetByteSize( delim )
0010 eof = GetByteSize( src )
0011
0012 loop
0013 token = src[pos,delim,1] ; * // Extended - note the last "1" argument
0014 pos = BCol2() + delimSize ; * // Get the byte offset and increment by
0015 ; * // the delimiter _byte_ size
0016
0017 * // Do stuff...
0018
0019 while ( pos < eof )
0020 repeat
(Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead)
Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons.
[EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above.
This is great!!
ReplyDeleteCould the same be done for the REMOVE command? A 'remove' loop would perform inordinately faster if it kept track of byte position rather than character position, especially for the circumstances in which it is most commonly used.
Cheers, M@
Hi M@,
ReplyDeleteI'm sure it *could* be extended or something like a RemoveB statement added - have to say it's not a function I use much - I find parsing with the "[]" operators much more convenient as they don't break on each system delimiter.
You ought to come to Vegas to lobby the Powers That Be at Revelation for changes like this :) As you guys do use UTF8 I think your input on this topic would be very useful!
Yeah, I'm sorry to be missing the conference this time around. There's heaps of stuff there we could learn.
ReplyDeleteA lot of our legacy code uses Remove, but changing that to use [] instead isn't a big change. Although adding a RemoveB would be consistent, I guess.
Cheers, M@
Hi M@,
ReplyDeleteTake a look here:
http://www.sprezzatura.com/blog/2010/03/utf8-bremove-statement.html
:)