![]() |
||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||
![]() |
![]() |
|
![]() |
|||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||
![]() |
Recent changes to the "[]" operators in OpenInsight 9.2 have resulted in substantial performance improvements to UTF8 mode string handling. This post highlights another such enhancement introduced in 9.2 to help bring UTF8 mode applications up to the standard of their ANSI counterparts.
Consider the Loop/Remove construct below: 0001 /* 0002 Example showing standard loop/remove construct used 0003 to parse dynamic arrays at high speed 0004 */ 0005 0006 mark = 1 0007 pos = 1 ; * // This is the CHARACTER position 0008 Loop 0009 Remove nextVal From dynArray At pos Setting mark 0010 0011 // Process nextVal... 0012 0013 While mark 0014 Repeat This is a common way to efficiently parse dynamic arrays in Basic+, but just like the normal "[]" operators it suffers from a severe performance degradation in UTF8 mode due to the need to find the byte offset of a character when given the position. To alleviate this Revelation have introduced the BRemove statement - this operates in exactly the same fashion as the normal Remove statement, but the index variable used in BRemove refers to a byte offset rather than a character position. Here is the same example rewritten to use BRemove: 0001 /* 0002 Example showing UTF8-friendly loop/remove construct used 0003 to parse dynamic arrays at high speed 0004 */ 0005 0006 mark = 1 0007 pos = 1 ; * // This is the BYTE offset 0008 Loop 0009 BRemove nextVal From dynArray At pos Setting mark 0010 0011 // Process nextVal... 0012 0013 While mark 0014 Repeat As you can see it's a simple change and one worth making - using BRemove in your UTF8 applications will ensure that your dynamic array parsing remains fast and efficient. Labels: Basic+, Performance, Unicode
As was pointed out in a recent post the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this:
0001 /* 0002 This is the usual way of implementing fast string parsing in Basic+. 0003 We scan along the string looking for a delimiter, and remember 0004 where we found it via the Col2() function. For the next iteration 0005 we increment that position and scan from that point. 0006 */ 0007 src = xlate( "SYSPROCS", "MSG", "", "X" ) 0008 0009 pos = 1 0010 eof = len( src ) 0011 0012 loop 0013 token = src[pos," "] 0014 pos = col2() + 1 0015 0016 * // Do stuff... 0017 0018 while ( pos < eof ) 0019 repeat The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead. Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2(). BCol1() and BCol2() The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found. The extended "[]" operators Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so: 0001 /* 0002 This example shows a UTF8-friendly way of parsing a string using 0003 byte offsets with the extended "[]" operators. 0004 */ 0005 src = xlate( "SYSPROCS", "MSG", "", "X" ) 0006 0007 pos = 1 0008 delim = " " 0009 delimSize = GetByteSize( delim ) 0010 eof = GetByteSize( src ) 0011 0012 loop 0013 token = src[pos,delim,1] ; * // Extended - note the last "1" argument 0014 pos = BCol2() + delimSize ; * // Get the byte offset and increment by 0015 ; * // the delimiter _byte_ size 0016 0017 * // Do stuff... 0018 0019 while ( pos < eof ) 0020 repeat (Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead) Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons. [EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above. Labels: Basic+, Performance, Unicode
In our recent post on using memory pre-allocation when building large strings commenter M@ pointed out quite correctly that using the normal [] operators while in UTF8 mode results in a severe performance hit due to the necessity of calculating the character position of the insertion point during each iteration.
A workaround that was suggested was to temporarily switch to ANSI mode for the [] operation and then switch back afterwards. This is a valid solution and one we've used ourselves before, but it does create a possible failure point: If your system hits a fatal debug condition before you switch back you might unknowingly be stuck in ANSI mode which could result in subsequent data corruption. A safer alternative to this is to use the PutBinaryValue function that we documented here - this ignores any string-encoding and does a straightforward binary copy to the specified offset. Here's the Preallocation sample program from the previous post updated with the binary functions: Subroutine ZZ_SpeedTest( Void ) Declare Function TimeGetTime startTime = TimeGetTime() stringLength = GetByteSize( @Upper.Case : @Fm ) totalLength = stringLength * 99999 newArray = Space(totalLength) arrayPtr = 1 For loopPtr = 1 To 99999 PutBinaryValue( newArray, arrayPtr, CHAR, @Upper.Case : @Fm ) arrayPtr += stringLength Next endTime = TimeGetTime() totalTime = endTime - startTime Call Msg(@Window, "Total time was " : totalTime) Return This option took 95 milliseconds in UTF8 mode in our testing. Pretty much on a par with the [] operator in ANSI mode (As a aside the [] operator in UTF8 mode took....... well we don't know actually - we gave up after 10 minutes of waiting for it to finish!) We also tested the concatenation (:=) option in UTF8 mode - this slowed down the program by half - better than the [] operators but still not great. Labels: Basic+, Performance, Unicode
One of the most important points to bear in mind when using the Basic+ string handling functions is that all normal string operations are character-based - not byte-based. This has major implications if you wish to manipulate your data in a byte-oriented fashion when in UTF8 mode, because UTF8 is a multi-byte encoding scheme; i.e. it doesn't always follow that one byte represents one character as is the case in ANSI mode.
To overcome this issue Revelation introduced several new Basic+ functions way back in OpenInsight 7.0 that explicitly allows binary manipulation regardless of the string-handling mode you are currently in (Note that these functions are intrinsic to the Basic+ language and do not need to be declared before use). These functions are:
The intention of this blog post is to document these functions and to make you aware of them so that you can develop your applications correctly should you wish to work in UTF8 mode. (Also note that most of these functions expect you to specify a variable type when using them. This type should be chosen from one of the standard "C" types understood by the Basic+ compiler and listed at the end of this post) GetByteSize Returns the number of bytes occupied by the specified variable. This is in contrast to the Len() function which returns the number of characters. sizeInBytes = GetByteSize( varData )
E.g. rec = Xlate( "SYSOBJ", "$WRITE_ROW", "", "X" ) recSize = GetByteSize( rec ) GetBinaryValue This function extracts a binary value from a variable at a specified offset. You must specify the type of data to extract, and if you are extracting a type with a variable length, such as a string of bytes, you must also pass the number of bytes you wish to copy. binVal = GetBinaryValue( varData, byteOffset, varType, [,noOfBytes] )
E.g. rec = Xlate( "SYSOBJ", "$WRITE_ROW", "", "X" ) // Get the first byte of the record as a number firstByte = GetBinaryValue( rec, 1, BYTE ) // Get the next 10 bytes as a binary string someBytes = GetBinaryValue( rec, 2, BINARY, 10 ) PutBinaryValue This subroutine modifies a variable by replacing binary data at a specifed byte offset. You must specify the type of data you wish to insert as well as the data itself. PutBinaryValue( binData, byteOffset, varType, varData )
E.g. * // Example showing how to access and update * // a Windows API structure using * // the binary operators. * // * // A RECT structure consists of four LONG types * // (32-bit signed integer, each 4 bytes long) * // * // typedef tagRECT{ * // LONG left, * // LONG top, * // LONG right, * // LONG bottom * // } RECT; * // We're going to use the GetWindowRect API function * // to get some RECT coordinates hwnd = Get_Property( @window, "HANDLE" ) rect = blank_Struct( "RECT" ) rect = GetWindowRect( hwnd, rect ) * // Increment the top member by 10 top = GetBinaryValue( rect, 5, LONG ) top += 10 PutBinaryValue( rect, 5, LONG, top ) CreateBinaryData This function creates and returns a "blank" binary variable of the specified type. binVal = CreateBinaryData( varType, varData )
E.g. * // Create a binary integer with an initial value of * // 100 a = "100" intA = CreateBinaryData( INT, a ) Basic+ "C" types The following is a list of variable types that may be used with the Basic+ binary manipulation functions described above.
[EDIT: 05 March 2010] Due to a recently discovered compiler bug (since fixed) the following "C" types will NOT work with the binary manipulation functions prior to OpenInsight 9.2.0:
Probably the biggest impact this will have is processing BINARY types, but you can work around this by using the CHAR type instead as they both perform exactly the same operation. Labels: OpenInsight, Unicode, UTF8 |
![]() |
||||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||
![]() |
| ![]() |
||||||||||||||||||||||||||||||||||||
![]() |