Sprezzatura :: Making Databases Happen

By Captain C | Saturday, 20 March 2010 17:59 | 2 Comments

Recent changes to the "[]" operators in OpenInsight 9.2 have resulted in substantial performance improvements to UTF8 mode string handling. This post highlights another such enhancement introduced in 9.2 to help bring UTF8 mode applications up to the standard of their ANSI counterparts.

Consider the Loop/Remove construct below:

0001  /*
0002     Example showing standard loop/remove construct used
0003     to parse dynamic arrays at high speed
0004  */
0005  
0006     mark = 1
0007     pos  = 1 ; * // This is the CHARACTER position
0008     Loop
0009        Remove nextVal From dynArray At pos Setting mark
0010        
0011        // Process nextVal...
0012        
0013     While mark
0014     Repeat 

This is a common way to efficiently parse dynamic arrays in Basic+, but just like the normal "[]" operators it suffers from a severe performance degradation in UTF8 mode due to the need to find the byte offset of a character when given the position.

To alleviate this Revelation have introduced the BRemove statement - this operates in exactly the same fashion as the normal Remove statement, but the index variable used in BRemove refers to a byte offset rather than a character position.

Here is the same example rewritten to use BRemove:

0001  /*
0002     Example showing UTF8-friendly loop/remove construct used
0003     to parse dynamic arrays at high speed
0004  */
0005  
0006     mark = 1
0007     pos  = 1 ; * // This is the BYTE offset
0008     Loop
0009        BRemove nextVal From dynArray At pos Setting mark
0010        
0011        // Process nextVal...
0012        
0013     While mark
0014     Repeat 

As you can see it's a simple change and one worth making - using BRemove in your UTF8 applications will ensure that your dynamic array parsing remains fast and efficient.

Labels: Basic+, Performance, Unicode

UTF8 and the extended string operators

By Captain C | Friday, 12 March 2010 13:21 | 4 Comments

As was pointed out in a recent post the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this:

0001  /*
0002     This is the usual way of implementing fast string parsing in Basic+.
0003     We scan along the string looking for a delimiter, and remember
0004     where we found it via the Col2() function. For the next iteration
0005     we increment that position and scan from that point.
0006  */   
0007     src = xlate( "SYSPROCS", "MSG", "", "X" )
0008  
0009     pos = 1
0010     eof = len( src )
0011     
0012     loop
0013        token = src[pos," "]
0014        pos   = col2() + 1
0015        
0016        * // Do stuff...
0017        
0018     while ( pos < eof )
0019     repeat

The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead.

Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2().

BCol1() and BCol2()

The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found.

The extended "[]" operators

Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so:

0001  /*
0002     This example shows a UTF8-friendly way of parsing a string using
0003     byte offsets with the extended "[]" operators. 
0004  */
0005     src = xlate( "SYSPROCS", "MSG", "", "X" )
0006     
0007     pos       = 1
0008     delim     = " "
0009     delimSize = GetByteSize( delim )
0010     eof       = GetByteSize( src )
0011     
0012     loop
0013        token = src[pos,delim,1]    ; * // Extended - note the last "1" argument
0014        pos   = BCol2() + delimSize ; * // Get the byte offset and increment by
0015                                    ; * // the delimiter _byte_ size
0016        
0017        * // Do stuff...
0018        
0019     while ( pos < eof )
0020     repeat

(Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead)

Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons.

[EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above.

Labels: Basic+, Performance, Unicode

UTF8 - To Malloc or not to Malloc - that is another question

By Captain C | Wednesday, 10 March 2010 10:03 | 0 Comments

In our recent post on using memory pre-allocation when building large strings commenter M@ pointed out quite correctly that using the normal [] operators while in UTF8 mode results in a severe performance hit due to the necessity of calculating the character position of the insertion point during each iteration.

A workaround that was suggested was to temporarily switch to ANSI mode for the [] operation and then switch back afterwards. This is a valid solution and one we've used ourselves before, but it does create a possible failure point: If your system hits a fatal debug condition before you switch back you might unknowingly be stuck in ANSI mode which could result in subsequent data corruption.

A safer alternative to this is to use the PutBinaryValue function that we documented here - this ignores any string-encoding and does a straightforward binary copy to the specified offset.

Here's the Preallocation sample program from the previous post updated with the binary functions:

Subroutine ZZ_SpeedTest( Void )

   Declare Function TimeGetTime

   startTime    = TimeGetTime()
   stringLength = GetByteSize( @Upper.Case : @Fm )
   totalLength  = stringLength * 99999
   newArray     = Space(totalLength)
   arrayPtr     = 1

   For loopPtr = 1 To 99999
      PutBinaryValue( newArray, arrayPtr, CHAR, @Upper.Case : @Fm )
      arrayPtr += stringLength
   Next

   endTime   = TimeGetTime()
   totalTime = endTime - startTime

   Call Msg(@Window, "Total time was " : totalTime)

Return

This option took 95 milliseconds in UTF8 mode in our testing. Pretty much on a par with the [] operator in ANSI mode (As a aside the [] operator in UTF8 mode took....... well we don't know actually - we gave up after 10 minutes of waiting for it to finish!)

We also tested the concatenation (:=) option in UTF8 mode - this slowed down the program by half - better than the [] operators but still not great.

Labels: Basic+, Performance, Unicode

UTF8 and Binary Manipulation

By Captain C | Wednesday, 3 March 2010 09:45 | 0 Comments

One of the most important points to bear in mind when using the Basic+ string handling functions is that all normal string operations are character-based - not byte-based. This has major implications if you wish to manipulate your data in a byte-oriented fashion when in UTF8 mode, because UTF8 is a multi-byte encoding scheme; i.e. it doesn't always follow that one byte represents one character as is the case in ANSI mode.

To overcome this issue Revelation introduced several new Basic+ functions way back in OpenInsight 7.0 that explicitly allows binary manipulation regardless of the string-handling mode you are currently in (Note that these functions are intrinsic to the Basic+ language and do not need to be declared before use).

These functions are:

GetByteSize
GetBinaryValue
PutBinaryValue
CreateBinaryData

The intention of this blog post is to document these functions and to make you aware of them so that you can develop your applications correctly should you wish to work in UTF8 mode.

(Also note that most of these functions expect you to specify a variable type when using them. This type should be chosen from one of the standard "C" types understood by the Basic+ compiler and listed at the end of this post)

GetByteSize

Returns the number of bytes occupied by the specified variable. This is in contrast to the Len() function which returns the number of characters.

sizeInBytes = GetByteSize( varData )

Argument	Description
varData	Variable to query.

E.g.

rec = Xlate( "SYSOBJ", "$WRITE_ROW", "", "X" )
recSize = GetByteSize( rec )

GetBinaryValue

This function extracts a binary value from a variable at a specified offset. You must specify the type of data to extract, and if you are extracting a type with a variable length, such as a string of bytes, you must also pass the number of bytes you wish to copy.

binVal = GetBinaryValue( varData, byteOffset, varType, [,noOfBytes] )

Argument	Description
varData	Variable to extract the binary value from.
byteOffset	1-based offset into varData to extract the binary value from.
varType	Type of data to extract. This must be one of the Basic+ "C" types as listed below.
noOfBytes	Number of bytes to extract. This argument is only required if varType is CHAR or BINARY.

E.g.

   rec = Xlate( "SYSOBJ", "$WRITE_ROW", "", "X" )

   // Get the first byte of the record as a number
   firstByte = GetBinaryValue( rec, 1, BYTE )

   // Get the next 10 bytes as a binary string
   someBytes = GetBinaryValue( rec, 2, BINARY, 10 )

PutBinaryValue

This subroutine modifies a variable by replacing binary data at a specifed byte offset. You must specify the type of data you wish to insert as well as the data itself.

PutBinaryValue( binData, byteOffset, varType, varData )

Argument	Description
binData	Variable containing binary data to modify.
byteOffset	1-based starting starting position to begin the modification from.
varType	Type of data to copy into binData. This must be one of the Basic+ "C" types as listed below.
varData	Data to copy into binData. OpenEngine converts this to the binary format specified by the varType argument before copying.

E.g.

   * // Example showing how to access and update
   * // a Windows API structure using
   * // the binary operators.
   * //
   * // A RECT structure consists of four LONG types
   * // (32-bit signed integer, each 4 bytes long)
   * //
   * // typedef tagRECT{
   * //   LONG left,
   * //   LONG top,
   * //   LONG right,
   * //   LONG bottom
   * // } RECT;

   * // We're going to use the GetWindowRect API function
   * // to get some RECT coordinates

   hwnd = Get_Property( @window, "HANDLE" )
   rect = blank_Struct( "RECT" )
   rect = GetWindowRect( hwnd, rect )

   * // Increment the top member by 10
   top  =  GetBinaryValue( rect, 5, LONG )
   top  += 10

   PutBinaryValue( rect, 5, LONG, top )

CreateBinaryData

This function creates and returns a "blank" binary variable of the specified type.

binVal = CreateBinaryData( varType, varData )

Argument	Description
varType	Type of variable to create. This must be one of the Basic+ "C" types as listed below.
varData	Initial value of the new variable.

E.g.

   * // Create a binary integer with an initial value of
   * // 100

   a    = "100"
   intA = CreateBinaryData( INT, a )

Basic+ "C" types

The following is a list of variable types that may be used with the Basic+ binary manipulation functions described above.

CHAR
BYTE
UBYTE
SHORT
USHORT
LONG
ULONG
FLOAT
LPVOID
LPCHAR

LPBYTE
LPUBYTE
LPSHORT
LPUSHORT
LPLONG
LPULONG
LPFLOAT
LPDOUBLE
DOUBLE
HANDLE

INT
UINT
LPINT
LPUINT
LPHANDLE
ACHAR
WCHAR
LPACHAR
LPWCHAR
LPSTR

LPASTR
LPWSTR
BINARY
LPBINARY

[EDIT: 05 March 2010]

Due to a recently discovered compiler bug (since fixed) the following "C" types will NOT work with the binary manipulation functions prior to OpenInsight 9.2.0:

ACHAR
WCHAR
LPACHAR
LPWCHAR
LPSTR
LPASTR
LPWSTR
BINARY
LPBINARY

Probably the biggest impact this will have is processing BINARY types, but you can work around this by using the CHAR type instead as they both perform exactly the same operation.

Labels: OpenInsight, Unicode, UTF8