Sprezzatura :: Making Databases Happen

By Sprezz | Monday, 11 June 2018 12:41 | 1 Comments

We seem to have had a flurry of UTF-8 queries lately at Sprezz Towers and, as there isn't a lot of information published about this, we thought it might be useful to comment a little about this subject.

For many years, OpenInsight (OI) didn't support UTF-8 and this led to difficulties with people who needed to store foreign language characters; which also just happened to be system delimiters for OI. So, for example:

Delimiter	ANSI	Character
@Rm	255	ÿ
@Fm	254	þ
@Vm	253	ý
@Svm	252	ü
@Tm	251	û
@Stm	250	ú

Quite the challenge if you're working with European languages.

To get around this issue, Rev introduced something called CHARMAP - a special system wide property that essentially relied upon the developer choosing characters they didn't want to use lower down in the ANSI table and using those to store the affected characters. Then at display time, swapping the lower characters back to the higher.

For example, if you wanted to be able to store ý (and what red-blooded occupant of Český Heršlág wouldn't?), you might decide that you're never going to need to use ™ (ANSI 153) and so instruct the CHARMAP to map 253 to 153. You'd enter Český Heršlág and OI would store Česk™ Heršlág - then when you asked it to display the town name, OI would convert the ™ back to ý before displaying it.

Clunky, but effective.

With UTF-8 we no longer have this problem, ý is actually stored as the multi byte string 00FD and so it doesn't stand a chance of being confused with ANSI 253 (although ironically it is just ANSI 253 with an ANSI 0 stuck in front). BUT, UTF-8 doesn't read minds. It doesn't know that when we ask it to display an ANSI 153 we really mean 00FD. And OI doesn't know that either - you might have intended to store a trademark symbol. So, if you are planning on moving data that previously used CHARMAP into a UTF-8 environment, you need to do some groundwork to avoid tying yourself (and anybody else trying to help you) into a series of complicated knots.

Use the SETNOOFDELIMITERS routine to set the value to be used to 0
For every data row in the application

Read the row

Use Loop/Remove to grab the individually delimited pieces of data

Convert your CHARMAPPED characters back to their original value

Recast the string as UTF-8 using the ANSI_UTF8 function

Add back into the new string you are building

When complete, write this string back over the original row. This means it will look strange in ANSI I because it will contain multi-byte characters, but it will work properly in the UTF-8 OI app.

Fire up your UTF-8 enabled OI app
Remove anything to do with CHARMAP
Attach your data and all will be well with the world

Hopefully this information will help you in the smooth transition of your data from ANSI to UTF-8.

And if you're new to the ways of UTF-8 don't forget this handy article for programming in UTF-8.

1 Comments:

In OI 9 the UTF-8 character 'ý' is stored as the multi-byte sequence C3BD - not 00FD. Has that changed in OI 10?
In OI 9.x, all bytes in a UTF-8 multi-byte character are high order (> 0x80). There is a good overview of OI's implementation of UTF-8 in the Unicode.chm file.

By Matt Crozier, At 11 June 2018 at 21:46

1 Comments:

Post a Comment