Characterset conversion conundrums

Posted in: Oracle, Technical Track

Every now and then a database needs to be migrated from one characterset to another. The driver behind this is ususally to either synchronize all databases to a single standard or to support new characters or symbols like €, 🖖 or 💩 which you just cannot get in the most common 8 bit codepages. And yes, someone thought that UTF-8 needed a code for the pile of poo and put it into the standard. But what starts with the simple and reasonable request to change a database’s characterset can quickly escalate into a neverending story that can keep you busy for a long time. I want to look at and explain two of these pain points.

Lossy conversion

The documentation says that if the new characterset is not a complete superset of the source, then the conversion might be lossy as some characters cannot be converted. I was interested in which characters or codes were not mapped to UTF-8 but could not find a satisfying answer in 5 minutes of googling, so I built myself a small testcase. I created an Oracle 11g test database with an 8-bit characterset and populated a table with all possible character values, then ran csscan for a report.

SQL> SELECT * FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER=’NLS_CHARACTERSET’;
PARAMETER                      VALUE
—————————— —————————————-
NLS_CHARACTERSET               WE8ISO8859P1
SQL> create table abc (id number(3), symbol varchar2(1 CHAR));
Table created.

Now let’s fill the table with every possible character code in 8 bits.

begin
  FOR i IN 0..255
   LOOP
   insert into abc values (i, chr(i));
   end loop;
   commit;
end;
/

Sidenote: running SELECT * from that table will crash sqlcl and the run-script mode of SQLdeveloper at least in the April 2016 version 4.2.0.16.112.

We can now run csscan (install the needed packages first) against this table for a report on what would get lost in the conversion.

csscan brost/oracle USER=BROST TABLE=ABC TOCHAR=AL32UTF8 LOG=abc

The result gets written to a file abc.err which has entries for every row of data that has lossy data in a conversion.

User  : BROST
Table : ABC
Column: SYMBOL
Type  : VARCHAR2(1)
Number of Exceptions         : 32
Max Post Conversion Data Size: 2
ROWID              Exception Type      Size Cell Data(first 30 bytes)
—————— —————— —– ——————————
AAAVU2AAEAAAACsACA lossy conversion         <80>
AAAVU2AAEAAAACsACB lossy conversion         <81>
AAAVU2AAEAAAACsACC lossy conversion         <82>
AAAVU2AAEAAAACsACD lossy conversion         <83>
AAAVU2AAEAAAACsACE lossy conversion         <84>
AAAVU2AAEAAAACsACF lossy conversion         <85>
AAAVU2AAEAAAACsACG lossy conversion         <86>
AAAVU2AAEAAAACsACH lossy conversion         <87>
AAAVU2AAEAAAACsACI lossy conversion         <88>
AAAVU2AAEAAAACsACJ lossy conversion         <89>
AAAVU2AAEAAAACsACK lossy conversion         <8A>
AAAVU2AAEAAAACsACL lossy conversion         <8B>
AAAVU2AAEAAAACsACM lossy conversion         <8C>
AAAVU2AAEAAAACsACN lossy conversion         <8D>
AAAVU2AAEAAAACsACO lossy conversion         <8E>
AAAVU2AAEAAAACsACP lossy conversion         <8F>
AAAVU2AAEAAAACsACQ lossy conversion         <90>
AAAVU2AAEAAAACsACR lossy conversion         <91>
AAAVU2AAEAAAACsACS lossy conversion         <92>
AAAVU2AAEAAAACsACT lossy conversion         <93>
AAAVU2AAEAAAACsACU lossy conversion         <94>
AAAVU2AAEAAAACsACV lossy conversion         <95>
AAAVU2AAEAAAACsACW lossy conversion         <96>
AAAVU2AAEAAAACsACX lossy conversion         <97>
AAAVU2AAEAAAACsACY lossy conversion         <98>
AAAVU2AAEAAAACsACZ lossy conversion         <99>
AAAVU2AAEAAAACsACa lossy conversion         <9A>
AAAVU2AAEAAAACsACb lossy conversion         <9B>
AAAVU2AAEAAAACsACc lossy conversion         <9C>
AAAVU2AAEAAAACsACd lossy conversion         <9D>
AAAVU2AAEAAAACsACe lossy conversion         <9E>
AAAVU2AAEAAAACsACf lossy conversion         <9F>
—————— —————— —– ——————————

This shows that there is no mapping from characters between 0x80 and 0x9F from ISO8559 to UTF-8. This might not be very surprising as these codes are undefined in the ISO standard and if anything is using these bytes then they are not really ISO8559 text in the first place.

Column length

The other issue in cs migrations to keep an eye out for is the length of columns. By default, oracle character columns are defined in bytes. Varchar2(42) means that you are allowed to store strings with a maximum size of 42 bytes in it. In the world of 8bit charactersets, this also means 42 characters. But in UTF-8, only the basic characters take one byte while others are represented by up to four bytes. Take my name for example: ‘Björn’ is 5 characters long and can be stored in a varchar2(5 bytes) column in a database using ISO8559 or a similar 8 bit characterset like WE8MSWIN1252. But in UTF-8, the same string is 6 bytes long as the ‘ö’ takes up two bytes – 0xc3b6 to be exact. For short strings, this can be easily fixed by changing the definition of those columns to character semantic:

ALTER TABLE ABC MODIFY SYMBOL VARCHAR2(1 CHAR);

And from a development perspective that also makes much more sense because applications would usually want to restrict the number of characters in a string rather than how many bytes they occupy on disk.

4000 bytes is still the end of the road

But the character semantic does not solve all problems as Oracle will still impose a maximum limit of 4000 bytes per varchar2 column. In the next example I create a table with such a column of 4000 characters in size and I insert two strings that are both 4000 characters long.

SQL> create table longtext (id number(3), message varchar2(4000 char));
Table created.
SQL> insert into longtext values (1, rpad(‘Pythian rocks’,4000,’!’));
1 row created.
SQL> insert into longtext values (2, rpad(‘Björn rocks’,4000,’!’));
1 row created.
SQL> commit;
SQL> select id, length(message) from longtext;
ID LENGTH(MESSAGE)
———- —————
 1       4000
 2       4000

Ok, now let’s see what csscan says about that:

User  : BROST
Table : LONGTEXT
Column: MESSAGE
Type  : VARCHAR2(4000)
Number of Exceptions         : 1
Max Post Conversion Data Size: 4001
ROWID              Exception Type      Size Cell Data(first 30 bytes)
—————— —————— —– ——————————
AAAVVkAAEAAAAC3AAA exceed column size  4001 Bj<F6>rn rocks!!!!!!!!!!!!!!!!!!!
—————— —————— —– ——————————

The 4001 in the size column indicates that after converting the string to the destination characterset, it will occupy 4001 bytes of space which is not going to work, even in a column defined as varchar2(4000 CHAR). At this point you’ll have to accept that this data can only be converted to UTF-8 by truncating it to 4000 bytes.

12c extended varchar to the rescue?

You might be tempted to think that a new feature in Oracle 12c can solve this problem as there is now a way to support varchar2 columns of up to 32kB in size. And while it may be a solution to avoid truncation, I also want to warn that it might introduce yet new problems. While enabling support for 32kB character columns is as simple as setting a parameter, restarting the database and running a script, there is a huge change happening under the covers. All columns defined to accept more than 4000 bytes will then be stored out-of-line in securefile LOBs and come with some restrictions as Roopesh has pointed out in an earlier blog post.

tl;dr

Characterset migration projects can quickly escalate from trivial to very complex. But I showed two things one needs to know in this blog post. First I showed how to find and assess characters (or bytecodes) that will get lost in the conversion because no mapping from the old codepage to the new one exist.
And the other issue to look out for has to do with character and byte semantic. Understand that by default Oracle is using byte semantic but that symbols in multibyte charactersets may and will occupy more bytes than characters. An easy workaround is to convert columns to the character semantic but that will not avoid the hard limit of 4000 bytes for a regular varchar2 column. Oracle 12c offers a possible solution in extended varchar support for strings up to 32kB but this should not be taken too lightly and will need some testing and evaluation on a case by case.

email

Author

Interested in working with Bjoern? Schedule a tech call.

About the Author

Björn Rost is an Oracle ACE Director, and one of Pythian’s top Oracle experts. A popular presenter, he travels the world attending technology conferences, sharing insights, and learning with his wide network of peers. Björn also serves as president of IOUG’s RAC special interest group. He is always challenging himself, personally through physical activities including triathlons, and professionally through his ongoing quest to increase his knowledge of Oracle and other leading technologies.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *