Oracle UTF8 Encoding and String Manipulation Overhead

Posted in: Technical Track

For one of our customers, I’ve recently reviewed the strategy of migration from single-byte encoding to variable length multi-byte UTF8 (AL32UTF8 encoding in Oracle naming standards). These type of projects are coming up again and again so I think it must be common for many of you. Thus, this bit might be useful. I’m also interested in your experience – perhaps you can run this simple simulation on your platforms and provide the results in the comments?

Back to the project… One area was estimation of the string manipulation overhead. Based on Pythian experience as well as the feedback from my network, I could conclude that nobody has observed any measurable performance degradation or significant CPU usage growth clearly attributed to UTF8 migration.

Thus, I decided to simulate at least some operations and measure them. One of the concerns was sub-string extraction — fixed length encoding sub-string is super easy to implement as the offset and length in bytes are known. Variable length character set would require scanning the whole string from the very beginning because byte offset and byte length are not known until the string is traversed from the beginning character by character.

This is the PL/SQL block I came up with:

[sql] declare
s1 varchar2(4000 char);
s2 varchar2(4000 char);
i INTEGER;
begin
s1 := rpad(‘a’,3999,’a’);
for i in 1..10000000 loop
null;
–s2 := ‘a’ || ‘b’ || ‘c’ || ‘d’;
–s2 := substr(s1,3000,1) || substr(s1,3001,1) || substr(s1,3002,1) || substr(s1,3003,1);
end loop;
dbms_output.put_line(s2);
end;
[/sql]

There are 3 cases and only one of them needs to be un-commented when running the block. null; is there to just capture the baseline and measure time required to run an empty loop. Concatenation s2 := 'a' || 'b' || 'c' || 'd'; is one scenario and, finally, substring extraction s2 := substr(s1,3000,1) || ... is the second one.

I’ve run it in the VMs on my laptop as well as customer’s own physical servers to confirm the timing. Below is the result from my virtual machines but the percentage was pretty much the same. Platform is Linux. Oracle version 10.2 and 11.1. Note that 11.2 is smarter and optimizes the block with NULL loop so you would need to trick PL/SQL optimizer better.

Character set null concat substr
WE8ISO8859P1 0.5 1.8 8.0
AL32UTF8 0.5 2.2 9.9

The time in the table is seconds that PL/SQL block was executing. I ran if few times and averaged the result but it was very consistent anyway. Also, skip the very first measurement to avoid the impact of block parsing.

To interpret the results, the string manipulation/concatenation (excluding 0.5s of looping itself) accounts 1.3s vs 1.7s – 30% UTF8 overhead. SUBSTR function overhead (excluding loop and concat itself) accounts for 6.2s vs 7.7s which is only 25% overhead.

I found substring overhead rather small – I expected order of magnitude difference to be honest. However, 30% of concatenation overhead seemed to be little too much and I don’t see why it should be that high.

Now, while overhead is rather noticeable, I didn’t see databases that were doing mostly string manipulations in SQL and PL/SQL. I think that’s why in the big picture, we generally don’t see much performance impact moving to variable-length character set.

If you have access to the databases with fixed-length and variable-length character sets on the similar hardware, feel free to post your results below.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

What does it take to be chief technology officer at a company of technology experts? Experience. Imagination. Passion. Alex Gorbachev has all three. He’s played a key role in taking the company global, having set up Pythian’s Asia Pacific operations. Today, the CTO office is an incubator of new services and technologies – a mini-startup inside Pythian. Most recently, Alex built a Big Data Engineering services team and established a Data Science practice. Highly sought after for his deep expertise and interest in emerging trends, Alex routinely speaks at industry events as a member of the OakTable.

9 Comments. Leave new

Jonathan Lewis
June 17, 2010 2:07 am

Alex,
It’s possible that any “real-world” problems due to increases in CPU for string manipulation are insignificant compared to other side effect. I’m thinking particularly of (a) increased storage for the the same number of characters and (b) changes in the execution plans because of the way the optimizer calculates stats for character columns.

The optimization problem is particularly nasty since some of the figures are based (essentially) on the first six BYTES of the string representation – which means, in the worst case, the first 1.5 (?!) characters in a multibyte set that can use 4 bytes per character.

You might also examine the impact on parallel queries which do a lot of messaging between layers of slaves – with longer strings being passed around you get fewer rows into a PX message buffer.

P.S. Having commented on the 6 byte thing, I realise I haven’t checked 11g to see if it has changed the way it handles a multi-byte set.

Reply
Alex Gorbachev
June 17, 2010 8:32 am

Jonathan,

I agree regarding impact on the real-world workloads and I tried to mention it. Thanks for emphasizing it again.

Regarding string sizes – this is something that’s emphasized in all guides to multi-byte encodings so it’s tough to miss. Even then, as long as most of the text is in US7ASCII range, size increase during migration is usually insignificant and this what I’ve observer but, YMVV. What people often miss is how CLOB content doubles. Maybe that’s something for another blog.

Histogram issue is a very tricky one and I recall you mention it to me already. Maybe I’ll get around the test case just to demonstrate.

Parallel messaging overhead — true in case text size does indeed grows significantly.

Reply
Alberto Dell'Era
June 17, 2010 2:22 am

It looks strange to me that there is a difference at all in the concat case; shouldn’t concatenation a matter of

– summing the length “bytes”
– concatenating the encodings

whatever the charset is ?

Moreover, ASCII characters such as ‘a’,’b’,’c’ and ‘d’ have the same encoding in both WE8ISO8859P1 and AL32UTF8…

Reply
Alex Gorbachev
June 17, 2010 8:33 am

Alberto, this is what I found strange as well so would be really nice to validate my findings in case someone is able to post their numbers.

Reply
Jonathan Lewis
June 17, 2010 9:35 pm

How about using variables instead of constants. It seems unlikely, but perhaps Oracle is doing a strange sys_op_c2c() or similar on the constants before performing the concatenation.

Reply
Øyvind Isene
June 22, 2010 3:52 am

I ran the same tests and also repeated them for Norwegian characters (æ,ø,å); they are single-byte in WE8ISO8859P1 and 2 bytes in UTF8. I found no difference in response time between tests with (a,b,c) and (æ,ø,å) on both databases. The UTF8 database is running 11.1, the other is running 10.2 (both linux, but with different hardware).

Set Test Result (1/100 sec)
8bit null 25
8bit cc-E 146
8bit cc-N 145
8bit ss-E 574
8bit ss-N 576

utf8 null 39
utf8 cc-E 119
utf8 cc-N 117
utf8 ss-E 479
utf8 ss-N 481

My concern with UTF8 is not cpu, but refactoring. Columns containing non-english characters stored in varchar2(4000) require migration to CLOB when the byte-length surpasses 4000 bytes (e.g. when close to 4000 characters on WE8ISO8859P1 contains Nordic characters they are expanded to two-bytes characters in UTF8 and the total length exceeds 4000 byts). The limit is 4000 bytes, not 4000 characters, even if Oracle lets you define (foo varchar2(4000 char)) in UTF8.

Reply
Alex Gorbachev
June 22, 2010 12:05 pm

@Øyvind: This is odd. Do I read your numbers right so that UTF8 database is performing faster than fixed length single-byte encoding? And odd that NULL loop is actually slower in UTF8 DB. Something is not consistent it seems. Are they on the same hardware?

Reply
Øyvind Isene
June 23, 2010 2:13 am

As stated somewhere in there they are on different hardware :) I believe the null-loop has not improved much from 10.2 to 11.1, so I imagine the 11.1 with utf8 is running on “slower” hardware and that the string functions used here in 11.1 has improved from 10.2 to 11.1 I ran the tests several times, there were deviations of course, but the average was quite stable. I initially wanted to show that your tests on utf8 in the same database did not change much as one replace characters encoded as single byte in utf8 with characters that are encoded as two or more bytes (as is the case for Norwegian characters); ref comments from that fine Italian :)

Reply

We are converting WE8ISO8859P1 oracle db characterset to AL32UTF8. Before conversion, i was asked to check implication on PL/SQL code for byte based SQL functions. What all points to consider while checking implications on PL/SQL code. I could find 3 methods on google surfing, SUBSTRB, LENGTHB, INSTRB. What do I check if these methods are used in PL/SQL code? What all other methods should I check? What do I check in PL/SQL if varchar and char type declarations exist in code? How do i check implication of database characterset change to AL32UTF8 for byte bases SQL function. Suggest me as if I know only CRUD statements in oracle DB.

Regards,
Rashmi

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *