Aligning ASM Disks on Linux

Posted in: Technical Track

Linux is a wonderful operating system. However there are a number of things that one needs to do to make sure it runs as efficiently as possible. Today, I would like to share one of them. It has to do with using ASM (Automatic Storage Manager) disks.

In Linux, there are 2 major ways to create ASM disks.

  1. you can use ASMlib kernel driver
  2. you can use devmapper devices

You could also use /dev/raw devices, but I don’t recommend this at all. I will write another blog explaining why.

Regardless of which approach you take, you have to create partitions on your LUNs. Starting with version 2, ASMlib won’t let you use the entire disk. You have to create a partition.

The reason to force the creation of this partition is to make explicit that something exists on that device, and that it’s not empty. Otherwise, some OS tools assume the disk is unused and could mark it, or just begin using it, and override your precious Oracle data.

(Read more after the jump.)

Most people would use the “fdisk” command provided by Linux distributions. This command is quite old, and so has some old-fashioned DOS-style behaviours built in.

When you create your partition, by default, the unit of measure is based on cylinders. Here’s a typical print command from fdisk on a 35 GB disk:

Command (m for help): p

Disk /dev/sde: 35.8 GB, 35864050176 bytes
64 heads, 32 sectors/track, 34202 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1       34202    35022832   83  Linux

Notice where it says Units. 2048 cylinders, 512 bytes each = 1 MB. So your units are 1 MB. only when the disk is relatively small.

When your disk is larger — which is far more typical in the database world, especially with raid arrays — the Units change:

Disk /dev/sdc: 513.5 GB, 513556348928 bytes
255 heads, 63 sectors/track, 62436 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       62436   501517138+  83  Linux

Take a look at the Units number. It’s 8 MB minus 159.5 kB. This is a very weird number, totally misaligned with any possible stripe size or stripe with (stride).

This, by itself, is not a big deal, since the best practice is to have 1 partition per LUN, which represents the entire device. However, this is not the end of it. If you switch to sector mode, you will see what the true start offset is:

Command (m for help): u
Changing display/entry units to sectors

Command (m for help): p

Disk /dev/sdc: 513.5 GB, 513556348928 bytes
255 heads, 63 sectors/track, 62436 cylinders, total 1003039744 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1              63  1003034339   501517138+  83  Linux

Notice the start sector, 63rd sector. That’s the 31.5 kB boundary. This value does not align with any stripe size. Stripe sizes are usually multiples of 2 and above 64 kB.

The result is, every so often, a block will be split between 2 separate hard disks, and the data will be returned at the speed of the slower (busier) device.

Assuming the typical 64 kB stripe (way too low, as I will discuss in another blog), and 8 kB database block size, every 8th block will be split between 2 devices. If you do the math, that’s about 12% of all your I/O. Not a significant number, but when you consider how discs are arranged in RAID 5, instead of a logical write being 2 reads and 2 writes (data+checksum, update, then write them back), each logical write could be 4 reads and 4 writes, significantly increasing your disk activity.

The solution?

Before you create your partitions, switch to “sector” mode, and create your partitions at an offset that is a power of 2.

I typically create my partitions at the 16th megabyte (32768th sector). Essentially, I “waste” 16 MBs, but gain aligned I/O for stripe width of up to 16 MB.

The procedure to create an aligned disk is:

[~]# fdisk /dev/sdg
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel

When building a new DOS disklabel, the changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable.

Command (m for help): u
Changing display/entry units to sectors
Command (m for help): p
Disk /dev/sdg: 143.4 GB, 143457779712 bytes
255 heads, 63 sectors/track, 17441 cylinders, total 280190976 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot      Start         End      Blocks   Id  System
Command (m for help): n
Command action
e   extended
p   primary partition (1-4) p
Partition number (1-4): 1
First sector (63-280190975, default 63): 32768
Last sector or +size or +sizeM or +sizeK (32768-280190975, default 280190975):
Using default value 280190975
Command (m for help): p
Disk /dev/sdg: 143.4 GB, 143457779712 bytes
255 heads, 63 sectors/track, 17441 cylinders, total 280190976 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot      Start         End      Blocks   Id  System
/dev/sdg1           32768   280190975   140079104   83  Linux
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.

This way, the disk is aligned. When it is aligned at least on 1MB, then the ASM files will be also aligned at the 1MB boundary.

This alignment also applies to ext3 file systems. This file system takes it a step further, allowing you to provide the array stride as a parameter during creation, optimizing writing performance (I have not tested this). Look in the man pages for more information.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

An Oracle ACE with a deep understanding of databases, application memory, and input/output interactions, Christo is an expert at optimizing the performance of the most complex infrastructures. Methodical and efficiency-oriented, he equates the role of an ATCG Principal Consultant in many ways to that of a data analyst: both require a rigorous sifting-through of information to identify solutions to often large and complex problems. A dynamic speaker, Christo has delivered presentations at the IOUG, the UKOUG, the Rocky Mountain Oracle Users Group, Oracle Open World, and other industry conferences.

26 Comments. Leave new

Hi Christo

I think this paper from Oracle
does mention to skip at least 1MB as Best Practice in ASM to avoid missaligment, not sure if can apply in Linux, I guess so.

Great post

Thanks

Reply
Christo Kutrovsky
April 1, 2007 7:22 pm

Yes it does mention it for solaris.

The aligment applies to any OS. I must point out, that the default partitions in linux are worst case for alignment.

Reply
Gustavo Tamaki
April 11, 2007 4:52 am

Hi Christo,

In your example, with a stripe size of 64KB, the number of unaligned I/Os are about 12.5%. If I consider the ASM stripe size of 1MB, can I assume that this number would drop to less than 1%?

For a storage configured in RAID-10, the effect would be negligible, since the unalignment occurs only in the stripe boundaries and there’s no right penalty, correct?

Gustavo

Reply
Christo Kutrovsky
April 11, 2007 8:26 am

Hi Gustavo,

The aligning has to do with the RAID level stripe size, not with the ASM stripe size.

There’s always a “penalty” on miss-aligned IO. You are using 2 devices, when you could have used only 1. Note, that using multiple devices is not bad by itself. What’s bad is reading small amounts of data from multiple devices, as opposed to just 1.

You generally want to be doing at least 512kb io sizes per IO device, to achieve 90% of it’s sequencial read speed (in my testing).

Reply

Hi Christo,
I was told by SAN administrators here that there’s no way to change the stripe size for EMC LUN to be 1MB here, since it has already been set to 128KB for a big chunk of RAID5, every LUN will come from that one.

In this case, there’s no choice but to accept that as a fact. Is it necessary to do alignment as above for 128KB stripe size?

Thanks,
Hai

Reply
Christo Kutrovsky
May 9, 2007 9:25 am

Yes it is. Because the default is a not a multiple of your stripe size. As long as it is a multiple of your stripe size you are good.

So if you offset at the 16th Mb as I do, then you will be alligned for all stripe sizes that are a power of 2, up to 16 Mb stripe size.

Now EMC has these hypervolumes that have the weird stripe sizes of 960 kb, that completelly destroys any allignement attempts :)

Reply

Thanks Christo. Since we aren’t going to get any stripe size that would be different from 128KB in this case, I guess I could configure the first sector to be 256 (128K) instead of 32768 (16M), so that it would do alignment for 128K stripe size. Do you see any issues in doing this?
Thanks,
Hai

Reply
Christo Kutrovsky
May 9, 2007 12:36 pm

It’s all good. However, consider this. ASM does it’s allocations in 1mb “blocks”. I wouldn’t go under 1mb.

However, given what I know so far, 256k will be fine. Remembe we’re talking about saving ~760 kb here …

Reply
David Edwards
May 9, 2007 1:43 pm

[Editorial Note: Christo pointed out that the sidebar on this page was getting pushed to the very bottom in IE browsers. The culprit was the long raw URL in the first comment (by LSC). To fix it, I edited the comment, condensing the URL into the words “this paper from Oracle”. The rest of that comment is untouched.]

Reply

Ok. thanks for the confirmation. I will go with 2048 (1mb) then.

Reply

Great post indeed! Very hard to find information about this topic.

An other great page is the following:
https://insights.oetiker.ch/linux/raidoptimization.html

What would be great is a complete post on how to optimize the whole thing…

HW RAID -> Partitioning -> LVM -> File system option (Ext3 like stride, fs-type largefile4, etc)

About the following:

> Essentially, I “waste” 16 MBs, but gain aligned
> I/O for stripe width of up to 16 MB.”

Based on the following page, I suspect “stripe width” isn’t rightly used here…

“Stripe width refers to the number of parallel stripes that can be written to or read from simultaneously. This is of course equal to the number of disks in the array.”

https://www.storagereview.com/guide2000/ref/hdd/perf/raid/concepts/perfStripe.html

Reply

Why is your text so pale? Use plain BLACK, not gray. Don’t make your text hard to read.

Reply
Yury Velikanov
January 14, 2010 11:27 pm

Hi Christo,

It is very cool article thank you for sharing it with us. I do have few questions/doubts about the approach and the way to validate it.

— In Enterprise SAN infrastructure LUN-s are provided from Storage Backend to Oracle – consumer without any relatable information (I assume here please correct me if I am wrong) about physical lay out of the underlying disks. I mean that SAN infrastructure hides the information such us cylinders/sectors/etc as each LUN isn’t a separate physical disk anymore. It is combination of disks and stripe technology used by particular vendor (with hopefully their own optimization). By the time a LUN presented to Linux/fdisk (I assume) it is far way from being a simple disk. Therefore I even assuming that we making a right offset to avoid double IO in certain cases I would like to have some good method to verify my configuration (see second point)

— Let’s imagine we made a 16MB offset creating our partition. We made an assumption that it will help us to increase IO performance by 12%. Question is there any good way to validate our assumption? I would like to be certain that I am releasing a storage chunk to my production space consumers in best possible configuration. Could you think about a simple test to validate our suggested assumption? I am guessing that iostat couldn’t be used in that case as we are talking about lower level IO split on device driver layer or in case of SAN it is going to be hidden somewhere in IO backend controllers.

— It looks like the way you put the suggestion it applies to fdisk only. Most probably the other volume managers (like Veritas etc) might have the same issue however it should be checked separately (with vendor support etc).

— If the reason to introduce fdisk activity is “make explicit that something exists”, I would introduce an additional maintenance procedures for the technical stuff and would make sure that everybody aware about the configuration chosen and would go with direct device usage (/dev/sdc rather then /dev/sdc1) to avoid an additional IO layer (partition) and risks of loosing 12% of IO performance. Are there other reasons to use partitions? I assume here that fdisk doesn’t help to address devices permissions and naming issue.

Thank you in advance,
Yury

Reply
Christo Kutrovsky
January 27, 2010 2:11 pm

Yuri,

With metaluns and etc it quickly becomes far more complicated than one can measure reliably. At least at this point.

The goal here is to ensure Oracle’s reads are aligned to some power of 2 with the LUN.

The process I described exists due to old style legacy disk partitioning (cylinders/heads). Today, we rely on LBAs, and just treat the space linearly.

fdisk (and others) however still operates under legacy assumptions for disks > 32 GB, and thus tries to align on something that doens’t exist anymore (cylinders).

Thus we align on power of two on LBA under the assumption that the underling storage will be able to better serve us , and be able to better optimize it’s caching blocks internally .

Reply

nice article, thanks for sharing

Reply
Moluccan » Blog Archive » disk partitions
July 4, 2011 12:26 pm

[…] (Where ?? are the characters used to identify the device.) The reason we are using 32768 as the start block for the device is because of the way scsi/san volume mapping uses the first few blocks of a device and subsequently causes excessive write I/O. This is better explained in the following link: https://www.pythian.com/news/411/aligning-asm-disks-on-linux […]

Reply
geofrey rainey
October 3, 2012 6:29 pm

Hi,

Well written and easy to understand, thank you.

A small typo for you to fix, “63th” should be “63rd”.

Thanks.

Reply

Thank you for the article. This was a quick and good explanation.
Is there a way from the OS to find out the stripe size you are dealing with, or is that something you’d have to ask the storage team?

Reply

Hi Christo,
Awesome article but i have few doubts?
I am trying1MB offset on one of the raw disk(/dev/sdb) and where i specified on first parition(/dev/sdb1), first sector as 2048 and last sector as +4G and when i try to create another partition (second partition) i see first sector value to be 2 ? if i select 2 then next partition will also be of 1MB offset ?

Device Boot Start End Blocks Id System
/dev/sde1 2048 8390656 4194304+ 83 Linux

Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 2
First sector (2-209715199, default 2):

this is how it is asking and i am not sure how to process with this. Please advise.

Reply

If partition starts at 1 (unit = 1 sector), 1st sector.
Is it still considered a misaligned partition?

Disk /dev/xvdb: 2199.0 GB, 2199023255552 bytes
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
/dev/xvdb1 1 4294967295 2147483647+ ee GPT

If sector numeration starts at 0, then I assume it is.
Although when you create a partition in fdisk, smallest one it lets you choose is 1.

Reply

Never mind. Sector 0 stores partitions table. So a partition can’t start from sector 0.

Reply

“Notice the start sector, 63th sector. That’s the 31.5 kB boundary”

I did not get this can you please explain , the 31.5 kB boundary

Reply

“Notice the start sector, 63th sector. That’s the 31.5 kB boundary”

Did not get this ,can you please explain and why it will be spread across two devices?

Reply

Never mind. Sector 0 stores partitions table. So a partition can’t start from sector 0.
https://17oxen.com/best-external-hard-drives/

Reply

How amazing article you shared,
A little typo for you to fix, “63th” should be “63rd”.

Reply
Rosie Leonard
May 15, 2017 11:23 am

Thanks for reading and for the feedback, Agha!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *