Full credit for this tale should go to my colleague Christo Kutrovsky for the inspiration and basic plan involved.
We recently migrated a large database from Solaris SPARC to Solaris x86-64. All seemed to go well with the migration, but in the next few weeks, we noticed some I/O issues cropping up. Some research led us to find that the ZFS filesystem used to hold the datafiles was killing us on I/O. The default “recordsize” setting for ZFS was 128k.
$ /usr/sbin/zfs get recordsize zfs-data NAME PROPERTY VALUE SOURCE zfs-data recordsize 128K default
An Oracle database typically uses 8k for the block size, but in this case it was 16k. We saw basically the same thing that Neelakanth Nadgir described in his blog post, Databases and ZFS:
With ZFS, not only was the throughput much lower, but we used more [than] twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized. We noticed that we were not only reading in more data, but we were also doing more IO operations [than] what is needed.
The fix is to set the ZFS recordsize for a datafile filesystem to match the Oracle instance’s
db_block_size. We also read in the ZFS Best Practices Guide that redo logs should be in a separate filesystem with the default ZFS recordsize of 128k. We already had them separate, so we just needed to get our datafiles on a ZFS filesystem with a 16k recordsize.
The first step is to change the ZFS recordsize for the current filesystem. Easy enough, but the change in recordsize will only apply to new files written there. It doesn’t do anything for our existing terabyte-plus of datafiles. This means we need to move the files to a new ZFS filesystem with the correct recordsize already set. That’s where the magic happens.
My first thought was that we’d need whole other slab of disk. Then I learned that we can simply create a new ZFS filesystem on the same pool of disks as the old filesystem. Obviously we won’t be able to hold two complete copies of the datafiles, but this would let us migrate groups of files at a time (for example, a tablespace). When the move is complete, disk space is released from the old filesystem, so the sum of used disk is essentially unchanged.
The rest is just a matter of moving datafiles. Now, we all know you can’t just move or copy Oracle datafiles from a running instance. We also know that clients want as little downtime as possible, so doing a complete shutdown/move/mount/rename/open is out of the question. We could offline a tablespace, move its datafiles, rename them in the database, then online the tablespace again. Here’s a sample scenario, starting with tablespace and datafile creation:
SQL> create tablespace zfstest 2 datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' 3 size 128m; Tablespace created. SQL> alter tablespace zfstest 2 add datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' 3 size 128m; Tablespace altered. SQL> create table zfsobjects tablespace zfstest 2 as select * from dba_objects; Table created.
Accept the following:
/u01/app/oracle/oradata/orcl/zfs128k/is a ZFS filesystem with default 128k recordsize
/u01/app/oracle/oradata/orcl/zfs16k/is a ZFS filesystem with a 16k recordsize
So here we have datafiles created on a ZFS filesystem with the default 128k recordsize, and a small table just to test data access. To move the entire tablespace in one go, we do this:
SQL> alter tablespace zfstest offline; Tablespace altered. SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf /u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' to '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf'; Database altered. SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf /u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' to '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf'; Database altered. SQL> alter tablespace zfstest online; Tablespace altered. SQL> select count(*) from zfsobjects; COUNT(*) ---------- 68844
Pretty simple. We do the one escape to the OS
mv command to move the datafile. If you want to be safe, you could use
cp instead of
mv to copy the file. Just know that disk space from the old filesystem won’t be released until you delete the old files.
Note that this operation could take a very long time in the case of large tablespaces. We could try a somwhat similar approach of taking one datafile offline at a time and moving it. We’d just need to perform a quick recovery before bringing it online. That would look like this:
SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' offline; Database altered. SQL> SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf /u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf SQL> SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf' to '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf'; Database altered. SQL> SQL> recover datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf'; Media recovery complete. SQL> SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf' online; Database altered. SQL> SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' offline; Database altered. SQL> SQL> !mv /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf /u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf SQL> SQL> alter database rename file '/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf' to '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf'; Database altered. SQL> SQL> recover datafile '/u01/app/oracle/oradata/orcl/zfs16k/zfstest02.dbf'; Media recovery complete. SQL> SQL> alter database datafile '/u01/app/oracle/oradata/orcl/zfs16k /zfstest02.dbf' online; Database altered. SQL> SQL> select count(*) from zfsobjects; COUNT(*) ---------- 68844
Seems fine. Note that your instance needs to be in ARCHIVELOG mode to be able to take datafiles offline. Obviously, if someone tried to access data on an offline datafile they’d get this error:
SQL> select count(*) from zfsobjects; select count(*) from zfsobjects * ERROR at line 1: ORA-00376: file 6 cannot be read at this time ORA-01110: data file 6: '/u01/app/oracle/oradata/orcl/zfs16k/zfstest01.dbf'
So if a datafile is particularly large, this can mean still a few minutes of some data being offline. We have one more tool at our disposal that provides for a quicker turnaround: RMAN’s
BACKUP AS COPY command. As it says, this creates a copy of a datafile as a backup. The benefit is that it does this while the current datafile (and data) remains online. We then just need to offline, switch, recover, and online each datafile, which takes very little time compared to the time spent moving a large datafile. That process would be something like this.
First, we want to know the file numbers of our datafiles. This is because the RMAN procedure doesn’t provide for an easy way to preserve datafile names when we backup an entire tablespace (which we’re going to do):
SQL> select file_id, file_name 2 from dba_data_files 3 where tablespace_name='ZFSTEST'; FILE_ID FILE_NAME ---------- ------------------------------------------------------- 6 /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf 7 /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf
So we have file numbers 6 and 7. Now we create our RMAN command file and execute it:
RMAN> backup as copy 2> tablespace zfstest format '/u01/app/oracle/oradata/orcl/zfs16k/%U'; 3> 4> sql 'alter database datafile 6 offline'; 5> switch datafile 6 to copy; 6> recover datafile 6; 7> sql 'alter database datafile 6 online'; 8> 9> sql 'alter database datafile 7 offline'; 10> switch datafile 7 to copy; 11> recover datafile 7; 12> sql 'alter database datafile 7 online'; 13> Starting backup at 18-AUG-09 using target database control file instead of recovery catalog allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=134 device type=DISK channel ORA_DISK_1: starting datafile copy input datafile file number=00006 name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf output file name=/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh tag=TAG20090818T140640 RECID=2 STAMP=695225208 channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:15 channel ORA_DISK_1: starting datafile copy input datafile file number=00007 name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf output file name=/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0 tag=TAG20090818T140640 RECID=3 STAMP=695225222 channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:07 Finished backup at 18-AUG-09 sql statement: alter database datafile 6 offline datafile 6 switched to datafile copy "/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh" Starting recover at 18-AUG-09 using channel ORA_DISK_1 starting media recovery media recovery complete, elapsed time: 00:00:00 Finished recover at 18-AUG-09 sql statement: alter database datafile 6 online sql statement: alter database datafile 7 offline datafile 7 switched to datafile copy "/u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0" Starting recover at 18-AUG-09 using channel ORA_DISK_1 starting media recovery media recovery complete, elapsed time: 00:00:00 Finished recover at 18-AUG-09 sql statement: alter database datafile 7 online Recovery Manager complete.
Voila! The tablespace is moved with minimal downtime:
SQL> select file_id, file_name 2 from dba_data_files 3 where tablespace_name='ZFSTEST'; FILE_ID ---------- FILE_NAME -------------------------------------------------------------------------------- 6 /u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-6_01kn0irh 7 /u01/app/oracle/oradata/orcl/zfs16k/data_D-ORCL_I-1203425866_TS-ZFSTEST_FNO-7_02kn0is0 SQL> select count(*) from zfsobjects; COUNT(*) ---------- 68844
Obviously, you can generate all your statements from SQL scripts, and you’ll definitely want to do this if your tablespace has a large amount of datafiles. If you are scripting, you could opt to backup one datafile at a time if you need to preserve file names. In this case, I wasn’t too concerned.
UPDATE: Teammate Marc Billette pointed out that the old datafiles were not removed when doing a datafile copy and switch. I clearly left them on disk, continuing to use space! So, once we are done with all the datafile switches, then our original datafiles become the copies, and we can list and delete those in RMAN like this:
RMAN> list datafilecopy all; List of Datafile Copies ======================= Key File S Completion Time Ckp SCN Ckp Time ------- ---- - --------------- ---------- --------------- 4 6 A 18-AUG-09 629982 18-AUG-09 Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf 5 7 A 18-AUG-09 630001 18-AUG-09 Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf RMAN> delete datafilecopy all; allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=135 device type=DISK List of Datafile Copies ======================= Key File S Completion Time Ckp SCN Ckp Time ------- ---- - --------------- ---------- --------------- 4 6 A 18-AUG-09 629982 18-AUG-09 Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf 5 7 A 18-AUG-09 630001 18-AUG-09 Name: /u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf Do you really want to delete the above objects (enter YES or NO)? YES deleted datafile copy datafile copy file name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest01.dbf RECID=4 STAMP=695225227 deleted datafile copy datafile copy file name=/u01/app/oracle/oradata/orcl/zfs128k/zfstest02.dbf RECID=5 STAMP=695225228 Deleted 2 objects
Thanks to Marc for catching this early in the published life of the article!
I hope this post can help other people that run into this problem with Oracle on ZFS (I’ve heard from one on twitter already), or people just wanting to move datafiles. Again, thanks to Pythian DBA Christo Kutrovsky for laying this process out for us!
Note: this process was tested on Oracle 10gR2 (10.2.0.2) and 11gR1 (18.104.22.168).
Great article with some interesting ideas.
One question though: Is there not the possibility that the datafile is being written to while the “backup as copy” is running (and until the point when the file is taken offline)?
How do we ensure that the *new* copy of the file is up-to-date?
Peter: That’s why you have to run the “recover datafile X” command after the switch and before putting the new copy online.
[…] 10 – Using backup copy to move files after changing the recordsize on filesystem ? Don Seiler-Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize […]
[…] leave a comment » Originally posted on The Pythian Group blog. […]
What you are suggesting here will not work for system level tablespaces such as SYSTEM, SYSAUX, UNDO. These will require the entire database to be down to move. They cannot be taken offline while the database is running. You’d still have a database outage!
Also, taking individual datafiles offline in a running system can cause logical data problems which might not be immediately recognizable in an high-throughput application with referential integrity. For example, you could generate an order record in one tablespace, and then, when attempting to assign the order to a customer in an offline tables, you could end up with an orphaned order.
We are encountering this same issue right now. Luckily, the system is not live yet. Good articles such as yours led us to decide to change the ZFS recordsize before we went live (thank goodness)!!!
Michael, you can do it with UNDO. Simply create a new UNDO tablespace in the new ZFS mount. Then change the system undo setting. Wait for all operations to use the new UNDO and you’ll be able to drop the old one.
This is precisely what I did, same goes for TEMP.
You are wrong. even for switch UNDO from one tablespace to other, you have to shudown database , update the parameter file and start the database and then only the database will start using new UNDO tablespace.
It is bad for me that I am facing the same I/O problem on my disk box. I have just migrated to ZFS filesystem and the reads are excessive. I am planning to change the recordsize to 8K I wnat to know if this solved the problem.
[…] Next, I limited the ZFS ARC cache to about 100MB and re-ran my tests. This time, I was able to get the disks to do something, but I maxed out at about 6.000 IOPS when I realized that the ZFS blocksize was set to 128kB which is propably a pretty stupid idea when doing random read tests on 8k data blocks. Oh, and I was not the first one to notice this, read about it in more detail from Don Seiler. […]