Recently, while I was working on a backup failure issue, I found that it was failing for a particular database. When I ran the backup manually to a different folder it would complete successfully, but not on the folder that it was directed to when the backup jobs were originally configured . This makes me suspicious about hard disk corruption. In the end, I fixed the backup issues in the interim so that in the future I would not get paged, as well as lowering the risk of having no backup in place.
Upon reviewing the Windows Event logs, it was revealed that I was right about suspecting a faulty hard drive. The log reported some messages related to the SCSI codes, especially the SCSI Sense Key 3 which means SCSI had a Medium error. Eventually, the hard drive was replaced by the client and the database has been moved to another drive. In the past month, I have had about 3 cases where I have observed that the serious messages related to storage are reported as information. I have included one case here for your reference, which may help you in case you see such things in your own logs.
CASE 1 – Here is what I found in the SQL Server error log:
- Error: 18210, Severity: 16, State: 1
- BackupIoRequest::WaitForIoCompletion: read failure on backup device ‘G:\MSSQL\Data\SomeDB.mdf’.
- Msg 3271, Level 16, State 1, Line 1
- A non-recoverable I/O error occurred on file “G:\MSSQL\Data\SomeDB.mdf:” 121 (The semaphore timeout period has expired.).
- Msg 3013, Level 16, State 1, Line 1
- BACKUP DATABASE is terminating abnormally.
When I ran the backup command manually I found that it ran fine until a specific point (i.e. 55%) before it failed again with the above error. Further, I decided to run DBCC CHECKDB which reports when a particular table has a consistency error at a particular page. Here are the reported errors:
Msg 8966, Level 16, State 2, Line 1 Unable to read and latch page (1:157134) with latch type SH. 121(The semaphore timeout period has expired.) failed.
Msg 2533, Level 16, State 1, Line 1 Table error: page (1:157134) allocated to object ID 645577338, index ID 0, partition ID 72057594039304192, alloc unit ID 72057594043301888 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header. The repair level on the DBCC statement caused this repair to be bypassed.
Of course, repairing options did not help as I had anticipated initially, since the backup was also failing when it reached at 55%. The select statement also failed to complete when I queried the object 645577338. The only option that I was left with was to recreate the new table and drop the original table. After this had been done, the full back up succeeded. As soon as this was completed we moved the database to another drive.
I was still curious regarding these errors, so I started looking at Windows Error Logs – System folder, filtering it to show only Errors and Warnings. However, this did not show me anything that attracted me to read further. Thus, I removed the filter, and carefully reviewed the logs. To my surprise, the logs show entries for a bad sector, but, this was in the Information section of Windows Event Viewer, System folder.
Event Type: Information
Event Source: Server Administrator
Event Category: Storage Service
Event ID: 2095
Time: 1:04:18 AM
Description: SCSI sense data Sense key: 3 Sense code:11 Sense qualifier: 0: Physical Disk 0:2 Controller 0, Connector 0.
There could be a different error, warning or information printed on your server depending what the issue is. Upon further review there is still much to be said in order to explain codes and descriptions.
You may have noticed that I have referred to this as CASE 1, which means, I will blog one or two more case(s) in the future. Stay tuned!
Photo credit: Hard Disk KO via photopin (license)
Learn more about our expertise in SQL Server.