Thursday, July 15, 2010

Replacing a degraded drive in a 3Ware Array Unit, and that crazy "u?" unit

Well THAT was annoying.. Last night I tried to replace a degraded drive in my array and it was not as simple as I had previously observed. Here's what the array looked like before replacing the drive:

> /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 DEGRADED u0 931.51 GB SATA 7 - WDC WD10EADS-00M2B0

Note the drive isn't actually failed.
Next I pulled the drive, and loaded the replacement. Here's what tw_cli showed now:

# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 OK u? 931.51 GB SATA 7 - WDC WD10EVDS-63U8B0

What does "u?" mean in the Unit column? It means that the controller is going to refuse to use it...

# tw_cli maint rebuild c6 u0 p7
The following drive(s) cannot be used [7].
Error: (CLI:144) Invalid drive(s) specified.


After hours of surfing the web, I never found a solution, but I didn't figure it out. Since the original drive didn't outright fail, I decide to swap it BACK IN, and then forcefully remove it from the array, and then replace it.

# tw_cli /c6/p7 remove
Removing /c6/p7 will take the disk offline.
Do you want to continue ? Y|N [N]: y
Removing port /c6/p7 ... Done.


# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0


Now I replaced the old bad drive with the new drive.

# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 OK - 931.51 GB SATA 7 - WDC WD10EVDS-63U8B0


Ah, now that's better, it's no longer showing that "u?" in the Unit column.

Next I forced the rebuild using the new drive:

# tw_cli maint rebuild c6 u0 p7
Sending rebuild start request to /c6/u0 on 1 disk(s) [7] ... Done.


# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 REBUILDING 0%(A) - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 DEGRADED u0 931.51 GB SATA 7 - WDC WD10EVDS-63U8B0


Hooray. Rebuilding. A few hours later the rebuild completed.

Sunday, July 11, 2010

Another drive bites the dust..

This time, in 266 days:

WDC WD10EADS-00M2B0
6384 hours online

And by unusable, I mean that my RAID6 card has now degraded the array and won't trust it anymore. And why should it after these sort of errors:

Jul 09, 2010 09:16:23PM (0x04:0x0023): Sector repair completed: port=7, LBA=0x262A8AE1
Jul 09, 2010 09:16:19PM (0x04:0x0023): Sector repair completed: port=7, LBA=0x262A8ADF
Mar 30, 2010 09:27:47AM (0x04:0x0023): Sector repair completed: port=7, LBA=0x29ED55DB
Mar 30, 2010 09:27:44AM (0x04:0x0023): Sector repair completed: port=7, LBA=0x29ED558D
Mar 30, 2010 09:27:41AM (0x04:0x0023): Sector repair completed: port=7, LBA=0x29ED550A
Jan 02, 2010 10:03:49AM (0x04:0x0023): Sector repair completed: port=7, LBA=0x2E5FD932
Jan 02, 2010 10:03:45AM (0x04:0x0023): Sector repair completed: port=7, LBA=0x2E5FD927


Plenty more where that came from... oh well....

Saturday, July 3, 2010