Thursday, July 15, 2010

Replacing a degraded drive in a 3Ware Array Unit, and that crazy "u?" unit

Well THAT was annoying.. Last night I tried to replace a degraded drive in my array and it was not as simple as I had previously observed. Here's what the array looked like before replacing the drive:

> /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 DEGRADED u0 931.51 GB SATA 7 - WDC WD10EADS-00M2B0

Note the drive isn't actually failed.
Next I pulled the drive, and loaded the replacement. Here's what tw_cli showed now:

# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 OK u? 931.51 GB SATA 7 - WDC WD10EVDS-63U8B0

What does "u?" mean in the Unit column? It means that the controller is going to refuse to use it...

# tw_cli maint rebuild c6 u0 p7
The following drive(s) cannot be used [7].
Error: (CLI:144) Invalid drive(s) specified.


After hours of surfing the web, I never found a solution, but I didn't figure it out. Since the original drive didn't outright fail, I decide to swap it BACK IN, and then forcefully remove it from the array, and then replace it.

# tw_cli /c6/p7 remove
Removing /c6/p7 will take the disk offline.
Do you want to continue ? Y|N [N]: y
Removing port /c6/p7 ... Done.


# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0


Now I replaced the old bad drive with the new drive.

# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 DEGRADED - - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 OK - 931.51 GB SATA 7 - WDC WD10EVDS-63U8B0


Ah, now that's better, it's no longer showing that "u?" in the Unit column.

Next I forced the rebuild using the new drive:

# tw_cli maint rebuild c6 u0 p7
Sending rebuild start request to /c6/u0 on 1 disk(s) [7] ... Done.


# tw_cli /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 REBUILDING 0%(A) - 64K 2793.94 ON ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 931.51 GB SATA 0 - ST31000528AS
p4 OK u0 931.51 GB SATA 4 - WDC WD10EACS-00D6B0
p5 OK u0 931.51 GB SATA 5 - WDC WD10EADS-00L5B1
p6 OK u0 931.51 GB SATA 6 - WDC WD10EADS-00M2B0
p7 DEGRADED u0 931.51 GB SATA 7 - WDC WD10EVDS-63U8B0


Hooray. Rebuilding. A few hours later the rebuild completed.

2 comments:

ritchiem said...

I also got the u? last night. Most annoying as it was for a drive I didn't touch. I had planned to pull the DEGRADED p5 drive and add a new p0 as the RAID 6 was only using 7 drives, but on inserting the p0 drive p1 has gone to u? and the auto-rebuild kicked in as two drives had failed. So annoying as it looks as though it will take 5 days to rebuild!

/c2 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 REBUILDING 12%(A) - 256K 9313.17 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 DEGRADED u0 1.82 TB SATA 0 - SAMSUNG HD204UI
p1 OK u? 1.82 TB SATA 1 - WDC WD20EARS-00MVWB0
p2 OK u0 1.82 TB SATA 2 - WDC WD20EARS-00MVWB0
p3 OK u0 1.82 TB SATA 3 - WDC WD20EARS-00MVWB0
p4 OK u0 1.82 TB SATA 4 - SAMSUNG HD204UI
p5 DEGRADED u0 1.82 TB SATA 5 - WDC WD20EARS-00MVWB0
p6 OK u0 1.82 TB SATA 6 - WDC WD20EARS-00MVWB0
p7 OK u0 1.82 TB SATA 7 - WDC WD20EARS-00MVWB0

Scott Marlowe said...

OK I figured it out guys! For future reference, you do this:

p18 OK u? yada...


tw_cli /c0/p18 remove
tw_cli /c0 rescan

Now the drive should show back up but be part of a non-existent volume:

p18 OK u4 yada...

In my case I had u0 through u3 but u4 was "new".

tw_cli /c0/u4 del

And now I can use the drive.