RAID1 both drives faulty but ok - bay issue?

G’Day all,

I’ve got an odd failure on my TS-EC879 Pro: Raid group 2 is a RAID1 on bays 5 and 6. When I started the server up today, both drives showed as not there. Bay 8 had an empty drive, so I shut down and plugged one of the presumed faulty drives into bay 8 and it shows up as a perfectly healthy drive. Status “free”, nothing on it.

So I tried the recover option in the Disks/VJBOD section of the storage app, which should recognise the volume. But computer said no. This feature has worked in the past so this is most peculiar.

dmesg shows it recognised faulty drives in bays 5 & 6. Moving one to bay 8 shows it’s “non-fresh” and gets “kicked from array” (huh?).

Here are come of the console bits if it helps, sda was moved from bay 5 to bay 8:

# qcli_storage
Enclosure Port Sys_Name      Size      Type   RAID        RAID_Type    Pool TMeta  VolType      VolName 
NAS_HOST  1    /dev/sde      9.10 TB   data   /dev/md1    RAID 10,512  1    64 GB  flexible     DataVol1(!),M_Vol(X)
NAS_HOST  2    /dev/sdf      9.10 TB   data   /dev/md1    RAID 10,512  1    64 GB  flexible     DataVol1(!),M_Vol(X)
NAS_HOST  3    /dev/sdc      9.10 TB   data   /dev/md1    RAID 10,512  1    64 GB  flexible     DataVol1(!),M_Vol(X)
NAS_HOST  4    /dev/sdd      9.10 TB   data   /dev/md1    RAID 10,512  1    64 GB  flexible     DataVol1(!),M_Vol(X)
NAS_HOST  5    --(X)         --        --     /dev/md2(X) RAID 1       2(X) 64 GB  flexible     DataVol2(X)
NAS_HOST  6    --(X)         --        --     /dev/md2(X) RAID 1       2(X) 64 GB  flexible     DataVol2(X)
NAS_HOST  7    /dev/sdb      9.10 TB   data   /dev/md3    Single       288  --     Static       Backup1 
NAS_HOST  8    /dev/sda      2.73 TB   free   --          --           --   --     --           --      

Not fresh enough so it gets kicked from array:

# dmesg | grep -i sda
[    7.149653] sd 0:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[    7.149654] sd 0:0:0:0: [sda] 4096-byte physical blocks
[    7.149715] sd 0:0:0:0: [sda] Write Protect is off
[    7.149716] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    7.149740] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    7.207004]  sda: sda1 sda2 sda3 sda4 sda5
[    7.207687] sd 0:0:0:0: [sda] Attached SCSI disk
[   19.983605] EXT3-fs (sda1): using internal journal
[   19.994565] EXT3-fs (sda1): recovery complete
[   19.998916] EXT3-fs (sda1): mounted filesystem with ordered data mode
[   20.159368] EXT3-fs (sda1): using internal journal
[   20.159369] EXT3-fs (sda1): mounted filesystem with writeback data mode
[   21.299837] md: bind<sda1>
[   21.316676] md: kicking non-fresh sda1 from array!
[   21.321458] md: unbind<sda1>
[   21.329062] md: export_rdev(sda1)
[   21.392245] md: bind<sda1>
[   21.455971]  disk 0, wo:1, o:1, dev:sda1
[   22.921469] md: bind<sda4>
[   22.935529] md: kicking non-fresh sda4 from array!
[   22.940314] md: unbind<sda4>
[   22.948019] md: export_rdev(sda4)
[   23.067335] md: bind<sda4>
[   23.170208]  disk 0, wo:1, o:1, dev:sda4
[   41.409828]  disk 0, wo:0, o:1, dev:sda1
[   42.268572]  disk 0, wo:0, o:1, dev:sda4
[   45.786533] md: bind<sda2>
[   46.573174] md: bind<sda5>

sdg is the “broken” drive in bay 6 (but also OK in bay 8):

 # dmesg | grep -i sdg
[   10.000850] sd 10:0:0:0: [sdg] 1007616 512-byte logical blocks: (515 MB/492 MiB)
[   10.004473] sd 10:0:0:0: [sdg] Write Protect is off
[   10.004474] sd 10:0:0:0: [sdg] Mode Sense: 23 00 00 00
[   10.008097] sd 10:0:0:0: [sdg] No Caching mode page found
[   10.008098] sd 10:0:0:0: [sdg] Assuming drive cache: write through
[   10.009847] sd 10:0:0:0: [sdg] ATA PASSTHROUGH(16) ATA Identify failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   10.020222] sd 10:0:0:0: [sdg] ATA PASSTHROUGH(16) ATA Identify failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   10.023348]  sdg: sdg1 sdg2 sdg3 sdg4 < sdg5 sdg6 >
[   10.034721] sd 10:0:0:0: [sdg] ATA PASSTHROUGH(16) ATA Identify failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   10.034722] sd 10:0:0:0: [sdg] Attached SCSI removable disk
[   10.753327] FAT-fs (sdg1): bogus number of reserved sectors
[   10.758904] FAT-fs (sdg1): Can't find a valid FAT filesystem
[   76.680062] FAT-fs (sdg1): bogus number of reserved sectors
[   76.685639] FAT-fs (sdg1): Can't find a valid FAT filesystem
[   76.712186] FAT-fs (sdg1): bogus number of reserved sectors
[   76.717761] FAT-fs (sdg1): Can't find a valid FAT filesystem
[   87.624331] FAT-fs (sdg1): bogus number of reserved sectors
[   87.629899] FAT-fs (sdg1): Can't find a valid FAT filesystem
[   87.653830] FAT-fs (sdg1): bogus number of reserved sectors
[   87.659399] FAT-fs (sdg1): Can't find a valid FAT filesystem

Any input appreciated!

What about testing the drives externally ?

Thanks, haven’t tried it yet, but have tried one other thing: Plugged a good drive into one of the suspect bays and… not recognised.
Went through drive details again and another oddity is that sdh does not exist. It’s an 8-bay unit so we should have devices sda through sdh.

To summarise

  • I have two drives making up a RAID1 Raid group / storage pool and one good spare drive (just empty, not configured as a spare as such)
  • Any drive in bays 5 and 6 will report as non-existent in QNAP storage manager
  • One of them exists as /dev/sdg, the other just doesn’t exist. There are 7 sd* devices in the system
  • Any of the three drives in bay 8 is fine

So it seems to suggest a backplane issue with these two bays.

But the big problem that remains is: Why will it not recognise my storage pool in bay 8? It sees a RAID volume on there but decides it’s not worthy, and the recovery scan also doesn’t find it.

sdg looks like it may be the DOM. It’s the only device in QNAP NAS that use a FAT filesystem. The reported size (512MB) matches too.

Interesting, the DOM theory makes sense. It means that two drive slots are totally pfft, gone.

Still doesn’t explain why plugging one drive of a RAID1 into another slot is not recognised, this is concerning from a reliability perspective.
I’m going to try slotting the two drives into bays 7 and 8, but that will have to wait until the weekend.

Hi @pokrakam

Thank you for providing the logs. However, for a more complete analysis, we require the full system dump log.

Please open a support ticket via the Helpdesk application on your NAS. This will automatically include all the necessary system logs for our engineers to investigate further.

Thank you!

Thanks, I have opened a ticket.

FWIW, I have swapped both of the RAID1 volume’s drives into good slots 7 and 8 and it’s not recognizing the volume. It’s behaving as if the drives are locked to the slots - it still treats 7 and 8 as the volumes of the drives that previously occupied them, but now with errors.