Recover RAID 5 data after created new array instead of re-using

Ok - something was bugging me about your issue, so I fired up a VM to dive into the behavior that should be expected. I'll get to what was bugging me in a minute; first let me say this:

Back up these drives before attempting anything!!

You may have already done damage beyond what the resync did; can you clarify what you meant when you said:

Per suggestions I did clean up the superblocks and re-created the array with --assume-clean option but with no luck at all.

If you ran a mdadm --misc --zero-superblock, then you should be fine.

Anyway, scavenge up some new disks and grab exact current images of them before doing anything at all that might do any more writing to these disks.

dd if=/dev/sdd of=/path/to/store/sdd.img

That being said.. it looks like data stored on these things is shockingly resilient to wayward resyncs. Read on, there is hope, and this may be the day that I hit the answer length limit.


The Best Case Scenario

I threw together a VM to recreate your scenario. The drives are just 100 MB so I wouldn't be waiting forever on each resync, but this should be a pretty accurate representation otherwise.

Built the array as generically and default as possible - 512k chunks, left-symmetric layout, disks in letter order.. nothing special.

root@test:~# mdadm --create /dev/md0 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd1[3] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

So far, so good; let's make a filesystem, and put some data on it.

root@test:~# mkfs.ext4 /dev/md0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=512 blocks, Stripe width=1024 blocks
51000 inodes, 203776 blocks
10188 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
25 block groups
8192 blocks per group, 8192 fragments per group
2040 inodes per group
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729

Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 30 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
root@test:~# mkdir /mnt/raid5
root@test:~# mount /dev/md0 /mnt/raid5
root@test:~# echo "data" > /mnt/raid5/datafile
root@test:~# dd if=/dev/urandom of=/mnt/raid5/randomdata count=10000
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB) copied, 0.706526 s, 7.2 MB/s
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Ok. We've got a filesystem and some data ("data" in datafile, and 5MB worth of random data with that SHA1 hash in randomdata) on it; let's see what happens when we do a re-create.

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
unused devices: <none>
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 21:07:06 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 21:07:06 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 21:07:06 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdd1[2] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

The resync finished very quickly with these tiny disks, but it did occur. So here's what was bugging me from earlier; your fdisk -l output. Having no partition table on the md device is not a problem at all, it's expected. Your filesystem resides directly on the fake block device with no partition table.

root@test:~# fdisk -l
...
Disk /dev/md1: 208 MB, 208666624 bytes
2 heads, 4 sectors/track, 50944 cylinders, total 407552 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Yeah, no partition table. But...

root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks

Perfectly valid filesystem, after a resync. So that's good; let's check on our data files:

root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Solid - no data corruption at all! But this is with the exact same settings, so nothing was mapped differently between the two RAID groups. Let's drop this thing down before we try to break it.

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1

Taking a Step Back

Before we try to break this, let's talk about why it's hard to break. RAID 5 works by using a parity block that protects an area the same size as the block on every other disk in the array. The parity isn't just on one specific disk, it's rotated around the disks evenly to better spread read load out across the disks in normal operation.

The XOR operation to calculate the parity looks like this:

DISK1  DISK2  DISK3  DISK4  PARITY
1      0      1      1    = 1
0      0      1      1    = 0
1      1      1      1    = 0

So, the parity is spread out among the disks.

DISK1  DISK2  DISK3  DISK4  DISK5
DATA   DATA   DATA   DATA   PARITY
PARITY DATA   DATA   DATA   DATA
DATA   PARITY DATA   DATA   DATA

A resync is typically done when replacing a dead or missing disk; it's also done on mdadm create to assure that the data on the disks aligns with what the RAID's geometry is supposed to look like. In that case, the last disk in the array spec is the one that is 'synced to' - all of the existing data on the other disks is used for the sync.

So, all of the data on the 'new' disk is wiped out and rebuilt; either building fresh data blocks out of parity blocks for what should have been there, or else building fresh parity blocks.

What's cool is that the procedure for both of those things is the exact same: an XOR operation across the data from the rest of the disks. The resync process in this case may have in its layout that a certain block should be a parity block, and think it's building a new parity block, when in fact it's re-creating an old data block. So even if it thinks it's building this:

DISK1  DISK2  DISK3  DISK4  DISK5
PARITY DATA   DATA   DATA   DATA
DATA   PARITY DATA   DATA   DATA
DATA   DATA   PARITY DATA   DATA

...it may just be rebuilding DISK5 from the layout above.

So, it's possible for data to stay consistent even if the array's built wrong.


Throwing a Monkey in the Works

(not a wrench; the whole monkey)

Test 1:

Let's make the array in the wrong order! sdc, then sdd, then sdb..

root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdc1 /dev/sdd1 /dev/sdb1
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:06:34 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:06:34 2012
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:06:34 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdb1[3] sdd1[1] sdc1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

Ok, that's all well and good. Do we have a filesystem?

root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Nope! Why is that? Because while the data's all there, it's in the wrong order; what was once 512KB of A, then 512KB of B, A, B, and so forth, has now been shuffled to B, A, B, A. The disk now looks like jibberish to the filesystem checker, it won't run. The output of mdadm --misc -D /dev/md1 gives us more detail; It looks like this:

Number   Major   Minor   RaidDevice State
   0       8       33        0      active sync   /dev/sdc1
   1       8       49        1      active sync   /dev/sdd1
   3       8       17        2      active sync   /dev/sdb1

When it should look like this:

Number   Major   Minor   RaidDevice State
   0       8       17        0      active sync   /dev/sdb1
   1       8       33        1      active sync   /dev/sdc1
   3       8       49        2      active sync   /dev/sdd1

So, that's all well and good. We overwrote a whole bunch of data blocks with new parity blocks this time out. Re-create, with the right order now:

root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:11:08 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:11:08 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:11:08 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks

Neat, there's still a filesystem there! Still got data?

root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Success!

Test 2

Ok, let's change the chunk size and see if that gets us some brokenness.

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=64 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:19 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:19 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:19 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Yeah, yeah, it's hosed when set up like this. But, can we recover?

root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:51 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:51 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:21:51 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Success, again!

Test 3

This is the one that I thought would kill data for sure - let's do a different layout algorithm!

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --layout=right-asymmetric --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:32:34 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:32:34 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:32:34 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdd1[3] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 1 [3/3] [UUU]

unused devices: <none>
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
Superblock has an invalid journal (inode 8).

Scary and bad - it thinks it found something and wants to do some fixing! Ctrl+C!

Clear<y>? cancelled!

fsck.ext4: Illegal inode number while checking ext3 journal for /dev/md1

Ok, crisis averted. Let's see if the data's still intact after resyncing with the wrong layout:

root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:33:02 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:33:02 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=3 ctime=Sat Jan  7 23:33:02 2012
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Success!

Test 4

Let's also just prove that that superblock zeroing isn't harmful real quick:

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --misc --zero-superblock /dev/sdb1 /dev/sdc1 /dev/sdd1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 12/51000 files, 12085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Yeah, no big deal.

Test 5

Let's just throw everything we've got at it. All 4 previous tests, combined.

  • Wrong device order
  • Wrong chunk size
  • Wrong layout algorithm
  • Zeroed superblocks (we'll do this between both creations)

Onward!

root@test:~# umount /mnt/raid5
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
root@test:~# mdadm --misc --zero-superblock /dev/sdb1 /dev/sdc1 /dev/sdd1
root@test:~# mdadm --create /dev/md1 --chunk=64 --level=5 --raid-devices=3 --layout=right-symmetric /dev/sdc1 /dev/sdd1 /dev/sdb1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdb1[3] sdd1[1] sdc1[0]
      204672 blocks super 1.2 level 5, 64k chunk, algorithm 3 [3/3] [UUU]

unused devices: <none>
root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
root@test:~# mdadm --stop /dev/md1
mdadm: stopped /dev/md1

The verdict?

root@test:~# mdadm --misc --zero-superblock /dev/sdb1 /dev/sdc1 /dev/sdd1
root@test:~# mdadm --create /dev/md1 --chunk=512 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
root@test:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sdd1[3] sdc1[1] sdb1[0]
      203776 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>

root@test:~# fsck.ext4 /dev/md1
e2fsck 1.41.14 (22-Dec-2010)
/dev/md1: clean, 13/51000 files, 17085/203776 blocks
root@test:~# mount /dev/md1 /mnt/raid5/
root@test:~# cat /mnt/raid5/datafile
data
root@test:~# sha1sum /mnt/raid5/randomdata
847685a5d42524e5b1d5484452a649e854b59064  /mnt/raid5/randomdata

Wow.

So, it looks like none of these actions corrupted data in any way. I was quite surprised by this result, frankly; I expected moderate odds of data loss on the chunk size change, and some definite loss on the layout change. I learned something today.


So .. How do I get my data??

As much information as you have about the old system would be extremely helpful to you. If you know the filesystem type, if you have any old copies of your /proc/mdstat with information on drive order, algorithm, chunk size, and metadata version. Do you have mdadm's email alerts set up? If so, find an old one; if not, check /var/spool/mail/root. Check your ~/.bash_history to see if your original build is in there.

So, the list of things that you should do:

  1. Back up the disks with dd before doing anything!!
  2. Try to fsck the current, active md - you may have just happened to build in the same order as before. If you know the filesystem type, that's helpful; use that specific fsck tool. If any of the tools offer to fix anything, don't let them unless you're sure that they've actually found the valid filesystem! If an fsck offers to fix something for you, don't hesitate to leave a comment to ask whether it's actually helping or just about to nuke data.
  3. Try building the array with different parameters. If you have an old /proc/mdstat, then you can just mimic what it shows; if not, then you're kinda in the dark - trying all of the different drive orders is reasonable, but checking every possible chunk size with every possible order is futile. For each, fsck it to see if you get anything promising.

So, that's that. Sorry for the novel, feel free to leave a comment if you have any questions, and good luck!

footnote: under 22 thousand characters; 8k+ shy of the length limit


I had a similar problem:
after a failure of a software RAID5 array I fired mdadm --create without giving it --assume-clean, and could not mount the array anymore. After two weeks of digging I finally restored all data. I hope the procedure below will save someone's time.

Long Story Short

The problem was caused by the fact that mdadm --create made a new array that was different from the original in two aspects:

  • different order of partitions
  • different RAID data offset

As it's been shown in the brilliant answer by Shane Madden, mdadm --create does not destroy the data in most cases! After finding the partition order and data offset I could restore the array and extract all data from it.

Prerequisites

I had no backups of RAID superblocks, so all I knew was that it was a RAID5 array on 8 partitions created during installation of Xubuntu 12.04.0. It had an ext4 filesystem. Another important piece of knowledge was a copy of a file that was also stored on the RAID array.

Tools

Xubuntu 12.04.1 live CD was used to do all the work. Depending on your situation, you might need some of the following tools:

version of mdadm that allows to specify data offset

sudo apt-get install binutils-dev git
git clone -b data_offset git://neil.brown.name/mdadm
cd mdadm
make

bgrep - searching for binary data

curl -L 'https://github.com/tmbinc/bgrep/raw/master/bgrep.c' | gcc -O2 -x c -o bgrep -

hexdump, e2fsck, mount and a hexadecimal calculator - standard tools from repos

Start with Full Backup

Naming of device files, e.g. /dev/sda2 /dev/sdb2 etc., is not persistent, so it's better to write down your drives' serial numbers given by

sudo hdparm -I /dev/sda

Then hook up an external HDD and back up every partition of your RAID array like this:

sudo dd if=/dev/sda2 bs=4M | gzip > serial-number.gz

Determine Original RAID5 Layout

Various layouts are described here: http://www.accs.com/p_and_p/RAID/LinuxRAID.html
To find how strips of data were organized on the original array, you need a copy of a random-looking file that you know was stored on the array. The default chunk size currently used by mdadm is 512KB. For an array of N partitions, you need a file of size at least (N+1)*512KB. A jpeg or video is good as it provides relatively unique substrings of binary data. Suppose our file is called picture.jpg. We read 32 bytes of data at N+1 positions starting from 100k and incrementing by 512k:

hexdump -n32 -s100k -v -e '/1 "%02X"' picture.jpg ; echo
DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2
hexdump -n32 -s612k -v -e '/1 "%02X"' picture.jpg ; echo
AB9DDDBBB05CA915EE2289E59A116B02A26C82C8A8033DD8FA6D06A84B6501B7
hexdump -n32 -s1124k -v -e '/1 "%02X"' picture.jpg ; echo
BC31A8DC791ACDA4FA3E9D3406D5639619576AEE2E08C03C9EF5E23F0A7C5CBA
...

We then search for occurrences of all of these bytestrings on all of our raw partitions, so in total (N+1)*N commands, like this:

sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/sda2
sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/sdb2
...
sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/sdh2
/dev/sdh2: 52a7ff000
sudo ./bgrep AB9DDDBBB05CA915EE2289E59A116B02A26C82C8A8033DD8FA6D06A84B6501B7 /dev/sda2
/dev/sdb2: 52a87f000
...

These commands can be run in parallel for different disks. Scan of a 38GB partition took around 12 minutes. In my case, every 32-byte string was found only once among all eight drives. By comparing offsets returned by bgrep you obtain a picture like this:

| offset \ partition | b | d | c | e | f | g | a | h |
|--------------------+---+---+---+---+---+---+---+---|
| 52a7ff000          | P |   |   |   |   |   |   | 1 |
| 52a87f000          | 2 | 3 | 4 | 5 | 6 | 7 | 8 | P |
| 52a8ff000          |   |   |   |   |   |   | P | 9 |

We see a normal left-symmetric layout, which is default for mdadm. More importantly, now we know the order of partitions. However, we don't know which partition is the first in the array, as they can be cyclicly shifted.

Note also the distance between found offsets. In my case it was 512KB. The chunk size can actually be smaller than this distance, in which case the actual layout will be different.

Find Original Chunk Size

We use the same file picture.jpg to read 32 bytes of data at different intervals from each other. We know from above that the data at offset 100k is lying on /dev/sdh2, at offset 612k is at /dev/sdb2, and at 1124k is at /dev/sdd2. This shows that the chunk size is not bigger than 512KB. We verify that it is not smaller than 512KB. For this we dump the bytestring at offset 356k and look on which partition it sits:

hexdump -n32 -s356k -v -e '/1 "%02X"' P1080801.JPG ; echo
7EC528AD0A8D3E485AE450F88E56D6AEB948FED7E679B04091B031705B6AFA7A
sudo ./bgrep 7EC528AD0A8D3E485AE450F88E56D6AEB948FED7E679B04091B031705B6AFA7A /dev/sdb2
/dev/sdb2: 52a83f000

It is on the same partition as offset 612k, which indicates that the chunk size is not 256KB. We eliminate smaller chunk sizes in the similar fashion. I ended up with 512KB chunks being the only possibility.

Find First Partition in Layout

Now we know the order of partitions, but we don't know which partition should be the first, and which RAID data offset was used. To find these two unknowns, we will create a RAID5 array with correct chunk layout and a small data offset, and search for the start of our file system in this new array.

To begin with, we create an array with the correct order of partitions, which we found earlier:

sudo mdadm --stop /dev/md126
sudo mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sdb2 /dev/sdd2 /dev/sdc2 /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sda2 /dev/sdh2

We verify that the order is obeyed by issuing

sudo mdadm --misc -D /dev/md126
...
Number   Major   Minor   RaidDevice State
   0       8       18        0      active sync   /dev/sdb2
   1       8       50        1      active sync   /dev/sdd2
   2       8       34        2      active sync   /dev/sdc2
   3       8       66        3      active sync   /dev/sde2
   4       8       82        4      active sync   /dev/sdf2
   5       8       98        5      active sync   /dev/sdg2
   6       8        2        6      active sync   /dev/sda2
   7       8      114        7      active sync   /dev/sdh2

Now we determine offsets of the N+1 known bytestrings in the RAID array. I run a script for a night (Live CD doesn't ask for password on sudo :):

#!/bin/bash
echo "1st:"
sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/md126
echo "2nd:"
sudo ./bgrep AB9DDDBBB05CA915EE2289E59A116B02A26C82C8A8033DD8FA6D06A84B6501B7 /dev/md126
echo "3rd:"
sudo ./bgrep BC31A8DC791ACDA4FA3E9D3406D5639619576AEE2E08C03C9EF5E23F0A7C5CBA /dev/md126
...
echo "9th:"
sudo ./bgrep 99B5A96F21BB74D4A630C519B463954EC096E062B0F5E325FE8D731C6D1B4D37 /dev/md126

Output with comments:

1st:
/dev/md126: 2428fff000 # 1st
2nd:
/dev/md126: 242947f000 # 480000 after 1st
3rd:                   # 3rd not found
4th:
/dev/md126: 242917f000 # 180000 after 1st
5th:
/dev/md126: 24291ff000 # 200000 after 1st
6th:
/dev/md126: 242927f000 # 280000 after 1st
7th:
/dev/md126: 24292ff000 # 300000 after 1st
8th:
/dev/md126: 242937f000 # 380000 after 1st
9th:
/dev/md126: 24297ff000 # 800000 after 1st

Based on this data we see that the 3rd string was not found. This means that the chunk at /dev/sdd2 is used for parity. Here is an illustration of the parity positions in the new array:

| offset \ partition | b | d | c | e | f | g | a | h |
|--------------------+---+---+---+---+---+---+---+---|
| 52a7ff000          |   |   | P |   |   |   |   | 1 |
| 52a87f000          | 2 | P | 4 | 5 | 6 | 7 | 8 |   |
| 52a8ff000          | P |   |   |   |   |   |   | 9 |

Our aim is to deduce which partition to start the array from, in order to shift the parity chunks into the right place. Since parity should be shifted two chunks to the left, the partition sequence should be shifted two steps to the right. Thus the correct layout for this data offset is ahbdcefg:

sudo mdadm --stop /dev/md126
sudo mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sda2 /dev/sdh2 /dev/sdb2 /dev/sdd2 /dev/sdc2 /dev/sde2 /dev/sdf2 /dev/sdg2 

At this point our RAID array contains data in the right form. You might be lucky so that the RAID data offset is the same as it was in the original array, and then you will most likely be able to mount the partition. Unfortunately this was not my case.

Verify Data Consistency

We verify that the data is consistent over a strip of chunks by extracting a copy of picture.jpg from the array. For this we locate the offset for the 32-byte string at 100k:

sudo ./bgrep DA1DC4D616B1C71079624CDC36E3D40E7B1CFF00857C663687B6C4464D6C77D2 /dev/md126

We then substract 100*1024 from the result and use the obtained decimal value in skip= parameter for dd. The count= is the size of picture.jpg in bytes:

sudo dd if=/dev/md126 of=./extract.jpg bs=1 skip=155311300608 count=4536208

Check that extract.jpg is the same as picture.jpg.

Find RAID Data Offset

A sidenote: default data offset for mdadm version 3.2.3 is 2048 sectors. But this value has been changed over time. If the original array used a smaller data offset than your current mdadm, then mdadm --create without --assume-clean can overwrite the beginning of the file system.

In the previous section we created a RAID array. Verify which RAID data offset it had by issuing for some of the individual partitions:

sudo mdadm --examine /dev/sdb2
...
    Data Offset : 2048 sectors
...

2048 512-byte sectors is 1MB. Since chunk size is 512KB, the current data offset is two chunks.

If at this point you have a two-chunk offset, it is probably small enough, and you can skip this paragraph.
We create a RAID5 array with the data offset of one 512KB-chunk. Starting one chunk earlier shifts the parity one step to the left, thus we compensate by shifting the partition sequence one step to the left. Hence for 512KB data offset, the correct layout is hbdcefga. We use a version of mdadm that supports data offset (see Tools section). It takes offset in kilobytes:

sudo mdadm --stop /dev/md126
sudo ./mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sdh2:512 /dev/sdb2:512 /dev/sdd2:512 /dev/sdc2:512 /dev/sde2:512 /dev/sdf2:512 /dev/sdg2:512 /dev/sda2:512

Now we search for a valid ext4 superblock. The superblock structure can be found here: https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#The_Super_Block
We scan the beginning of the array for occurences of the magic number s_magic followed by s_state and s_errors. The bytestrings to look for are:

53EF01000100
53EF00000100
53EF02000100
53EF01000200
53EF02000200

Example command:

sudo ./bgrep 53EF01000100 /dev/md126
/dev/md126: 0dc80438

The magic number starts 0x38 bytes into the superblock, so we substract 0x38 to calculate the offset and examine the entire superblock:

sudo hexdump -n84 -s0xDC80400 -v /dev/md126
dc80400 2000 00fe 1480 03f8 cdd3 0032 d2b2 0119
dc80410 ab16 00f7 0000 0000 0002 0000 0002 0000
dc80420 8000 0000 8000 0000 2000 0000 b363 51bd
dc80430 e406 5170 010d ffff ef53 0001 0001 0000
dc80440 3d3a 50af 0000 0000 0000 0000 0001 0000
dc80450 0000 0000                              

This seems to be a valid superblock. s_log_block_size field at 0x18 is 0002, meaning that the block size is 2^(10+2)=4096 bytes. s_blocks_count_lo at 0x4 is 03f81480 blocks which is 254GB. Looks good.

We now scan for the occurrences of the first bytes of the superblock to find its copies. Note the byte flipping as compared to hexdump output:

sudo ./bgrep 0020fe008014f803d3cd3200 /dev/md126
/dev/md126: 0dc80400    # offset by 1024 bytes from the start of the FS        
/dev/md126: 15c80000    # 32768 blocks from FS start
/dev/md126: 25c80000    # 98304
/dev/md126: 35c80000    # 163840
/dev/md126: 45c80000    # 229376
/dev/md126: 55c80000    # 294912
/dev/md126: d5c80000    # 819200
/dev/md126: e5c80000    # 884736
/dev/md126: 195c80000
/dev/md126: 295c80000

This aligns perfectly with the expected positions of backup superblocks:

sudo mke2fs -n /dev/md126
...
Block size=4096 (log=2)
...
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872

Hence the file system starts at the offset 0xdc80000, i.e. 225792KB from the partition start. Since we have 8 partitions of which one is for parity, we divide the offset by 7. This gives 33030144 bytes offset on every partition, which is exactly 63 RAID chunks. And since the current RAID data offset is one chunk, we conclude that the original data offset was 64 chunks, or 32768KB. Shifting hbdcefga 63 times to the right gives the layout bdcefgah.

We finally build the correct RAID array:

sudo mdadm --stop /dev/md126
sudo ./mdadm --create /dev/md126 --assume-clean --raid-devices=8 --level=5  /dev/sdb2:32768 /dev/sdd2:32768 /dev/sdc2:32768 /dev/sde2:32768 /dev/sdf2:32768 /dev/sdg2:32768 /dev/sda2:32768 /dev/sdh2:32768
sudo fsck.ext4 -n /dev/md126
e2fsck 1.42 (29-Nov-2011)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/md126: clean, 423146/16654336 files, 48120270/66589824 blocks
sudo mount -t ext4 -r /dev/md126 /home/xubuntu/mp

Voilà!


If you are lucky you might have some success with getting your files back with recovery software that can read a broken RAID-5 array. Zero Assumption Recovery is one I have had success with before.

However, I'm not sure if the process of creating a new array has gone and destroyed all the data, so this might be a last chance effort.