View Full Version : Filesystem Corruption on vm3
jamesotron
06-09-2005, 09:18 PM
Hi there.
I've been a unixshell.com customer for only about two weeks. This morning when I awoke I found that the filesystem had remounted read-only (I'm going to take that out of fstab) and was sitting there ignoring incoming connections.
This is the second time I have had file system corruption with this VPS in just two weeks. I'm running Debian sarge with kernel 2.6.11.10-xenU.
Any help/suggestions anyone can give would be appreciated.
Thanks
James
matta
06-09-2005, 09:30 PM
You're not the first to report something similar... based on prior discussion on the forum it's not a hardware or Xen problem, but rather something that happens with plain old Linux servers also. It might entirely be due to errors=remount-ro although I have no proof of that. I removed that option from fstab recently and not complaints/problems so far.
As far as the hardware, vm3 has been up for almost 25 days as of now so a host crash is not responsible and the hardware RAID is running fine as shown by their command line utilities.
//vm3> /c0 show
Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 OK - 64K 698.655 ON - -
jamesotron
06-09-2005, 09:40 PM
[QUOTE=matta]You're not the first to report something similar... based on prior discussion on the forum it's not a hardware or Xen problem, but rather something that happens with plain old Linux servers also. It might entirely be due to errors=remount-ro although I have no proof of that. I removed that option from fstab recently and not complaints/problems so far.[/QUOTE]
Not so sure that I buy that. ext2/3 is probably one of the most tested open source file systems in existance. Sure, you get occaisional fs corruption on your average linux box - but it's usually caused by either unexpected power outages or flakey consumer-grade IDE chipsets. I think I will try a couple of different kernel versions and see if I can isolate it - that said, I'm not compiling these suckers, so it's hard for me to know exactly what's in them.
[QUOTE=matta]As far as the hardware, vm3 has been up for almost 25 days as of now so a host crash is not responsible and the hardware RAID is running fine as shown by their command line utilities.[/QUOTE]
That's good to know. How does Xen do the disk partitioning - is it just a filesystem image like it is with UML or do we get direct access to a particular disk partition? I think I'll run another snapshot right now, but it makes me nervous - the reason I switched to a VPS is because then I wouldn't have to deal with chea, flakey hardware :)
matta
06-09-2005, 10:15 PM
We use LVM and you get direct access to the swap and root partitions. I forget the thread name for the previous issue, but it comes down to something software that has been known to happen on both VM and non-VM platforms. Try removing "errors=remount-ro" and report back on if the problems occurs in the future.
jamesotron
06-09-2005, 10:18 PM
[QUOTE=matta]We use LVM and you get direct access to the swap and root partitions. I forget the thread name for the previous issue, but it comes down to something software that has been known to happen on both VM and non-VM platforms. Try removing "errors=remount-ro" and report back on if the problems occurs in the future.[/QUOTE]
Thanks for that. Will do.
matta
06-09-2005, 10:29 PM
For cross reference here is the URL for the other thread regarding this issue: http://www.unixshell.com/forum/showthread.php?t=208.
As you can see users reported it on vm1, vm2, and vm3 which were 3/5 of the servers we had at that time so it is very unlikely a hardware issue if it happens on any host. I'm still trying to re-produce this actually as it has yet to happen for me.
speedbird
06-10-2005, 08:47 AM
This apparently is killing my VM upon booting:
modprobe: FATAL: Could not load /lib/modules/2.6.11.10-xenU/modules.dep: No such file or directory
and then...
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
[FAILED]
Give root password for maintenance
(or type Control-D to continue):
From there I'm stuck :confused:
werpon
06-10-2005, 09:09 AM
[QUOTE=matta]For cross reference here is the URL for the other thread regarding this issue: http://www.unixshell.com/forum/showthread.php?t=208.
[/QUOTE]
I posted on that same thread that I was having a similar problem with a "real" server, not a VPS. I finally could trace it to a bad memory module. Probably not the same problem here, but it might help.
jamesotron
06-10-2005, 10:01 PM
[QUOTE=speedbird]This apparently is killing my VM upon booting:
modprobe: FATAL: Could not load /lib/modules/2.6.11.10-xenU/modules.dep: No such file or directory
and then...
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
[FAILED]
Give root password for maintenance
(or type Control-D to continue):
From there I'm stuck :confused:[/QUOTE]
You need to enter your root password and then do:
fsck -fy /dev/sda1
mount -o remount,rw /
reboot
and it should come up happy. This is exactly what's happenning to me.
the filesystem error condition seems to be related to taking a snapshot.
several hours after taking a snapshot server fs is readonly. this has occured on my server every time I've used snapshot.
this time the fs was so badly damaged (the ext3 journal and superblock were toast) I tried to restore from the full system snapshot I made. the restore went okay but most of the directories in /var are completely missing.
I have disabled errors=remount,ro and am currently reconstructing /var by hand and restoring the mysql database from (thankfully regular) exports.
Is it the case that we cannot rely on snapshot to capture /var? What are the restrictions on how this process works?
jamesotron
06-21-2005, 12:31 AM
[QUOTE=aws]the filesystem error condition seems to be related to taking a snapshot.
several hours after taking a snapshot server fs is readonly. this has occured on my server every time I've used snapshot.
this time the fs was so badly damaged (the ext3 journal and superblock were toast) I tried to restore from the full system snapshot I made. the restore went okay but most of the directories in /var are completely missing.
I have disabled errors=remount,ro and am currently reconstructing /var by hand and restoring the mysql database from (thankfully regular) exports.
Is it the case that we cannot rely on snapshot to capture /var? What are the restrictions on how this process works?[/QUOTE]
Crap. I just started a snapshot :/
jamesotron
06-21-2005, 05:27 AM
I can confirm that it's caused by running a snapshot - I had to reboot my VPS because the disk was "write-protected":
undies:/var/log/mysql# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /dev/shm type tmpfs (rw)
undies:/var/log/mysql# touch file
touch: cannot touch `file': Read-only file system
undies:/var/log/mysql# mount -o remount,rw /
mount: block device /dev/sda1 is write-protected, mounting read-only
undies:/var/log/mysql# dmesg
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
ext3_abort called.
EXT3-fs error (device sda1): ext3_remount: Abort forced by user
ext3_abort called.
EXT3-fs error (device sda1): ext3_remount: Abort forced by user
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
EXT3-fs error (device sda1) in start_transaction: Journal has aborted
matta
06-21-2005, 05:45 AM
That's very interesting if that is the case. For a snapshot the partition is mounted read-only on the host and that is accepted as perfectly fine under Linux. We had used LVM snapshot feature in the past, but that caused system crashes due to LVM problems.
I don't see more of a solution here than to stop the VM before performing a snapshot and then start it back up after completion.
This must be a fairly specific problem though as i've run dozens of snapshots with no problems... on a personal VM of mine I run a live snapshot every 2 days and no problems so far.
matta
06-21-2005, 04:52 PM
I have added an option to the snapshot system to stop the VM before performing a snapshot (and will start it back up upon completion). It is checked by default.
vBulletin v3.0.6, Copyright ©2000-2008, Jelsoft Enterprises Ltd.