PDA

View Full Version : Odd Problem


cmtech
04-24-2005, 06:51 PM
OK, I was sitting around fiddling with apache and suddenly, the filesystem becomes read only. I didn't even know that the mounted root filesystem could suddently become read only! So... I rebooted. Error: This disk will be checked for errors... oh fsck [sic] :P. So... shouldn't be too bad... and off it ran... merriily wiping system libraries off the disk. What was that about? I haven't even tried loading modules into xen or anything on there. Any clues? I'm a bit nervous it will happen again who knows what it will delete next time...

matta
04-24-2005, 08:11 PM
Anything in dmesg or /var/log/messages regarding this? A server randomly re-mounting it's root fs as read only is definitely not normal.. from what it sounds like it doesn't have anything to do with Xen specifically... more like a general problem that could occur on a physical server also.

werpon
04-25-2005, 10:57 AM
Incidentally, it happened to me two days ago with Debian testing while doing apt-get upgrade. Rebooted with a live cd, fsck'ed, got the FS hosed again, fsck'ed again, and lost half of /usr/lib/.

I'm going to reinstall this server, and I recommend you to back up yours and reinstall. Just in case...

EDITED to add that this happened to me on a real server, not a VPS.

msh
04-25-2005, 12:48 PM
This have also happed to me on my VPS (debian testing). I have had a few problems with the filesystem on my server.

Maybe I should reinstall but it have been some time since the last trubles so I am hoping for the best.

aws
04-26-2005, 12:31 AM
This happened to me also -- actually it was not my server (which is on vm2 and has had no problems) but a friend's server which is on vm1.

I don't think anything significant was happening at the time, were were installing a bunch of software via apt earlier and setting up apache/mysql. Stuff started acting weird, and when I logged in I could not edit files anymore because the filesystem was mounted readonly, so I remounted it rw, but things were still kind of haywire so we rebooted and fsckd the partition via the console. That fixed the fs errors and then it came back normally.

In the logs I can see where the fs was repaired on Apr 19th at 1800 hours, and there is an 21 hour gap before that with no entries in syslog, not even --MARK--. Before that there are a bunch of messages from mysql complaining about corrupted tables, but no kernel messages. You would probably have to look at the console to get any kernel messages b/c the disk was readonly, but I never did that.

There is one section in messages about 8 hours prior to the fs going readonly:

Apr 18 14:38:03 www -- MARK --
Apr 18 14:58:03 www -- MARK --
Apr 18 15:18:04 www -- MARK --
Apr 18 15:21:50 www kernel: Pausing... 5^H^H^H^H^H^H^H^H^H^H^H^HPausing... 4^H^H^H^H^H^H^H^H^H^H^H^HPausing... 3^H^H^H^H^H^H^H^H^H^H^H^HPausing... 2^H^
H^H^H^H^H^H^H^H^H^H^HPausing... 1^H^H^H^H^H^H^H^H^H^H^H^HContinuing...
Apr 18 15:21:50 www kernel:
Apr 18 15:38:04 www -- MARK --
Apr 18 15:58:04 www -- MARK --
Apr 18 16:18:06 www -- MARK --
Apr 18 16:38:06 www -- MARK --

Thats the only unusual thing I can see... not sure if its related or not.

Actually, after repairing the fs from the console, I was so impressed with the console feature that I decided to buy my own unixshell account.

aws
04-26-2005, 12:51 AM
debian's root mount point is configured by default with the errors=remount-ro option, which is why the partition gets remounted readonly -- because its the defined behavior for handling fs corruption. so, its not just "spontaneous".

/dev/sda1 on / type ext3 (rw,errors=remount-ro)

some other distros use the default action, usually "continue" (ignore). (you can see this via 'tune2fs -l'). that can be risky though, I prefer debian's setup.

matta
04-26-2005, 03:53 AM
Ahh.. yes the errors=remount option will do that. I actually use a standard fstab across all the distro images and all include that as I consider it a useful feature for the reasons previously stated. Users are free to remove it if they prefer it the default way.

cmtech
04-26-2005, 07:50 PM
Why is the filesystem getting corrupted in the first place? The past three days in a row the VPS has locked me out including accessing the Teknic control pannel and the ssh console feature while I'm doing nothing more than sitting in vi. Every time it stays inaccessable for a few hours and then is apparantly rebooted and comes back online. It's done this a few days apart ever since I got it a month ago but now it's doing it daily. Sometimes I lose files. Usually what happens is I can connect to all the ports on the system but it doesn't go any further than that. I should have taken a packet dump really of forming a connection - it will establish connections to the ports (both services on my VPS, Teknic and ssh console) but the connections just hang and no application data is transferred; except today, when it did it today I was getting the SSH versions through but still not able to form connections. Teknic was still hanging completely. I didn't investigate too hard because by now I've come to realise if I go away and come back in a few hours it will all be working again (minus a few files and and the uptime) :rolleyes:

There is nothing more in syslog or klogd than a "restart" line from the syslog daemon followed by the normal boot sequence and the file system having various file repair operations done to it in reboot. I assumed the read-only mounting of disks was an error related to these other issues I'm having. Perhaps it is related albeit in a positive protective rather than negative context if they are remounted this way as a measure against further filesystem corruption? Or are these all completely seperate issues?

Anybody else having these issues or am I just unlucky?

Edit: I just checked and I'm on VM3.

msh
04-26-2005, 08:05 PM
[QUOTE=cmtech]
Anybody else having these issues or am I just unlucky?

Edit: I just checked and I'm on VM3.[/QUOTE]

I had them (2 or 3 times) but its been a while since the last time. I am on VM1.

gilesmorant
04-26-2005, 10:04 PM
The same thing just happened to me on VM2; emerging Asterisk on Gentoo:

EXT3-fs warning (device sda1): empty_dir: bad directory (dir #658597) - no data block
EXT3-fs warning (device sda1): ext3_rmdir: empty directory has nlink!=2 (0)
EXT3-fs warning (device sda1): empty_dir: bad directory (dir #658909) - no data block
EXT3-fs warning (device sda1): ext3_rmdir: empty directory has nlink!=2 (0)
EXT3-fs warning (device sda1): ext3_rmdir: empty directory has nlink!=2 (0)
EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1241158
Remounting filesystem read-only
EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1241159
EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1241169
EXT3-fs error (device sda1) in start_transaction: Readonly filesystem
EXT3-fs error (device sda1) in start_transaction: Readonly filesystem
EXT3-fs error (device sda1) in start_transaction: Readonly filesystem
EXT3-fs error (device sda1) in start_transaction: Readonly filesystem
EXT3-fs error (device sda1) in start_transaction: Readonly filesystem
EXT3-fs error (device sda1) in start_transaction: Readonly filesystem


:-(

EDIT: Was running on kernel 2.6.11 AFAIK. Uptime of circa 16 days... A good test of my backup strategy...

cmtech
04-27-2005, 08:49 AM
Seems like I'm not alone here by any means, so the problem likely is xen/host systems related and not to do with my vm. If this isn't xen related then whats it about? I have a system running xen and when I used sparse files for disk images I had quite a few problems with it involving reboots and disk corruption. It also managed to bring down the host system a few times (kernel crash backtraced to the filesystem). Stopping using sparse files as disk images instantly solved my troubles. Why? No idea. It worked and that was enough for me.

matta
04-27-2005, 02:49 PM
The problem with sparse files is that data is cached both in the VM and on the host. Imagine a scenario where in the VM all data is sync()'ed, but it is not yet actually written to the disks on the host due to the cache. If the host crashes in this scenario then there will be data loss / corruption. We don't use sparse files, we use LVM which gives you raw access to the logical partition. A sync() call in the VM should go directly to disk.

I'll see if I can reproduce the problem, but a quick search shows it happenning to a lot a physical servers also so it may be a Linux/kernel problem and not necessarilyXen.

cmtech
04-28-2005, 12:37 PM
Yeah I can see why they are a bad idea for that reason and I'm glad you guys don't use them. They still shouldn't make the host system crash but I'm just taking wild guesses at why it's playing up and if you guys use LVM then it's definately not that. It happened to me again just now, the filesystem corrupted while online and remounted the disk read-only. I don't think simply turning this option off would be a good idea as after reading the manpages and list posts from 98 or so when debian were stealing the idea to ship their distro like that from redhat, I could easily end up losing more data that way...

Is there anything I could do to provide info on why this is happening?

msh
04-28-2005, 02:04 PM
The problem just happend for me again. Strange thing.

matta
04-28-2005, 02:58 PM
Is everyone experiencing this problem using 2.6.11? Was the problem seen under 2.6.10?

cmtech
04-28-2005, 09:49 PM
I don't remember having this problem with 2.6.10 but I hadn't long had it when you swapped it either. It's happened again just now... when you upgraded it to 2.6.11 I tried to change it back but it said the files were missing. Perhaps you could put 2.6.10 back on there as a choice and see if it solves the problems? Yet - why does it also kill Teknic? Think when it comes back I will downgrade to 2.4.xxx and see if that helps.