PDA

View Full Version : Was there another glitch?


shifting
10-05-2005, 05:18 AM
From about 21:23 to 21:56 (MDT) My monitoring system showed services randomly going on and off line, and I can't find anything in my server logs to support this. Could there have been some more network problems causing severe packet loss? (or maybe just my end was screwed?) I'm on VM1, if anyone else noticed anything

mroch
10-05-2005, 05:21 AM
[QUOTE=shifting]From about 21:23 to 21:56 (MDT) My monitoring system showed services randomly going on and off line, and I can't find anything in my server logs to support this. Could there have been some more network problems causing severe packet loss? (or maybe just my end was screwed?) I'm on VM1, if anyone else noticed anything[/QUOTE]

This is not an announcement. It belongs in Main > Networking.

matta
10-05-2005, 04:58 PM
It could have been anything. I don't show any outages on our end.

shifting
10-05-2005, 11:17 PM
It happened again. This time I was at my computer at the time so I was able to run some tests.

1)I couldn't get to unixshell.com either
2)I COULD get to yahoo, google, and a few other test sites
3)ping showed a packet loss of ~85%

Something isn't right here.

At least this time it was a shorter outage...only lasted about 5 minutes. (4:09-4:14 MDT)

Ben Eastep

shifting
10-05-2005, 11:25 PM
problems are ongoing. I'm managing to sneek this message in during a brief burst of uptime.

matta
10-06-2005, 02:06 AM
Where did the packets stop on a traceroute?

matta
10-06-2005, 02:23 AM
Ok it was related to our upstream provider and only affected people coming in via Level (3). It seems Cogent/Level(3) de-peered and banned each other from their respective networks... battle of the backbones.

shifting
10-06-2005, 06:09 AM
I understand the de-peering stupidity, but I am curious..
1)The technology page says
"Our network consists of 6 gigabit links to 5 different premium Tier-1 providers"
Shouldn't that mean that even with Cogent/Level(3) trying to screw each other up, my traffic would find another way through
2)15% of my packets got through anyway...

or maybe 15% of my packets DID find another way, in which case, I wonder why the other 85% didn't...

whatever. I'm just glad things are back up now.

Ben Eastep

matta
10-06-2005, 03:29 PM
Yes... normally it would just re-route. The problem is the BGP mix consisted of both Level (3) AND Cogent at the time.... so we're connected to both backbones and then they decided to block each other, in short it created a big mess for a LOT of people (not just us/our upstream). The quick fix was to remove Cogent from the mix.

I'm still sketchy on how this all works out, we are just colo clients in the datacenter so I don't run the core part of the network, just our own branch. According to the upstream Cogent isn't going to be re-added. That is just fine though as we will be bringing in Cogent via our own ASN soon to enable IRC access and less expensive bandwidth for those who want it.

matta
10-06-2005, 03:34 PM
Some news stories regarding this.

http://www.theregister.co.uk/2005/10/06/level3_cogent/
http://ask.slashdot.org/askslashdot/05/10/05/2247207.shtml?tid=95&tid=187&tid=4

devnu11
10-06-2005, 06:56 PM
I am unable to reach my unixshell servers or the unixshell website through Time Warner.

#traceroute www.unixshell.com
traceroute to www.unixshell.com (65.254.53.254), 64 hops max, 40 byte packets
1 10.118.224.1 (10.118.224.1) 5.079 ms 10.045 ms 5.313 ms
2 pos6-1-0.albynyams-rtr01.nyroc.rr.com (24.29.32.213) 6.130 ms 7.307 ms 6.799 ms
3 pos1-0.albynysch-rtr01.nyroc.rr.com (24.29.32.41) 8.292 ms 5.675 ms 5.983 ms
4 pos2-0.albynywav-rtr02.nyroc.rr.com (24.29.32.153) 6.535 ms 5.946 ms 6.788 ms
5 son2-0-0.albynywav-rtr03.nyroc.rr.com (24.29.32.105) 7.785 ms 9.258 ms 10.289 ms
6 pop1-alb-P6-0.atdn.net (66.185.133.225) 9.092 ms 19.054 ms 7.842 ms
7 bb1-alb-P0-0.atdn.net (66.185.148.96) 66.013 ms 9.856 ms 9.320 ms
8 bb2-nye-P3-0.atdn.net (66.185.152.71) 12.217 ms 11.234 ms 12.207 ms
9 pop1-nye-P1-0.atdn.net (66.185.151.51) 44.766 ms 25.997 ms 10.957 ms
10 GlobalCrossing.atdn.net (66.185.151.62) 11.397 ms 12.121 ms 21.384 ms
11 so1-0-0-9953M.ar4.ATL1.gblx.net (67.17.68.234) 35.675 ms 32.768 ms 33.732 ms
12 * * *
13 * * *

Hmm, makes it to Atlanta, doesn't appear to be a Time Warner problem. Now what?

devnu11
10-06-2005, 08:49 PM
Amazing, I can connect again to unixshell. :D

matta
10-06-2005, 09:06 PM
I'm not sure what the problem is there... we have no reported outages ourselves. I do know that a lot of internet routing is messed up due to the Cogent/Level 3 fiasco.

More info:

http://www.theregister.co.uk/2005/10/06/level3_cogent

http://ask.slashdot.org/askslashdot/05/10/05/2247207.shtml?tid=95&tid=187&tid=4

fmccoey
10-07-2005, 10:57 PM
I have been real happy with unixshell since signing up. I now have about 6 clients on my machine and they have also been happy, but, in the last week too much has happened for me to dismiss it.

* Just this morning (Austalian time) I could not connect to anything, unixshell or any of my hosts. Traceroutes stopped in gnax somewhere.
* The other days we have had outages for a couple of hours each.
* Recently all my ssh connections just drop after just a few minutes of idle (this does not happen when i ssh to hosts not on unixshell (gnax's) network.

If this is all related and is going to be fixed (ie. that firewall issue) then Matt, please let me know so that i have something to tell my clients as well. If not then please just let me know!

I would like to continue loving this service but just this morning i am annoyed. I feel a little better after venting this but some explaination would be great.

-felix

matta
10-07-2005, 11:01 PM
The outage over a week ago was our fault. Lately everything has just been spotty due to the whole Level 3 / Cogent thing. I posted on this forum already articles that explain the problem... in short Level 3 has partitioned the internet and really shown how 1 backbone can wreak havoc on the internet. We are nowhere near alone in being affected by this.

fmccoey
10-07-2005, 11:19 PM
thanks Matt, I realise that it is somewhat out of your hands. Thanks for the articles (which i have now read!). Keep up the good work. ;-)

prox
10-20-2005, 02:37 AM
It would seem all of unixshell was once again down for something like 15-30 minutes tonight (not sure when it started). Another firewall problem? Can anyone confirm?

joshmh
10-20-2005, 02:39 AM
Yep, I recorded downtime from:

2005-10-19 20:52:08 EST to 2005-10-19 21:30:21 EST.

Seems like unixshell.com was down, as well.

-- Josh

orion2012
10-20-2005, 02:46 AM
Same for me. On Time Warner broadband, tried several of my university shell accounts as well and they couldn't access unixshell or my server.

yejun
10-20-2005, 02:58 AM
1 <1 ms <1 ms <1 ms 192.168.5.1
2 23 ms 23 ms 24 ms 10.32.37.1
3 23 ms 24 ms 25 ms 130.81.11.173
4 25 ms 24 ms 28 ms so-6-0-0-0.BB-RTR1.NY325.verizon-gni.net [130.81
.18.88]
5 31 ms 33 ms 31 ms so-6-2-0-0.BB-RTR1.RES.verizon-gni.net [130.81.8
.253]
6 32 ms 31 ms 31 ms 130.81.10.90
7 31 ms 30 ms 35 ms ge6-2.br02.ash01.pccwbtn.net [63.218.94.21]
8 49 ms 49 ms 48 ms gnax.ge2-3.br01.atl01.pccwbtn.net [63.216.31.158
]
9 49 ms 51 ms 48 ms 209.51.149.109
10 55 ms 49 ms 51 ms 65.254.48.2
11 56 ms 49 ms 50 ms 209.51.131.102
12 209.51.131.102 reports: Destination host unreachable.

btm
10-20-2005, 03:26 AM
I couldn't get past GNAX either. Unixshell.com was down, as well as my servers.

matta
10-20-2005, 03:37 AM
You'll want to see: http://www.unixshell.com/forum/showthread.php?p=2471#post2471.