blob
drs' stuff

Thursday, 22 June 2006

Network Loops Considered Stressful

We now know that the problem was that a visiting academic plugged a fly lead from one network socket into the next. However, it took me about three paniced hours to deduce that, and about another three to fix all the problems it caused.

Around lunch yesterday, we noticed was that the stack of new 48-port Nortel gigabit switches in AMM couldn't be pinged. On closer inspection, we noticed one switch had no link lights at all and some cascade links were down, which led us to suspect that that switch was faulty. Removing it from the stack didn't solve the problems: the multi-link uplink from the stack to the core switch was still not working. Since one of the links in the trunk was on the switch that had been removed, I presumed that I'd need to configure that link out of the trunk. Still didn't help. Removing the trunk configuration entirely on both the stack and the core switch got us no further, either.

By this time, the network technicians had arrived, and revealed that the reason there'd been no link lights on the suspect switch was that nothing had ever been patched into it! So it looked likely that the switch was fine -- scratch out that theory. We then realised that there was an enormous amount of broadcast traffic, and that STP had not been enabled on the switches in the stack. Thinking that that sounded like a loop, I turned on STP, but broadcast storm continued. Starting to doubt that STP was working on the stack, I started a binary search to find the troublesome port, disabling half the ports at a time. With the first iteration, the loop stopped, but re-enabling ports didn't reveal the culprit.

It turns out that turning on STP had no effect until a link state change occurred. Enabling and disabling ports had done so, and STP had then picked up a loop and blocked one port. Drew and Mike found the offending flylead, while I reinstated the multi-link trunk configuration. After a bit of testing, we called it a day.

Unfortunately, I didn't notice that the uplink to the smaller stack of 100 Mbps switches had also been knocked out by the loop. So we went back this morning, and enabled STP on that stack, too. (Why did neither of the stacks have STP enabled: was it the default setting, or did someone actively disable it?)

No sooner was back at my office, than it was reported that the phones didn't work. They couldn't call extensions on the rest of campus, but could call outside numbers via AMM's PRI to Telkom. Back up the hill to AMM. After checking the networking to the IP PBX and finding nothing wrong, I called our PBX support technician, who'd found that the problem was the IP-to-TDM gateway in the main PBX on campus. After taking that card out of service, rebooting it, and starting it up again, everything finally worked. It seems that the looping broadcast traffic had been carried across campus on the VOIP VLAN to the gateway and overwhelmed it.

Now we really need to find out which other switches on campus don't have STP enabled... hopefully before this happens again.

posted at: 16:13 | path: | permanent link to this entry

Monday, 19 June 2006

Kiosk web browser setup

A majority of students in Rhodes' residences have PCs and pay for ResNet access and thus have access to the network from their rooms, but there are still many students who can't afford that. So the idea of providing a PC in the residence's common room for basic e-mail access, meal booking for those people was suggested.

We (systems administrators) had some concerns about unattended machines in res -- lab machines are carefully set up with limited rights, etc. -- whereas these machines wouldn't be supported by I.T. nor necessarily be entitled to site-licensed software. So I set up a locked-down FreeBSD installation with X and Firefox, which is secure and completely free.

The machine boots into X (autodetecting video hardware at each boot), and starts Firefox. I've hacked Firefox's chrome so that it goes into fullscreen mode at startup, and removed the UI controls to go out of fullscreen mode, so you can't get access to the rest of the desktop. I picked Blackbox as the window manager, as it has no built-in keyboard support, so there's no chance of accessing the window manager's functions to start any other programs. The machine itself has been hardened with a restrictive firewall (which only allows outgoing connections to our proxy server), only /var is mounted read-write (for reliability purposes), and the browser user's home directory is restored from a tar archive every time the browser is restarted (so no-one can permanently change the browser's settings). It might not be completely bulletproof, but it's whole lot better than Windows 98. :)

One of these machines has been installed in a residence so far. According to the warden of the res, it's been well-received, and our traffic graphs show that it's used almost continually. It's nice to see it being put to good use.

Webkiosk screenshot Webkiosk usage

posted at: 17:20 | path: | permanent link to this entry

Wednesday, 14 September 2005

IBGP design and testing

The need for a dynamic routing protocol on the Rhodes network is becoming apparent. At the moment, we've set up a bunch of static routes between our two Nortel Ethernet Routing Switch (neé Passport) 8600s, the Internet firewalls and the ResNet firewall. Next year, though, we'll be adding a third 8600 and a second ResNet firewall to the mix. The ResNet firewalls, in particular, will complicate matters as each will have about 20 not-very-aggregatable networks behind them....

Based on Guy's experience with the GINX BGP setup, we reckoned we'd setup IBGP rather than trying to get our heads around another routing protocol (such as OSPF, which we considered).

Guy's first tests with the 8600's BGP implementation weren't very fruitful: it flatly refused to advertise any networks to its peers. Now that we've upgraded the software on the 8600s, I've had another go at it, with much more success. So far, we've got two 8600s, an old Cisco 7200 and zebra on FreeBSD peering.

We're not as concerned about achieving reliability as ease of configuration. We're hoping that Nortel's SMLT/RSMLT implementation will provide us with a resiliant, triangular-shaped backbone segment, which will elegantly solve the reliability issues at layer 2 rather than layer 3.

Our intended BGP design tries to maintain our fairly flat routing structure: all routers will have an interface on the backbone subnet, and create a full IBGP mesh on that subnet. Full mesh should be easy enough to manage, because we should only need one peer group and n neighbour statements in each router. If that proves too painful, we could use the 8600s as route reflectors, with a full mesh between them.

Whiteboard Toys

posted at: 23:08 | path: | permanent link to this entry

Powered by Blosxom Powered by Apache Powered by FreeBSD
This work is licensed under a Creative Commons License