| « Urges | "I'm such a stats pfhore." » |
Bunnies, lollipops & rainbows!
What a wonderful week it's been.
Follow up:
Last Friday we had to do an in-situ server switch since the 4+ year old machine was spontaneously rebooting without any logs, errors, messages, warning, or reason. Since it's the primary file server for the site, without it people start complaining they can't work. That would be a lie - since they can still send/recieve email, access the SAP system, use the intranet apps (my kettle of fish, which still worked), use their phones, and use a pen... to write! Oh the wonders of technology!
But anyway, the server was dying in catastrophic fashion. Luckily the replacement fleet of servers had arrived a few days before and we had been in the process of prepping it's replacement. So on Friday due to worse and worse uptime (irregular as well) we made the call to do an unscheduled outtage and quick-build the new server to be an in-situ replacement.
Amazingly enough it appeared to have worked. With only an hour of downtime we managed to seamlessly transition out the old server and usher in the new one with users none the wiser. Great stuff.
But Monday morning things started to unravel. Despite monitoring the server over the weekend, the start of the week would be when it really got tested. And it was then that the problems started to surface. Initially it was the automated chkdsk on reboot wiping out the carefully constructed share permissions. This meant they all had to be rebuilt by hand with a backup structure for reference, a mammoth task when there are roughly 800 user accounts and probably that again in critical common shares.
Then by Tuesday we were seeing corrupted files, people would open a file and see blank contents. They'd open a folder and there'd be nothing there... "what the hell?" So we'd restore from the backups and everything seemed to be peachy. Then more and more corruption came to light, it quickly turned into a corruption cascade which was spreading across the terrabyte faster and faster. We again had to drop the machine off the network to save what data we could.
The team stayed in the office overnight till Wednesday morning working on it till we were kicked out for working too many hours.
After a large amount of grief, several parts shipped in by courier, and a great many shuffling on servers it was finally good as gold Thursday afternoon. Turns out the problem was due to having the network storage devices hooked up... to a network. Since the new server is the only new machine in there at the moment it's the only one running the new company standard OS (Windows Server 2003), while the others are running a mixture of Windows NTSP4, and Windows 2000. So since all servers had access to the storage devices, when they decided to do a virus scan or otherwise access the device for any reason they were corrupting the file allocation tables of the new structure. What a bitch.
Simply providing a direct fast link between the file server and the network storage (nothing else) restored stability and halted the corruption. Crazy stuff.
Personally I think it's almost a good thing, since it may well get people to comprehend just how much they rely on the IT crew to keep things running. Without the primary servers running in top notch condition the fragile corporate world starts to rapidly crumble. So next time we ask for more funding, extra resources, or a new chair (i want a leather power chair), hopefully they'll be a little more responsive. Probably not, they'll just think we fucked up and will think worse of us - never understanding the herculean effort that went in under immense pressure and with no support. But it'd be nice all the same.
The other thing that's occured is that we've got a new boss finally. So I can stop doing the jobs of 2/3 different people at the same time. Yay!
He seems like a top bloke and straight up on the first day when we were discussing the usual things with a newish person (work hours, where's the beer at, what can you get away with) he said 'As long as you guys get stuff done out there, you can kick back and play Unreal Tournament in here'. Score.
So today i brought in Quake3 to play during lunch so I'll be prepped a little for Quake4 (which comes out in 3/4 days). I'm thinkin Quake4 is going to be huge, so intend on getting all the practice I can get.
But it all goes to show that a so called 'cushy-office-job' isnt really as cushy as one might think when the fit hits the shan. It's not all bunnies, lollipops & rainbows!