Tales From Tech Support 01: Percussive Maintenance
2020-07-26 04:04:00 Author: fyr.io(查看原文) 阅读量:296 收藏

If you’ve worked in IT for more than a year you probably have some crazy tales to tell. I certainly do – with nearly 15 years in the field I have seen insanity and hilarity – more than I could ever remember. So I thought to myself, why not write it all down somewhere?


This first tale comes from back in my early days.

As a PFY, I was trusted with little more than basic desktop repairs and printer toner replacements – a fairly common slice of life for many IT bods. I was relatively fresh faced, a few scars but nothing major. Eager to learn and eager to please, I was often the first to raise my hand and take any challenging job, despite the vast gaps in my knowledge. Our team was small – three techs (two general helpdesk end user support PFYs, one mobile device repairs) with one very isolationist “Network Guy” (who would now be called a SysAdmin) and one “Database guy” (although their only qualification was knowing what “SQL” stood for.)

It came as quite a surprise when, early on a sunny Monday, we were told that the Network Guy was leaving. And he wasn’t being replaced.

It quickly became apparant that the entirety of his duties were to fall on myself and PFY#2, who had been working at this place for a bit less than me. Networks Guy’s last day rolled around without much issue, nor much communication from him – in fact at one point I asked for his help with a network issue and he told me “I don’t care, I’m leaving!” So it came as a bit of a surprise (and equal amounts of relief) when he called me and PFY#2 up to his office to give us his handover document.

His handover document consisted of a single sheet of A4 with handwritten notes about a few things barely qualifying as useful. A few IPs and other miscellaneous details about servers and switches, the odd issue he knew about but hadn’t fixed, and maybe a password or two.

We received the document at the door to his office (we weren’t tolerable enough to go inside on this occasion?) and quickly shoo’d off to our regular helpdesk support duties.

That evening, he left and we never saw him again.

The following weeks and months were absolutely insane. I can’t recall much about what happened during this time. Myself and PFY#2 managed to keep on top of most of the helpdesk support calls and make a start on untangling the network. We quickly found that most switches were essentially unboxed and plugged in without any config changes, servers were unpatched and had uptime into the hundreds of days (you could tell when the last powercut had been by looking at the uptime) and we were getting very close to the limit on resources – maxed CPU and/or RAM, HDDs filling up. Group policy was a mess, roaming profiles reaching into the tens of gigabytes with nothing preventing their growth… it was a mess.

To top it off, out the back of Network Guys office was another small closet containing almost all the servers we had. Neither of us had ever been back there, and when we did we found dust, cobwebs and equipment that seemed to be switched on but we had absolutely no idea what it did. Helpfully the single sheet of A4 told us some server names and serial numbers.

We were fueled by Red Bull and the long (long) days began to blur into one massive learning experience. To this day I have never learned so much so quickly as I did back then as we fought to keep the place running, the users happy(ish) and continue to learn as much as we could.

There was one event, though, that utterly stumped us.

Cheerily and awesomely smashing the helpdesk as we were, we suddenly had calls coming in about email being offline. After a quick check we realised that, yep, email was down. We couldn’t RDP into the Exchange (2003) server either – something was wrong, clearly. Off to the dusty old Network Guys office we go.

We walk in, grab the single 4:3 CRT monitor in the room, stretch the cables across the room from one of the power extension cables with a spare socket to one of the nearby tables, and plug the crusty old VGA cable into the back of the absolute beast of an exchange server. I mean, this thing was huge. An old tower, black, solid steel everything. No idea why it was up in this first floor room when all the other servers were on the ground floor, but… whatever.

Flicking the monitor on we quickly saw everyones favourite screen. Yep, it’s blue, and it signifies death. The BSOD.

We panicked a little, probably downed another Red Bull each, then got to work trying to bring this thing back online.

Remember: we were literally flying by the seat of our pants here, and had been four months at this point. We had no idea what we were doing.

First things first – switch it off and back on again. We held the power button down, heard a massive CLUNK, the fans span down and the screen went black. Take a breath. Switch it on again, wait for the BIOS, wait for Windows to start booting, wait some more…. BSOD.

Crap.

We try again, switch it off and back on. This time, we hear a horrible grinding noise as the machine spins up. We get to Windows trying to boot and everything freezes – not even a BSOD. Off it goes once more, switch it on, grinding noise, BIOS doesn’t even finish loading.

Double crap.

Backups! Backups? There are backups, right? One of our jobs is to take the tapes out of the drive and swap them with the next numbered batch in the fire safe, surely we could restore this? But… Network Guy did the backups. Network Guy didn’t elect to write anything down about the backups! We don’t know anything about them! An oversight of the highest order!

We know the boss has the phone number of Network Guy, but we need to try fixing this ourselves first. We don’t want him shouting down the phone at us like the one and only time we called him for help before this…

More Red Bull, more diagnosing. On the odd occasion we can get Windows to try booting, and sometimes we can get to the login screen using Safe Mode, but no matter how quick we are, we can never get logged in, and even this doesn’t last forever. Eventually the server just stops trying to load Windows and we’re presented with some error about not finding a HDD. The grinding at this point is still going on and we’re faced with our fear that the grinding isn’t a fan, but the single HDD that all of our email is stored on.

There’s nothing else for it. We’ve gotta call up Network Guy and ask his advice. Neither of us want to do this though – he wasn’t helpful to us when he worked here, and especially not when he was leaving.

Is there nothing we can do?

I remember the moment – we were stood either side of this huge hulking great server with no more options (that we are aware of at any rate). Our eyes meet, and without saying a word we both think the same thing at the same time.

PFY#2: “Shall we?”

Me: “I dunno… I mean it’s not working so maybe?”

PFY#2: “I think we should.”

Me: “Okay. Let’s do it.”

PFY#2: “Go on then, you can try first.”

Me: “No way, you do it.”

We look down at this ancient black server, lined with solid steel frame, sides and front, touched with the occasional bit of cheap plastic and the odd faded sticker.

I sigh.

PFY#2 raises his foot, and kicks the bastard right in the side, smack bang in the middle.

The grinding noise changes in pitch audibly. It’s still there, still buzzing away in that audio range that you think you can put up with but slowly sends you insane without you realising it. PFY#2 reaches down and holds the power button in to switch it off. He switches it back on.

BIOS loads up, boots fully, screen goes black.

Windows loads. And loads. We stare. And it continues to load. The login window appears after what must have been at least 25 minutes. We’re in shock. No BSOD. No lock up. Buzzing? Yeah that’s still there, but the server has booted. What the f-?

We rush to a nearby office, get the first user to open up outlook. It connects. Emails from the upstream start flooding in to their mailbox.

We check our own, same thing – it’s working.

We’re buzzing. Our blood is filled with adrenaline, caffiene, sugar, and whatever the hell else they put in Red Bull, but it also filled with joy. We fixed a server by kicking it.

After spreading the word that email is back, we get right back to our helpdesk calls, of which dozens have appeared since our exchange issue surfaced.

Not long after this we do eventually employ a sysadmin who I work with to this day, but this exchange server didn’t get replaced immediately. I can’t remember exactly when it was retired, but it whirred on for a good year or more after this. We tried very hard to not touch it – it wasn’t perfect, and we definitely had at least one more very confusing issue with it, which I’m sure I’ll write about at some point, but the beast chugged on and kept our email flowing until it was eventually replaced by a younger sexier model.

Some say that if you can find your way into Network Guys old (and long repurposed) office, and if then you can manage to make your way into the back room, you can, on quiet days when email traffic is up, still hear the buzzing of that once-failed-but-then-recovered hard drive.

This is how I came to learn about and respect percussive maintenance.


文章来源: https://fyr.io/2020/07/25/tales-from-tech-support-01-percussive-maintenance/
如有侵权请联系:admin#unsafe.sh