So, you're supporting a server. It might be yours directly, it might belong to a customer. Doesn't matter. You've got an urgent issue - an alert, a ticket, an end-user reporting a problem - and you need to get moving. Where do you start? Everyone has a method, and if you haven't developed yours yet, here's one you can borrow in the meantime. :)
Step One - Chill!
The most important thing you can do first is not to panic. Acknowledge the problem, let the end-user know that you're aware and working on it, but stay calm. Even if you're out of your depth, the user doesn't need to hear that right now - they need to know that they can leave the problem with you (who may well need to escalate it or ask for help - there's no shame in that, we all have something to learn.). Even if you're worried and uncertain, try to not let them hear it, because that will only stress them more and they will react to that stress by giving you more stress, and so on and so on in what we call a 'stress cascade'.
Step Two - Analyse.
Right. You're breathing smoothly, the user is confident that you have this under control. Analyse the problem using the error report as a first point. If you don't have a good error report (it's just running slowly, say...) do a quick once-over of the server, looking for unusual or concerning states. Is CPU massively up? Is RAM suddenly on fire? Having a specific error message to start from will make this much smoother, but we don't always have that luxury, so in those cases it's our job to go and find one.
Step Three - Liaise.
Right. You're calm, the customer is confident, and now you have an idea what the problem is. Or... do you? Perhaps there's nothing obvious, perhaps you don't know what to look for next. This step is probably the most complex, so bear with me here.
If you've found the problem, and are confident in the path to resolution, liaise with the user. Let them know what you are going to need to do and how that's going to affect them. If there's going to be a service outage or similar needed, ask them when you can cause that - it is often the case that 4 hours of slow running is less painful to your users than an outage in the middle of the day, and that's not your call to make - it's theirs.
If you've found the problem and are not confident that you have the solution, liaise with a colleague. If you don't have a colleague, google the error message you've found (excluding things like server/instance names - just the number and the software that's reporting it) and see if anything from Microsoft, StackExchange, and other reputable sites around the software in question have similar solutions. Obviously if you're working with SQL, I'd suggest Coeo.com be considered reputable, but then I'm biased. Go back to step 2 with your new information and check the symptoms and errors against your new-found knowledge, then come back to step 3.
If you've not found the problem, you're not confident in your solution and/or your colleagues can't help, you need to escalate (or 'call an adult', as a technician I know insists on saying). First, let the customer know that you're escalating the issue as it's more complex than expected and apologise for the delay - don't use a form email or boilerplate, you're communicating with a human being here.
Now, you've got a few options:
- Tweet #SQLHELP with your question (assuming you've got an SQL issue, skip this if not!) - there's a lot of people out there who try to answer these.
- Post your question on StackExchange - again, there's a huge community of people who will try to answer interesting technical questions out there.
Those are the best free options. Now onto options which might cost you:
- Contact the software vendor. This is often a long, painful process and will usually involve being asked to patch things and upload diagnostic data, but generally gets you the answer.
- Contact an MSP for the software in question - even if you don't have a support contract, most MSPs will be happy to consult with you and get to the bottom of the issue - in my experience, this is better than vendor support, but again with the bias.
Step Four - Move.
Finally. You're calm, your users are on-board with what you're about to do, and you may have just gained a new support partner to help you develop and reduce your stress levels in future. Implement your solution according to the agreed plan.
Congratulations, you just handled a major incident like a rock star. Keep doing it, and the confidence will build each time, letting you do this better and better, with lower stress levels - and keeping calm throughout.
Thanks for tolerating the puns, and hope you enjoyed this brief guide - if you'd like more detail on any of these steps, let me know!