[Note from Pinal]: This is a 98th episode of Notes from the Field series. Maintenance of the database is a very critical activity and I have always seen DBA taking it very seriously. There is a only one problem – there is no single solution or guidance for how to maintain the database. Everybody has their own opinion and way to do different tasks. System maintenance is very straight forward task but I have seen quite often experts even getting confused with the same. Many only focus on index maintenance, statistics maintenance and a few other tasks are common, but understanding the real philosophy of this task is something beyond indexes. When I asked my good friend Stuart about system maintenance, he came up with very interesting guidelines for system maintenance.
In this episode of the Notes from the Field series database expert Stuart Ainsworth explains about three simple guidelines for system maintenance.
This week, I was helping a client review their backup and disaster recovery strategy when they asked me how to handle an ongoing maintenance issue; it wasn’t really a SQL issue, but they were concerned about what to do for a virtualized server running an older, critical website. They’re in the process of replacing the site, but they have to maintain the machine for at least another 6 months.
A recent patch had crippled the machine, and they had to restore the VM from backup, which took the site down for a few hours. Thankfully, their backups worked, but a retrospective look at what happened showed some flaws in their process:
- They knew that the server was critical AND having issues before they applied the patch, and
- Their system admin was inexperienced with this particular type of restore, and that cost them some time during their recovery.
Automate what you can
Most modern system tools will allow you to do some form of automatic maintenance; SQL Server runs jobs on SQL Agent, Windows Server has Task Scheduler, and Linux has cron. Lots of third-party tools offer not only backup capabilities, but also restore verification capabilities and reporting. The security and time saved by using a tool are usually worth the investment; it frees your biggest expense (your people) up in order to focus on other things.
A software process that cannot be automated is only marginally better than no process at all.
— Stuart R Ainsworth (@codegumbo) September 18, 2015
If you must have a manual process, make it simple, and document it
Sometimes you can’t escape manual processes. For this client, the critical nature of this web site meant that they didn’t want to have an automatic reboot cycle, so while patches got pushed automatically, they rebooted manually. Given the fragility of their recent experience, they’ve decided to stop automatically patching, and doing the following steps each month:
- Snapshot the server (for quicker restores).
- Patch and reboot.
- If patch is unsuccessful, restore from snapshot.
- If patch is successful, delete the snapshot.
Simple enough. However, my suggestion was that they write those steps down. Operating from a checklist ensures that procedures are followed consistently, elevating the process to near automatic status. It also gives you an opportunity to periodically review manual processes, and automate pieces as situations and technology changes.
Validate, validate, validate…
So, if you’re automating your processes as much as you can, and simplifying your manual processes, what are your engineers spending their time on? They should be periodically validating that the maintenance plans you implemented are successfully working, including routinely doing recovery drills. Just like a fire drill, IT professionals need to practice what they would do in case of a recovery scenario in order to minimize the downtime associated with an actual recovery.
This is one of those activities that is well received in theory, but rarely done in reality. Support queues are always full, new projects and feature requests are always pressing, and maintenance issues are usually at the bottom of the pile. However, in the event of an outage, routines that have been well practiced are the ones that are the fastest to recover; a crisis situation is the absolute worst time for an engineer to feel like they don’t understand what to do.
Routine maintenance isn’t difficult, but it can be time consuming. To minimize time, automate what you can, and simplify your manual processes. Then, invest your time in preparing for the inevitable. Technology breaks; don’t let it break you.
Reference: Pinal Dave (https://blog.sqlauthority.com)