I recently had to see an ENT (ear, nose and throat) doctor because I had developed a hollow cough that wouldn’t go away. She has been my ENT for years, and she’s fantastic. I told her what had been going on and for how long, and that I thought I had seasonal asthma. She put a scope down my throat to my windpipe, made me make some weird sounds and cough, and said, “It’s not asthma. Your throat is narrowing. This could be serious. We need to figure out why it is happening.”.
Off I went for tests and scans that ruled out the more sobering options, and paved the way for surgery. The surgery would give her a chance to take a much better look at the narrowed area and fix it, as well as to help determine why it might be happening. I came out of surgery able to breathe. Like I said, she’s fantastic. She told me that it was likely caused by a prior surgery I had undergone several months before. I had experienced terrible throat spasms all night, and had been told that the anesthesiologist had a difficult time getting the breathing tube down. She said that it would be difficult (at best) to prove, but she knew my throat, what it had looked like before, and all the scar tissue she had to free up after, so it made the most sense. I’ve been fine ever since.
How was she able to help me so quickly? Because she knows my ears, nose and throat like no one else. She has treated me for years. She knows what is normal for me, and what is not, and when she sees a change and hears a history, the correlations work themselves clear.
As DBAs, our servers are our patients. There are things we should know about them. What is normal; what is not. We should have an idea of the history, so we can be proactive about preventing future issues.
I have a database that tracks these things throughout my production environment. Local databases gather the data I need, and then send it to my central hub. Info is kept for three months at a time, and I have a report that I have written to show me the trend over that length of time, so I can try and see what might go wrong (or whether we’ll need more space, or CPU, or code review – you get the idea). Basically, I track the following:
- Brent Ozar’s Blitz results and whether or not they were fixed (done once monthly and recorded to a table on each server, then sent to the hub for archiving)
- CPU utilization – both by SQL Server and non-SQL Server processes. I find breaking these out is helpful
- IOs by drive and server (again, breaking them out is helpful)
- PLE – here I’m looking for sustained low page life expectancy, rather than quick dips. Are the dips getting longer? What times of the day? Can I correlate them to application processes reliably?
- How long is it taking for Ola’s indexing and CHECKDBs to complete on each server?
- How many times have we had to revert either to a backup or a database snapshot?
- How many times has a server been rebooted?
- What are the top five waits on each server? Have they increased or changed?
- What are my top ten databases across production in terms of size, and how much are they growing by month/quarter/year?
- How much space is each server taking up on the SAN?
In the coming weeks, I’ll talk about ways of getting this information for yourself. Stay tuned.