People who have inherited a SQL Server they need to administer
You’ve just been handed a SQL Server, and you need to make sure there aren’t any skeletons in the closet. What things should you check to make sure everything is healthy?
In this presentation, we’ll run you through the checklist I use when I provide a health check to a customer. First, I check for any resume generating events (RGEs) like broken backups. Then I check for best practices like email alerts and MAXDOP settings. Finally, these checks turn into more a scuba diving expedition, where we look for things that seem off and dig deeper.
You won’t walk away with a perfect strategy, but you’ll have an idea of how one data professional does it.
Enjoy the Podcast?
Why I Want to Present This Session:
This is something that I do anyway, I might as well document it for the community.
Session Transcript: Introductions
Brent Ozar: Ladies and gentlemen, in this session at GroupBy, Eugene, a.k.a Eugenuis Meidinger will be talking about how to do SQL Server health checks. So take it away, Eugene.
Eugene Meidinger: Alright, thank you. I’m very excited to be presenting at GroupBy again. It was a lot of fun last time. This time’s on actually technical topic, so we’ll see how that goes. So just real quick about me, I’m a business intelligence consultant, so that’s what I do every day, and I’ve worked at All Lines Technology for six years now, so I’ve been working with SQL Server in some form or another for the past six years. I’m a Pluralsight author, so I have a course out now on Power BI, I’ve got another one coming out early next year, and I regularly speak at SQL Saturdays, I got to speak at SQL Summit, which was a huge deal, or PASS Summit, and GroupBy this year as well.
The most important line about me is that I went from SQL Newb to SQL Pro, and every time I give a presentation I give this story because when I started this job, I was not qualified for it. So it said .NET/SQL Developer. Like, great, I always wanted to do software engineering, this is going to be like, 80% .NET, 20% SQL. Well, it was the other way around, and I really – I had taken a course on databases in college, but that was about it. So I didn’t know the difference between a stored procedure, a view, or a function in SQL, and I remember Googling the first week, like what is the difference between all these things? But I learned over time and I also started to learn how to do database administration.
So we’re going to talk about SQL Server health checks today. We’re going to talk a little bit about why would you do it, why does it make sense to do it, we’re going to talk about kind of these different chunks of doing a health check. And in my mind, it’s kind of hard to break it out into these logical segments, because at least for me, whenever I’m doing a health check, there’s 20, 30 different things that I’m checking. And so it’s kind of all over the place, it’s very scattershot, and it depends a lot on the customer, how I approach it.
Prioritizing Your Health Check Results
But I’ve gotten kind of three buckets that I think is useful for prioritizing things.
The first one is what I call resume generating events. What are the things that might force you to update your resume? If you lose all the data in the database, that’s a resume generating event. If you have corruption, if the database – the server’s just down for an entire day, that could be a resume generating event, depending on the environment. So those are the most important things. If you get anything done in a health check, you want to take care of those potential resume generating events.
The next thing is best practices. There’s lots and lots of best practices, and so once you’ve taken care of the potential emergencies, there’s a lot of configuration changes or things to set up that are just good general best practices. They may not ever bite you in the butt if you don’t set them up, they may not ever cause performance or stability problems if you don’t take care of them, but as the name implies, it’s just best to do them. And one of the things that’s a challenge with this is that the list can get really, really long, but we’re going to cover some of the like, the top-hitters. But that’s one of the challenges with health checks is there’s always something new to learn, and it can start to blend into just database administration in general, which you know, can take years and years and years to become a senior DBA and really learn all these nooks and crannies. So it can be hard to know where to kind of cut things off.
The third thing, this comes up a lot whenever I do SQL Server health checks for customers, and it’s very hard to define, but we’re going to try anyway, and I call them server smells, based off the idea of code smells from programming, where it’s not necessarily there’s an issue with the best practices, there’s not necessarily something that’s broken per say, but something seems off, something smells funny. And there’s been a number of times where I’m going through event logs or just looking at how things are set up and something seems – like, something smells a bit odd. And it doesn’t necessarily tell me immediately that something’s wrong, but it starts down a path that eventually does turn to something that is worth addressing. And so there’s a lot, in my experience, when you’re doing a health check, of just digging, going down these rabbit trails and saying, “Okay, is this a problem?” And Googling something, and then a lot of times maybe it isn’t, and sometimes it is, and that can be useful to get a good horse sense of okay, is this – could this be a problem?
And then the last thing is okay, you’ve done the health check, well, now what? Every time I do a health check for a customer, they’re generally day long, an extra four hours for each extra server, and the last two hours are usually the write-up, and that’s me just writing up everything I found. Here’s the problems, here’s what it’s going to take to fix them, and here’s the urgency with them. And so the write up is an important part that comes at the end, especially because you’re going to have stakeholders you’re going to want to communicate to the business, here’s what we need to do, because often times, you don’t have just carte blanche permission to do whatever you want. Ideally, there is a change control process, or in my case as a consultant, what I like to do is basically say, “Here’s the stuff that you should fix, do you want my help?” And some customers say, “Yes, please just do everything on this list”, other customers say, “You know what, for this service pack installation, we’d like to have you around just in case something happens, but everything else I think we can handle it.” And so it just varies a lot by customer.
And then finally, I’ve got a demo. So I’ve set up a SQL Server box in Azure and I’ve intentionally messed some things up, and what I’m going to do is I’m going to go in and I’m going to do my best to sort of stream of consciousness explain what’s going through my mind as I’m doing it. Now, there’s going to be some challenges here for a couple of reasons. One of them like I said, is that when I do a health check, it’s really scattershot because what I’m doing a lot of times is just getting a lay of the land, and just looking at a bunch of little stuff all over the place so I can get an idea of what the biggest issue is, and then I kind of circle back. And so I get in these cycles where I check the wait stats to see where the pain points are, I make sure backups are working, I check the error log a little bit, I come back, and so I usually have a checklist but it’s not very linear for me. But I’m going to try to be as linear as possible, and I’m going to try to explain while I’m doing it what I would normally be thinking, and it’s hard too because I set it up so I know that a lot of stuff is broken. But we’ll see if there’s anything there I forgot that comes up during it. So that’s kind of the lay of the land, that’s what we’re going to be going through today. And feel free to ask questions at any point and Tracy will bowl them up to me.
Why You Should Do Health Checks
So why do we want to do SQL Server health checks? So SQL Server is not great out of the box. It’s a wonderful piece of software, it’s very powerful, but it does require a certain amount of maintenance. Even just to get it set up, there’s a number of times when I do a health check that in that eight hours, six of finding issues and two of kind of writing them up, there’ll be a lot of things that are one-offs, that once they’re addressed are good, and a lot of that is configuration, because many of the defaults for SQL Server aren’t the best, and they can cause issues with space and performance and stability and all of these things. And some misconfigurations can mean that you lose all your data. Specifically, I’ve seen customers where they think they have everything set up for their backups, and they don’t. Or they think that they have point in time recovery capability with their backups, and because of a third party tool, they don’t. Or you know, almost never do I see anyone with proper alerts set up, and so you could run into a database corruption issue and be unaware of it.
So these are all things that are really important, at least get SQL Server to a good, solid state. If you’re a small shop or something like that, then usually after that first visit, that takes care of a lot of things, but you still want to make sure that all that initial stuff is taken care of. And to be clear, the audience here applies for anyone, but I’m coming in from the perspective of a consultant, and also as an accidental DBA, so that’s kind of my target audience in the sense that what I usually deal with are shops where they don’t have a DBA. They have Frank in IT, and he is a sys admin, and so he got SQL Server installed and he set up backups and that’s kind of it, and he sort of knows what he’s doing, but it’s not really his bread and butter. So they want someone to come in and say, “Yes, it’s healthy” and what am I missing. But also as an accidental DBA because again, my primary role is business intelligence developer. I started out as a report monkey, and they went, “Well, you can spell SQL, so you’re in charge of our databases too.” And so I had to learn a lot of stuff and trying to maintain our servers starting out was very scary. So what I want to cover is for ideally small shop or people who maybe don’t have a lot of expertise in database administration.
So that’s why you should do a SQL Server health check.
Why *I* Do Health Checks
Now, why do I do health checks? So I’m – like I said, I’m a business intelligence developer. This is not my daily bread and butter. This is not what I do everyday. What I work with ideally is T-SQL, SSRS, Power BI, a little bit of SSIS, that sort of thing. I’m much more of a database developer, but I – you know, I have had to learn the administrative side as well. I am certified in SQL Server 2012, when it comes to administration. And so the reason I do them, the biggest thing is the first consulting that I ever did in my company was a SQL Server health check. In that time, I was not prepared.
What had happened was the company structure for where I work is bizarre, because it’s a fire protection company, so they sell fire extinguishers, sprinklers, you know, fire alarms, that sort of thing, and like, ten years ago they bought an IT company, All Lines. And then about three years ago, we bought another IT company called Lantech, and so when we bought Lantech, the sales people at Lantech went, “Oh, we’ve got a SQL person now. Hey, can you do this work for us?” Right? And so the first thing I ever did was a health check, and at the time, I was super nervous, I didn’t know a lot of this stuff, and it went well. But even today, you know, I’m nervous about giving this presentation because I know there’s a lot of professional database administrators, senior database administrators watching this presentation who know SQL Server like the back of their hand. And for me, I’ve learned it piece meal. I like to think it’s like those crabs that just put pieces of detritus on their shell and just build things out. I feel like that crab, where I’ve just build out this knowledge of here’s some things to check, here’s some things to look for, but it’s not what I specialize in. And hopefully, in some ways, that perspective is helpful because I know a lot of you might be accidental DBAs where you’re in charge of SQL Server and you don’t quite know what you’re doing and you need to start somewhere.
As to also why I do them, it’s an easy sell for customers. You take your car ideally to the mechanic every year to get it inspected, to get it checked out, right? Same concept applies to SQL Server because stuff does break down, but also it’s very clearly defined what the scope is. I can tell customers at this point, pretty confidently, “It’s going to take a day and we’re going to be able to tell you whether your SQL Server is healthy, and we’re going to give you a list of actionables at the end of the day that you can take action or not, but you’re going to have a clear-cut deliverable. We’re going to give you a piece of paper that says, ‘Yes, you’re good’ or ‘No, you’re not'”, and a lot of our customers want that, because again, we service a lot of small medium businesses, and so there’s a number of them where they’ve got an IT guy, he knows hardware, he knows networking, he knows his pieces, he’s kind of okay on SQL and really wants someone to come in, just give it the thumbs up, thumbs down.
Parts of a Health Check
So let’s kind of talk about different pieces of doing that health check. So the first thing that we want to take care of no matter what are these resume generating events. What could basically get you fired very quickly? That’s your biggest priority. If you get nothing else done, you want to look at these things. And so the first thing, the most important thing is backups. SQL Server without a proper backup and restore procedure is worthless, because the moment something goes wrong, the moment Susan in accounting deletes some sort of entry in your ERP system, the moment that there’s some sort of just, failure with the SQL Server and you need to migrate to a different one, you’re going to be in a lot of trouble if your backups don’t work. And I’ve seen multiple occasions with customers where they thought they had backups set up and they were misconfigured. Either they didn’t quite understand the distinction between transaction log backups and full backups, which we’ll talk about a little bit, or they were backing up certain databases and had added new ones but they weren’t automatically added to the maintenance plan.
The next thing related to the backups is corruption. You have to be checking for corruption because disks do fail, and sometimes they can fail just you know, a page here, a page there, something like that, and you need to be checking to make sure your data’s good because okay, you’ve set up backups and you can do restores, but if you’ve had corrupt data for a month, your backups probably don’t go back that far, and so okay, you can restore to bad data.
The other big thing, this is the top three, is you want to have a handle on security. And we’ll talk a bit more about that, but a lot of times in small shops, everyone has the SA password, everyone has an account that has system privileges, and so if someone breaks something, you may not know who did it, but also you just don’t have a lot of control, and you’re the one responsible for the database, for the database server, and so you need to understand who has what permissions.
And then the last thing here, and I’m going to talk about how whenever I do a health check, I don’t necessarily look a lot into high availability and disaster recovery but I always want to keep the top in mind because the business – no matter what, it’s going to have an implicit expectation of certain SLAs. They’re going to have an expectation of, “Okay, how available are we? What happens if there’s an emergency? What happens if somebody deletes something?” All these sorts of things, and so they may assume that you’re highly available when you’re not. I mean, it takes work to set up high availability, it takes work to set up disaster recovery. So again, the perspective I’m coming from is if you’re an accidental DBA or I’m consulting for a small business or something like that, there’s a good chance that none of this is set up, but the business may have certain expectations of how quickly can they recover from sort of disaster, emergency, or something like that.
Checking Your Backups
So the first one with backups; the big thing is make sure you’re taking backups. Are you positive you’re backing up everything? If you set up a maintenance plan with – and you pick some specific databases, if you add a database later, that won’t be added to the maintenance plan, and I had a customer where this happened, where I checked to see everything that was being backed up, and there were three or four databases that were not being backed up at all. And so if something had happened to one of those databases, they would have been completely out of luck. And really, you don’t know that your backups are working until you’ve tested a restore, and something that I see commonly is third party tools breaking the backup chain.
So the next slide we’re going to talk a little bit about the difference between full and simple recovery models, a little bit about the backup chain, but the big thing to know is that if you’ve got a full recovery model, then those backups are kind of like this conga line. And so everyone has their hands on the shoulder of the person in front of them, and if you want to backup to 8 am this morning, you need the full backup that happened before that, and then any transaction log backups since then. And if somebody sneaks in and takes a full backup in between, that can break that chain, and if you don’t have access to that full backup, then you can’t do that point in time recovery anymore.
Well, a number of third-party backup tools will backup the database to a virtual device. So if you’re looking into log, you’ll see this weird GUID, all these letters and numbers and dashes for the backup location. It won’t be the C drive, or the D drive or whatever, it will be you know, left squiggly brace, four, two eight, seven, dash, so on and so forth. And so I’ve seen that multiple times where they thought that they had point in time recovery capability, but because of the way someone had set up the recovery – the backup tool, third party tool, that had broken the chain. So again, next slide we’ll talk a little bit about that, we’re not going to go too heavily in detail, but the biggest thing is test your restore. Take a database, say, “Alright, we’re going to restore to yesterday”, test it, and make sure it works without errors.
So do you understand the difference between full and simple recovery model? For a number of years I didn’t, or at least I didn’t understand it fully. The short version that I want to give is that full is the default. So any time by default you make a database on SQL Server, it’s going to be in the full recovery model, and this is based off of the model database that determines the default settings for everything in – all the new databases in SQL Server. And so when you have full recovery, it allows for point in time recovery.
So again, Susan, or Mike in accounting, they delete an entry, and you want to go back to 3pm today. Full recovery model allows you to do that because it keeps track of, okay, we backed up everything for this point in time, and then we backed up all the changes since then. But for that to work, you have to backup the transaction log, and so you may not think to do that, and if you’re not backing up the transaction log on a regular basis, then that transaction log will grow and grow and grow and grow because it won’t be shrunk or reused unless you’ve backed it up so that SQL Server can guarantee that you have all the data that you need to be able to go back to any point in time. And so this can be a problem, this happens to people where they don’t have backups set up the way that they should, and they have an outage because the transaction log just grew out of control.
The other recovery model that you’re going to care about is simple. And so simple is good for low priority databases. We had a situation with one customer where their SLA, their service level agreement was basically 24 hours. They could lose a day’s worth of data for their databases and they’d be fine. Right? No one would be happy about it, but it wouldn’t be the end of the world either. And so simple recovery model is great for that because you can recover to wherever you have a full backup in place, and so if you don’t care about doing point in time recoveries, it’s a lot less maintenance burden and your transaction log almost never is going to grow out of control because as soon as it can write the data back to the data file, it’s going to do so and it’s going to stop growing.
I want to make a quick book recommendation, so SQL Server 2014 Backup and Recovery, it’s probably a little bit out of date at this point, but it’s by Tim Radney and John Sterrett, and this was the book that taught me how to do proper SQL Server backups, and actually, I have a copy here, signed, but this book was really helpful to me. You can get it for eight bucks on eBay, and once I had actually read a book about it, I understood the difference between full backups, transaction log backups, differential backups, but it honestly took me a little while to properly understand it, and this is like, the most important thing. If you do not understand backups and restores, that’s going to be the most likely resume generating event as a database administrator.
So next thing is corruption. You want to be checking for corruption. You want to be running DBCC CHECKDB on a regular basis and you also want to check for – you want to have alerts set up for certain – I don’t think they’re necessarily – I forget if they’re errors per say, but hardware alerts, the numbers 823, 824, and 825, and they will tell you if there was a hardware failure. And sometimes there can be intermittent disk failures, so SQL Server might try and write to disk, fail, try again, fail, try again, and then succeed. Well, I want to know if – even though it was successful, did it fail intermittently? I want to know about that.
So in addition to checking for just corruption using DBCC CHECKDB, which checks the database for corrupt data, you also want to have alerts set up so that you know, okay, our disks are getting a little wonky, maybe we need to replace them. So this is one of the other big things that you really want to be checking for before anything else. This is just making sure that there’s no corruption.
The third thing is security. Again, coming from the perspective of dealing with small shops, everyone has SA, right? And that’s not great. Or you might have accounts that have sys admin privilege, and maybe they don’t even work there anymore, right? I had an experience with a co-worker at one point in my life where he was joining our team, he was moving from one team to another, and literally the first day that he was on our team, like he had done some stuff for us before, but the first day he officially joined the team, by the end of the day he was fired. And that was because he had written a PowerShell script that would basically destroy the user profile of anyone else on the shared machine that they used, and I don’t know if he was just mad at people or trying to get revenge, but basically they would lose two or three hours of work because they had to recover their user profile, and customers were impacted by this.
So you’d like to think that everyone’s honest and ethical and that sort of thing, but I’ve had a personal experience with someone who is going to be on my team, working with me, who was a malicious employee. So you really want to understand who exactly has access, and do they all still work here.
Another thing I’d say is every SQL Server developer at some point in his or her career is going to accidentally delete a table. I’ve done it, my co-workers have done it, everyone’s going to accidentally run truncate because they think it means something else, and it’s a great learning experience, but if you got junior developers, maybe it makes sense to make sure that they have limited access to just the development environment, or just whatever that they need. So you want to have an understanding of, “Okay, who has security?” because someone else could easily mess everything up, and again, if you’re the database administrator, your responsible for that.
The last thing of these resume generating events. So high availability/disaster recovery, these are never a big part of a health check when I do them, but they’re always something that I bring up at the end of the health check. So I always bring it up as a discussion point along with SQL Server version, because these are inevitably things that haven’t been considered, or considered in a while, but businesses usually have a lot of interest in because if you can tell them, “Hey, we can set this up so that if something critical happens to this database you can backed up in an hour.” That’s a big deal. If you can set up availability groups or failover cluster instances, especially if you can do it cheaply, that can be a big win for the business, especially if they’re in manufacturing, or a hospital, or something like that, where it’s just critical that they’re up as much as possible.
So you really have to understand what the business needs are. Again, there was one situation where the customer, they could handle a day without too much issue, so we put their stuff in simple availability mode. They didn’t really probably need availability groups or anything like that, it just wouldn’t make sense, but then other businesses have these implicit expectations. Again, rarely do the business users think about, “Well, what if it goes down?” They just kind of expect it to work. And so it’s very important to communicate, well, the way you have things set up, if you’re down, it could easily take you another day or two to recover from that. Or if all their backups are on the same storage material as the rest of the server, then if the server goes down, they may not have access to the backups, and so they may not be able to restore at all. Right? So that’s more like the disaster recovery kind of piece.
So these are the things where it can be a resume generating event, because the business expects certain SLAs and because you maybe never level set with them what those are supposed to be, then you could get in a lot of trouble if there’s a big outage and you never took the time to say, “Here’s what’s going to happen if you don’t fork up the time and the money or whatever”. So whenever I’m doing a health check, you know, I’m just checking to see if they have anything set up at all, but usually, it’s more just the item at the end. So those are kind of resume generating events, and again, if you have any questions, feel free to post them in the Slack or in the chat window for GoToMeeting.
So once we take care of the big stuff, the stuff that’s going to lose you your job if it’s not properly maintained, the next thing is best practices. And we’re going to talk about a number of best practices, but the challenge is there are so many best practices.
An example of something – I’m not even going to touch in this presentation is SQL Server instant file initialization. So whenever you’re restoring a database, normally, it has to write out – it has to zero out all of the data for the data file so that it can actually use it. Now, the ideal would be that Windows just says, “Well, we know there was stuff written there before, but we’re not going to worry about it, here’s your file”, and you can set it up so SQL Server can do that, but you need to give it certain Windows permissions in order to do that. So in a recovery situation, you could greatly improve the time to recovery. That’s one of those niche, edge case kind of thing that’s good to get to, but there’s 20 or 30 or 40 different things like that, and it’s hard to know about all those things, it hard to learn about all those things. So best practices are hard because there’s so many of them, and again, if you’re not like a senior database administrator, you might not have a good idea of what is out there.
So we’re going to talk about some of the ones that I always try to get to no matter what, whenever I’m doing a health check. So that’s managing RAM, that’s managing parallelism, that’s configuring tempdb, and that’s setting up email alerts. But there’s a bunch of other things, there’s some things that might come up in the demo as well, but some of the ones I just want to talk about, max RAM, this is an example of where the defaults for SQL Server are just garbage. Garbage, garbage, garbage. The default max RAM for SQL Server is – if I did the math right with my calculator, two petabytes, which is a lot. I don’t know if they make machines with that much RAM, but I’m sure they’re getting close at this point.
SQL Server will use as much RAM as it can, and so you may have a situation where you’ve got 64GB Ram and SQL Server’s taking 60 of them, and it’s potentially choking out the operating system, or other procedures, or whatever, or it’s just really over-utilized. I mean, more RAM is always nice, it does cover a multitude of sins when it comes to poor database design and that sort of thing, but it is possible to overprovision, and so if it’s a physical box, okay, maybe there’s not much you’re going to do, I doubt you’re going to just pull out some RAM, stick it and say, “Alright, now you have 32GB.” Maybe. But if you have a virtual machine, that’s easy to do. You may say, “Hey, we don’t need all this RAM when we’re only really – I mean, our databases only go up to 20GB, we don’t need that much RAM.” Right? So you may decide to re-provision, but if you just look at utilization, it would look like SQL Server is using all that space and it needs it, when really it may not. But you really want to limit this, especially if you’re the pool soul that has stacked sequences.
So you have a regular instance and a named instance, well, I’ve seen situations where those two are just beating each other up, fighting for RAM, because no one ever set max RAM on either of those instances. So you want to limit that. Page life expectancy is a performance monitor counter that we’re going to do in the demo, they can help you understand, okay, do I have enough RAM in general? How do things look, and give you a better idea of some of the provisioning. But if you Google it, there’s formulas out there for what to set this to, and generally it’s something like – I forget exactly, it depends on how much RAM you have for your SQL Server but it’s like, you want to leave 4GB for the operating system plus another four for every 16, or something like that. I don’t remember off the top of my head and I don’t need to, I always just Google it whenever I have to change it. But there’s some great blog posts out there about it. So you want to limit the amount of RAM it’s using.
Parallelism: MAXDOP and Cost Threshold
You want to look at parallelism. So while this may not always be a problem per say, and when we start to talk about one of the wait stats called CXPACKET, people can make knee-jerk reactions to that, people can make knee-jerk reactions to parallelism. At the same time, the defaults are, by default, terrible.
So there’s two measures here that can be configured, and by default they’re terrible. So the first one is MAXDOP, which is max degree of parallelism. And that’s basically when you run a query, how many cores can it use? And the default is zero, which means infinite. If you’ve got you know, some HP Thunder Dome super mega X machine with 128 cores and you run a big enough query, it will potentially split it out amongst the 128 cores, or it can anyway with that default setting.
Microsoft’s recommendation is to generally limit it to eight, and then – eight, or I guess the number of cores you have is less than that, but limit it to eight and then kind of tune from there. You’re just not going to get a lot of benefit most times from splitting it out to 32 different cores. So MAXDOP a lot of times makes sense to change because the default is to use everything.
And then cost threshold is set to five, so that’s a measure of how expensive the query is to run, and it’s measured in what Brent likes to call query bucks, because it’s this arbitrary measure of how many seconds did it take to run this query in 1997 on this machine in a Microsoft employee’s desktop. So it’s not a very meaningful number in itself, like the units of measure are seconds in 1997. But the default of five for modern computers is very, very low. The recommendation that I see over and over is to set it to 50. So that way, if you’ve got a really small query, it’s not going to split it up over those eight cores or 16 cores. It’s the bigger queries, the more expensive queries that it’s going to split up. So the defaults can lead to heavy parallelism, which can lead to poor performance, because if you have an uneven workload or something like that, you may have it so that some of the work’s being done on some of these cores, and the other ones are waiting for it to finish. So parallelism is something that I almost always end up changing whenever I’m looking at a SQL Server.
Tracy Boggiano: Question. What if the VM only has four CPUs, what should your MAXDOP be?
Eugene Meidinger: So I don’t know 100%, but I would imagine that you would just set it to four. There’s not really much harm in setting it to eight because it can’t use more than it has, right? So – but I would say you want to set it to eight or whatever your maximum number of CPUs is. A related question is how likely do you think that it’s going to change, right? Maybe your VM team, after you complain about the performance with only four cores, they decide to re-provision it, or something like that. But you’re probably just best off just setting it to eight if you think they’re going to change it. That’s the Microsoft recommendation, is eight. You don’t really want more than that, and that’s just a good starting point. I don’t see any way that you’re going to have a harm in having more of a MAXDOP than the max number of cores you have. But good question.
Another thing is tempdb. So this is one of those things where it’s a best practice, but it’s not always an issue. It just depends on the workload on your SQL Server, but as a best practice, I like to split it up, tempdb in multiple files because you can run into contention on certain pieces of tempdb.
You generally want to split this out because I have seen at least one customer instance where there was tempdb contention because there’s certain pages in tempdb that specify allocations for the rest of it, and so those can become a point where there’s locking going on, and contention. But if you split it up into multiple files, if you split it up into eight different files, then there’s less contention because now you have eight tempdb files and so those pages, there’s going to be eight times as many, so to speak. So you want to split it up, and you want to split that up based on the number of cores that you have. Again, I would only go up to eight, and then see if you need to make changes from there. If you split it up into eight files, or how many cores that you have, that’s generally going to be a good starting point, and you really shouldn’t need to do it more than that.
Now, something really important is you want to make sure that if you’re going to split it up, you set a max size to each file. You just pre-provision them to a certain size and you limit growth, you stop growth because if one of those tempdb files starts growing, then it’s going to get a disproportionate amount of traffic. It’s going to get more traffic than the other files because of the size, and it’ll potentially keep growing. And so you can get into a situation where you had eight small files, one of them grew, and now there’s a bunch of contention on that one file. So if you are going to split it up, make sure that you just say, “Okay, each of these are going to be 10GB”, or something like that.
Something to be careful about is where you place tempdb. So tempdb, as Brent likes to call it, is the public toilet of SQL Server, which means all of these temporary objects, all of this temporary information gets dumped there, and I’ve seen queries run on a server that are very poorly defined, and just blow it out of control and take up gigs and gigs and gigs of space. And so you need to make sure that if that happens, it’s not going to interfere with anything else. Tempdb, you know, you can screw it up and generally, you’re going to be fine because any time you restart SQL Server, you get a new instance of tempdb. But if you take up all the drive space on C, maybe you’re not able to log back into the server, or if you take up all the drive space on your data drive, if you have one, hopefully, then maybe SQL Server is not able to write to the log, or something like that. So you either want to set a max size, or make sure that tempdb is isolated in some way so that it’s not going to just run rampant.
Tracy Boggiano: Tempdb question. Do you suggest having them on different disk drives or on a single drive?
Eugene Meidinger: Right. Well, actually, some clarification. Do you know if they mean like the multiple files, or tempdb as a whole?
Tracy Boggiano: The multiple files, if you put them on different drives or on one drive.
Eugene Meidinger: So this is where we start to get outside of my experience, because I work primarily as a database developer, but I personally wouldn’t see a ton of benefit in putting them each on their own individual drives. Unless you have a lot of money to throw at it and they’re literally mapped to individual disks, I don’t see a lot of benefit, and even then, what I would consider doing is going with SSD instead. The reason that you split up the multiple files is not necessarily to get better read writes, because tempdb is going to be kind of random access anyway. The reason you do it is to limit contention, because there’s certain portions of tempdb that basically say what you can do with the rest of that database, and so there’s contention in those specific masterpieces, master spots, however, you want to describe it.
What you would get more benefit from is if you can put it on some kind of SSD, because tempdb has a lot of reads and writes, and because it’s temporary, you don’t have to worry so much about – I’m trying to think how to put this, but like, as much with your RAID configuration or some of those other things, because if you lose tempdb, it’s an outage, it really sucks, but you haven’t lost any critical data most likely. So I wouldn’t see a lot of benefit from putting tempdb on individual disks, but I would see benefit from separating tempdb from the rest of your drives, because it’s going to have a specific read-write pattern, especially that’s different from your log files. So your log files are going to be sequential writes, generally speaking, and so if you can kind of partition out, okay, here’s the logic for our log files, or here’s the data disk for our log files, here’s the data disk for tempdb, awesome. But the best thing you can probably do is just get a cheap SSD and put tempdb on there, and see some better performance. Were there any other tempdb questions?
Tracy Boggiano: Yes, how would you go about sizing the files? That’s a tricky one.
Eugene Meidinger: I know. So what I’ve heard is – I’m trying to remember the numbers off the top of my head, but the two things that I’ve heard is twice – I think like twice the size of all your data, or like, four times the size, or somewhere around there, of the largest object. For me personally, a lot of that is just some experience with seeing how big tempdb tends to get, and looking in comparison to the database. But I would say, off the top of my head, I would start looking at anywhere from a quarter to twice as large as your data as a whole, and somewhere around there. And the ideal would just be to have a situation where you can grow it out if you need to, but it can vary a lot based on usage. So that’s kind of a ballpark estimate, I know there’s some articles out there.
Usually in my case, I can get a pretty good feel from looking at it because whenever I’m dealing with customers, when they’re small shops, you know, a lot of times we’re talking 10GB to 100GB. Right? And so for 10GB, if you have 10GB of data, I can just make tempdb 20GB, 30GB and call it a day, generally speaking. If it’s 100GB, then you know, I’m looking somewhere around – anywhere from like 30 to 70 as a starting point, and from just my experience of working with customers, that’s worked. So there’s always a chance that you might have to make some resizing, but that’s kind of the ballpark. I wish I remembered exactly off the top of my head the rule of thumb, but I remember something about a certain factor of the database size and a certain factor of the largest database object that could possibly need to be put into tempdb or something like that.
Eugene Meidinger: Email alerts. So this – do this. Just set this up because it’s never – it’s not configured by default, and so I almost never see this set up when I’m dealing with small customers. And you want to have two types of alerts set up, but you can do more, and I’ll explain a situation I did for a customer. So the first one is errors severity 16 and higher. So generally speaking, SQL Server has severity levels, I want to say all the way up to 25, or somewhere around there, and for the most part, anything under 16 usually you don’t have to worry about too much. But 16 and higher is where things are in actual issue. If I remember correctly, if you were to raise an error by hand, or a lot of the user based errors are level 16; once you start getting into like 20 through 25, that’s really bad, but you want to set up email alerts for all of those, from 16 and higher, so that if something is failing, you know about it. There’s been so many times that I’ve caught some sort of error with logic in a job, or some sort of SQL query, or something like that because of these alerts.
Now, you will sometimes get false alarms, and you just have to understand that and redirect the email to some folder that maybe isn’t clogging everything up. An example that I’ve seen is a severity 20 error about – I forget if it was like – it was something like a network handshake or something like that, or pre-login authorization. Well, it turned out it was our network guy running basically the equivalent of Nmap and just port scanning everything on the network. And so SQL Server went, well that didn’t login correctly, and I got an alert about it, and it turned out to be nothing. So there are some times where something comes up and it’s not a huge deal. But you really want to set up these alerts.
The other type you want to set up are these hardware alerts. So they’re not necessarily errors, they’re just alerts, and they’re numbered 823 to 825, and these tell you about potential corruption events. And that was what I was talking about before, where one example is okay, try to write to disk, failed, try to write to disk, failed, it’s exceeded. Well, as far as SQL Server is concerned, hey, we wrote it. Good. But you still want to know that your spinning rust is getting close to failure.
Now, you can set up other alerts, so we had one customer where they had what I like to call five car pileup, where they have an ERP system with 100GB, 200GB of data, somewhere around there, and so you’ve got tables with literally millions of entries in it. Well sometimes, these queries wouldn’t be as performant as they should, and so they would lock the entire table. And so I’ve seen situations where the entire table would get locked and then these other smaller queries would be blocked and be waiting, and you would get five, or even 10 waiting task kind of piling up. Well, you can set alerts for stuff like that. You can set up alerts for low disk space in SQL Server. There’s a number of these things you could set up, but if you don’t set up anything else, make sure that you set up for errors and for these hardware alerts 823, 824, and 825.
So again, this is just a best practice, and you really want to know what’s going on in your SQL Server. So those were the best practices. Like I said, there’s a bunch more, there’s probably some that are going to come up in the demo, but that’s kind of some of the heavy hitters that any time that I work on a SQL Server, I always make sure that I take a look at those.
Tracy Boggiano: Would you recommend an error 833, which is the one where the I/O is taking 15 seconds longer to complete on a file?
Eugene Meidinger: I have not heard of that. So I was expecting I would learn at least one thing during this presentation, so no I have not heard of error 833, but that makes a lot of sense. We could probably set it up on the server that I have because it’s running on Azure disk and you get about 500 IOPS from Azure, so it’s very, very slow. But no, that is a good recommendation.
Tracy Boggiano: I do have that set up on my environment, in my previous environment where I worked before, and I actually wrote a query that actually pulled it out of the error log so I could show my storage admin that the storage was slow. They wouldn’t believe me.
Eugene Meidinger: Yes, that’s always a fun fight with the SAN admin, where like, each is trying to prove that the other person’s having the problem. Okay, so those are best practices, and like I said, one of the challenges, like you’re always learning more stuff, but that’s one of the reasons I do enjoy things like this and SQL Saturday because there is a lot you can learn by just attending and listening.
SQL Server Smells
The next kind of portion are what I call server smells, and these are really hard to define because it’s just a lot of gut instinct, it’s a lot of poking around, and so there’s going to be a lot of noise, but you start digging around, you go, “Something seems off”, right? And so part of doing health check, part of being a good database administrator or database consultant like me is learning to get that gut intuition to say, “This doesn’t seem right.” And so there’s a number of different things you can look at. The first one is activity monitor. I don’t like this a lot, I don’t use it a lot, but there’s some pieces that I like there.
The next is wait stats, which is kind of the best possible tool for telling what’s going on in SQL Server that you need to address, because wait stats tells you every time SQL Server was waiting on something, so it can very quickly give you kind of a lay of the land of okay, here’s what’s wrong. And so in our demo we’re going to look at wait stats and see, “Okay, what’s wrong with my server? Where’s the pain points going on?” So wait stats are invaluable, but as we’ll talk about, they’re kind of a pain because there’s so many of them, and they have these names and it feels like you’re catching Pokemon because there’s just so many wait stats and who the heck knows what a CXPACKET is, right? Like, how would you even know that from reading what it means? Or SOS scheduler yield. What’s SOS?
The next thing is the logs. You want to just dig through the logs and there’s a lot of noise here but it’s very useful. I’ve found serious potential issues by just trolling through the logs and seeing what I can find. There’s the SQL error log, there’s Window event logs, and there’s – I like to look at the job history. You’d be surprised how many times someone set up a job and the employee’s not even here anymore, and either the job doesn’t run anymore, or maybe it was set to only run on demand, and that sort of thing. We had a customer where they had a job for index rebuilds, but it didn’t have a schedule associated with it, it was on demand. So they thought they were rebuilding their indexes, but they just weren’t.
And then the last thing that’s always a good smell is SQL version. So I mean a couple different things by that. So there’s whether they’re on Standard or Enterprise. If they’re on developer edition, you may want to let them know they’re in violation of licensing. If they’re on SQL Express, you may want to point out this only supports 10GB. I believe there was a customer where we had to upgrade them because they were like running up against that. I forget. But in addition to that, there’s what year. SQL Server 2005 isn’t supported anymore. SQL Server 2008 R2 is kind of on its last legs, right? But even beyond that is service packs and cumulative update. So I know for 2017, and I don’t know if they’ve made the change for 2016, but they’re moving to cumulative updates only. Service packs are going away entirely. But regardless, for older versions, if you’re not on the latest service pack, you’re not on a supported version of SQL Server.
So we talked about corruption. Well, let’s say we run into corruption. Well, we’re going to do that in the demo. We run into corruption. Well, I’d want to open up a ticket with Microsoft. Well, Microsoft comes in and says, “You’re not on service pack four, we’re not going to help you.” That’s really important to communicate to the business that, “Hey, we really should take an outage and take this 15-minute procedure to update to a new service pack so that if something happens we can call Microsoft and get help.” When you explain things in terms of business value and the implications, it can be a really easy sell.
How to Spot SQL Server Smells
So going through some of these things. Activity monitor, this gives you an idea of you know, the traffic, the I/O, the CPU. The thing that I honestly only care about for the most part on here is the number of waiting tasks, just because I like to know are things being blocked, because like I said, I’ve seen those ten car pileups where one query was causing eight or nine other queries to have to wait and be a problem. So you know, this can be useful starting out, but I just found whenever I was working with this as an accidental DBA that I could see some of this information but I didn’t know how to get more, I didn’t know how to act on it, so it was kind of a frustration point for me. But it still can be a useful starting point.
Next is logs. So when I’m looking at the logs I’m looking for errors, I’m looking for failed logins, I’m looking for anything that looks weird. And so there’s a good chance you’re going to have a lot of false positives, you’re going to do a lot of Googling, like, be careful with this rabbit hole because you can spend hours just digging through this stuff and Googling these things. The way that I structure a health check when I do it is we talked about critical stuff, we talked about best practices, and then – so usually the first four hours is me just going through the stuff that I know I need to get done today. And then the next you know, two hours is me just poking around and going, “Well, that’s weird” or “That doesn’t seem right”. And so you want to kind of time box this and limit it because it can be hit or miss, there can be a lot of noise in the logs. There can be a lot of false positives. I think I have an example in the logs for this one, this demo, as an example.
Interpreting Wait Stats
Next is wait stats. So wait stats tell you the pain points of your SQL Server. What is slowing down my SQL Server? This is the best tool for understanding what’s going on and it’s completely free. It’s just built in to SQL Server. It’s really impressive. It’s a problem learning it because you have to kind of learn it ad hoc, because again, there’s all these different types of wait stats, some that are very rare, some that are hard to look up, and so you just start getting better and better at it, but it can feel like you’re capturing Pokemon or something like that.
Three of the kind of wait stats that I look for, there’s some other ones, whenever I’m looking at things, the first is PAGEIOLATCH_XX, and there’s a couple different varieties. I’ve just xx’ed there. PAGEIOLATCH says we were waiting to read something from disk. We had a lock on something on disk and we wanted to read it in the memory. Well, it would seem like PAGEIOLATCH means that you have slow disk, which you could. But usually from my experience, from what I’ve read, it means that you have memory problems, and I’ve seen that, where there wasn’t enough memory provision and so we were getting PAGEIOLATCHes, and a lot of times, if you can just put everything in memory, if you can buy 30 extra gigabytes of RAM or what have you, suddenly this can go away.
The next one is CXPACKET. So CXPACKET is kind of a boogie man of wait stats because people see them and they go, “There’s a problem.” If you’ve got more than one core in your server, you’re going to see CXPACKET. That’s okay. That’s fine. Whenever it’s the biggest wait stat, I like to at least take a look, but usually what I do is I change those defaults we talked about, so I change max degree of parallelism to eight, I change cost threshold to 50, CXPACKET goes down to – maybe it’s only 20% of the waits, and I’m usually pretty happy. So if it’s just dominating your wait stats or you haven’t changed any of the defaults, definitely look into it.
Now, Tracy, I don’t know if you know anything about this. I saw a picture from summit of they’re adding a new wait stat called like CXCONSUMER, to indicate whenever it’s just the monitoring thread that’s waiting or if it’s actually one of the active threads, the worker threads that are just sitting around, so we’ll actually be able to tell if this is a problem or not. Do you know anything about that?
Tracy Boggiano: I’ve heard about it, yes.
Eugene Meidinger: Okay. So hopefully that should be coming out to give you a better idea about this sort of thing, but I like to see where it’s at and see if it’s a problem because again, the defaults for parallelism are just kind of crud.
The third one that I like to look for is SOS_SCHEDULER_YIELD. And so my understanding of this, and I don’t understand one of those internals well, is that SQL Server is kind of this mini operating system and it does its own scheduling internally. But occasionally, if it’s taking too long on something or there’s other stuff going on on the CPU, Windows has its own scheduling and it says, you know what, you’re going to have to wait. You took too long, we’re taking the CPU back, we’re going to give it to Fred, or whatever process you have on the Windows server. And so this can be an indicator that maybe you have something else on SQL Server that’s taking up a lot of CPU. And so SQL Server is losing control of the CPU frequently because it’s fighting for access.
Ideally, the only thing you have installed on SQL Server is SQL Server. Ideally, you’re not running SQL Server on you know, the act trajectory domain controller and your exchange server or whatever, right? But sometimes people stack a bunch of stuff on the same server. So these are three of the waits that I look for a lot of times, but again, it’s more of an art than a science, and this is why it’s under that server smells because you just look at it, you do some Googling, and you get a feel for what’s going on.
Wait Stats Questions
Tracy Boggiano: Couple questions for you. So would the PATCHIOLATCH_XX stat happen when you limit your SQL Server RAM?
Eugene Meidinger: Yes, and in fact, spoiler, we’ve done that in our demo, and we’ll see that as one of the big wait stats.
Tracy Boggiano: Man, spoil the video.
Eugene Meidinger: I know, it’s ruined. It’s ruined forever. One of the ways to tell too is page life expectancy. So we’re going to look at page life expectancy, which is a performance monitor counter, and it tells you how long when I read a page from disk into RAM, how long does it live before it gets pushed out of RAM and kicked off the island, right? Well, if you have limited your RAM to say, like, 4GB, and you’re just cycling through all the time, your page life expectancy is going to be very low. So yes, if you artificially limit the RAM, which I tell you to do, you could see these stats, and page life expectancy can also give you an idea of, okay, have I properly configured my RAM. So that was a good question. Any other questions?
Tracy Boggiano: Second question. How should we monitor the wait stats?
Eugene Meidinger: So the – it depends, right. That’s the common IT answer. The way that I like to do it is I’m coming in as a consultant, and it’s going to be different for you DBAs, but for me, I try to sell a health check as an annual thing. So what I like to do is we’re going to run a query by Paul Randall, he has a blog post called Tell Me Where It Hurts, that gives us some nice output of the wait stats that are the top 95%, and they’re organized by that. And so what I like to do personally is I copy all that and put it in Excel, as part of one of my deliverables. I give the customer this document saying here’s the stuff you should change. Your defaults are terrible, let’s change it. We make the changes, we give it a week to like kind of burn-in. We’ll probably reset the wait stats, and then after a week or so, we run it again and see, okay, how has it changed. And then moving forward, every time I do this annual health check, I take a look at it. So that’s how I do it, but I’m a consultant. I’m not managing stuff every day. You very easily could create a table and dump that information on a daily basis, weekly basis, that sort of thing. In my personal experience, there’s a certain amount of marginal utility. Ideally, your wait stats aren’t going to change that dramatically on a daily basis. And in addition, ideally, you’re going to get to a point where – what I like to see with wait stats for a system that we’ve tuned is a lot of the stats are stuff that’s kind of unavoidable, so I like to see backup start to show up, because I mean, it’s – you know, we have to do backups anyway. Backup could be a disk issue potentially, but generally, it’s just you have to do it. Same thing with a little bit of write log. We’re going to see a lot of write log on my machine, which is bad, but a little bit’s good, and I like to see that the wait stats are broken up into a bunch of them for the top 95%. I don’t like seeing one or two big ones that are taking up everything. So there’s a good chance that after you’ve started to make the changes and you performance tune it, that you only really need to check it every month or every year, or something like that. So that’s what I do is I just copy it into Excel, I do a comparison to see how it’s changed and I check it every year for our customers. You can usually dump it into an Excel – into a SQL table if you wanted to monitor it more closely. Anything else?
Tracy Boggiano: Nope, that was it.
Out of Date Versions
Eugene Meidinger: Sweet. Okay, I think this is the last – yes, this is the last server smell. Again, there’s going to be stuff in the demo that we’ll probably get into that we didn’t talk about here, but server version is always a smell for me because it’s almost always out of date, right? And I’m bad about this too at times because if you’re doing things right, any time you want to do an update, you’re planning for an outage, you’ve got a back-out plan, there’s change control, so it can be kind of burdensome, but it’s important to note, if you’re not on the latest service pack, Microsoft isn’t going to support your SQL Server if you’ve got a problem. If you’re not on a fairly recent version, Microsoft isn’t going to support your SQL Server. I have an in-house application that runs on SQL Server 2000, and I know if something happens, I’m SOL, right? We’ve tried a couple times to upgrade it and the project never kind of takes off the ground. Eventually, we’ll get round to it. But so if you’re on an old version, you may not be able to get support from Microsoft. But even worse, Microsoft aren’t omniscient and they screw up. SQL Server is made by humans just like you and me, and sometimes they ship with bugs that cause big issues. The one that’s on my mind is in SQL Server 2012 and 2014, they had a regression where there was a corruption bug, and it was in a very limited set of circumstances. You basically had to be on Enterprise edition, so that you could run online index rebuilds and you had to be doing it with parallelism, and in those circumstances, it could cause corruption, and this was actually even an issue in some of the early service packs. But if you update your SQL Server if you have 2012, and you update to the latest service pack, then you’re good, right? If you don’t, if you’re running like release from manufacture, or even service pack one, this could be an issue.
Now, again, this is kind of a niche problem, but this stuff happens, and Microsoft has changed how they treat updates moving forward, like I said. Before it used to be that they would say, “Alright, service packs, always apply service packs. Cumulative updates, we didn’t test it as much, so only apply it if it has a specific hotfix or a problem that you need fixed.” Well, that’s changed. Now they’re saying no more service packs, just cumulative updates every month for the first year and then on a broader schedule from that, and so you know, it gets a little bit murkier, but the most important thing is you want to be updated. And again, when I’m dealing with small shops, they’re generally not updated.
Writing Up the Health Check Results
A couple other things and then we’re going to jump into the demo. So you’ve done the health check, now what? First thing is you’re going to do a write-up. You’re going to prioritize based on the severity of the issue, you’re almost inevitably going to have to plan an outage, and then you’re going to fix stuff.
And again, as a consultant, usually what I do is I do the write-up and I prioritize it and then I let them figure out what they want to do. And again, some customers are like, “You seem like you know what you’re talking about, can you just fix everything?” Some customers are like, “Thank you for pointing this out, thank you for coming out here, we’ll take care of it.” I’ve seen both ends of the spectrum, it just depends.
But if you’re the DBA, then you should probably fix this stuff, or at the very least, you’re going to want to have a conversation and say, you know, “Hey, we should limit the max amount of RAM” and the business says, “Well, why?” and you want to communicate it in a way that says, “Here’s how the business is impacted if we don’t do this.” For max RAM, the business there’s a 1% chance maybe that SQL Server is going to take up too much RAM and cause a crash or something like that. You know, some of these – it’s some guesswork. But some of the other stuff, like if we don’t fix these backups, then we’re going to just completely lose our data if anything happens.
So tempdb, breaking that down into files, the business case is a little bit harder. It’s a best practice but unless you’re running into contention currently, then you know, the chances of that being an issue might be low. So you have to organize it. So the write up, what I like to do whenever I’m providing this deliverable to the customers, and remember, literally a fourth of the time that I spend is doing the write up. So think about that. It’s easy to think that it’s all about the technical piece, but I go out to a customer site, I’m out there for eight hours, generally speaking. Four of them are my punch list, here’s what I need to cover, two of them are just poking around and seeing if I can find something you know, crawling under a rock, and then two of those hours are writing up a deliverable. And I take screenshots and I put data into Excel from queries that I ran, I give them the queries that I ran. And so the way that I organize it is by severity and difficulty of mitigation.
So what I like to do is I have high, medium, and low. I say high, if you don’t fix this, you’re going to be in a world of hurt. Medium, you should probably fix this. Low, it would be good to do this. You know, something like backup compression. It’s a best practice in most circumstances, and for backup compression, you’re basically trading CPU for disk space, because what you’re doing is you’re compressing your backups before you write them to disk. 90% of the cases, unless you’re like very high utilization, you probably want to do it because a lot of times you have CPU to burn. This is going to be somewhere between medium and low. If they don’t have a ton of disk space to work with, it’s probably a medium issue. If they do, if they’ve got 100GB free space and their backups are a gig each, it’s a low issue. It’s not necessarily a big deal. But you want to organize that you can communicate here’s the impact to the business, here’s the value-add, or here’s the business risk if we don’t address this.
And then the other thing is the severity, but then there’s mitigation. So to give you an example, max RAM, we can change that in the demo. We probably will. And we don’t have to restart SQL Server. There’s certain settings that you don’t need an outage to implement, right? There’s certain things you do. If you were going to install a service pack, plan for an outage and plan for some kind of backup, right? Take backups before you do this service pack because it’s probably not going to be an issue, but you want to be prepared for a situation where you just hose the entire server. And if you’re virtualized, your backup plan is tell the VM admin to take a snapshot, still probably take backups, and then just be ready to roll that back. If you’ve got a physical box, you may want to be ready to rebuild it if necessary. So the risk of mitigation is important as well, because a lot of times, there’s quick fixes. If you’re not taking backups properly, that can be a change to a maintenance plan. That can be done very quickly, whereas again, other things require outages. And overall, the biggest thing is communicate, how does this benefit the business. No CEO is going to care that you didn’t install these error alerts because you know, Brent told you to, or something like that. What they can about is that you can say, “Well, if we have some sort of hardware failure happening, we’re going to know immediately” or, “If we’ve got some sort of malicious login attempt or whatever, we’re going to know immediately”, “If we’re starting to run out of disk space, we’re going to know”, and because of that, here’s how it impacts the business.
Planning the Outage and Emailing the Users
So you did the write up, they approved the changes, now you’re going to have an outage, most likely. A lot of these things require an outage, and you want to have a change control process. Write it out. Write out a plan. Here’s the steps I’m going to take. Ideally, most places have some sort of change control process, and if they don’t, maybe now’s a good time to implement it. There’s so many times where I’ve seen in my life, people just wing it for whatever. You know, not even just databases, but like SQL Server stuff, and they screw something up. And if they’d just written down the steps, someone else could have peer reviewed it, or they could have just like, thought it through, or they could have double checked things. So much better to have a change control process in place even if that’s just you writing down here’s the steps I’m going to take.
It’s also nice in terms of the kind of like, post-update email, where you can say to all the stakeholders, “Okay, you know, we had this outage, here’s the things we changed and here’s how it benefits the business.” People like that. They like to know what did you do. They don’t care about the nitty gritty, right? I remember when I started my job, I would fix someone’s PC or fix some report and I’d be like, “Do you want to hear how I MacGyvered it?” And they’re like, “Is it working?” I’m like, “Yes”, they’re like, “No, I’m good.” A lot of people view us as car mechanics. They’re happy we’re there, they’re happy we know our stuff, but they just want to know is it working, right? But it is nice to be able to say in an email, “Here’s what we did, here’s the value we added” very succinctly.
You always want to have a back-out plan. How can we undo this change if we screw something up? A lot of them are simple. Like I said, max RAM, you change a number, it takes effect immediately. You know, certain settings with the database take effect immediately. Service packs, it’s a little bit harder to back-out of that, right? So you want to understand before you make these changes what do I – if I screw up, how do I revert that?
So now we’re at the demos portion. Do we have any other questions that popped up before we just dive into this?
Tracy Boggiano: No.
Eugene Meidinger: Cool. Okay, so now we’ve got a demo, so I have – give me one second. I just want to get my personal checklist up now. I’m going to try not to rely on it too much. But I have a personal checklist of things I want to make sure I get to, but what I’m going to try to do, like I said, is kind of stream of consciousness, what’s going through my mind as if I have never seen this machine before, and I’m just kind of jumping into it.
Demoing a SQL Server Health Check
So we’re logged into this machine, and one of the first things that I see that’s concerning is we’re running Windows server 2008 R2. Is ther an R2? We’re running Windows server 2008 something, and that means that we’re probably on SQL Server 2008 R2. So automatically, we’re on a version that’s not going to be supported by Microsoft for much longer, right? So they’re not on the latest version, they’re probably not up to date, that sort of thing. I also see some icons on the desktop. In this case, it’s stuff I installed to make my life easier, but you know, hopefully, no one’s browsing the web from the SQL Server, or doing any kind of weird activity that they shouldn’t be. You also want to keep an eye out for antivirus, to make sure that files aren’t potentially getting locked by a web route or something like that.
So I’m on here, just to let you know, I’m running a tool called HammerDB, and what it’s doing is it’s simulating a load against the database, right? So I’m using this, so we actually have a load and we can have some more slightly more interesting wait stats, but it’s based on the TPC-C standard, although it’s not formally approved, and that’s kind of this neutral body for comparing databases based on kind of a generic OLTP kind of system. That sort of thing. So it looks like the matrix, you don’t need to worry too much about it, but we’re running that so that we can simulate a load.
Connect In and Poke Around
So what I’m going to do is I’m going to hop into SQL Server and I’m going to connect and I’m going to start poking around, right? So the first thing that I see is I’m looking at this SQL version here, and that looks like way too round of a number, and so I’m going to see, okay, what’s 10.50.1600, right? The 10 is for 2008, and the 50 is for R2, I know that. But that looks like a pretty even number, so I’m going to look and see, okay, what’s 10.50.1600. So this is released to manufacture, this means there’s been no service packs applied. So already I know that they’re probably not performing regular maintenance on this server, right? They had a full-time DBA, ideally this would be up to the latest service pack, and in this case it doesn’t seem to be the case. So you know, I have a feeling that there’s a lot of stuff that’s probably not set up properly. So let’s take a look and see if we’re doing backups.
So I see two maintenance plans, this is a good sign. So let’s take a look at the hourly to see what’s going on here. So we’ve got this hourly task, and something’s a little bit funky about this. The type that’s going on is a full backup, so every hour, they’re doing a full backup of the database, and they’re also doing specific databases. So here, TPCC was added to the database after this maintenance was created, so it was never included, and I’ve seen this happen with customers. So instead of selecting all databases, they’re just backing up just a couple of them. They’re also doing a full backup every hour, which is weird. That really shouldn’t be necessary. If we take a look at nightly, here, they’re doing a transaction log backup but they’re just doing it every night. I literally had a customer that was doing this. I had a customer where they were doing a full backup every half an hour and they were doing a transaction log backup every night, and really, the reason was that they had certain SLAs, they didn’t know how this stuff worked, and they probably set up the transaction log backup so that the transaction log didn’t keep growing out of control. Well, this database was like 10GB or 15GB. So that means every half an hour, they were taking up 10GB of space. Now, they’re only doing it during business hours, so that was 16 of those. So everyday they were backing up 160GB of data, that was not great, right? So that’s an issue.
Another thing that we have going on here is shrink database. A lot of times when people do set up a maintenance plan, they just check all the boxes, and one of the boxes is shrink database, and it seems like, “Oh, this would be great”. Well, you really – unless you know for sure that you want to be shrinking your database, you don’t want to do it because it’s just going to grow back anyway in many cases. You’re not really reclaiming that much space, and it can cause potentially fragmentation. So it’s just not an ideal sort of thing. So of resume generating events, one of the things I found already is okay, this backup piece doesn’t seem right. Now, let’s take a look and let’s run a query and see okay, what has been going on with the backups.
Querying Backup History
So there’s a bunch of queries out there that look at backup history. I grabbed one by Tim Ford on MS SQL Tips, and this is just going to tell us just some basics about okay, what’s been run, that sort of thing. So we can see that just these two databases seem to be getting backed up, and we can see that last night there was one log backup. Now, for – and that was for the corruption one. AdventureWorks, it doesn’t run for it because it’s in simple recovery model. So we talked about those different recovery models, and so if I expand this out and I go to properties, we can see it’s in simple recovery model. And so what that means is that as soon as those transactions are saved to the log to disk, they’re written back to the data piece so we don’t have to backup the transaction log. So again, this – if this was actually AdventureWorks, that should be great because it’s a development database. There’s certain circumstances where again, the business doesn’t need point in time recovery.
One of the reasons that like to run this type of query is I told you about those third-party tools, so if we look here at physical device name, we’re writing to C, which isn’t great, and we’ll talk about that in a second. But at least we’re writing to a disk. I’ve seen multiple times where someone has Backup Exec or Veen or Barracuda, I think that’s the name, set up and it’s backing up to a virtual device, and then when I as the database administrator wants to do a restore, I don’t have access to that virtual device so my backup chain’s been broken, right? But we also see here, it’s being written to the C drive, which is the default, and it’s not great. Ideally, if possible, you want to have a separate drive for your backups that’s ideally on separate storage, because think about it. If you have a disk failure and this server just dies, all of your backups are on the same drive as the server. So you really want to have that either separated out or you want to be making copies, or you want to be doing something. So the backup situation here is not great. So that’s something to be aware of, and whenever I’m going through this stuff, what I would be doing is I would open up Notepad, or I would open up Word and I would start taking notes. So I would say, you know, maintenance plan being launched, and I’m going to try maintenance plans, backup location, you know, these sorts of things. So I’m constantly taking notes as I’m going through because like I said, I kind of work in these cycles. I’ll start poking around, seeing what is off, checking some of the basics, and if I see something that’s a problem, I’ll usually come back to it again later. But very often when I’m starting, I’m just trying to get a lay of the land, figure out, okay, how screwed up is this SQL Server? Sometimes, you know, go to a customer thing and you say, “You know, this is a little bit overprovisioned, but overall things look good.” So it varies. They’re not always terrible.
Checking for Corruption
So we talked about backups, another thing we talked about is checking for corruption, right? So one of the things is I can say, “I want to check for corruption”, so I can run DBCC CHECKDB, and there’s specific options you can run with this. You want to have it as part of a maintenance plan of some regular jobs, so you’re checking on a regular basis. But I can run it, and you want to be mindful of the load on the server, because it’s going to add some amount of load either to RAM or disk, depending on how much RAM you have. But if there’s not a lot of load, so if we were to look and check activity monitor- so we can see I’ve got a fake load so I do have quite a bit of load, so I may decide to do this after hours or something like that so it’s not too much of an impact. But it’s a demo so I’m not too worried about it here. So I’m going to run in AdventureWorks, DBCC CHECKDB, actually I already ran it, but so there’s no issues there. Well, we’ve got a database here called CorruptionChallenge1 by Steve Steadman. I am sure it will be fine. And so it can take a while, depending on the size of your database and the load. Whenever I wasn’t running a load it just took a couple seconds to run this. So I’m scrolling down, and then crap, there’s stuff in red.
Now, I can tell you that I’ve been lucky enough to never have to deal with corruption. So me personally, I would not have a great idea of what to do in that moment, but I would do a couple things. Anytime I see an issue of this severity, I see that backups are just not set up at all, or there’s corruption, or something that’s just if we don’t do something, the business could dramatically suffer. I immediately grab the stakeholders, I let them know the situation, I talk it through with them. Some of the things I do know about corruption is I would be very careful not to restart the SQL Server. Again, I’ve never had to deal with it, but my understanding is that you may have it so that the data’s fine in memory, but it’s the disk is messed up, and so you may have the information in the buffer pool or something like that, and you can save it or potentially do a backup or something along those lines, but if you restart the SQL Server you should be screwed, or it could be in some sort of suspect state. It’s possible to have corruption where some of the header pages for the database are messed up and you can’t even start up the database, right? Paul Randall has some corruption examples like that.
So I know I would try and do a couple things. I would make sure not to turn it off, I would immediately take a backup, and I would immediately try to open up a ticket with either Microsoft or a consultant shop that I trusted. So again, thankfully in my work, I’ve never run into database corruption, but you want to be checking for this stuff because it may not be screaming at you, hey something’s wrong. I was able to restore the backup for this database fine. It didn’t complain at all from what I saw. You know, sometimes this corruption can be hidden especially depending on what kind of CHECKSUMs are set up for SQL Server, because in older versions of SQL Server, the CHECKSUMs that check to see if all the data on the page matches what it’s supposed to be weren’t as comprehensive. So again, you can have scenarios where this catches you by surprise. So you want to be checking for – you want to be running CHECKDB. There’s a script you can run to basically dynamically run against all the databases, or if you’ve got maybe a weekly or nightly plan or something like that, you can add an option here. You can add the check database integrity task and do that as part of your regular maintenance plan, something along those lines.
Checking Wait Stats
Okay, so we’ve looked at corruption, we’ve looked at backups. One of the big things I always go to is wait stats. I want to know how is my server performing. Let’s close that. So I’m going to open up a query by Paul Randall. If you literally just Google SQL Server wait stats, it’s the first thing that comes up and he has a really nice query for this. It’s on the SQL Skills blog, and so what it does is it pulls out all the waits information. It filters out stuff that we don’t care about because it’s usually not an issue. And then using a CTE, it’s comparing itself to itself to say, okay, what’s 95%? What are the biggest hitters? Because if you were to say 100%, there’s going to be a ton of stuff, but if we run this we’re going to see two big wait stats, and because we only see two, I might change this to 99% and see what else we got.
So we’ve got four big wait stats, 99% of the time SQL Server is waiting, it’s these four. And the parts that you really care about is wait type, percentage, and then on SQL Skills, Paul Randall started having an encyclopedia of wait stats with cool graphs and lots of information, and so that can be useful as well if you’re running this query by him. But the big thing I care about is, okay, my biggest wait stat is write log. Now, you expect a certain amount of the database activity to be writing to disk. That’s fine. But I don’t expect it to be 82% of the waits. In this case, what it means is my disks are slow. I’m running Azure disk, you get 500 I/O operations per second. It’s not a lot, it’s not great. I didn’t pay for SSDs or premium disks or anything like that.
And then the next thing we have here is page I/O latch, and we talked about that, that can be a sign that it’s reading from disk because there’s memory pressure. There’s not enough RAM available. So let’s take a look at how much RAM it’s using, let’s take a look at page life expectancy to dig further. So this is a good example of that whole server smell thing I was talking about. You poke around a little bit, you get some information, and then you dig further based on that. There’s so much whenever I do a health check that’s ad hoc. You know, the first couple hours are yes, I’m going through my checklist and make sure I didn’t miss anything, but there’s so much where I’m like, “Okay, well that doesn’t look right”, or “Alright, why am I getting that wait stat?”
So the first thing is I’m going to take a look at task manager, and I know task manager can be a dirty, filthy liar, and you can check some of the stuff with perfmon, but it’s still useful, so I just want to see, alright, how much RAM is SQL Server taking up? And it’s taking up you know, 2GB, 2.5GB, which isn’t a lot because we’ve got 32GB allocated to this machine. That’s not bad. And we’ve got eight virtual cores, so we have a decently sized SQL Server machine for at least you know, small to medium use, right?
So I would expect that if this was SQL Server, the RAM would be near the top. It would be using as much as it can. Now, maybe we only have a gigabyte database. Let’s take a look. So if we – AdventureWorks, let’s see how big that is. 90MB. Okay, that’s small. Let’s see how big CorruptionChallenge is, 3MB, okay. Let’s take a look at this TPCC. Okay, that’s 9Gb or 10GB. That’s way more than two. I have the RAM, I want it in RAM, right?
Looking at Perfmon Counters
But let’s take a look at page life expectancy to see how that’s impacting the performance of the database. So there’s a built-in tool, you can find it through administrative tools, I have it bookmarked there, called Performance Monitor. Performance Monitor is a great way to see some of the underlying performance stats for Windows server and SQL Server, just as a whole.
Now, whenever I’m doing a health check, literally the only counter I care about is page life expectancy. There’s other valuable ones, but that’s usually more when I’m trying to debug a specific issue, and you know, it can vary for each of you. There’s great stuff I’m looking at disk performance, but literally, I’m looking at wait stats and a bunch of other stuff in SQL, the only performance monitor stat or counter that I care about whenever I’m doing a health check is page life expectancy. So I open up perfmonitor, and I’m going to say I want to add a counter. And then I’m going to go to SQL Server buffer manager, and it can be overwhelming. There’s tons of these and a lot of them really don’t make sense what we’re doing, but page life expectancy is really, really nice because it tells you do I have enough RAM for my SQL Server.
So I’m going to right-click on it and I’m going to tell it scale selected counters, and that just makes this little chart shows up. And actually, I’m going to spread it out so that we can see more over time. So I’m going to tell it we want a thousand seconds, and it will take a little while for the chart to catch up with us. And we can see over here it’s slowly rising, so you would expect it to keep rising like second per second. Something must have dropped it down, but normally, page life expectancy like I said, is we’ve read something from disk into RAM, how many seconds do we expect that page to live in RAM before it gets pushed out by other work that we’re doing.
So it’s going up, but looking at what we have here, the page life expectancy right now is about 800. So that’s 12 minutes. That’s not great. That means that we’re cycling through our RAM every 12 minutes. Now, if we’re dealing with a couple gigabytes, okay, that’s not a big deal. If you’ve got a 100GB database, if you’re cycling through 30GB of data you know, in a ten-minute span, that’s not good. There’s kind of this old folklore number of 300 for page life expectancy, and that hasn’t been true for the last decade or two, and even then, I’ve heard rumor that that was a number that was pulled out of someone’s butt just so that there was a number. Where I like to page life expectancy is in thousands, and generally, at least an hour. If data is staying in RAM for at least an hour, then I feel pretty decent that we’re not just pounding the disk.
If it’s here and it’s at 900 or 800, I don’t have warm fuzzies about that, and it’s something I’d want to look into. Maybe I need to increase the RAM or something like that. So we had a question – you can see it just dropped. So something just ran and actually I have a job that runs every five minutes, so that’s what did it, but something just ran that just killed our page life expectancy. That’s the trend that you’re going to see a lot of times is it keeps going up and up and up and then something crushes it. And you want to like, match up those times with jobs and understand what’s doing that. Did something cause a ton of reads? Or something along those lines. So there’s something funky going on and right now again, we’re seeing around 700 seconds. That’s not great. That’s like, 12 minutes. So that’s perfmon, so let’s take a look at the RAM, because we saw that we have 32GB available, but we only should be using two.
Configuring SQL Server Memory
So if we go to options, properties for SQL Server, then we go to memory, there we go, we can see here that someone has changed max server memory to be 2GB, and that was me. And so that’s really, really low. Like the minimum amount of RAM you need to be able to run SQL Server is 1GB. And so that’s why we’re seeing those page I/O latch waits. So I’m going to change that to be – well, I don’t want 32, I’m probably going to want something more like 28. Again, there’s formulas out there, I don’t memorize it because I don’t need to, but it’s generally something like give the OS 4GB RAM and then for every eight or 12 you give it something more gigabytes of RAM, or something like that. But even as a good starting point, I feel comfortable about giving the operating system 6GB to work with, right? And this is one of those settings that takes effect immediately. But again, this is a demo, so I’m going to change it, right? If you’re doing a real health check, don’t change it. At least whenever I’m consulting a customer, I want to get the okay before I change anything, and I want to make it clear that this is a health check, this isn’t any kind of change. O
ne, so there are more of these, but two, because of scope [create]. Because what’s a risk is that you keep going down these rabbit holes and you keep changing this stuff and you didn’t keep track of what you did, and what I said to the customer was going to take eight hours, now takes two days and we’ve only got half of it done. So what I say whenever I’m doing a consulting is here is the deliverable. I’m going to write a Word document at the end of the day is going to tell you everything that we can address. I make exceptions sometimes, we’re like, yes, just change it, but in general, I try to avoid mixing the actual mitigation with the analysis.
So we’ve changed the amount of RAM that it can use. Let’s double check that, I’m pretty confident we got that right. Yes. So page life expectancy, when I’m – I’m expecting that’s going to keep going up and up and up for quite a while now. I don’t think we’re going to run into any issues because we can fit the whole database in RAM. We should also see that SQL Server is going to start just using more and more memory. Yes, so it’s steadily increasing as well, and again, that’s going to keep going until at a bare minimum it has – well, anything that we’ve queried from the database in RAM at the very least, but SQL Server will use as much RAM as it can, generally speaking.
Okay, so we talked about wait stats and how often do you query them and that sort of thing, and so we made a change. How do we know if the change helped at all? Right? So what we can do is we can clear out wait stats, and again, this is a change, you want to avoid doing it if you don’t need to, but if we pull up that blog post by Paul Randall, he has a lot of great detail, some other post to go in a bit more, but he also has how we can reset those wait stats. So again, you want to try to avoid making changes, you don’t want to get down this rabbit hole of tuning, tuning, tuning, tuning, you’re just trying to do an analysis, but this is a demo. So what we’re going to do is we’re going to clear out the wait stats and we’re going to see, has anything changed. So right now it’s 82%, it is write log, and then 14% is page I/O latch, that sort of thing. So I’m going to clear that out, and now it’s 99% write log, right? So that went away. We will check back in on this a little bit later because I have a job that’s causing some of those page I/O latch waits, so we’ll see how that’s affected, it should be running at about a minute or so. It runs every five minutes.
So let me take a look at my list real quick to see what else I wanted to cover, but you get the idea that in a lot of ways, you want to have a checklist because you’re going to forget things, it’s very ad hoc where you start digging in, you see, okay, here’s a lot of activity, or here’s these different wait stats, let me dig a bit further, right? So it’s very much as much an art as it is a science.
Is the Server Properly Provisioned?
Okay, some other things to kind of look at, one of the things I like to look at is is the server properly provisioned. So we looked at RAM, and it seems like we have enough. We certainly have more RAM than we have database, so that’s usually a good sign. If I’ve got more RAM than I do the entirety of all the databases, I feel confident that I have enough of that. We’ve got eight CPUs, so considering we’re barely touching CPU, that’s probably more than enough. Well, let’s take a look at disk space and see how we are there. So disk space is okay, we’ve got 90GB, and right now we’re not doing the proper backups, so let’s force one and see how much space this TPCC takes up.
And again, you wouldn’t do this in production without talking to the stakeholders, but this might be something where you’re like, “Hey, we’re not backing up everything” and they go, “Oh, crap.” So I added the TPCC database, and now I’m going to say, hey, I want to run this right now. And again, this could be resource intensive, depending. So we’ll let that run, see how big the backup is. But I generally like to keep an idea of, okay, how’s my RAM, how’s my CPU, how’s my disk, right?
Let’s see, okay, so a couple other things we talked about. We took a look at max RAM, our tempdb file. So if we go to SQL Server, although we could also just right-click on tempdb and look here as well to get some information about okay, there’s just the two files, but you can also look and go to data and say, okay, here’s the log file, here’s tempdb, you know, we only have one file. Again, you probably want to split it out. I don’t think you technically require an outage for it but it’s still good to plan for it anyway to do that sort of change.
Let’s take a look at the max degree of parallelism, take a look at that. So if we go to I want to say, advanced, here again, the default is five, and usually, I’m going to change that to 50, and MAXDOP is usually zero, I’m going to change that to eight. And those are settings that you can just change without having to restart the server or anything, so it’s good to understand what kind of changes you can – take effect right away.
So let’s see if our backup finished while we were waiting for that. I think they might have disappeared. I think it might still be running. But I like to be able to see how much space we have and how long – like, I want to be able to predict, are we going to run out of space any time soon, because I’ve seen that happen for customers.
Job Duration & Failures
One thing that’s related is the jobs. So maintenance plans, they run based off of SQL Server agent jobs, so when you make a maintenance plan, it creates this sub-plan one to run, and one of the server smells I like to check out is the job activity monitor. Has everything been running recently and has it been running successfully? So we can see that these have been running on a regular basis, I’ve got this read heavy job that’s running every five minutes. We had a customer where they would get like, a ton of activity every 15 minutes, and so we looked at which jobs were running exactly every 15 minutes, because one of the performance monitor counters you can look at is number of transactions. So that could be an issue, that’s a smell potentially if there’s a description that tells me, yes, this might be a problem.
And then I’ve got this ETL process. So I’m going to want to take a look at that and see, well, okay, this failed, like it was run today, it was run 10 minutes ago, but it failed, let’s take a look at the job history for it. Okay, it’s been consistently failing and it’s trying to run every hour. So if I go ahead and take a look at why, alright, there’s a divide by zero error. So the business may think that some process is running or it’s been orphaned or whatever, and you have errors. So these are smells, you want to check through the logs for this kind of stuff.
If I take a look at the step, this was just a mockup where I’m literally selecting one divided by zero, but you can pretend this is a more complicated script or something like that. So I want to look at – when I’m just rooting around trying to find whatever, I want to look at the job activity, I want to look at the Windows log, so again, you can take a look at administrative tools, event viewer, and again, this is my – former boss would describe this is poke and hoe, right? I’m just trying to see if I can find anything interesting, so I’m going to filter on all the errors and warnings, and so I’ve got this confusing error about SQL Server and Windows Manage [rotation] and the privilege account, this looks scary. I have – me personally, like not pretending, I kind of have an idea what this is about, I’m not super worried about it, but this would be good to maybe follow up about later, right? And this, this is something else going on with – I honestly don’t know. It looks like a performance counter of some sort, but I would have to dig into it more.
So there’s some errors, there’s some weird stuff, stuff I might come back to, but you really have to apply kind of the 80/20 rule. What’s going to give you the biggest bang for your buck? What 20% of the things you run into are going to give you 80% of the value? And in many cases, that’s backups, that’s corruption, that’s being on the right SQL version. When you start to deal with this stuff, you can get out into the weeds, so let’s take a look at the SQL error logs.
Reviewing SQL Server Error Logs
So we’ve got SQL Server logs, one of the things that I like to look at is what are the dates for the archives, because a neat trick is that whenever you restart SQL Server, it cycles the logs, and actually, you can tell it how many times it’s allowed to recycle. You can have it so it goes back 99 times if you want, or something like that. The default is six. But I like to look at this so I get an idea of how often are they rebooting SQL Server, has this been up for months or I had one customer that was literally rebooting it every Sunday, and they said it helped, I didn’t believe them but I didn’t press them on it, but when you reboot SQL Server, everything that’s up in RAM can fall out, because it has to pool it all back from disk again. So it’s not ideal if you’re rebooting SQL Server every week. But if we take a look, let’s see if we can find anything just goofy in here.
So there’s going to be a lot of stuff where you may not know what it means, and that’s fine. Like, I can tell you right now, I have no idea what this means. I know a little bit about file stream, but I have no idea what this means. What I believe is going on is that some of this stuff is just because we changed some configuration settings and so it – this information comes up. So I don’t know if this is a big issue or not, a lot of these things, but I’m just looking through, trying to find things that look like more of a problem. So one of the things that’s weird is it keeps saying starting up AdventureWorks. Now, that’s a little bit bizarre, so what I would do is I would make a note and I would say starting up AdventureWorks, and later on if I don’t know what that is, I might Google it if I have time. You have to prioritize these things. In this case, I know that AdventureWorks keeps starting up because I set it to be on autoclose, which you would never do. So if I go to properties, options, I have it set so autoclose and autoshrink are true, and you would never have this. Autoclose makes sense if you’re a web post provider and you’ve got like, ten thousand databases that are very rarely used or something like that, autoshrink just – I don’t know if it ever makes sense, but it automatically tries to shrink the database and keep it as small as possible, which isn’t really much of a value add.
So I might end up Googling that message and find out, someone set it up this way. I had a customer recently where they had half a dozen databases set with autoshrink and I explained to them like, this isn’t helping. So going back to the log, let’s see if there’s anything else goofy that we can see here. So there’s some configuration changes, I changed the memory, there’s backups, there is a trace flag you can set so you don’t get all these alerts that backups were successful, because ideally you know about that already, and you want to know when backup fails. So it can clutter up the log there.
Okay, so like, I would be nervous at first if I see that amount of stack dump, but that’s because we did CHECKDB. There’s a lot of stuff where it’s like, you may not know what it is and that’s okay, but you dig through and you find something like this. Someone tried to log in as SA and failed, and it seems to be happening about every hour. That’s something I would bring up with people. That could be a malicious user, that could be a program where it had a specific password and it no longer works. Could be a bunch of things. The customer, they had an account, basically called like, abcuser, and it was failing to log in. So you can find some interesting stuff in here but there’s a lot of stuff you’re going to be like, “I don’t know what this is” and you may not have time to dig into it. So you get a better sense for it as you go through, but this is scuba diving mission, this archeology dig, you’re just seeing, can I find anything useful in here. So that’s kind of digging around in the logs.
Revisiting the Perfmon Counters
So let’s see how our performance monitor is doing. So now that we added more RAM, you can see these trawls every five minutes, right? Or maybe a little bit longer than that, but that’s because of that one job, but now it’s going up. So before it was stuck at 600, at the minute you can see 650. If I scale this out, we can see it’s starting to make its way up, and now it’s at, you know, 13000. So it’s starting to get to a point where I’m feeling more comfortable. An hour’s going to be 3600, so we’re like, 20 minutes or something like that right now. So now I added more RAM, my page life expectancy is doing better.
Let’s take a look at wait stats, so again, we’re going to clear it – well, no we don’t want to clear it out because we did the thing that would cause page I/O latches, so let’s run this. Okay, so now we see it is still showing up but it’s better. So it is still a bit of an issue, and we also have CXPACKET but it’s less than a percent. I’m not worried about that, but I’d bring up with the company, hey, write log should not be 93%. Maybe 10%, maybe 15%, fine, 93% tells me that your disk is probably really slow, and so you might be able to – you can either pull up from performance monitor counters to take a look at that, or you might go into resource monitor. So taskbar, resource monitor, and you might take a look at disk and see okay, how many writes per second am I doing right now. And so it looks like you know, one or two megabytes per second right now, and you want to ask, is that normal amount of usage? Is that what we want? Do we need faster disk because it’s maybe taking longer than we expect?
Now, so we had the question if you set max RAM can that cause page I/O latch, so we saw that. We changed it, it’s not as prevalent. Also what I’m doing is I have a job that I set up as part of this demo, where I’m doing some selects with some aggregates that are basically reading the entire table for these big tables and forcing a bunch of reads into RAM, right? So I’m kind of forcing that for the purpose of the demo.
Using the Checklist
So we’re kind of at the point where, again, having a checklist is really useful because I’ve been kind of bumping around and doing a bunch of things, I’ve been taking notes, and I want to make sure I didn’t miss anything, so I’ve got a checklist – I can’t show you because it’s covered up. Actually, hold on. So I’ve got a checklist that I have, of just stuff I’m trying to make sure that we cover here.
And it’s useful because you’re going to forget about things, so one of the important ones is logins. So we talked about security. Well, who has access and what access do we have? So if we go to security, go to logins, let’s take a look at what’s here. So there’s some system accounts, there’s SA, there’s some built-in accounts, there’s my account, and then there’s Mike who was fired. So we let Mike go about a week ago, you know, he was doing some stuff that he shouldn’t, and let’s see what permissions he has. Looks like the account’s still active. Let’s see, good, he has sys admin. So if Mike was mad at us for firing him, and he had some way to get access to the server, he could just scorch earth everything, and we wouldn’t – we may not even know because maybe he would change the SA password and login as SA. We probably don’t have any auditing set up for any of that. So you know, immediately, I would tell the stakeholders, “Hey, we probably want to do something about this” and maybe they say, “Well, we deactivated his Windows login so it’s not an issue”, okay, but I can take away his role, I can say I want to you know, disable that, that sort of thing.
Now, you might be tempted to delete this and say, “Well okay, we know he doesn’t work here, let’s delete it, let’s just get rid of it.” Well, let’s see if he owns anything. I don’t think – let me see. See if I can – that doesn’t help me. There’s ways to find out who owns what. He happens to own some jobs, so Mike was in charge of our database and if I go to this plan, we can see that he owns the job, right? Same thing for I think the ETL process. Well, guess what happens if you delete Mike’s account. Your backups stop working. I’ve literally had this happen. Had a situation at my current company, there was a gentleman called Alan who did all the backup kind of stuff, he had set up the maintenance plans way, way back, he left the company, he’s actually come back ironically, but he left, and then a while later I was like, well, we don’t need his account anymore in SQL, I’m going to delete it. And then silently backups stopped happening, and I only happened to stumble upon it thankfully, but you have to be careful about who has ownership. You want to make sure that SA should own all the jobs, not people, right? So I would change that to SA for everything. So there’s lots of these things that you just learn about piece meal, and it can be hard, it can be frustrating.
Let’s see what else we have on the menu here. We went through the logs, yes, so just a couple other things that I want to go over. One of the things is high availability. So like I said, it’s not really part of my health check but I always take a note, do they have failover cluster instances, mirroring, log shipping, availability groups, anything. A lot of times they don’t. You know, do they have full recovery model, do they have transaction log backups, so that I can have a conversation with the stakeholders, saying, “Is this something you’re interested in?” And very often they are. Very often, if I can tell them, “Hey, we can get it so that we can get you up and running in an hour”, that’s really meaningful to them, and that may be worth the money.
Another thing and like I said, the list of best practices just goes on and on, let’s see if I can remember where this is. For the backup settings, let’s see, this should be – you know, I’m not able to remember off the top of my head, but one of the things I like to set is backup compression. Let me just poke around a little bit. This is where Google’s nice. Database settings. Compress backup. This is nice because I’ve seen 5x or 10x improvement in the size of the backups just by enabling this, and again, you’re trading CPU for disk space. This is one of those lists of things that you know, it’s a one-off, you have your checklist, you do it, and you’re good.
Alright, so just trying to think, is there anything else that we did not cover. Okay, so I think that’s nearly everything, I’m going to stop this tool. So again, this is HammerDB, this is nice for doing some load testing, although it’s very confusing, the interface. So one other thing that is kind of a no-brainer but I didn’t want to do it at the start because it makes things a little bit too on rails, is I have installed sp_Blitz, which is a great piece of open source software by Brent and his team. And so I can execute this, and there’s a bunch of options, generally, the defaults are just fine, but sometimes you may have a lot of user databases and you want to ignore those, or there’s certain checks you want to ignore. And I want to run sp_Blitz to see if there’s anything I missed, and this is really useful for me, just as an it’s a built-in checklist for things to be looking into. So we already know that we have issues with backups, we’ve got full recovery model where we haven’t been doing the backups for logs, we haven’t been doing CHECKDB, autoshrink, autoclose, not good.
We’ve got stuff on the C drive, right? So this is an issue where if one of your databases grows out of control, then you may not be able to log into Windows. If tempdb, if one of your log files, something goes out of control, you may not be able to log into the server to fix it. So that’s something to be careful about. So not all these things you’re going to necessarily care about. See this is useful because something that I forgot was setting up emails, right?
So right now, if we take a look, we haven’t set anything up in terms of database mail. If I say, okay, it’s not even turned on at the moment, but in addition to that, if I go to SQL Server agent, there’s no alerts set up, there’s no operators, there’s none of that sort of stuff. So there’s nothing set up to tell me that there was a severity 16 error, there’s nothing to tell me that there was error 833, which I just learned about for long waits with the disk. Nothing.
And one of the cool things with sp_Blitz is there’s a URL for each and every on of these so you can say, “I want to learn more about what this issue is” and for the alerts one, they actually provide a script that you can literally run to set up all of those errors and alerts, right? So a lot of stuff we talked about, there’s some configuration stuff, we talked about how maybe Mike who was fired shouldn’t be SA, we’ve got all the databases run by me, just a bunch of different things. So what I like to do as well is I – all these things that I run I put into Excel as part of my deliverables, right? So I take a snapshot of the wait stats, the backup information, the sp_Blitz information, all that kind of stuff, and I put it together into one big deliverable. sp_Blitz is a great way to kind of automate a lot of this stuff and help you learn if you’re still learning these bits and pieces.
So I’m just double-checking to make sure there wasn’t anything else that I missed. I’m checking my list. Again, this is why you want to have a checklist ahead of time and a plan because you’re going to forget something, especially because it’s just so ad hoc and you’re kind of circling around.
Brent Ozar: There you go. Perfect. Well, warm round of virtual applause for you, Eugene there, very nice session there on SQL Server health checks.
Latest posts by Eugene Meidinger (see all)
- Why Power BI isn’t for You: Pitfalls and Limitations - January 17, 2018
- Just Enough Database Theory for Power BI - June 15, 2017
- Keeping up with Technology: Drinking from the Firehose - June 2, 2017