You sometimes have to jump on servers and need to diagnose and fix them quickly. Let’s streamline this process to make it faster and more efficient.
You’ve just been given a server that is having problems and you need to diagnose it quickly. This session will take you through designing your own toolkit to help you quickly diagnose a wide array of problems. We will walk through scripts that will help you pinpoint various issues quickly and efficiently. This session will take you through;
- What’s on fire? – These scripts will help you diagnose what’s happening right now
- Specs – What hardware are you dealing with here (you’ll need to know this to make the appropriate decisions)?
- Settings – are the most important settings correct for your workload?
- Bottlenecks – We’ll see if there are any areas of the system that are throttling us.
By the end of this session, you should have the knowledge of what you need to do in order to start on your own kit. This kit is designed to be your lifeline to fix servers quickly and get them working.
All code we’ll go through is either provided as part of this presentation or are open source/community tools.
Enjoy the Podcast?
Why I Want to Present This Session:
I’m a performance tuning DBA that loves optimization. I regularly have to jump on (customer) servers I’ve never seen before and fix them. I’ve built my own toolkit over time to make this process quicker and would like to share this information with the community.
I’ve contacted all authors of open source/community tools that I’ll use in this presentation to obtain permission to include them here. I’ll also publish all of my scripts, links to others scripts and the slide deck on my blog.
Full Session Transcript:
GroupBy Webcast – Rich Benner
Brent Ozar: Alright, in this first session at GroupBy, Rich is going to be talking about how to make your emergency toolkit. So take it away, Rich.
Rich Benner: Okay, hi, welcome to the presentation. I’m Rich Benner (StackOverflow – Github – LinkedIn). I work for a product vendor. I’m a performance specialist. I primarily deal with all your refactoring, the architecture of a database, and basically making our tool faster. One part of what I do is our clients tend to host their own SQL Server instances locally, and we obviously act as a support agent to the customers. Sometimes if there’s an extremely urgent issue or a high priority issue, something that needs fixing quickly, they’ll come and get me. I’ll drop on and have to fix it as quick as I can, get the client back up, get it working and then we can triage and get it fixed at a later point.
One thing that has made my life a lot easier working here is creating a toolkit that I’ve got in a few different places that will allow me to quickly jump on, it gives me a prioritized list of what I need to do in what order, and I can work through them until the issue is resolved or until it diagnoses the problem that I can then go and fix. The presentation will be on richbenner.com/toolkit, there’s some contact details if you want to get a hold of me. I’ll put them up again at the end.
Why You’re Here
So why are you here? You are probably the main line of defense against the SQL Server. The idea of this is that you can build it at whatever level you’re at. You can either be – you may be an accidental DBA who has to do these things not regularly, but when you do them, you want to get it done quickly. Or you may be a senior DBA who knows what they’re doing but again, if you could put a little bit of work in at the beginning, save you a lot of time later, you can get things fixed and look really good to your managers. You don’t have time to waste, I know how busy everybody is nowadays. If you’re not busy, then it’s not going to be long before you get more work dumped on your plate. So you want to make things as streamlined as possible.
Again, you might not have a lot of knowledge at this point, but the purpose of these scripts and some of the sample scripts I’ll show you is that you don’t necessarily need to have a lot of knowledge on the internal SQL Server. But these scripts that are written by people who do have the knowledge of SQL Server will point you towards the resolution, how to fix it, and hopefully, as quickly as possible.
The Contents of Rich’s Toolkit
So let’s make a toolkit. So it’s a set of scripts that will help you diagnose issues, we’ll number them so that you have that prioritized list. You can literally start at script one, script two, script three, and work through until you worked out the problem and you can resolve it. You want to make sure it’s kept up to date. A few of these scripts are released monthly, some of them are released slightly less, but it’s generally a good idea for these tips to keep them as up to date as possible. You’re always getting new SQL Server versions released, new features, and a lot of these scripts will have to be slightly modified or there may be enhancements to them for later versions. So generally, you want to be keeping on top of it, maybe check it once a month, see if there’s new versions of the scripts.
But when there are new versions, you want to make sure you know how they work. You want to make sure that either they work the same way they did before, or you want to know what parameters you want to use or what the output’s going to look like for the latest version. A couple of the scripts sometimes have major revisions so we’ll be going through some of the first on the kit scripts, and sometimes when the new version’s out, they’ll rework it and maybe move some functions into a different parameter. So you need to be able to know how they’re going to work and the outputs. You don’t want to just be downloading the latest version every time you’re running it unless you are confident that you can deal with potentially it not looking how you expect it. But again, spending 10, 15 minutes when the new version’s out to work out how that works is going to mean you don’t get surprises when everybody’s panicking when a server’s on fire. You need to fix it.
So these are all community tools, there’s a couple of scripts that I’ve either written on modified from scripts I’ve found, the rest of them I will point you to the original authors, where to get them, and how to get them. So there’s a few scripts that are on blogs, we’ll go through them later. But there’s a few that are compact enough to be on blogs. We’ll go through Adam Machanic’s sp_WhoIsActive, I’ll show you why you want to be using that instead of the in-built ones. And we’ll go through a couple of scripts from the First Responder Kits as well. If I have time, there’s a few more scripts to go through, but I don’t want to rush through these ones, so we’ll do these and then there’s a link at the end to where I’ll host this and I’ll put some more scripts on there that might be helping.
So you get given a server, you need to fix it as quickly as possible. So you need to prioritize. There’s a number of different priorities that you can do. We’ve got the server’s on fire, it’s melting, the CPU’s maxed out, basically, people are panicking, it’s down, you want to get it up as quickly as possible. You can have intermittent outage issues, so some user’s having issues, some particular pages are having issues, slightly less urgent but it’s obviously still you want to fix it as soon as possible and get it back up. Or there’s another issue where people report it’s slower. It’s been slowing down over the past month or two, it’s not as fast as it used to be. I’m sure you can think of reasons why that might be, but we can get some scripts to try and help us analyze that.
So what I like to do is start with some diagnostics of what I’m working with because there’s a 90% chance I haven’t seen the server before, and the wide variety of servers that it could be, hardware spec, software specs, I like to run a diagnostic so you can get a baseline of what the server looks like that you’re working with, how big it is, some information like that. So we care about some database settings, specifically hardware and software. Once we know what we’re working with, we want to look at what’s happening now. So we can use sp_WhoIsActive for that, we’ll go through that in a second. But yes, that tells you what’s running on the server right now, which once you know that can kind of help you dig down a little deeper if it’s an issue with what’s running right now, and it’s not an issue that nothing’s running right now.
We’ll go through sp_BlitzFirst, which has a – it’s a massive script, it’s got a load of parameters. We’ll just go with the basic script at this point, and you can have a look through things like expert-mode and setting like that if that interests you but we won’t specifically go over that. But these two scripts are on GitHub. Again, I’ll point you towards where they are. You can go through, and I do implore you to have a look at these scripts and kind of step through them yourself when you have a bit of time and try and work out where they’re getting the information from or what it means. And then it’ll really help your knowledge of A, how the internals of SQL Server work, and B, how to get it back up running, how to help the performance.
We’ll look at wait stats as well on the server, there’s a blog script that I’ll point you towards for that, that’ll help you analyze that helps tell you if you’ve got a specific CPU issue or an I/O issue or locking. That will help narrow down a little bit further as well. And then finally, we’ll look at disk performance. Whether you’re on spinning rust or on hard disks, or potentially, network issues will get picked up when you try and see how your disk’s performing.
So going through the basics, on the site where I’ll point you towards at the end, script one is called Environment.sql. This is what I use to try and work out what basic hardware and basic settings are. It’ll give you things like the instance name, machine name, what product version we’re working so you know if it’s 2008 versus 2017, you might approach fixing these slightly differently. Whether we’ve got a HADR solution on there, it’s clustered, availability groups, how long it’s been up – one of the fun things that people like doing with the SQL Server going slow is to reboot it, and then you’ve lost a bunch of motion like performance counters, which don’t persist through a restart.
If you’ve been very good and they’re on 2016 plus and have the query still switch on, you can still get some information from the query store because that persists through a restart, but a bunch of stuff doesn’t. So I like to check server up time so that if they’ve rebooted it this morning, I know to take some of the perf counters with a pinch of salt. However, if it’s up for months, great, we’ve got more information we can dig through.
How many cores we’re working with and what sort of process potentially we’re working with. I had a client issue today, two core SQL Server, and it was complaining for SOS_SCHEDULER_YIELD wait type, and that is CPU. So straight away, seeing that, two cores, yeah, I bet wait stats are going to be an issue and the CPU. But at least you know what you’re starting with, and obviously the amount of RAM you’re working with. Guys, here, most of you will know that SQL Server will use as much RAM as it can. It’s greedy, it’ll just take it all, but knowing how much we’re working with will help, especially when we get to the point of checking things like our database sizes. So you know how much of your data you can cache, and then some of the perf counters might help as well.
So all you have to do with Environment.sql is get the script, put it on the server, run it. It’s going to give you results something like this. The script is very simple, you can see what it does and where it gets its information from. But knowing that this particular instance here is SQL Server 2016 SP1 Enterprise, with 40 cores and 250GB of RAM, it’s been up for 10 days. Hasn’t been up for that long because it’s only a test server, but understanding that this is a fairly decent sized server, 40 cores and 256GB of RAM, you’d approach that differently than you would for a server with four cores and 16GB of RAM. From here, you want to be asking yourself if there’s anything out of the ordinary. Is there an issue here? Are they running Web edition? Are they running Standard edition and it’s not going to utilize all the cores? This will help you get to the original, to the start problems.
So you want to have a look for any immediate issues. Now you know what you’re working with, let’s try and have a look what’s happening. So in the First Responder Kit, which again, is firstresponderkit.org, or I’ll point you towards the GitHub directly, we’re going to start with sp_BlitzFirst. Now, one of the prerequisites for BlitzFirst is that you need BlitzCache. All these are, two SQL scripts, put them on your server, run it, and it’ll install the stored procedure. You can then call it and check your results. So you need to stick on sp_BlitzCache, and then sp_BlitzFirst. If we have time for demos, I’ll show you actually how that work in practice slightly later.
Once you’ve installed the stored procs, you just call them, execute the stored proc. Now again, there are a number of parameters you can pass through. Things like expert-mode equals one will give you more information, it’ll give you four, five result sets other than this standard one. Depending on the use case for your toolkit, if you are confident that more information will help, by all means, set your scripts to run it in expert-mode. However, if for example, you wanted to give this script to a support desk, to a help desk, you may give them this script but don’t tell them to run it on expert-mode. There is – it does give you a lot of information, which is very useful, but if you’re not particularly confident in SQL abilities, then perhaps don’t drop into expert-mode. It’s a bit of an overload.
The results of BlitzFirst will give you a prioritized table like this. Priority naught is always there, priorities 200, 250 and above, are informative. Closer to one is prioritized as more important. If you’ve got a one, you’ve got a major problem, but it’ll tell you what it is. On this particular server, I ran it during a data load, it gives us a priority 40, which is pretty high, but it doesn’t mean the server’s on fire. This particular script tells us that forward fetches are high. Now, on there, there’s a how to stop it column, and opening that will tell you what you need to do.
Now, if you know much about indexing, you’ll know that forward fetches is because you’ve got heaped tables that haven’t been rebuilt in a while. So you either want to rebuild your heaps potentially, or look at adding clustered indexes to those heaps to try and avoid the forward fetch issue, but the links there will give you a step by step of how to fix it, or a link to somewhere that will tell you how to fix it. So it’s very good for narrowing down a first point of what’s happening, and again, if you wanted to provide this script to your help desk team to try and stop the volume of cores coming to you, you can give them this and they can say, “Right, so the output of that is that it’s currently running at high forward fetches.” I’ll show you, again, if we have time for demos, if I don’t talk for too long, we’ll go through a couple of demos later and I’ll show you other examples of what can go on.
But if you want to know everything it’ll report on, have a look at the script. Just go through step by step. It’s an awful lot of rows of data, talking thousands, but the way it’s built is very logical and very sensible, so you can just dig through and have a look at what it’s going to report on and how it would know these issues. And yeah, don’t worry about the size of the scripts. They are big but they’re a lot less intimidating than you think they are the first time you see them. Yeah, so the higher on the list your issue is then it’s going to appear higher, it’s going to tell you things like you’ve got high CPU, if you’re running it out of memory, you’re going to have different flags in there depending on the symptoms.
What’s Running Now: sp_WhoIsActive
Right, so when you’re looking at what’s running right now, everything’s active. You’ve got two options. I’m going to tell you the first one, I’m going to tell you why not to use it and why to use the second one. So Microsoft built in sp_who and sp_who2 that are in-built system stored procedures that you can just call, it’ll tell you all active connections to the database right now. You literally just exec sp_who2, and it’s going to give you an output something like that, which will tell you all connections to the database, but you’ll notice the vast majority of these are either background processes or they’re sleeping, they’re not doing anything right now. This is fine but I now – if what user processes are running right now, you can have thousands of rows of data on here. It’s an awful lot of data to dig through, which when you’re under time pressure, you don’t want to be messing about with.
So I suggest we don’t use this. We look off and we run sp_WhoIsActive. Now, this is a stored procedure written by Adam Machanic years back. It is maintained, but what this will do is give us a much nicer view of what’s happening. It’s going to give you an output something like this. This was run a few days ago on a server I have. It tells you all processes that are running right now. Things like – it’s sorted by how long they’ve been running for, so you can see I’ve got a process that’s been running here for 60 days, which is fun to find out. But what this also does is it gives you the query text of what the process is doing. If you run sp_who2, if you want to find out what processes SPID is running, you have to run things like DBCC INPUTBUFFER, which just takes time, especially when you’re in a rush. So this you can just click on a query text and it’ll open up a new XML window and show you what query it’s running.
Now I know that that 60-day process was just a diagnostics process that was in the background. It’s not actually doing anything. If you have a look, it’s used zero CPU, 20 reads in 60 days. It wasn’t actually an issue, but seeing 60 days was a bit of a surprise. And you can see all the other currently active connections of the database, things like how much CPU they’re using. We’ve got one there that seems to be using a lot of CPU and doing a bunch of reads for this database, that’s a bunch of reads. And it’ll tell you things like that status, so whether they’re actually taking up CPU, whether they’re running, it’s just much nicer to use this than it is to try and dig through sp_who2. And this will help us with things also like blocking, which I’m sure you’ve all come across it, it can be a massive pain. But having this, instead of having to scroll up now thousands of rows to try and work out what’s causing your blocking, you’ve only got nine rows of data here, depending on your server. It’s just much nicer, I highly suggest you get sp_WhoIsActive, download the latest version. Make sure you know how it works and go run it in your instance to see what’s happening rather than use sp_who2 because nobody’s got time for sp_who2 all the time.
Wait Stats Script from Paul Randal
Once you’ve been through there and you know that – you look at things like blocking, you know, it’s not an issue. If this doesn’t resolve the issue, then we need to go to the next step, a step further. We’re going to look at wait statistics. Now, this is – it uses a lot of system counters, but basically, this tells us where potential bottlenecks might be in our SQL Server. This particular script I’m going to use was written by Paul Randal at SQLskills.com, you can get it from here, don’t worry about writing that down, I’ll give you a link to this at the end. So you can follow that.
One thing to note is from this script is the default that it scans wait stats for is 30 minutes. What it does, it takes a snapshot of the data, waits 30 minutes by default, takes another snapshot and compares what’s happening right now. Obviously, when a server’s on fire, we don’t want to wait 30 minutes for a result. I changed mine to 30 seconds as a default. If you feel like that’s not enough or too much, you can have your own version in your folder, which is whatever time you want. If you only care about five-second snapshot, what’s happening right now. I found 30 seconds is more of a sweet spot for me, but that’s in my toolkit. When you build yours, if you want a different parameter, go for it, stick it in, that’s not a problem.
So the output of that will look something like this. It sorts your wait types by the total wait time over the time period you’ve given it. Now, this gives – your total wait time will be wait time per core. So if you’re running a 40-core processor and you run the processor at a second, you could potentially have up to 40 seconds of wait over that one second period because it counts per core over CPU.
Looking at this, if you’ve not seen wait stats before, it might not mean a lot. Even if you do know wait stats, it might not mean a lot. What is write log, what is LATCH_EX, there’s dozens of different wait types and very few people would be able to tell you what all wait types are off the top of their head. So I keep a link to the particular page, which again, SQLskills, they have an excellent list of all known wait types. You can find your wait type, click on it, it’ll tell you what it is, what sort of category it is, write log is fairly self-explanatory. It’s going to be your log files, LATCH_EX is again, it’ll tell you some more information about it. CXPACKET means parallelism’s happening, but CXPACKET doesn’t really mean much by itself. You’d need to know that that is a parallelism issue. I’m sure some of you are seeing SOS_SCHEDULER_YIELD and having – hyperventilating a bit. That’s a CPU contention flag but again, if you don’t know them, you wouldn’t be expected to know them.
As you run this more often, you’ll get the hang of what your most common ones are and your sorts of environments. But if you’re not sure what any of them are, you can go to that link and just drill down and it’ll tell you what they are and that will give you an idea of either how to fix them or where you need to go next.
Daniel Hutmacher: We have a comment from Brent, he says, “Bonus, even on a 40-core box you can have more than 40 seconds if multiple processes are waiting for the same – I guess, other kinds of problems…”
Rich Benner: Yes, you can stack them. But yeah, they’re kind of – the point was that you can have more than a second of wait time in a one second period. You can have multiple. So don’t necessarily worry that you’re seeing – if you’re seeing 40 seconds or 100 seconds wait times in a second. That just means like, you have more than one process waiting and it’s per core as well.
Daniel Hutmacher: Sorry.
Rich Benner: Go on.
Daniel Hutmacher: We were also laughing at the fact that your SA account is called Totally Not SA, and we had at least someone say – J.D. Walker says, “Stolen, I’m going to rename the SA account on my dev instance.”
Rich Benner: Nobody calls it – nobody ever uses SA to log on. All of my demo machines, it’s totally not SA. I’ll give you three guesses on what the password are for my demo machines. Password, yeah, that’s right. I feel like we’re all doing worst practices for demos. Log on as SA, use password as password, yeah, just all the terrible things.
Daniel Hutmacher: [crosstalk] password was blank, that was nice.
Rich Benner: Yes. Yeah, that was interesting. Yeah, no, when we go through the demos you’ll notice that as well. I’ll always use totally not SA as my SA password because nobody should do that in real life, nobody ever does that. I’m sure nobody’s ever seen anybody using SA other than all the time.
Daniel Hutmacher: I worry that it might crash a service pack installer some things – best practices to rename back to SA installed in the update or C or SP and rename it back again.
Rich Benner: [crosstalk] no, no, it’s fine. It’s these sorts of things, you come across them, don’t you, but I like to mess about when it comes to demos. Be a bit more reasonable on live servers and production, but for demos nobody really cares, do they? None of these are network available, so I don’t have to worry about security as such because they’re all local. Okay, any more questions before I carry on?
Daniel Hutmacher: No, just loud.
Bert Wagner: I don’t see any more.
Rich Benner: Okay, excellent. So SQLskills.com, excellent resource for what wait types are until you get a handle on what they are. Half of these I could probably tell you off the top of my head, half of them I don’t see them often. PREEMPTIVE_OS_AUTHENTICATIONOPS, so I could guess but if you want to be sure, go to that web address and you can tell what they are.
Analyzing Storage Performance
So if that doesn’t fix it, then you probably want to look at disk performance. If you’ve got – you might have a wait stat that is showing high network I/O waits, one of the earlier – the BlitzFirst script might tell you you’ve got network issues. I have to double-check that. But if we suspect disk performance, then we want to have a script for checking our disk performance. We don’t want to be digging around and make something go offline. We just want to be able to chuck something on and get it working.
Now, few things about disks. SSDs are cheap nowadays. Spinning rust is slow. So a lot of people, we were talking about this before the webinar, a lot of people are still on hard drives for servers, whereas most laptops nowadays will be SSDs. It’s crazy that people have SQL Servers on spinny disks, partly because of their performance, partly because you’re spending so much on SQL Server licensing and then you go and put it on cheap hardware. It’s daft. But we want to be able to check what we’re working with.
Is network speed an issue? Even if it’s on SSDs, you may have – they may have tripped out a RAID controller, and your network speeds aren’t up to scratch. But we can check that whilst checking our disk performance. It will give us a general idea and we can then go and call up the network guy and say disks don’t seem to be performing that well.
Disk performance can cause a lot of issues. It’s the architecture of SQL Server, writing and reading. You might sometimes get away with it. If you’re reading a lot of data just out of memory, you’re doing a lot of small reads and it can load all the data into memory, disks might not actually be too much of a concern, although still not great, but in other workloads, depending on what the product is, what SQL Server’s for, they can be a lot of problems caused by some disks.
There’s a script written by a guy called David Pless from Microsoft called – my version’s called Disk_Speed_Check.sql. David’s didn’t do quite what I wanted it to do for my applications, so I’ve modified his script for my toolkit, which I’ll give you my version of, but it’s still very much the brainchild of David Pless. I’ll give you a link to where it’s from as well, it’s embedded in the script to say who the original author was and the fact that it’s been adapted slightly from that.
What we’ll see from these checks is something like this. It will give you per file the general performance, and it will give you a great idea for how you’re looking on your boxes. Now, he takes Microsoft recommendations and has kind of built them into the script. And that’s the read performance and write performance here are extremely handy, especially if again, you were going to give this to say, a support team or first line help desk. Knowing that they might not know what a good read stall is, in fact, we’ll just go through it quickly for read stall. So read stall and write stall is what you’re going to focus on mostly when looking at your disks. That is basically – it’s a lot more complicated than this, but it’s how long a process takes to request information from the server and request it back, and receive the data back. That’s the purpose of reads. For writes, it’s how long it takes for it to send data and get the confirmation it’s been received by the server. That’s – again, it’s more complicated but as a general rule of thumb, it’s how your disks are performing.
You can see on this server, we’ve got mostly good. Good writes, I’ve got one particular file I potentially want to look at. It’s the live database data file. I’ve had to remove data, to file names here, but ultimately, if we were worried about disks, we’d look at that one and see perhaps why that disk is running slowly or that file is running slowly. But this tells me that actually, probably the disks aren’t the issue. Again, knowing what this output looks like and then what you’re looking for is pretty essential. You don’t want to be running this for the first time on a client machine or on a server you haven’t seen before, and then try to sit there and work out what this means. Having a basic understanding of how it works is always a great idea.
And what you can do with any of these scripts, you take your copy of them, say from in your toolkit, you can fork a copy of the GitHub ones if you want. If you’re going to use them yourself and want to add more functionality, you can modify them. If you want to give it to a support team, and it needs to be a bit more in-depth, or maybe they don’t need so much information, you can take parts out or put comments in and give guidance in these scripts to try and make them as performant as possible. For example, one of the scripts that I’ve used is BlitzFirst. But I’ve given a copy to our support desk with a number of functions removed. It’s still BlitzFirst, but it’s not as big as it was, and it’s basically the basics to try not to overload the input so that the version of BlitzFirst I used for the support desk toolkit, different to the one I use. But again, that’s because I’m probably more comfortable in having a bunch of data fired at my eyes than a support desk would be. They just want to – they generally don’t have the time to dig in and you want to try and quick wins. So any of these scripts just modify, but don’t be horrible and just take them and take them as your own. They are written – this script for example is David Pless’s script. I’ve just made slight modifications so it suits my needs. Just remember where you got your data from, it’s only nice.
What we’ll do, I can show you a few of these scripts and how they work in some demos because I have talked the legs off this. So let me just get rid of that, and open Management Studio. Now, one of the things that I’ve done in my time is I’ve had most problems that you can imagine. I’ve seen things that you would only see in movies, and if you were joking. So I’m using – I’ve got a couple of demo machines here, they are the Stack Overflow 2010 database, which was on BrentOzar.com. If you have a Google for Stack Overflow database, you’ll find a torrent of the full thing, but you’re talking 140GB I think unzipped, for the entire database. This file is about 1GB zipped, and about 10GB unzipped. It is the website Stack Overflow’s database, but the data from the first two years I think, when it existed, so 2009, 2010. It gives you some basic fields and you can build a number of queries on just the – I like it for demo purposes.
I’ve got – let me just get this machine, make sure it’s happy. In fact, what I’ll show you before, the demo I couldn’t show you – there’s a demo I can’t run because it maxes out the CPU and I think it’ll crash the total webinar. So I ran it before and I will show you the results of it now. So I had a server which when I checked it was high CPU. Now, I know these servers are single core 4GB RAM servers because they’re running on my laptop. So I didn’t run the first check, but I did run BlitzFirst on it. Now, this you can see I had high CPU utilization on my server. Now, you can also see there’s high CPU utilization, not SQL Server.
If you need some more information, BlitzFirst will always give you a link to a blog post and you can go and dig in a little bit further. But yeah, I knew this one was high CPU because – high CPU on a server but it’s not SQL Server. Now, the reason for this, it’s something I’ve legitimately seen live, is somebody running a game on a SQL Server. I’ve had somebody who had – they were playing a game installed on the SQL Server because it was nice and highly powered compared to their little desktop. So I ran this remotely from my machine, high CPU, not SQL Server, and then you come to something like them playing a game.
Legitimately happens, and I went down with the permission revoke for said developer because they thought they were being smart and running games on a SQL Server that’s monitored by IT is an excellent idea. But this is the sort of thing…
Daniel Hutmacher: That’s the first time in something like 20 years that I’ve ever heard this happening in real life. It’s the standing joke. We’re all in IT, right? Let’s just make a PlayStation out of this 64-core machine.
SQL Server Bingo Card
Rich Benner: Exactly. I’ve seen all sorts of nonsense. I’ve actually made myself a bingo card, which I’ll show on the next slide. But you can have for a week – I’ve got a little board here, I can put it up, and see how many – see how long it takes to get bingo. There’s all sorts of nonsense. I’ve seen – I’ll go through the bingo card in a minute, but it’s – there’s some silly things on there and none of them are made up. They’ve all happened, it’s crazy.
Daniel Hutmacher: Brent says, “Put it up. Bingo’s awesome.”
Rich Benner: I’ll go find it. This isn’t in this role, just to preface that. It’s in my DBA career. But if we go through – if we can finally get there.
Daniel Hutmacher: I crashed Rich’s presentation.
Rich Benner: I didn’t minimize it, I closed it didn’t I, so I’ve got to get back to where I was. Right, there, here’s my bingo. So it’s things like, “Can you have a look at this server?” “Oh yeah, what is it?” “Oh, our Oracle server.”
Daniel Hutmacher: I’ve got bingo already.
Rich Benner: Yeah, it’s – if you can find a board for it, just keep it on there. Legitimately, I’ve seen somebody using it as a torrent box, people running their own web servers from a live server, just some of these which are probably more common, things like having percentage file growth stats. That’s a lot more common. People mining stuff, even – they weren’t caught doing it, but the software was caught on there. Also, people come up to you and say things like, “SQL Server’s using an awful lot of memory.” Yeah, it’s going to do that. But a lot of people don’t understand – if you’re not a DBA and you don’t understand how SQL Server works, it may be surprising that SQL Server’s using a lot of memory, but trying to explain that to people – it will use as much memory as it can. It’s greedy. It’s going to fight things, it will crash Windows to get it if it has to.
Daniel Hutmacher: There’s so many war stories here.
Rich Benner: Yeah. People browsing on SQL Server so they can bypass and get on Reddit. If the content wouldn’t allow certain websites, worked out you can do it from a server, bless them. But then you’ve got things like auto shrink and default MAXDOP and cost threshold. They are very common, but they allow you to fill up your bingo. Oh yeah, the developer locking the database for testing and leaving the transaction open. Bless.
So I’m aware I am probably running out of a bit of time, I don’t want to overrun into the next session, but let’s close this. And we can run a quick live demo, something that isn’t going to crash the machine. So let’s get rid of that, don’t want that anymore, let’s connect to our demo one system. Now, I’m going to run – I’m using a – in case anybody wonders how this works, I’m using a tool called SQL Query Stress, which if I remember correctly, is from somewhere like querystress.com or sqlstress.com, but I’ll put that on the website. It’s a tool that you can give it a script, point it towards a database and tell it how many times you want it to run a script, as how many users, and it’s a load testing tool. You can stress SQL out, you can run stuff.
For this demo, I am doing just a select top 10 for posts, but ordering by new ID, which I know is not a great idea. But it’s going to put some load in our server. I’m running it a thousand times as 200 threads. So for this little VM I’m using it’s going to hurt so we’ll keep that running. This is my machine, so we’ll go through our toolkit. So this is the toolkit as is, we’ll start at the beginning. We run our environment, we can see what we’re working with. I know it’s demo machine, 2017 RTM developer because I’m not running it live, it hasn’t been up for long because I’ve been messing about with demos. One CPU and four cores. Now, if this was live, I’d be having a heart attack, but for the purposes of demos, it’s not out of the ordinary.
So if we want to see what’s happening right now – that’s interesting, I’ve deleted the script. Have it installed. But basically, you would run the scripts, install it, and then just run sp_WhoIsActive. I can see I’ve got one thing running at the moment, it’s been running for an hour. Fun fact, before this, query stress crashed on me before the demo and I had to reboot it, kill it. It looks like it never actually finished and never actually closed the connection, so I can see that I’ve got a process that’s been running for an hour. If I want to see what it’s doing, it was running an update, which in fact, I can probably kill that now. I was messing about locking a database. So as that’s rolling back, WhoIsActive, you can see it’s doing a rollback, it’s been doing that for four seconds. Once that’s done, if it completes, then we’ll be able to see the other processes catching up. They would have been blocked by my developer leaving a transaction unblocked. I’ve shot myself in the foot. Right, what we’ll do as well, we’ll move on to the next script while we wait for that to finish.
So BlitzCache, this is the sort of thing that BlitzCache will allow. You can see the size of it down in my toolbar here. all you do, execute it, it installs, once that’s installed, we’ll do the same thing with BlitzFirst, run it. If it’s already there, it doesn’t matter. It will just create stored proc and then it will alter it. If it’s already there, skip this bit and alter it to the latest version so you don’t have to worry about if you’re running this more than once. You can run it as many times as you want, it’s not going to make a blind bit of difference.
In fact, what we can then do, if we run BlitzFirst, you can see what’s happening. I’m just going to go back here, see if that’s – yes, you can see my rollback, and then we can see session ID 261 running, which is my tab up here. So this gives me a nice list saying I’ve got a query currently rolling back. And if you want to go to this link here, you can find out some more information about why it’s rolling back and why rolling back transactions are an issue.
This is the sort of thing you’ll see when you click on how to stop it. This particular script basically tells you don’t panic and restart your server. You’re going to have to wait for it to finish with the rollback. Please don’t restart your server, you’re going to cause yourself all end of nightmares. But the problem with the rolling back, it’s ultimately – you have to wait for it to finish, which is a pain. This server we can see also sitting there at 97% CPU. So it’s less of a priority than a query rolling back, the fact that your CPU on your server is sitting there at high may not actually be an issue if that’s your workload, although I wouldn’t be comfortable at constantly sitting at 97%. It probably means that it’s underutilized, unless you’re worried about things like Azure and you want to make the most of your bang for your buck.
If we run this again now, we can see that our query is still rolling back – I would have thought that’d be completed by now. If we run this again, if it wants to, yes, so we can see that the rolling back process is finished now, and it’s telling us high CPU at 94% of the server, and we’ve also got low page life expectancy now. Again, how to stop it, add more memory to the server or find the query that’s reading a lot of data. Now again, I know for this server that it’s not got much provision for memory, so actually, I know that’s going to happen. But if you don’t know what’s happening, again, prioritize list, the rollback was probably the most important, priority one, and then this gives me an idea of I want to look at CPU and I want to look at my memory. But it gives me a bit more information on what to go on.
So just to see the output of a few more, if you check my disk speed for this, again, it’s – this is the blog post where I originally got the script, from David Pless. It’s a Microsoft blog, but ultimately yeah, I just messed about with it a bit and made it suit me. If I run that, it’s all running on a local SSD on my laptop. It’s got an M.2 drive, it’s fine. We can see in my read stall, they’re all fine, they’re all good. I’ve potentially got one write in my tempdb file, which I’d look at. But for the purpose of this demo, it doesn’t matter. Six isn’t terrible. You want that to be less than 10 ideally, but if this was live, I’d go, right, tempdb, log file, so I know that I’d go and have a look at what drive that’s on, what’s happening in tempdb. Am I doing a lot of stuff in tempdb? Have I got a lot of spills? But knowing this information will allow you to go further.
Just have a quick look at the final two scripts. So we’ve got – this is the wait stats script directly from Paul Randal, the primary tool I talked about changing from 30 minutes to 30 seconds is just there. What we’ll do, we’ll run this. Now, this is going to take 30 seconds at least. If you’re not running this yourself and giving it to somebody else, it might be worth putting a disclaimer at the top to say this is going to take at least 30 seconds, and then follow up with but, if it takes more than two minutes, something’s wrong. Just so that people don’t panic because if they run a script and they just sit there and it’s just taking ages, they – people can panic. So again, depending on your audience, it might be worth putting a little comment at the top to say, yeah, it’s going to take 30 seconds.
So it’s finished now, we can see the only wait type is SOS_SCHEDULER_YIELD. If you know what that is, great. It means you can go and look at your CPU. If you don’t, it gives you a link directly to SQLskills site, which if I mess about and do it in SSMS, I can show you what it looks like. So it gives you a brief description of the wait type – what I’ll do, Management Studio isn’t known – say that again? What I’ll do, I’ll Chrome it because Management Studio isn’t know for being a great browser. Yeah, you can see what I do when I’m messing about.
So what it’ll tell you, SOS_SCHEDULER_YIELD, gives you a brief description of what it is, and when it gets added in. Ultimately, without reading through this in a live demo, it is a CPU contention issue. It’ll allow you to go into a bit more detail as to how to fix it. Now, just to briefly show you the list of different wait types on here, there are alphabetized, but you can see that’s quite a list of wait types. If anybody knew those off the top of their head, I’d be extremely impressed, considering also, they’re adding new ones all the time. With every different version you’ll see a new type of wait stat and they may not be very well documented by Microsoft. Yeah, that’s quite the list. I think we’ll probably call it there so that we’ve got time before the next presentation. Are there are more questions, guys?
Daniel Hutmacher: Not really questions, there was a discussion about who was allowed to log interactively on the server, which is a problem, and why this problem would be urgent. Yeah, it’s going back and forth between bad management policies, consultants, not me, but all the other consultants, and so on.
Rich Benner: Yeah, I’ve seen it before. Things like using Management Studio on the server when you’ve already got high CPU, you want to make sure you can use Management Studio on a box that isn’t your server and just connects. So for example, all of these, Management Studio is on my laptop, and they’re just connected into the server because otherwise, you’ll see all sorts…
Daniel Hutmacher: Also, I’ve seen quite a few developers who just won’t install Management Studio on their local machine, probably because they think they’ll need to install all of SQL Server. And more commonly, I’ve seen consultants, when they bring their own laptop, that laptop isn’t on the client’s domain so they can’t use SQL Server authentication without a few tricks. Many people don’t know those tricks so…
Brent Ozar: They don’t make it easy for you. They don’t make it easy at all. Well, thanks a lot, Rich. Nice job. Very well done. Thanks everybody for – lots of people asked where they can get your session and where your website is.
Rich Benner: Yes, just there. Sorry, I never got to that. Yeah, if you go to there, I’ll be adding on to that throughout the rest of the afternoon. But you’ll have the slides and a link to most of the scripts we talked about on there. if you’ve got any questions for me, you can grab me on the Slack community or email me, email@example.com if you need it.
Brent Ozar: And Doug asks, “Can you address the dedicated admin connection?”
Rich Benner: Okay, yes. I specifically didn’t mention that here because you have to have set that up before things are burning. But ultimately, what the dedicated admin connection is is your – you as a DBA, your backdoor into SQL Server if everything’s on fire and everything’s melting, the dedicated admin connection is your connection as an admin to be able to get in and find out what’s happening and try and fix it. With certain things, yeah, you won’t be able to connect as a user. You won’t be able to get in the front end because it’s just such a mess. Having that dedicated admin connection is great, but for this toolkit, if you don’t set that up before hand, you can’t set that up in emergency. You need to have it set up in advance. So yeah, it’s great. You definitely want it set up, but because the assumption for this presentation was that you haven’t seen a server before, and I always assume that everything’s going to be set to default, that you might not have that opportunity. But yes, you definitely want to set that up if you can get on a server in advance, and when you’re setting up servers, if you set up that connection, excellent. Yeah, definitely do that.
Brent Ozar: He says, “That makes sense, thanks.” CV URL – I have to look up CV URLs. Elsimer says, “Isn’t the DAC available by default but the remote DAC requires set up?”
Rich Benner: Possibly, yeah, I’d have to double-check.
Brent Ozar: Yeah, you can’t – usually when there’s an emergency, you can’t remote desktop in. URL says, “On your bingo card,” he’d like to add something. Last week he had a guy uploading a picture to be inserted into the database by drag and drop, via RDP. He had remote desktop…
Rich Benner: That’s definitely a new one. I’ll update my bingo card.
Brent Ozar: So awesome. And then J.D. keeps being blown away by all these things that we’re finding like the things that the audit scripts detect and that are in your bingo card, he’s like, “Oh my god, I would use an [unintelligible] on the staff.” Well, the things that you learn to transition or that you have to go from being a full-time DBA to working with other people’s servers, Rich, I know you’re doing a lot of this because you’re taking over other people’s servers or working on other people’s servers. Do they have a DBA? Do they not have a DBA? What’s that look like?
Rich Benner Depends on the client. We are – we’ve got a variety. We’ve got – we make software for architecture firms. Quite often, you’ll only have a three, four-people office where they just don’t have an IT person, let alone a DBA, up to companies that have 40, 50 people in their IT department, but they have more – they’re more concerned usually with the backups, your restores, your clustering, your availability, rather than performance issues. There’s quite a lot of – we’re the product vendor, we sell our own product and usually, the server that it goes on, or they use their own server they pre-provisioned. But yeah, going on there and sometimes – as I mentioned a few times, I’ve given a few tools to our customer care team who can just quickly fix stuff and then sometimes the client might have somebody to fix it or it comes to us. It’s a real preference, but yeah, it’s such a wide variety of clients that we have and that I deal with. It’s interesting. It’s varied.
Brent Ozar: If for any of you out there who haven’t worked for a software vendor before, just know that when you take that job going to work for a software vendor, you’re going to learn more about the crazy things that other people are willing to do with their servers. It is jaw-dropping. Aaron drops in and nods. Aaron, what’s the strangest thing you’ve ever seen on someone else’s server?
Rich Benner: This should be good.
Brent Ozar: He’s not allowed to say. Or he says he’s muted. Well hold on, let me unmute you. Here we go. Alright, now try now. He probably doesn’t even have a microphone.
Aaron Bertrand: Yeah, I’m kind of prohibited from talking about most of the crazy things that customers do, but let me tell you, they know crazy.
Brent Ozar: They know crazy. Well, thanks a lot, Rich. You have a lot of great, positive messages over in Slack there, check that out when you’re done. Thanks for hanging out with us today and have a good week.