Enjoy the Podcast?
You’re responsible for the care and feeding of production SQL Servers, including index maintenance.
For years, you’ve heard that you’re supposed to reorganize your indexes to make SQL Server go faster. It sounds like it makes sense – keep things in order, right? But you keep doing it, and SQL Server isn’t getting any faster. You’ve even heard that setting fill factor will help prevent fragmentation, and you’re doing that too – but your indexes still keep getting fragmented every day, and users aren’t happy with performance.
This advice made a lot of sense at the turn of the century, but today, things are different – and we’re not just talking solid state drives. In just the first 15 minutes, you’ll have a series of ah-ha moments when you realize that your daily index maintenance jobs might just be making the problem worse instead of better. Then, you’ll learn what you need to do instead.
Why I Want to Present This Session:
The cool thing about the SQL Server community is that there’s over a decade of blog posts, presentations, and videos out on the net. The bad thing is that a lot of that is outdated, and now we have to combat that now-bad advice. I want to put something out there to make it clear why those nightly windows might be better spent checking for corruption instead of shuffling pages.
Brent Ozar: So this morning, in the first session at GroupBy I’m going to be talking about why defragmenting your indexes isn’t helping. I’m Brent Ozar; I’m a Microsoft Certified Master and that just means I’ve made a whole lot of really expensive mistakes, and now I try to help others avoid some of those same mistakes. I’ve been a database person since the late 1990s. I got my start on SQL Server 7, which sucked, and then gradually went up newer and greater.
You’ll hear a lot of advice out here around performance, but a lot of the stuff is really old. You see blogs from people who’ve been doing it since, like, SQL Server 7 or whatever and they don’t update their old blogs. I was one of those people too. I would put all kinds of crappy blog post advice and then never go through and update it. Well you’ve one out and you’ve listened to some of that advice, and you’re here because performance still sucks. You keep defragmenting and rebuilding your indexes all the time, nothing seems to get any better. And what I’m going to define success as, is a performance boost that end users will actually notice; that people wake up and will focus on.
Now, if you go out and read those SQL Server 7 type blogs, they’ll tell you it’s because fragmentation sucks and you need to rebuild your indexes every day, you need to do defrags and reorgs. I’m going to talk about why that fix doesn’t make things better; it usually ends up making things worse. So everything I’m going to talk about, the slide deck, the demos, all of it is already up there at BrentOzar.com/go/defrag. So you can download this exact slide deck from slide share and get the demos, you name it.
The Two Kinds of Fragmentation: Internal & External
So, let’s start with data pages in SQL Server. Everything in SQL Server is in 8K data pages. Things start to get a little bit different when you start talking about Hekaton or clustered column store indexes and the way that they store data. But here, let’s talk about old school, conventional row store indexes. Every 8K page is kind of like 8K pages in the phone book. If you take phone books, that thing that Grandpa used to keep on top of the refrigerator and call his old friends from the war – and they never answered, because they were all dead. The White Pages up on top of the fridge, that’s what we’re going to be talking about in here at first for an example. Visualize the phone book organized by last name and then first name.
Now, in my town, I have a whole bunch of folks like Mr. Hall through Mr. Loggins is on this page, Mr. Loggins through Messina is on this 8K page. And Mr. Messina through Mr. Oates are all on this 8K page. Now, my phone book is just like the White Pages of the phone book that Grandpa had. The pages are packed in completely tightly. If you visualize the clustered index with the data on it, there’s no space left for new rows. I’ve rebuilt this index, it’s crammed in perfectly; 100% packed in and all of the 8K pages are in order. In a data file, it’s just a bunch of 8K pages all lined out in a row. So when I rebuild an index, under best case scenario, all of these 8K pages are lined up in one row in the file; page number one, page number two, page number three all in order.
Now, what happens when you have a phone book? Within a matter of minutes, someone’s going to move in.
In this case, Michael McDonald moves into my town and I need to have space for him in the phone book, but there’s no empty space on these White Pages. It’s not like I can just suddenly make a spot for Mr. McDonald. I’m going to have to do something by ripping one of these pages apart. I’m going to have to take the page that used to have the data where he needed to go, and I’m going to need to turn it into two different pages. I have this one 8K page here that has half of the residents from that page, and another 8K page with the rest of the residents.
Now, that new page isn’t page 4097 and a half; that’s not how this works. That new page has to be somewhere else in the database. If I go allocate a new page – Slack, [S-Handler] says, “Hall and Oates, Page Eater” – she’s a page eater…
So there’s two kinds of fragmentation on this slide. One kind, internal fragmentation, is empty space on our 8K pages. The second kind of fragmentation is external fragmentation. Outside of our 8K pages, they are not in physical order on disk. So when people say that there’s fragmentation, there are really two different kinds with two different causes, two different measurements, and two different fixes.
The first thing is, whenever I have one of these page splits, whenever SQL Server has to put some of the data on one 8K page and some of the data on another 8K page, there’s a way that you monitor that. A Perfmon counter called page splits per second.
When you want to track external fragmentation, pages that are not in order on disk, there’s a DMV you can check. There’s an average fragmentation percent on that DMV; that shows external fragmentation. And to see internal fragmentation, there’s a separate field that says average page space used, in percent. Now, tracking these and getting a really accurate picture on them involves scanning all of the pages; so people don’t like to do that.
Predicting Fragmentation with Page Splits per Second
Instead, they think about tracking the first thing in the list. They think about tracking page splits per second. It’s a perfmon counter, super low overhead, and when you see a lot of pages splitting, then that must mean that you’re having problems, right? Not so fast…
Let’s start by creating a table. I’m just going to create a table:
CREATE TABLE TheHeart
(ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
It has a primary key on the identity field of our table, and I just have a field in there called groove – the groove is in the heart. So an ID field starts at one and goes up to a bajillion, and then before I put any data in it, I’m going to go check my page splits. You can monitor SQL Server’s perfmon counters from inside SQL Server. You can run T-SQL queries and say things like, give me all of my perfmon counters where the counter name is like splits – doing the splits.
This gives me page splits per second, my SQL Server has had 39. Wait, that’s not right. This says page splits per second, this is actually cumulative. As you continue querying this over and over, as you’re having page splits, that number goes up. The name is a little misleading. It says page splits per second, but what it really is, is page splits, period. If you want to track it per second, it’s up to you, dear reader, to do some kind of differential over time and track this number over time.
Now, in my SQL Server here, I’m not doing anything else. so if that number goes up – and it just goes up when SQL Server starts up, because SQL Server’s doing its own thing – when that number goes up, I’m going to know that a page split happened. So if you’re going to try and do this demo along at home and follow along with scripts, hit F5 on this a few times, just to make doggone sure that that something else isn’t going up because you have other activity happening on the system. Here, I don’t.
So I’m going to put one row in the table. It’s a brand new table, there’s nothing else in it and I put one single row in it.
Then go back and check your page splits again – and they’ve gone up.
Because there’s a dirty secret that the page splits counter is hiding from you: adding a new page is also called a page split!
So if you’ve been monitoring page splits over time and going, “Oh my god, I’ve got so many of them. I must be having fragmentation problems.” No, you could be just adding data to tables, with an identity field, adding new data at the end of the table, allocating new pages. SQL Server calls that a page split.
It Gets Worse:
“Fixing” Page Splits with Fill Factor
I know people who look at that counter and say, in order to fix this, I’m going to fix fragmentation. I’ve got to make sure that my pages are all in order and that there’s empty space on the page so that I can avoid doing these nasty page splits. So they start playing around with knobs and buttons and they never go back to see if page splits per second actually went down. Even worse, what these yoyos do is they go back and look at the page splits counter and go, “Oh my god, it’s going up again. I’d better do more fixes.” And they shuffle more setting around, thinking that they’re making things better, when they’re usually making things worse.
So let’s zoom out a little further and ask is external fragmentation having your pages out of order on disk even bad? Is there even anything wrong with it? So let’s think through this in terms of a pop-quiz. First, will your maintenance tasks take longer if your pages are out of order? If I have a bunch of 8K pages on disk and they are kind of scattered around for the same object, are my backups going to take longer? What about my index and maintenance; what about CHECKDB? What about queries reading from RAM? If I have data in RAM, is that going to be slower? What about queries reading from disk? If I seek in – if I need to do an index seek down on disk and read just one 8K page off of there, is that going to be slower depending on where the page is? What if I need to do a scan; if I need to scan the entire object start to finish?
As you start to think through these, the answers are kind of surprising.
If I have a million pages on disk, just to pick a random number, and I need to back them up, it doesn’t matter if the objects inside are scrambled all over the place. As long as I’ve got a million pages and they’re in order on disk because the database is all together, the fragmentation inside that file makes no difference whatsoever. Same thing with doing index rebuilds and stats maintenance; it’s just going to take a really long time when you’re reading off of disk.
Will queries reading from RAM be slower? Well, what does RAM stand for? Does it stand for sequential access memory? No, it stands for random access memory. It doesn’t matter where those 8K pages are in RAM, accessing them is roughly the same cost. There are edge case gotchas around NUMA nodes and local and remote access; that has nothing to do with fragmentation. Rebuilding your indexes doesn’t fix that.
If you need to seek one individual row off of disk, it doesn’t matter whether his neighbors are in order or not. And then if you’re scanning, this is an edge case. So, if I need to scan a table from start to finish, it actually does matter if those pages are in order or not, because the closer they are in order, the more that SQL Server will do larger read ahead scans, grab larger chunks of data. The storage may be working sequentially instead of randomly.
But, before you get all excited about it, you need to understand what faster means. There’s a great example of it from a book, Systems Performance by Brendan Gregg. This guy’s a Linux performance tuner, not SQL Server, but he’s a genius in terms of Linux performance tuning. He has just great ways of explaining concepts; and Brendan needed to explain the difference between CPU speeds, memory speeds, and disk speeds. But he needed to put it in ways that human beings can understand. We don’t think in terms of nanoseconds or microseconds, we think in much longer human time scale; so these numbers don’t mean anything to us. What Brendan said was, imagine for a second, literally, that one CPU cycle takes one second. It doesn’t, it’s obviously much faster than that, but just to put it into human terms. If every CPU cycle takes one second, then accessing Ram takes a latency of six minutes.
We like to think that CPUs are fast but memory is really fast. That’s not true; Memory hasn’t even come close to keeping up with the incredible speed improvements out of modern CPUs. This is why modern CPUs have this storage cache on board, and it matters a lot for some SQL Servers that you stuff as much cache as you can inside those chips because that’s much faster than going out to RAM. Brendan says if you go out to solid state storage, it’s this fast, which equates to two to six days. Out in reality, a lot of us, especially when we overload solid state storage, drifts closer to one millisecond, which is a month. And a lot of us don’t have the best-shared storage, and our shared storage drifts out to, say, 40 or 80 milliseconds, which is four to eight years. Users don’t notice the difference if you cut latency from, say, three months to two months. Where they notice it is where you cut latency from here down to here.
No one ever went carrying their DBA through the halls of the company on their shoulders going, “Yay, fragmentation saved me one millisecond on my queries, it’s amazing.” No, they get really excited when you can service that data out of RAM, but anything else feels pretty doggone slow. There are exceptions. You get blazing fast solid state storage from companies like Pure. That stuff really is quick, but then it matters even less about the fragmentation. So when I look back at this pop-quiz, putting the data pages in order doesn’t make your maintenance tasks shorter, it makes them take longer.
It Gets Even Worse:
Fixing Fragmentation Temporarily
So we’ve got two ways to fix this; we could either fix it via reorgs – reorgs just move around one 8K page at a time. Rebuilds build a whole new copy of the index from scratch; perfectly packed in, crammed in as tightly as possible. Well, as tightly as you allow it to be; more on that in a minute. Are all the physical pages in order when we’re done? No, not if you do reorgs. Reorgs are just shuffling one individual 8K page around at a time, and there are not enough empty pages in your database to put them all next to each other in a row. SQL Server’s just finding little holes in the data file where it can go and shove these things in to.
Sure, it can be done online, but look at these drawbacks. It’s actually hitting your transaction log, which slows down your database mirrors, availability groups, transaction log shipping, differential backups. You’re making all of your maintenance worse instead of better. Rebuilds really do fix it. It can put your entire index in order on disk, but people often don’t do rebuilds because they are so time intensive.
J.D. Walker says in Slack, “Jeez, with the way that this looks, why would you even bother reorganizing indexes? Just rebuild during times of low traffic.” You’re starting to get it.
Users hate the fix that we’re doing. Our maintenance tasks are actually making things worse, and they’re not making it so that tasks are any faster when we run them out of RAM.
If you’re going to fix it – and I’m not even sold that you should fix it – the way that you would fix it would be to rebuild these indexes periodically. You don’t want to do it every day, you simply don’t have outage windows for that.
Oh Yes, It Gets Even Worse:
Fixing Fragmentation with Fill Factor
Instead, people start to think, “Well, there’s another way that I could fix this. I’ll set fill factor.”
Fill factor is the amount of empty space that you leave on each of these 8K pages. Say, for example, I have an *K page chock full of data, if I set fill factor to, say, 80%, I’m really leaving 20% free space on every page. This sounds smart at first, because this way, when Michael McDonald moves to your town, instead of this page being completely packed in, there’s an empty space on the row just waiting for Michael McDonald to just go slide right in.
The default for SQL Server fill factor is 100%. SQL Server, by default, crams the data in as tightly as it can when it goes and rebuilds indexes. Sometimes you’ll see people set fill factor at 0%; that’s exactly the same thing, SQL treats them both the same way. The concept of this, that these rocket surgeons back from SQL Server 7 will tell you, is go set fill factor to a lower number so that there’s space available on the page; when Michael McDonald moves in, you’re not screwed.
Let’s think through that in a little bit more detail. Those of you who’ve been through my free class, How To Think Like the SQL Server Engine, will recognize this page. This is an 8K page from the users’ table at Stack Overflow. The users table clustered index is clustered on ID. It starts at one and goes up to a bajillion. It’s an identity field. Every time that you insert a new row, guess where it goes.
Now, what do you want for a fill factor on this page? How much empty space do you suppose that we should leave behind?
Do I ever need empty space on this page for an insert? No. Unless someone goes and turns on set identity insert on, and they purposefully backfill some deleted rows. But unless they do that, I’m never going to insert rows on a previously full page. What about updates? Maybe, maybe one of these fields that’s NULL, for example, maybe we would go back and populate. Or maybe fields that have a variable width, like display name or location, maybe someone had a tiny value before and they’re going to update it to a wider value.
So what’s the right fill factor for this? I don’t know. How could anyone know? There’s no way you could know unless you, A, measure page splits accurately and only count the bad page splits and then change fill factor in tiny infinitesimal amounts, trying to find the exact sweet spot where you never have bad page splits to begin with. Newsflash, if you didn’t even know about this before, there’s a whole new world of index tuning things that you want to be focusing on long before you go and play around with that one; you talk about diminishing returns there.
And then, what about deletes; do you ever need to leave empty space behind on the page for deletes? Well in the case of Stack Overflow, you just don’t delete rows; that’s not allowed by that particular system. But, maybe in your system, you do allow for deletes and deletes actually leave space back for you for people doing updates. It’s kind of like a healing type thing. So taking this clustered index, man, it’s really hard to make a case for a fill factor of anything other than 100%. It’s not going to help inserts, it’s not going to help deletes, it’s only going to help updates and only in the case of variable fields or NULLable fields.
Let’s take another index. So this is also from How to Think Like the Engine:
This is a non-clustered index on last access date and ID. When you have an index on those fields, here’s what it looks like on disk. You’ve got a bunch of last access dates sorted by last access date. Every time that someone logs in or touches a page on stack overflow, they go and flick the last updated date to a last access date, to be current. So how about this one? Do you want empty space on this page to account for new users? Whenever someone logs in for the very first time, what’s their last access date? Is it in the past? Well until Doc Brown logs in with this DeLorean, people aren’t logging in in the past. We don’t need any empty space on past fill tables.
What about updates? Every time that a user logs in, touches a page – what do we do with their row? We pick it up from wherever it was and we move it all the way to the end of the table; the current section of the table. People don’t log in in the past, they log in in the current. So I’m always picking up rows off of this object and moving them to the very end of the table. After half an hour even, this page is going to look like Swiss cheese. There’s going to be so many holes in it; it’s almost like SQL Server’s purposefully stuffing the fill factor on this table. And I do need to rebuild this periodically because some users never log in again. Roughly, out of the five million users in the current data dump, three million of them have a reputation of only one point. They logged in once to ask a really crappy question with a throwaway account; probably posting their company’s source code out there so that everyone could see it. And I probably need to rebuild this in order to keep it nice and compact to get it closer to 100% fill factor, because by default I’m leaving all kinds of empty space on these pages.
So What’s the Right Default Fill Factor?
What’s the right fill factor for this? Just like most objects, you want 100% fill factor. And when you choose to fill your pages with anything less than 100% data, guess what you’re really doing – you are purposefully inflating the size of your object by leaving empty space behind on these 8K pages.
You’re causing internal fragmentation.
When you set fill factor, you’re the one that’s slowing your SQL Server down.
So, let’s look at the pop-quiz again. Hat now, if your pages have empty space, how are these things impacted? CrashDan says, “The main reason I rebuild and reorg my indexes is to appease vendors when they say they want to look at performance and they blame fragmentation.” Yes, we’ll talk about that in more detail as well.
So now, will maintenance tasks take longer? Yes; if you purposefully inflated your database. Because by leaving 20% empty space on every 8K page, you just made your table 20% bigger. So now your backups take 20% longer, your index rebuilds 20% longer, updating stats, it’ll take 20% longer to scan those tables.
CHECKDB, I’d like my maintenance to take 20% longer, please. I’d like my RAM to be 20% smaller, please. Because every one of these 8K pages, it’s the same unit; whether it’s on disk or in RAM. When you leave 20% of your 8K pages free, guess what you’re saying – I want 20% less RAM.
Will reading stuff be any faster? No. your queries scanning from disk will now take longer. You are purposefully slowing your SQL Server down by setting fill factor, AKA setting fragmentation. The problem is actually us. We’re the ones who’ve been monkeying around with this all along. By rebuilding and reorganizing our indexes daily, we’re making our maintenance windows take longer, we’re inflating the size of our backups. Users aren’t any happier because it’s not making any big of a difference; and by setting fill factor, we’re making things even worse.
I like to joke and modern airplane pilots like to say that, modern airplanes are so advanced, there are only two things in the cockpit; there’s the pilot and there’s a dog. It’s the pilot’s job to feed the dog; it’s the dog’s job to bite the pilot if he tries to touch any of the knobs or buttons. SQL Server needs a dog because we’re in there – ooh, let me play with fill factor, let me change this setting, let me try this sp_configure button, ooh I read this was faster once. Ooh, ah, this will make all kinds of difference… And we’re flipping things in the cockpit left and right, when really my job, as a consultant, often is to go, “Get your hands off that; don’t touch that. It’s not for you. Go back to basics.”
We need a dog to keep us from touching all of these advanced options, that to make sense in very specific scenarios, but they are not global, across the board type stuff.
So How Do We Really Improve Performance?
So what are we going to do in order to make things faster? Because clearly, we’ve been defragging and rebuilding our brains out and things aren’t getting any better. Here’s the short answer about what you’re going to do when you get back to work.
First, you’re going to go set fill factor at the server level back to 100%. Sometimes I see people set fill factor at the entire server, it’s one of your right click on the server and go properties options, or right click on a database type options. Don’t touch that. Leave that back the way it was. There may be specific indexes where you need to touch fill factor, but until you can measure those indexes and their page splits, get your hand off that object.
Second, use sp_Blitz and sp_BlitzIndex to check for your existing index fill factors, because some rocket surgeon may have gone around behind you – let’s be realistic, it was you, but we’re going to blame it on the last guy. Some rocket surgeon might have been playing around with fill factor and set it on specific indexes. We’re going to need to go through and rebuild those indexes manually. I wish I could tell you a story about when someone accidentally set fill factor to 10% instead of 90% and blew out the size of their database. And I wish I could tell you the story about how all we had to do was set their index fill factor back to 100%, and immediately after rebuilding those indexes, the whole database fit in RAM. I wish I could tell you that story, but the problem is, I’ve had three of them. Three of those where the DBAs have gone in and played around with those buttons and gotten the numbers wrong.
After you set that, you’re going to go through and rebuild your indexes. You do have to do this to get your new fill factor to go through and take effect. And then, we’re going to pick the right metric to monitor; the thing that’s actually going to make a difference, put fill factor down and just ignore fragmentation for a minute – let’s focus on a much bigger metric that will actually make a difference in terms of your performance tuning.
Step Back and Focus on Wait Stats
In order to explain it, I’m going to take the world’s simplest SQL Server. It’s only got one core in it, and that’s it.
If you only have one CPU, and I mean one core, not even one quad core CPU, just one core – as a query starts running, it’s going to churn a long through – say I’ve got a 1TB restaurant table and I’m going to start churning through it – while other people pile in behind me, maybe they’re waiting on something, waiting on disk, waiting on CPU, memory, latches, whatever. My query is going to keep going until one of three things happens. One, I guess, is that it finishes. Two, is that it uses CPU for four milliseconds straight. When it uses CPU for four milliseconds straight, it’s going to step aside; or three, when it needs to wait on something.
Let’s say, for example, that my restaurant table is too big to fit into cache. It’s a 1TB table, we’ve got all kinds of stuff inside there. Whenever SQL Server says your table needs to wait on disk, it’s going to yank me aside like the gong show and someone else can step inside in front of me. As a result, I care a lot about whatever’s on the right-hand side here, in this right-hand column. How many queries are waiting, what are they waiting on? Is it disk, is it CPU, network, memory, what?
Waits roughly break down into three categories. The top category, resources, you can think of as things that you would walk into the data center and touch; that is if the guard wouldn’t shoot you. You walk into the data center and touch, plus latches and locks when someone else has the object that you want. These are the common performance tuning scenarios.
The second two, yeah, they’re important, but they’re not usually the things that end users care about. Remember what our goal is, I want to make a difference that end users are going to notice because a server doesn’t give you a raise, users give you a raise. Users are the ones that you want to focus on making happy.
So if I’m going to measure this – if I’m going to track how bad a SQL Server is, I hate using adjectives like, “Well your SQL Server’s working really hard. Or, “This SQL Server’s really slow.” I want real numbers that I can track back down to. Is this SQL Server working hard? Kind of, but not really. CPU could go to 100% for four milliseconds. It would drop momentarily, 100% again, but realistically if everything I need is in RAM, this query’s just going to run until it’s done and it’s not really working that hard at all.
What about this server? Say that Jeremiah’s waiting on disk and Kendra’s waiting on CPU – this is working harder.
What about this one? Adam’s running sp_whoisactive, Ola Hallengren’s trying to backup a database, Paul Randall’s trying to run CHECKDB. This poor one core SQL Server is working harder. If I start tracking out how many things are in the right-hand column and what they’re waiting on, the amount of waits that I have in any given time span is starting to drift longer. And notice that some of these are stored procedures; like Adam’s running a stored procedure here, Ola usually backs up his databases with stored procedures. CHECKDB is not a trivial amount of work.
When I was thinking about just one table queries, if I, say, just select stuff from table, that looks pretty simple. But that’s not how your real world queries look. Out in the real world, your queries are heinous. You start seeing things like select this table and this table and this table. And in this execution plan, I’ve got one temp table scan that’s gone parallel across multiple cores. I’ve got another clustered index scan gone parallel across multiple cores. Another clustered index scan gone parallel. This hash match went parallel. This clustered index scan went parallel. This hash match went parallel.
When I’m dealing with a live running SQL Server, it’s not just that I have simple queries that are each spending a little bit of time waiting – I can have one heinous query that’s trying to do several things simultaneously. For one core, with even just one crappy query running, it’s not unusual to see in the right-hand columns there, in the waits list, several tasks all stacked up.
Alright, so if my SQL Server looks like this for one second straight, say that it looks like this for exactly one second on the clock and I’ve got 15 tasks waiting on storage, I’ve got three tasks waiting on locks and I’ve got one task waiting on CPU. If it looks like this for one second, how many seconds of waits do I have? Well, I’m no data scientist but I can add those numbers together and I would have 19 seconds worth of waits in one second on the clock; even with just one CPU core.
Tracking Wait Time Per Hour
That’s why, when you look at any monitoring app, they will all have a graph that looks like this. it’s the number of hours of wait time that you have for every hour on the clock. Say that, from the hours of 6 am to 7 am, I had 18, 19 hours of waits on the clock, because I had lots of things piled up waiting. Now, in my graph, I have something really similar, simple here. I have the waits just tied down to some simple categories. In yours, you’ll have confusing things like CXPACKET, SOS scheduler yield. SQL Server 2017 starts to group waits together by category, and we’re introducing this in sp_Blitz and sp_BlitzFirst soon.
So six o clock, I had lots of hours of wait time on the clock for every hour on the clock. Seven o clock, things got a little bit worse, eight o clock, things got even worse, nine o clock, things are still pretty bad. And then all of a sudden at ten o clock, we spent much less time, in this case, waiting on storage. So the question becomes, what happened between nine o clock and ten o clock. And there’s not one answer, there’s a bunch of really good answers. I’ll watch not in Slack to see what you guys come up with here.
So what do you guys think happened between the hours of nine o clock and ten o clock? I’ll put Slack up so that you all can see it here. So now you can see – and too, if y’all are watching online too, you can go to SQLSlack.com. SQLSlack.com will get you into the live chat that people can see.
T. Friedman has a good answer. He said that, “Your maintenance tasks finished. What if you were running CHECKDB, index rebuilds…” Because god knows you all love index rebuilds. “And all of a sudden, you didn’t have to wait on storage anymore.” That’s a very good answer as to why your waits would drop between nine and ten. LJeb and SQLCWBY both say that there’s less user activity. The users went to early lunch; maybe I have sales people, they decided to start drinking. It’s 9am, 10am, time to go out for margaritas. Shannah says, “Set single user.” I like that a lot. Truncate tables would be another good answer there. Daniel says, “The data warehouse load completed. Maybe we had a bunch of scheduled reports that finished overnight. Server backup – MikeJWhitty has a good point, “What if all the databases had been loaded into memory? Say maybe that I did a SQL Server restart at 4am, my buffer pool had nothing in it. Now we’ve gotten done priming that buffer pool. We don’t need to wait on storage anymore. SQLCWBY says, “The stats were updated and we got a better plan.”
Isn’t it funny that in a room full of database people who are supposedly data professionals, nobody said we tuned a query, nobody said that we tuned an index? The hell is wrong with you people? Why are you focusing on rebuilding indexes? Get to work. Tune queries, tune indexes. This is your job. This is your metric as a data professional. When you do your job right, these waits drop. When you do your job wrong these waits stay the same or heaven forbid, they even go up. Shame on your people – holy smokes.
I like – SQLCWBY says, “I’m going to be out with the sales team drinking.” I would certainly not blame you there. That’s why I like working in sales teams.
Now, to get this number, that metric – that graph is super important. And all the wait and all the monitoring tools have it. It’s not always on the top, sometimes you have to go poking through a monitoring tool to go find it. There’s a DMV that you can go query, sys.dm_os_wait_stats; but this thing is pretty doggone near useless. All it does is give you a cumulative number since your SQL Server was restarted. Hello page splits per second, same exact problem.
So what you have to do is you have to go break this out over time. You have to go figure out what it looks like for business hours versus after hours, and I’ll show you how to do that here in a minute.
When this number goes up, it’s typically correlated to how many requests your SQL server is asked to do. The queries that are running, backup tasks, maintenance jobs, BI reports, all those things you all were talking about in the Slack channel, when that time goes up, it’s often related to batch requests per second. The more work that’s being done, the longer we’re going to have to wait as these tasks take on. However, it could also go up if your end users are running crappier queries or if someone drops the wrong index; or if the vendor changes the application and queries that used to do an index seek suddenly now have to do a key lookup.
When this goes down, this is your success metric. This means you are doing things right. Any time I show a number, people always want to know, what’s a good version of that number? What’s a bad version of that number? When your server’s wait time is zero or near zero, SQL Server just isn’t working that hard. That doesn’t mean that every query is fast, it just means that the server overall isn’t really working that hard.
If you only see one hour of waits for every one hour on the clock, SQL Server, still not really working all that hard. But if you see multiple hours for every hour on the clock times the number of cores, now we can tell that the SQL Server’s really working hard. So let’s go measure your SQL Servers. And the way that we’re going to do it is with sp_BlitzFirst – sp_BlitzFirst is our performance health check stored procedure. Totally free if you go to BrentOzar.com and click on tools up at the top. You can go grab our First Responder Kit. The one I’m going to be focusing on right now is just sp_BlitzFirst, and I’m going to show you how to use it to monitor or check your waits since your SQL Server started up.
Demoing sp_BlitzFirst to Check Your Waits
I’m going to switch over and show my VM here. Let’s look at this guy. So what I have here is SQL Server 2016 – everything I’m going to show you works on all supported versions of SQL Server; 2008 forward. If I run sp_BlitzFirst – if I just run it by itself, it takes at least five seconds. Because what it’s doing is taking a snapshot, or shnapshot, as Sean Connery would say – hopefully, Doug Lane is in here with his shtored prosheduresh. It takes a snapshot of your DMVs including sys.dm_wait_stats and then waits for five seconds and then takes another snapshot and compares the difference to see what happens during that time. I won’t make you put up with any more of my horrible Sean Connery.
If you every get the chance, Doug Lane does an amazing Sean Connery; just absolutely fantastic. Doug Lane is also the man that thought me that Sean Connery would pronounce this is SHEXPACKET. It’s not CXPACKET, it’s SHEXPACKET.
So when I run sp_BlitzFirst, what it tells you is things like right now no problems were found on my SQL Server because there’s nothing going on on the SQL Server; it’s just sitting around bored. But look at what it’s got here in terms of a dashboard. It tells me that the number of batch requests I’ve got per second is zero and my wait time per core per second is zero because this SQL Server isn’t doing any kind of hard work. That’s not fun, let’s go make our SQL Server do some hard work.
I’m going to open up an app called SQL Query Stress, and this is an open source utility originally written by Adam Machanic; the same Boston area MVP who wrote sp_whoisactive. If you Google for this name, SQL Query Stress, you’ll find a blog post that I wrote on how to use this to do load testing. I’m going to use the Stack Overflow database here, just because it’s so much more fun than AdventureWorks. But I’m just going to load up the same scripts that I use in my SQL Query Stress article and have this thing go start hammering my SQL Server, to put it under significant load. Immediately when I show this to data professionals, they’re always like tell me more about how this works. Just go read the blog post on SQL Query Stress, and I go into details about how this load tuning query works.
But now let’s go back over to sp_BlitzFirst, and let’s go run that again. Now you’ll notice it usually takes a little bit more than five seconds because my SQL Server is under horrific load. So it’s not doing more work, it just takes more time during the same kinds of analysis queries.
SQLCWBY says, “I owe Brent a beer the next time I’m in Chicago.” That sounds good, I like that. JDWalker says, “I’m astounded that something so useful and needed is actually free. I’ll buy Brent all the beer he wants if he and I are ever in the same place.” Oh, I got to remember that. And then AB says, “That’s what makes this such a great community.” Yeah, absolutely. Now, this thing, holy smokes, it returns a lot more info. You’ll notice that there’s URL columns – I’m going to hide those, for now, just to make this easier to see on the screen. If you ever have questions about what each one of these things means, you can just copy and paste it into your web browser. I wish I could make a clickable link in SSMS go to a URL, it just doesn’t work that way.
So here, now we have our same dashboard section. Here we have batch requests per second, we’re doing 48 batch requests per second. And look at wait time per core per second, 303. Think about that right hand column in the PowerPoint where I talked about the number of things that are waiting in any given second. On any given core I had three hundred and three seconds of waits piled up in the right hand side. Tell me, Brent, what were those waits? I’m glad you asked. If you see up here in the wait stats section, now I can click to see details on my top waits. I’m going to click on the top one, CXPACKET – and it says for 22,798 seconds over the last five seconds, SQL Server was waiting on this particular bottleneck. And I know what you’re thinking when you see that number; Brent, your math sucks. And you would be right, I did fail the daylights out of pretty much every math class I had in college. Did really well in high school, then went to college and discovered that I had my own dorm room and there were women around and alcohol and suddenly my grades did not do so well. Not that there weren’t women around in high school, and not that there wasn’t alcohol around in high school… Look, I had a hard childhood; or an easy childhood, depending on how you look at it.
So the thing here isn’t that my math sucks. 22,798 seconds over five seconds, I had a lot of stuff in the right-hand column and I have a lot of cores processing queries all al exactly the same time. This bottleneck, whatever this bottleneck is, is what my SQL Server is waiting on. Pro tip, there’s no fragmentation wait type. There’s a lot of wait types, none of which are named fragmentationEX.
So I want to go find what my SQL Server’s waiting on. Now, in this example – I’m going to go stop my load test, just to give my poor server a break. In this example, I happen to query my SQL Server as of what’s happening right now. But out in the real world, you need to see what’s been happening recently. So that’s where the since startup equals one parameter comes in. when you run it with since startup equals one, it doesn’t even bother taking a sample; it just says since the SQL started up, what have your wait types been. Here, in my since 2.1 hours that my SQL Server has been up, I’ve had 2.7 hours of CXPACKET, which might sound like a lot, but I got a lot of cores here. So my per core per hour still ends up being near zero. On mine, it’s not that big of a deal, but on yours, you will have high wait types and now you want to go find out what’s causing those high wait types.
What Common Wait Types Indicate
This is where I’m going to pop back over into the PowerPoint and show you that. Show screen number two, there we go. So in here, some of the waits are like poison. You happen to see in my sp_BlitzFirst, I had a poison wait at the top of the list, those are really nasty and you’ve got to go focus on those first. If you follow the links, we give you much more information about how to solve those poison wait types. Further on though, you have to work through the list and see alright, if it’s CXPACKET, what does CXPACKET mean and how do I make that go better.
Here’s your little common decoder ring for the most common wait types that are out there. If you see CXPACKET, that means you’re having parallelism on the SQL Server, which may not be bad. In order to figure out if it’s bad, you got to start going and looking at the queries that are typically going wildly parallel. The easiest way to find that is to go run sp_BlizCache. sp_BlitzCache will let you sort the most resource intensive queries by different things like reads or by CPU. If you have locking queries, your best bet there is usually to sort by the queries that are running the longest or doing the most reads. If you have page I/O latch, that means you’re reading data pages from a data file. There’s the fragmentation wait he says. If the pages were in order, they would be faster. Not so fast, speedy. Let’s go take a look.
Now in this one, say that you see you’re waiting to read from disk, being page I/O latch, let’s run sp_BlitzCache, sort it by reads and see what happens. I’m going to pop back over and show my VM again, and I’m going to say, sp_BlitzCache sort order equals reads. What this does is it analyzes your execution plan cache. You don’t have to run a profiler trace, you don’t have to turn anything on, stuff is just on by default, it just always works. Now, it doesn’t necessarily always – of course, SQL Query Stress is still going. So this is a thing with opens source. When it breaks, you get to keep both pieces. So I’m going to kill it forcefully with a vengeance, I’m going to use sp_whoisactive to go use Adam’s stored procedure to find the queries that Adam’s app left behind.
He is such a nice guy, he has amazing restaurant recommendations and beverage recommendations as well. Highly recommend it, but sometimes his app doesn’t quite clean up after itself. He will tell you too that he open sourced it and it’s someone else’s problem now. Eric E. J on GitHub, thank you, Eric, you’re amazing for managing this.
So this shows us the top ten most resource intensive queries on our SQL Server sorted by reads. It’s really tempting to go in and try to copy paste this query text out, don’t do that. This is only just to give you a rough idea of what it’s doing. If you want to see exactly what it’s doing, scroll further over to the right and you get the query plan. You can then go click on this query plan and see more details about that particular query or that particular stored procedure.
Wait, what? Missing index? This missing index would have made the query 99.7% faster? Fragmentation won’t make your query 99.7% faster. Are you telling me that I’ve been doing the wrong thing? Yes. Yes. For the love of god, yes. Stop shuffling deck chairs on the Titanic, and start looking at what are the most resource intensive queries by the wait type that you have and look at what their problems are. sp_BlitzCache, Erik Darling on our team has done a phenomenal job of analyzing these execution plans and throwing warnings in here to say, “You have a query that’s got one missing index on it. You got another query that’s got forced serialization. Compilation time outs.” And if you don’t understand what any of these means, he’s got a decoder ring down at the bottom that map up each warning to giving you a whole webpage telling you more about tuning that particular query.
As Hannah says, when we do create that index for 99%, then we can rebuild it all the time. Yes, of course, then you can rebuild your brains out. So this is what really makes an amazing difference in terms of performance. It’s not shuffling the pages. It’s finding the queries that are causing your biggest waits and removing those.
Trending Waits Over Time with sp_BlitzFirst
Now, I did tell you one thing that I skipped over. I said for sp_BlitzFirst, only tells you either what’s happening on your SQL Server right now, or since start up. But there’s more that it can do to generate that little graph over time to show what your waits are over time. If you go search for sp_BlitzFirst, and you read the documentation – I know, I’m making you read the documentation, what a horrible person I am. If you go to sp_BlitzFirst and read the documentation – if you don’t like documentation, you can see my foul little face there wearing one of my favorite t-shirts. If I scroll down further, there’s got this section, storing sp_BlitzFirst into a table.
You can dump your wait stats, perfmon stats, file stats into a table. Do it every say, one hour, or 15 minutes. Dump it into a table and then you can do back and graph your wait stats over time. You can even build that same graph that I’m dealing with over in my PowerPoint. So lots of people in Slack are like, “Oh my god, wait a minute, I’m starting to understand, I don’t have to rebuild my indexes all the time.” Or at least that’s why it’s not helping.
When I show people that they can dump sp_BlitzFirst into a table, one of the questions that folks always ask is, “Alright, well how often should I do that and how much of an impact does it have?” I wouldn’t go more than every five minutes because you’re not going to make decisions based on granularity that small. I’m fine with running it say, every 15 minutes to a table and that at least gives me a rough idea of what my waits look like at different times of day.
The problem with since start up, whenever I run sp_BlitzFirst since start up equals 1, I bet your wait types have things like backup I/O and backup buffer, which means of course that they’re happening overnight. You can also separate out CHECKDBs waits because it’s causing its own heinous waits when that thing goes and runs.
If you have questions about what any of the wait types mean, your best bet is to take a picture of this, the result set right here, what your wait stats are out of sp_BlitzFirst since start up. Include all the way out to average milliseconds per wait and then post that in Slack. I’m pointing over there as if you could see my Slack window. If you take a picture of that and post it over in Slack – I’ll show this now – to join in the room and the discussion, go to SQLSlack.com. You can post your screenshot over in there and some of the other fine folks will be able to answer some of that, and I’ll go through it after my presentation finishes too.
You’ll notice there’s lots of other cool notes or things to say in there in Slack. There’s also rooms for consulting if you want consulting help. This is for other consultants who want to give advice to each other about our first responder kit, how to get started with presenting or speaking, DBA tools, the super popular PowerShell scripts, and even the high altitude SQL Server project where Drew Furgiuele recently launched SQL Server up in a balloon up in the air – I can’t remember if it was an Intel dual I think that thing was.
So now, back over to our presentation here. Show screen two. I’m going to be all over the place today. Let’s look – I’ve been looking at Slack. Let’s look at the GoToWebinar Q&A. Let’s see here. People are doing all kinds of “Could you tell me more about CXPACKET waits and how to optimize them?” No, your best bet there because I’ve got a whole presentation on it, is go copy paste the link out of sp_BlitzFirst and you can see way more details.
What If My Vendor Tells Me to Defrag?
So now what if someone wants to play doctor? This question came up in Slack a few times here already and in the GoToWebinar list. So say that you’re on vacation in the South of France, enjoying the castles and the Loire Valley, and all of a sudden you get a text. “All performance is bad and Ted says that it’s because you don’t have an index rebuild job. He says fragmentation is really high and he thinks it’s your fault. They’re not going to touch anything else until you go do fix fragmentation.”
Here’s how I react. “So just to make sure, person on the other end of the line, you’re guaranteeing me that rebuilding the indexes will fix that.” “That’s not exactly…” “Okay, what metric should I monitor to know that the user experience got better?” “Well, you can check fragmentation at…” “No, users don’t notice that. Tell me the metric that end users will notice.” “I suppose you could look at how long the queries take in order to do run time…” “Great. Give me a query. Give me the query that you’re believing fragmentation will cause to go slower or faster, and then together on the screen, we’re going to run the query, measure how long it takes. I’m going to rebuild the index, we’re going to measure it again.”
Now, those of you who are familiar with parameter sniffing problems will also know that you need to check for that as well. Check to make sure you’re getting the same shape of the plan before and after. And when you do, because if it’s a parameter sniffing issue, that’s totally different. But if you get the same shape of the plan optimized for the same values, then wrinkle that up, print it out on a piece of paper, wrinkle it up and shove it in their little mouths and make sure that they never ask you to rebuild indexes again.
So What Do I Do with My Nightly Maintenance Window?
So instead what should you do every night? If you’ve got this maintenance window that you’ve got all this time to do every night, what should you do instead? Let me show you something truly horrible, and I’m not happy with this, but it just is the way that it is in relational databases. So I’ve got here my SQL Server Management Studio. I’m going to go look at my list of databases and make sure I drop my old database from the last time I did this demo. Those of you who are looking at my screen will notice that I have some very amusing database names because why wouldn’t you, right? Why wouldn’t you have a database with the shrug emoji? You’ll also notice that I have a database named Period. You can also create a database name Space if you ever really want to freak people out.
Here’s what I’m going to do. I’m going to go create a database. It’s all you have to do to create a database. Just run the create database command with the name attached. You don’t have to say the files. You don’t have to say the autogrowth. You don’t have to say anything. Create database and the name of it. Now, I’m going to go back it up because I’m a good database administrator. Go back up my database, I’m going to take a full backup and I’m going to take a log backup.
Now, I’m going to go in and create a table. In my brand new car, I would like leather seats, sunroof, and racing stripes. So I’ve created a table called options with those three things in it. Leather seats, sunroof and racing stripes. Just to be really anal retentive, I’m going to go ahead and backup that database again. I’m going to do a backup of the log now that I’ve done my thing.
Everything is safe, right? Kind of. I’m going to go back into object explore, I’m going to refresh the list, and now I have brand new car here. I’m going to detach it. Now, I’m going to detach it not – well, I’ll just take it offline. I’m going to take it offline not because this is when corruption happens, but because storage things happen underneath SQL Server while it’s working and I need to have SQL Server look the other way for a second while I go play around inside that database.
Let’s go take this database offline. Drop all the active connections, go. Now I’m going to fire open something called a Hex Editor. The Hex Editor that I’m going to us is called XVI32. If you’re going to try and do a demo like this at home, you need to run it as administrator because this way, you’re going to be able to open the data and log files that you don’t necessarily have permissions to grab normally.
I’m going to run it as administrator and then I’m just going to go open that data file for my brand new car. When you look in Hex Editor at your data file, it looks like a bunch of gobbledygook. This Hex Editor is kind of like notepad but for binary files and you might go, “This is some kind of weird proprietary format, I can’t possibly imagine what kind of stuff’s inside. Wait, here it says brand new car right there. That’s plain text inside the database.” It gets worse. Click search and find and type in sunroof just because that’s one of the options that I put in my brand new car. And there’s the data in your database.
Sometimes people think that data and log files are somehow magically encrypted or that no one could see what’s inside their database. Dude, if you don’t encrypt it, it ain’t encrypted. And people can go in and see things like social security numbers, credit card numbers; anything that you store in there, in plain old text fields.
So let’s type over it. Instead of leather seats, I’m just going to type on the keyboard, vinyl, and then instead of racing stripes, I’m going to say racing – I don’t know, CD deck, and hit save. Close and now over on my SQL Server, I’m going to right-click and I’m going to bring this database back online. What this is simulating is that storage screwed something up underneath. Storage doesn’t tell SQL Server when it screws something up underneath. Bits just rot on their own. Bring online, success.
And if I refresh that database – database list still fine. I still see my database, I still even see my tables list. SQL Server doesn’t check for corruption whenever it goes and looks at a database. It can’t. Your database is too large. There are too many things that it would have to go look at. In order for it to find corruption, I need to do something like select star from DBO options. I need to go read that 8K page that I built and then, the only way you find out if there’s corruption is one of two things. SQL Server has to read that page for user query, like someone has to go and run a select. Or someone at SQL Server has to read that page for a system type query, running a CHECKDB or a backup with CHECKSUM.
If I just run a backup right now, will that detect that corruption? Let’s go see. Going to go all the way back up to the top. Here’s my backup database command. I’ll go run it again. Backup database successfully processed. SQL Server will happily backup databases with corruption all day every day, and not throw an error.
Now, how do you recover? If you’re like most database administrators out there, you’re not running CHECKDB all that often, and you’re either deleting your older backups or deleting like writing over the same names. Look what I just did. That data’s gone. If I either deleted that backup or I overwrote on top of it, it is out of here.
Should I be worried about corruption? Let’s go back and look at the PowerPoint deck for a second. So now we can see what people have been saying in Slack. Daniel says, “Here comes the Hex Editor.” T. Friedman says, “That’s a little scary.” SQLCWBY says, “With that discussion how” – I’m going to come back to those, like the fragmentation pieces at the end. I’ll go through and do the corruption checks stuff first.
So I showed you something awful there, but we need to think about too, what causes corruption. People think it’s just not going to happen on its own. If you go through any SAN vendor’s firmware list, you have Dell, HP, EMC, Hitachi, whatever. Get your SAN admin to give you the most recent list of changes inside their firmware and they give you a horrifying list of bugs. Spectacular list of bugs where the storage caused corruption and they went back and fixed it. “Well, they know about those bugs, right, they’re fixed.” Only if you’re applying the firmware updates, and if you’re not, you’re not.
It’s not just the storage either. SQL Server every now and then has logical or physical corruption bugs. Back here, data corruption occurs when you run an online index rebuild and it goes parallel. Over here, non-clustered index corruption may occur when you run an update statement with a NOLOCK hint.
Well first off, if you run an update with a NOLOCK, you kind of deserve exactly what you get. So when we go through and look at fixing this, if I had this as my real production SQL Server, I’m down. I’m downtown Julie Brown. This thing is out for the count. In real life, if you go to BrentOzar.com/go/corruption, I have a checklist of exactly what I would do in your shoes when I hit database corruption. Don’t call me for corruption help because you’re inevitably going to face it at Friday at 10pm when I’m hammered on tequila at Frontera Grill here in Chicago. Say for example that you run into a typical timeline where you’re doing backups, say every single day, you’re doing fulls everyday, but you’re only doing CHECKDB once a week. You’re doing log backups say, every hour, and on Sunday at 10am, one of your clustered indexes became corrupt. Storage doesn’t tell you when this happens. Storage just corrupts itself and no one knows about it until we read that page. Time keeps on moving along, Monday we have a full backup, we have more log backups, and of course by default, that full backup on Monday doesn’t send any kind of alerts, just succeeds.
Then sometime on Monday, someone run a select command and they get that red error back. They’re like, “That’s horrible, I don’t know what’s going on with that. I’m just going to close that and I’ll come back later and hopefully, it’s fixed.” Or if it’s any kind of app problem, our apps rarely surface the entire message to the end users. End users just see some kind of error and go, “Guess that’s not working today, I’ll come back to that later.”
And time continues to roll forward, and in our shop, let’s say that we only keep the last 24 hours of log backups online. And we have a job that goes through and deletes the older log backups. We still have our fulls, but we don’t have the log backups anymore. Now, finally, sometime on Tuesday, we get a call from the end users. “I’m having all these errors when I try to select from this table. Can you see what’s going on?” And you go in to run either CHECKDB or run the query, and you’re like, “Sweet potato, we have corruption.”
Now, how much data are you going to lose? When this happens to you, assuming that it’s a clustered index that’s corrupt and you don’t have non-clustered indexes to rebuild that field, which is a huge, spectacular amount of work to begin with, what you’re likely to do is try to restore Tuesday’s backup, run CHECKDB against it, it will fail. You’re going to try to run Monday’s full backup, restore from that, run CHECKDB against it, it fails again because the corruption happened earlier. Then you’re going to go all the way back to Sunday, you’re going to restore this backup and the CHECKDB passes, but you already deleted these log files. This is when you say to the boss, “Hold on a second, I’m on the phone with Microsoft”, and you’re over there putting your resume on every website that you can possibly find.
Forget Defrags: Check for Corruption Nightly If You Can.
Now that you know about this scenario, what could you do to lose less data? First off, you could go through and make sure that every time there’s an error from corruption, that you get an email. Good news, that’s totally free, it’s easy to do, I’ll show you how to go and do it.
Second, you could keep more log files. If you’re only doing CHECKDB once a week, you need at least log files going back all the way to the last clean CHECKDB. Not when you run it, but the last clean one. Because when it runs on Sunday and it fails, you may have to go back to the previous Sunday in order to get a good, clean copy.
You can also run CHECKDB every night. Remember that index rebuild job and reorg index job that you were shuffling all those pages around every single night? You’d stand a much better chance of keeping your job if you had CHECKDB going every night. I usually see database administrators with their priorities exactly reversed. They do CHECKDB once a week and index rebuilds every night. Dude, you got it backwards. Nobody ever got fired for slow performance.
Here’s the thing. When you have really slow performance, you don’t get fired. You get training money. You get consulting help. You get hardware budgets, or users just, “Stupid DBA doesn’t know what he’s doing.” The reason that you get fired is for losing data. I want to make sure that you keep your job. That means running CHECKDB as frequently as practical. I don’t mean every hour, but every day isn’t a bad option.
So, we’ve covered a lot of stuff inside this session. What you’ve learned here so far. First, and I say so far not because I’m going to start another round, but because we’ll switch to Q&A. External fragmentation means your 8K pages are out of order on disk. This only matters when you’re reading from disk and when you’re reading from disk, your queries are slow anyway. It’s not going to make any kind of significant difference; measurable difference to end users. End users don’t notice a 5% difference in query performance.
Internal fragmentation is empty space on pages. This matters a lot, and unfortunately, we are the ones who are screwing it with it by playing with fill factor. We are the ones actually turning performance down by setting fill factor to say, 80% or 70%. I don’t have a problem with people doing index maintenance. If you want to reorg, rebuild your indexes, that’s totally cool. It just doesn’t solve most wait types. You’ve got to figure out what your top wait type is with sp_BlitzFirst and then focus on that wait type to make it go faster.
On your own server, what you want to do is set the fill factor back to 100%, back to the default, then use sp_BlitzIndex to check for which indexes have lower fill factors, set them back to higher numbers, rebuild them so that all that good data is crammed into as little memory as you can possibly fit it in. Then monitor your wait time per hour going forward and use the scientific method. Pick one way that you’re going to change that wait type and see if your change actually improves your waits.
Everything that I talked about here, the entire slide deck is all over at BrentOzar.com/go/defrag, including the scripts and the slides as well. And when you’re in there, I would humbly suggest, that’s a pretty good place to go for advice. And don’t read as much advice from blog posts that haven’t been updated since SQL Server 7. Back in SQL Server 7, when you only had one database per server, and they had magnetic rusty spinning Frisbees, this kind of advice made vaguely more sense.
Questions & Answers
Alright, now let’s switch over and start answering questions out of Slack and out of the Q&A, so let’s go – I’m going to show screen two so that you guys can all see Slack. There we go. So let’s see here. So someone said – SQLCWBY said, “With the discussion that the hypothetical person who guarantees defragment will fix the world, if you defrag with a rebuild like they want, it will update stats which could improve run time. How do you prove to them that it was the stats being updated that fixed it and not the defrag rebuild?” What you want to search for is slow in the app, fast in SSMS. I should put his name in here. SQLCWBY, slow in the app, fast in SSMS, by Erland Sommerskog.
For the people who are listening in as opposed to watching the live webcast, slow in the app, fast in SSMS is an epic post by Erland Sommerskog. It seriously is like 50 pages long, so it’s one of those that’s kind of hard to digest. We also cover it in our training classes. We have like a two-hour module on how to identify parameter sniffing, how to troubleshoot it and how to fix it long-term. It’s not a small topic. It is a seriously big topic, and unfortunately teaching it to other people, it’s hard to do it to a vendor in say, 15 minutes. You might say to them, “Here’s this post by Erland Sommerskog. Now, if you want me to walk you through it, I’d be glad to, but we’re going to need to schedule like an hour in order to do it.”
Let’s see here. In the GoTo in the questions – GoToWebinar, Peter says, “Great session Brent, I learned something that I had forgotten many moons ago with index fill factor.” That’s cool. Justin says, “What queries can fragment indexes the best?” The best way to fragment indexes is you wanted to do it on purpose for a demo, for example, would be to create a table with a bunch of NULLable fields and then go back and update them to wider values. That will cause SQL Server to have to fragment them like crazy.
Raid says, “Sorry if you answered this already” – I did, so you can watch the video for that. Vlad says, “Don’t run reindexes update stats.” Yes, stats and parameter sniffing is a totally different issue. Let’s see here. Mandy says, “I’ve been told that page splits are expensive and using a non-100 fill factor is… that not the case?” Yes, that is not the case. We covered that thoroughly inside there.
AlesEDBA says, “Hey, I’ve got this random webpage that returns a 404.” Yes, that probably makes sense to go email that in. You can guess that I’m not going to fix that live during a webcast. Josh Dreed says, “Any suggestions on tuning queries generated by an app framework? Let’s say Entity Framework, where the code is almost unreadable and uses many variables.” Yes, I don’t know that it’s public out. Richie Rump has this session. Richie Rump, I’m Googling, framework. So there might be – yes, there is. Sweet, perfect. So I’m going to paste his URL into here. So Richie Rump has this session on don’t be afraid of Entity Framework, and it shows you lots of different ways to tune the code by Entity Framework either by changing Entity Framework or by changing the way that you do stuff in the database. The presentation itself, the video isn’t available online. He does it for SQL Saturdays every now and then, but the deck is on there, you can go download the deck. He also includes sample codes so you can see how that stuff works.
JDWalker says, “Thank you Brent, does he cover JPA as well?” No, only Entity Framework. If you want to deal with different specific ORMs you have to talk to people who use those ORMs. Elsimer says, “If I’m using backup with CHECKSUM, will that surface the corruption issues?” Yes, now, it will find the corruption issues that I’d caused, there during my own demo. It doesn’t surface every corruption issue. If you have a logical corruption issue, for example, the things caused by the merge statement bug or the update with no lock, you still have to do good old fashioned CHECKDB in order to surface those. I just look at backup with CHECKSUM as like a free way to get kind of sort of CHECKDB but not really.
Let’s see here. SQL Gene says, “Is there a dumb question?” There’s never a dumb question, only dumb attendees, Gene. “This is a really dumb question, is there a way to defrag a heap without adding a clustered index?” Yes, you can say, and I’ll say SQL Gene, alter table blah, rebuild. That will rebuild the heap. Alter table, whatever the table name is, and then rebuild. Just know that it will also have to touch all of the non-clustered indexes on that heap. So it is kind of invasive. If you’re under the gun performance wise, if you’re on Enterprise edition, what you could do is disable those non-clustered indexes, rebuild the heap and then rebuild the non-clustered indexes one at a time, but that’s like super – if you cared that much about it, I would just set a clustered index on it. Like Beyonce says, if you love it, put a clustered index on it.
Ade says, “When you hit that page with a query, will that page show in msdb suspect pages?” Let’s show it to you so you guys can see. So over here, I did my select from the options table. Jump down to that guy, so select star from options, and I get the corruption warning. Yes, I can do select star from msdb.dbo.suspect, and my autocomplete actually works that time.
Suspect pages is a system table where Microsoft or SQL Server tracks the number of pages that have been seen as corrupt. The odd ball thing is to get these cleared out of there, you have to delete them. It’s the only system table I’ve seen that where you’re actually supposed to delete the rows out. So it says in here, database ID 8, file ID number 1. There’s been a couple of errors on that particular page. Of course, being the guy who started sp_Blitz, I much prefer having sp_Blitz catch it. If you run sp_Blitz – get a load of this, sp_Blitz for those of you who haven’t seen it before, is my health check that gives you a prioritized list of issues you need to fix on your SQL Server.
If you’ve had any corruption on your databases, look at what you’ll get inside your priority one warnings. I’ll move this way down so you guys can see more of it. Any time I say that, I’m always like, “Seymour Butts.” I shouldn’t be allowed to talk just off the cuff. So I got all these warnings backing up to the same place where my databases reside. If it’s on the same drive, or even on the same server as your SQL Server it’s not really a backup because when SQL Server Windows won’t start you are boned. These databases haven’t been backed up recently. Here I’ve got some databases in full recovery without log backups and whammo, database corruption was detected in my brand new car database.
This kind of corruption could also be fixed automatically by always on availability groups and by database mirroring. However, of course, that doesn’t mean that you’re safe. You still want to fail out to get out of the other – to your other SQL Server as quickly as possible because of course, that corruption is still happening. Just because you fix it once doesn’t mean it’s fixed for life.
Donald says, “Were you going to show us how to set up an email alert for corruption?” Kind of, yes. It was in these slides. If you are on sp_Blitz and you scroll down through all of the warnings, it’s down further past priority 50, but if you keep going down even further, you get right down in here, warning you don’t have alerts configured, you don’t have a failsafe operator, you don’t have any agent jobs set up with errors.
So if you go in here, it says no alerts for corruption, and it’s got a link. We’ve also got more explanations there. I try to give you guys links for everything so that if you copy paste that, and then you go paste it into your web browser, see what it says here? Blitz result, no agent alerts configured. To fix the problem, here’s an exact script that you can go copy paste into production because I know you people, that’s exactly what you do with all of my scripts, is you copy paste them and go live in production. You can’t see that, let me go show it. Just show you my main screen, there we go. It’s going to be big, I apologize. It is what it is, I have a large monitor. Size doesn’t matter, I use it incompetently. No one is ever satisfied with the work that I do on my monitor.
So to fix the problem, here’s what you do. You go through, and what I did was I copy pasted the results out of sp_Blitz, and then there you go, here are the exact scripts in order to set up those alerts. We set up alerts for severity 16 through whatever it is, like 25. Plus, specifically ones for error 823, 824 and 825. You could argue that there’s some redundancy in there, but I’m a database administrator, I’m all about redundancy.
Now, let’s see here. AB says, “Thanks, sp_Blitz equals win.” Let’s also go back over to GoToWebinar and see what questions we have out of there. Peter says, “It’s the vogue in some database environments to only use identity columns as primary keys. Would you say that except for edge case scenarios fill factor is now best left untouched?” Yes, you missed about 15 minutes there. Go back to – this is what I get for not reading the questions first. Go back and watch the video. Watch the video.
Peter also says – Peter has a good question too. Peter, that last one was so crappy. The good question was, “Do you have to shrink the database to recover disk space freed up by rebuilding your index?” Well, this is kind of tricky. Say that your fill factor used to be 80%, and now you’ve rebuilt it with 100%. You’re going to have some empty space in the data file, but that doesn’t mean it’s a lot, and in most cases, most databases are growing anyway. I would argue that you shouldn’t shrink the database to reclaim that space unless it’s say, half of the size of the database. If 50% of your database is now suddenly empty, then it probably makes sense to collapse it back down so that you get faster backup times, faster restore times, when you go restore this thing into development. But if it’s less than that, if it’s say, 10%, 20%, 30%, unless you’re on a multi terabyte database, it’s hard to argue for shrinking because here’s the thing. When your database continues to grow back out, make sure you have instant file initialization turned on, but even with that, as you grow back out, you may be growing out in a fragmented way because you have lots of little databases that are all growing at different times. Then you end up with on disk fragmentation that’s pretty hard to cure.
UCSB says, “Why would empty space be 50% after a rebuild?” Go back to that story we had where someone accidentally set fill factor to 10% instead of 100%. In one of those cases, they had a 2TB database that ended up going down under 250GB after we did the rebuilds to 100%. We won’t name names. Actually, I’ve forgotten the names.
SQLCWBY says, “Brent, what about Windows fragmentation? Should I automatically run that on the table and schedule it or should I even bother doing it?” Windows won’t move around a SQL Server data and log file that’s actively in use, so the only way that you’d be able to defragment those would be either shut SQL Server down – who on Earth is going to do that? Or, use some kind of third party tool that claims to move files around even while they’re in action. My advice would usually be, when you find something that where Windows fragmentation is a really big concern, like it can be with very large databases, when I’m trying to deal with CHECKDB issues – when Windows fragmentation is that big of an issue, your much easier fix is to go take a different SQL Server, restore the databases on to there and go live. Man, does that sound like a consultant answer, right? “Just build another SQL Server, restore your databases over there, how hard can it be?”
But think about it. It’s just something that you’re going to note later for when you go to do a failover, when you go to rebuild somewhere else, you’re going to move to SQL Server 2016 or 2017. As you restore the databases over on to there, you’re fixing Windows fragmentation anyway. sp_Blitz will warn you when your auto growth sizes are too small and that’s the thing that really causing Windows fragmentation.
Ade says, “So slightly off topic, but we once found corruption by taking a backup. This was log corruption as opposed to data corruption. No trace of this in suspect pages. Any advice for identifying log corruption?” The times when I see that, it’s usually as soon as you go to write the pages, people pop out with a log corruption warning. It’s just a matter of monitoring those alerts that we set up alerts for in that sp_Blitz script. That will catch any kind of log file corruption, which would also surface when you’re doing log backups.
I’m going to say this and it’s going to sound weird. The fix that people will tell you is to shrink the log file down to nothing and then regrow it back out. Now, technically, that fixes the corruption but whatever corrupted that disk will happen again. So whether you see any kind of corruption, system pages in a data file, data pages in a data file, or any pages in a log file, it’s like Eddie Murphy said back in the 80s, “Get out” when he was talking about people in a horror movie. Whenever you hear there, there’s one kind of people who go, “Get out.” Sorry, got to go, I’m out of here. There’s another kind of people who are like, “I should investigate and see what the root cause is.” You do that root cause analysis in your own time, getting failed at SQL Server out of there as fast as you can, because that corruption can get worse.
Alright, Peter says, “Cheers for the two answers. Good to laugh on a Friday here, it’s the UK.” He says, “Man, my typing is bad today.” Well, so many UK jokes. See, I would say I’m like a dumb American. You could be typing things correctly, I just assume that your spelling is correct and me as a dumb American that my spelling is wrong.
Well, thanks everybody for hanging out with us in the first session at GroupBy. That draws this session to a close.