Getting Better Query Plans by Improving SQL’s Estimates

Target Audience:

You want faster queries, and you’re starting to read execution plans to figure out what to do.

Abstract:

You’ve been writing T-SQL queries for a few years now, and when you have performance issues, you’ve been updating stats and using OPTION (RECOMPILE). It’s served you well, but every now and then, you hit a problem you can’t solve. Your data’s been growing larger, your queries are taking longer to run, and you’re starting to wonder: how can I start getting better query plans?

The secret is often comparing the query plan’s estimated number of rows to actual number of rows. If they’re different, it’s up to you – not the SQL Server engine – to figure out why the guesses are wrong. To improve ’em, you can change your T-SQL, the way the data’s structured and stored, or how SQL Server thinks about the data.

This session won’t fix every query – but it’ll give you a starting point to understand what you’re looking at, and where to go next as you learn about the Cardinality Estimator.

Audio podcast:

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS
Leave us a review in iTunes

Resources:

To follow along, either print out the first page of this PDF, or click on the below image:

The Clustered Index

That’s a sample page from my free How to Think Like the SQL Server Engine class where I introduce foundational concepts of SQL Server internals, index design, and query tuning.

We’ll be using the free Stack Overflow data export, and here are the queries we’ll be working with:

Have fun!

Session Transcript:

Brent Ozar: Well in this session at GroupBy, I, Brent Ozar, am going to be talking about how to get better cardinality estimates, how to get SQL Server to return better guesses on how many rows are going to be involved. I’m Brent Ozar. I do performance tuning for a living. I’m a Microsoft certified master. I just make SQL Server go faster and be more reliable.

Everything I’m going to talk about here today, the slide decks, the scripts, it’s all already up there. If you go to groupby.org/go/estimates, the code is up there, the slide decks are up there and I’ll put the C URL up again at the end as well. So let me start by telling you a little bit about myself. My favorite champagne is Francois Montand, and it only looks expensive. It looks like it’s something fancy-pants. I mean, half of all champagnes and wines look like they’re expensive with the fancy bottles, but this stuff is like ten bucks a bottle. Ten bucks a bottle, I mean, you could drink it at lunch if you wanted. You could obviously drink it at breakfast, it’s great with mimosas.

A friend of ours introduced it to us. It’s like this really nice to drink everyday champagne. And so, as a result, I usually keep at least half a dozen, if not a dozen, bottles around the house because you just don’t feel guilty. When it’s ten bucks, who cares? Ten bucks, just go drink it whenever you want. So a typical weekend errand in our house, if I have to go get in the car and my wife sends me off to go do stuff – maybe she just wants me to leave the house – she’ll send me to Binny’s, our local beverage store here in Chicago. And if they have our champagne, we usually buy all of it.

Now, that makes me sound like I’m some kind of baller, but the thing is, this $10 champagne, it’s fairly rare to see come up at the stores. So I’ll usually just swing in to see if they have any. If they have any, I’ll pick up all of it, but all is usually only like six bottles, seven bottles. It’s not that much and we only have a small wine cooler in our house anyway.

Then after I go to Binny’s, my instructions are, if they had champagne then let’s go ahead and go get a lobster. We’ll have lobster and champagne at night. Otherwise, if they didn’t, we’ll go ahead and just get one pound of salmon. So now, of course, Erik pops on camera because he’s like, lobster, champagne…

Erik Darling: You’re talking about all of my favorite things. I just had to show up. Plus you looked lonely.

Brent Ozar: Always, especially when there’s champagne involved. And this would be a bad idea for an errand; you wouldn’t really want me to say, if they had champagne, go and get one lobster for every bottle of champagne. That would be financially suicidal. We can’t eat that much lobster. The lobster would be crawling out of the pots long before we would have a chance to cook it. But these are the kinds of things that we do inside SQL Server queries all the time. SQL Server queries are a lot like errands. We’ll tell people, go over to the users table and go find the most popular location. Then based on what you find in there, I want you to go run another errand and I want you to sort all these people by display name.

They’re very multiplicative; for every one row that you find, go do lots of other work. And SQL Server’s pretty good at cardinality estimations. SQL Server’s pretty good at guessing how many bottles of Francois Montand there’s going to be at the shelf. It’s pretty good at guessing how many lobsters are going to be alive. So when you’re looking at execution plans and you see that estimates are reasonably close to actual – and I’m going to talk about what reasonably close means – then you’re probably getting a good plan.

It doesn’t mean necessarily that the query is fast. For example, if you tell SQL Server, go get a billion rows out of this table and then join them to a billion rows out of that table, you’re going to have a bad time. But if the estimates are close, at least SQL Server’s going to go in with wide open eyes, understanding how many rows are going to be involved. However, when you look at estimates and actual and they’re 100X or 1000X off, that usually means you’re getting a bad plan; a plan that doesn’t properly associate how much work is going to be involved with how much data you want to get.

Sometimes SQL Server dramatically overestimates how much data you’re bringing back. You get huge memory grants, SQL Server goes wildly parallel for a query. Other times, SQL Server wildly underestimates how much work is going to be involved and it only estimates a single core, it ends up spilling data to disk. To understand how that happens, I’m going to give you a really simple query and we’re going to walk through it in plain English.

This query that I’m about to give you uses option MAXDOP of one. I am throttling SQL Server down to just one core. Now that isn’t a good idea generally, but I’m doing this here in terms of a demo because I want to keep my query plans really simple to fit on the page. Don’t worry – at the end of this hour, my demo advice isn’t going to be, if you turn on parallelism, everything’s going to go faster. This is specifically about the estimation of rows.

I’m going to use the Stack Overflow users table. The users table at Stack Overflow holds exactly what you think it has, and this is the White Pages of the table that I deal with in my How to Think Like the Engine course. If you haven’t read or watched How to Think Like the Engine, I’d highly recommend it. It introduces the basic concepts of things like clustered indexes. This is the clustered index of the users table at Stack Overflow. and the location column at Stack Overflow’s users table, it’s a plain script or a plain text field where you can put in anything that you want; in the back ally, Chicago, at your mom’s house, at the health clinic again, whatever it is that you want to put inside there.

If I told you to figure out the most popular locations, if I said, select location – which is a free text field, people can type in anything that they want – and then count the number of people in that location, group them by location and order by count star descending, so we’re going to see the most popular locations to the least popular locations. There’s a few things that you have to understand that you’re going to need memory for in order to accomplish. If I’m doing this in plain English and all I have is the clustered index, it’s just a bunch of pages with locations on there. I’m going to be yelling out locations at high speed; Chicago, your mom’s house, at the health clinic again, under the basement – yelling out locations while somebody else tallies up how many of each location it sees.

I need to know how many pages they’re going to be working with, how much they’re going to have to write down, and then they’re going to have to have those sets of pages again to sort all of the data. Let’s go see what SQL Server’s execution plan looks like. Now, the way you start reading execution plans is right to left. As Guy mentioned in the last session, it’s also often useful to read from left to right. We’re going to start from the right-hand side and read across to the left. And what I’m going to do is I’m going to show you pictures where I highlight my mouse over each arrow in the query plan, for example, if I hover my mouse my mouse over the clustered index scan.

The thing that I want to look at is SQL Server’s estimate on the number of rows. Now, because SQL Server knows how many rows are in the table with a fair degree of accuracy, it knows that about 7.6 million rows are going to be in the table. The arrow that’s coming out of that clustered index scan, that arrow is really just a yelled list of locations; Chicago, Alabama, Drew’s mother’s house. There’s going to be hardly anybody at Drew’s mother’s house because he says, “My mom is a classy lady,” over in Slack.

Now would be a wonderful time to make a joke about how my mom has been married seven times, but my mom follows me on Twitter and she’s going to see this at some point. She may have lost interest by now; who knows? My mom’s house won’t be one of the top locations. I mean, it doesn’t compare with India, but it could be up there.

So estimated number of rows, 7.6 million, that’s the big long list of locations that are coming out of here. Next up, we’re going to be putting them into buckets. I adore these little icons on execution plans because they kind of give you a rough idea of what’s happening inside the plan. For every row that comes in, we’re going to put them into a bucket. Not sorted in anyway, just throwing them in; oh Chicago, I’ve seen that before, Brooklyn, I’ve seen that before, Indiana, I haven’t seen that before, I’m going to put that in a new bucket.

This is where the estimates get a little tricky. Now, SQL Server knows how many rows are coming in. it doesn’t know how many buckets are going to come out. It doesn’t know how many locations are in those seven million people. Until we tell it somehow, it has no way of knowing. There’s just a stack of pieces of paper over where maybe everybody has their own unique location. Maybe everybody has exactly the same location. By default, SQL Server has no idea. So it comes up with a row estimate of 7478. It believes that out of seven million users, there’s going to be 7000 different locations.

Now, you and I know better. You know, if you have a text field and you let anybody type in anything that they want under location, they’re going to get real creative. People are going to put in 1111111. Some people are going to put in their exact address. People are going to put in, “I’m flying in a black helicopter…” just all kinds of crazy stuff because that’s what human beings do. SQL Server, god bless it, doesn’t really know that much about human beings, so it just estimates 7478, and that’s wrong. That’s going to come in important later.

The next step, compute scalar, it’s kind of mysterious how SQL Server does this – you’ve got to go way deeper into properties in order to understand what’s coming out of here. What this really is, is the count star; the number of people who are in each location. And then the last thing that SQL Server does is sort all of them because we asked for them by count star descending. Finally, on the way out the door, SQL Server estimates that 7478 rows will come back and that was all driven by that hash match, trying to guess how many distinct locations were inside the Stack Overflow database.

SQL Server costed out an estimate of 147 for this query, which is just a rough guess of how much CPU and I/O it thinks is going to be required in order to execute this query – how hard it’s going to have to work. A long time ago in a galaxy far, far away, this meant the number of seconds that it would take for a query to execute on one guy’s Dell Dimension or Dell OptiPlex desktop over in Seattle. These days, it has nothing to do with CPU and I/O; it’s all a bunch of hardcoded estimates on what different operators cost. So we jokingly call these query bucks. A couple of years ago, we had a set of query bucks made up with different people’s faces on them, different denominations. For example, this is Paul White, AKA SQL Kiwi, who does all kinds of cool stuff with the debugger. You can go download the PDFs for query bucks. They also have fun backs on each one of them. We each got to choose an operator and the kind of query plans that best described us. Paul White liked a compute scalar on the back of his query bucks.

Now, query costs are just an estimate. They don’t really mean anything. And what we have to figure out is whether or not our estimates are right. The only way to get an actual plan – and I even use actual in kind of air quotes – to get an actual plan, you’re going to have to go run the query and you’re going to have to watch what happens while that query runs.  At the top is the estimate for this query. It has the same shape as the actual plan down on the bottom, but there are differences. The one down at the bottom understands exactly how many rows moved from one operator through to another. This is where SQL Server is suddenly throwing up its hands, little yellow exclamation points going, oh my god, we have a problem. I have a problem with this has match here. I have a problem with the sort here. Drew’s back on camera now because he wants to make sure I don’t say anything bad about his mom again. Your mom’s nice, I’m sure. I haven’t met her yet. I’m sure she’s sweet.

And if I hover my mouse over those things with the yellow bangs, now I can suddenly start to see what SQL Server is complaining about. If I hover my mouse out of the hash match, there’s things like estimated rows on the estimated one, but over on the actual plan, it says actual number of rows.  If SQL Server’s estimates were right, it would have understood how much work was going to be involved and the plan would have been predictable. But you can’t get that good juicy stuff like actual number of rows, actual number of executions, spills, et cetera, out of estimated plans because it’s going to be different every single time the query runs.

Maybe somebody truncates the table right after you run it one time so your actual number of rows can be completely different. When you run plan cache tuning tools like sp_BlitzCache, that analyze what queries have run the most frequently, we can’t show you actual numbers. SQL Server can’t show you actual numbers because those numbers change every single time the query runs. So let’s go look at our GroupBy and go see how the estimate went. If I go look at the first thing, the clustered index scan, SQL Server did a really good job. I estimate that about 7.6 million rows will come out of here, and in actuality, 7.6 million rows come out.

That part is easy, just estimating how many rows are in the table. But let’s keep moving across and let’s go over to the hash match. For the hash match, estimated number of rows, I think there’s going to be about 7400 rows come out of here. There’ll be about 7400 different distinct locations. Not even close. There are 110,000 different locations. Now, I’m not saying that’s SQL Server’s fault. There’s no way it could have predicted this… There’s lots of ways it could have predicted it, but they all involve reading the table to go figure it out. And when you build an execution plan, you can’t go read the table first in order to go build the plan. You’ve got to roll with what little information you have.

There are two problems with what just happened here. First, SQL Server underestimated how many rows were going to come out of this. And second, because it underestimated how many rows would come out, it also underestimated how much memory it would need for the hash match. And further upstream, I kind of think of this as a data tsunami – this whole plan was designed for only 7400 locations, but because a whole data tsunami is coming out of this hash match, everything else upstream is going to be affected because it didn’t allocate enough workers, enough memory, et cetera.

Now, I don’t blame SQL Server. It’s hard. If I would have asked you to guess how many different locations would have come back out of the users table, I bet most of you would have overestimated. Because if you think about seven million people, what are they going to write in the location field, I bet you might estimate half a million or a million different locations. And you would have had a different problem. You would have been like, oh my god, this list is going to be huge. I better allocate all kinds of RAM. I better get all of the worker threads to all work on this simultaneously, and your SQL Server would be constantly falling over because you would be clearing out the buffer pool, allocating all these worker threads every time even a trivial query runs. The answer isn’t to wildly overestimate everything, because that causes different things like poison waits like thread pool or resource semaphore.

As we go up through this plan getting out to the sort, our bad estimates where we estimated 7400 rows and in actuality 110,000 came out, we end up writing down all of this data to tempdb again a second time. So getting estimates as accurately correct as we can is going to have a huge difference in terms of our performance. It’s going to make a huge impact. One of the things, god bless it, about SQL Server is – I always use a guy’s voice for SQL Server, “Trust me, I got this,” because he’s real stubborn, he refuses to ask for directions, he always swears he knows the way to the final solution and he’s completely wrong.

So SQL Server is going to stubbornly reuse the same freaking plan every time going, “Trust me I got this, 7400 rows; that’s exactly what’s going to come back…” and just keep doing the same thing over and over again. Now, we have a bunch of ways that we can make this query plan a little bit more accurate. We can update statistics. We can slap an option recompile on there. We can change our cardinality estimator. We can upgrade to SQL Server 2017 and get adaptive memory grants. We can add a statistic or we can add an index.

I’m going to walk through each one of these one by one and I’m about to pass judgment on some of them, but I need you to get that I’m only passing judgment on this example query. There are times when you’re struggling with cardinality estimation that lots of these can help and lots of them can be powerless. My goal here is to sketch out what your toolbox is and have you understand what’s going to work in this particular situation. Each query that you go and tune may be totally different.

Would updating statistics here help me? No because I don’t have any. By default, I don’t get any statistics unless I explicitly create them or unless SQL Server sees me running the same query over and over again and goes, “Wow, man I notice that you sure do query location a lot. I should probably create a statistic on that field.” SQL Server does a good job of that. I’m a huge fan of that auto create stats feature. Sometimes people turn it off for things like SharePoint but I’m like, no I want to give SQL Server as much of an edge as it can get when it’s starting to process query results. Here, I can update my brains out; it’s not going to make any kind of difference.

Two, could I throw option recompile on it? Option recompile tells SQL Server, I want you to drop and build a brand-new query plan for this statement at any given time, just at the time when I run option recompile. But the thing is, this is only one query, just one statement all in itself by a batch and SQL Server has to build the whole query plan at once. It doesn’t build the query plan, get halfway through it and then build a brand-new query plan from scratch. There’s some really cool things in 27 that start to make this easier, but only for certain kinds of queries and only for certain kinds of decisions. It’s not like it’s going to go rebuild the rest of the whole entire plan from scratch. So here in this example, option recompile isn’t going to be the thing that gets me across the finish line.

Next up is changing my cardinality estimator. Now, several years ago, Microsoft made a lot of noise about a brand-new cardinality estimator, which is the piece of code that guesses how many rows are going to come back from different parts of different execution plans. This is before the marketing team caught on the cool idea of it just runs faster; otherwise, they probably would have used that tagline for the new cardinality estimator here. If I right click on my database and I go into properties, I can change the compatibility level, pop it back and forth and I’ll get this different cardinality estimator. In 2012, if I go back to 2012 compat mode, it would have estimated 7500 rows, and in 2014, 2016, 2017 it would have estimated 7500 rows. Look, there are cases, rare as they may be, where the new cardinality estimator will make a dramatic difference and solve your problems for you. There are just as many cases where it puts a bullet in the gun and invites you to shoot your own foot off. It’s not a magic bullet, it’s just a bullet; it’s just a different bullet than the one you’ve been using before. So it’s not going to be what gets us across the finish line here.

SQL Server 2017 does start to have a little interesting advantage. It has this thing called adaptive memory grants and it doesn’t help us guess how many rows are going to come out. It still makes the bad decision of, “Hey, I think 7400 rows are going to come out of these locations…” when in actuality, 110,000 do. But it does notice over time if you run those same queries over and over again, it’s like, “Dude, I’ve been spilling to disk every single time this query runs. Maybe I should amp-up and give myself a little bit more memory for query work space.” It doesn’t happen on the first execution; it doesn’t happen on the second execution, but when your queries run often enough, SQL Server wises up and starts to give them more memory. It can also go in the opposite direction too as well, give you less memory.

Unfortunately, right now in 2017, this isn’t going to solve my problem because it also wants a column store index in the query. So I’m going to set that aside for now. I am a huge fan of what’s going to come in the future versions of SQL Server. Tara says, “I’m not sure which words just caused it, but Brent just woke up my Google Home device just now.” So let’s see what I can do here. Alexa, order me some champagne. And the other one is, hey Google, I would like to play an adult movie. The poor people who are watching on YouTube are so screwed. I would say something about Siri, but she never gets anything right so it’s pointless to even try and trigger her. I cannot even believe how bad Siri sucks. That’s the one thing stopping me from getting an Apple HomePod because I can be like right here, will you listen to me? It’s terrible.

Now, will adding a statistic help? I put this up on the screen. I don’t think I’ve ever done this in real life. I’ve done it in demos, but I don’t think I’ve ever come to a solution where I go, you know what we really need to do here is add a statistic. Now, I hear from other query tuning friends of mine that this does get them across the finish line from time to time. So I feel the need to describe it, even though it’s never gotten me across the finish line. When I create a statistic, it is one 8K page worth of metadata. It’s data about the table. It’s not the entire copy of the table sorted like an index is; it’s just an 8K page worth of metadata that helps to inform SQL Server’s decisions so that it can start making better guesses about how many rows are going to come back.

But all that does is change your cardinality estimation, and in that case, I’m still going to be putting all of my rows into buckets and I’m still going to be sorting them all. So usually, if I want a statistic, the thing that I’m more tempted to go do is go create an index. An index is a copy of your data sorted in the particular way that you query it and you get a side effect. For every index that you create, SQL Server automatically creates a statistic with the same name on the same fields. And so adding an index, in this case, has two functions. One, it starts to give me a different way to access the table. And index on location is a list of locations sorted from top to bottom – whatever location starts with an A, Brent’s mom’s house, Drew’s mom’s house, et cetera, Brooklyn, whatever, they’re all in order.

And so now SQL Server goes, I can just scan that index and instead of having to put stuff into buckets, now I can just keep a running total on how many people I’ve seen in each of these locations. Data is already sorted, this is going to be way faster. And yes, there’s a yellow bang, but hold on a second; we’ll come back to that. if I hover my mouse over the stream aggregate – this is the part where we’re counting up how many people are in a location – check out that estimate; cha-ching. Now SQL Server is saying, “Well I estimate that 110,000 rows are going to come back. And what do you know – 110,000 rows actually come back. Now, I do still have to sort the data that comes out of here because this is just a list of locations from A to Z, and then I have to turn around and sort them by popularity. So the sort still needs to happen and unfortunately, it’s spilling to disk.

Now, it’s not spilling a lot of pages to disk. It’s only spilling 808, but SQL Server still didn’t allocate enough memory. What my job is as a query tuner, every plan operator is trying to guess how many rows are going to come in and then how many rows are going to come out. My goal is to make both of those right and keep influencing them throughout the plan. Generally speaking, when we talk about reading the query plan, I read these form right to left, top to bottom, find the place where estimates go to hell in a hand-basket and that’s the place where I want to focus on fixing. When I know that it’s wildly off, that’s the part where I have to either change something about the query, change something about the database.

In this case, what I did was I changed how much data will come in or predict how much data will come in by way of the index. When I built an index, SQL Server suddenly had this statistic that told it exactly how many different locations were out there so it came straight in, in order, knowing how many rows were going to come out. That’s not necessarily the only way to tune queries though. There’s lots of different ways that you could go tune queries. For example, I could change how much data is going to go out to the next operator.

Here’s my initial query. And whenever I see a query and it has a lot of results – and to me, a lot is more than I would fit o one screen in my application – I start to ask tough questions of my developers. Wait a minute, what do you mean you need 100,000 locations sorted by popularity top to bottom? It’s just like, in real life, I really shouldn’t go to the wine store and buy all of the champagne as a hardcoded limit. I should really say top ten or top 12. If they have – sooner or later they’re going to have more than one case of Francois Montand in stock and it would be spectacularly stupid for me to go through and buy all of them.

What I should do instead is I should pick some number. And just for the sake of picking one, I’m going to say top 100 here. If I say only get me the top 100 locations ordered by popularity, the query plan still looks basically the same. The query plan is still going from right to left, scanning all of the locations, getting the top counts as you move through them. Now there’s a sort though, and because it’s only top 100, SQL Server doesn’t have to keep all of the locations. It doesn’t have to do a big ginormous memory grant in order to hold all of them; it just sorts the top as it’s coming through. If a row is in the top 100, it gets kept, otherwise, it gets discarded. SQL Server needs dramatically less memory now here and it’s even funny. SQL Server is complaining, “Oh my god, I gave you way too much memory. I detected an excessive grant. That’s way too much.” Hey, SQL Server, how much is too much to you? “1MB, like grandpa’s floppy. You remember that floppy-disk grandpa used to have that he would keep next to his IBM XD. I gave you 1MB of RAM. It was purely wasted.” You know, like you have those vendor flash-drives for 4GB and you throw them away because they suck; you don’t want to keep them they’re so tiny. Here, SQL Server is complaining about a 1MB memory grant. That’s how good our query has gotten. “I just gave you too much.”

The key there is that I use two different sorts. One of them was a top sort, whereas the other one was just a regular sort that kept all of the data. The amount of memory that gets used by that memory grant can change depending on what your top number is. We have a blog post that goes into more details about that.

So what we’ve learned so far with a fairly simple query is that SQL Server needs to guess how many rows are going to come out, not just of the table but out of each operator. In our case, SQL Server knew exactly how many rows were going to come out of the table, but it had no idea what was going to come out of some of the operators inside there. And when that operator number was way the hell off, SQL Server built a bad execution plan; a plan that wasn’t right for the amount of work we were asking it to handle.

Now it’s going to get a little bit trickier. You know how I said I have these two-part errands and I said that what I would do is go to Binny’s, get champagne, and then a really stupid errand would be to go over to Whole Foods and for every bottle of champagne I get, go get one live lobster. Well, that’s exactly what your queries are, so let’s build a multi-part query.

I’m going to start by saying go declare the top location. I don’t know what it’s going to be. It’s obviously not going to be Drew’s mom’s house if she’s a woman of fine repute or whatever it is. It might be Erik’s mom’s house, but he’s one of my employees, so I’ve got to assume that she’s probably okay. then I’m going to go select that variable. I’m going to set that variable as the top one location. In order to protect myself from nulls – because it turns out, there’s a lot of people who are paranoid at Stack Overflow and they don’t put their location in. Who could blame them, because some of those questions and answers that I see on Stack Overflow, I want to go to their house and I want to beat the hell out of them for doing such a – why would you think that option recompile is good for every query? Where location is anything other than an empty string, group by their location and order by count star descending.

This is exactly what we did before. Then, I want you to go find all of the users in that top location and order them by display name. So let’s start from top bottom. This is fairly simple. When I have a variable, it’s only going to have one row in it. There’s not a lot of cardinality estimation here, this variable is only ever going to contain one row. Now we’ve got the second part. Go get me the top one location from the users table where location is anything other than nothing, group by location.

You and I know, as we look at this query, there’s only one row coming out and that’s it; I just don’t know what the row will be. Now, you and I, if we knew that we had an index and if we had statistics on that, maybe we could guess. Maybe we could guess what that top one location is but SQL Server does not. It just tracks that there’s one row coming in here; one more value that’s going to be in our top location. So now the next part of our query runs, select star from DBO users where location equals top location, order by display name. How many rows is that going to return and how much memory am I going to need to allocate in order to sort all of these users by display name?

Here comes a tricky part. SQL Server has to build the query plan for the entire query at once. You know how go is a batch separator in SQL Server – you put in a bunch of queries and then you have a go. The go at the point defines an entire batch that SQL Server is going to compile all it wants. You and I would wish that it would build the execution plan one step at a time as it ran and make different branching decisions. That is not what it does. It builds the estimates for everything all at once. So first it says, go get me the top one location and then find me all of the users in that location sorted by display name. if I hover my mouse over the second part – so we’ve already optimized the top query earlier – over that index seek, hey, SQL Server, how may rows do you think you’re going to find in the most popular location at Stack Overflow? “68.” What? “Yeah, I think the most popular location in the world would have 68 people in it.” That makes no sense, but in fairness for SQL Server, what he’s really doing is saying, “I think that because I don’t know what the location will be, the average location has 68 people in it.”

That’s fair, it’s just kind of stupid. SQL Server built this whole entire execution plan at once and it’s not like us as human beings, where we would take an unlimited amount of time to stitch these statements together and figure out what was going on. SQL Server ain’t got time for that. So I could, if I wanted, make SQL Server drop and build a brand-new execution plan at any given point in the query. For example, we talked about our list of tools that we could go use, building an index, building statistics, changing the compatibility level; all kinds of things that we can go and do. I’m going to use one tool out of the toolbox in here to illustrate how I can help influence cardinality.

When I slap an option recompile statement on this particular line then what SQL Server does is it builds this query plan, but then at the point where it hits this statement, it says, “Right, now I’ve been told to go build a new query plan based on everything I know so far.” And at this point, it might actually know what the top location is. Maybe it’s going to consider what that location is as part of understanding how many rows are going to come back. It does. Now, at first it looks bad. Like Guy talked about in the last session, he said all of a sudden when you see a clustered index scan, sometimes that can look bad, but that can actually be smart as opposed to an index seek plus a gajillion key lookups, which is what we had before when SQL Server was going, “Oh my god, there’s only going to be 68 people in here. I could so a seek plus a key lookup. Millions of reads, not a problem.” Here, SQL Server knows, well, I know there’s going to be a lot of people in that top location so I should probably just scan the whole table. And yeah, I have to sort all these people, however many people were in the top location, but you know what you don’t see on that sort is a yellow bang because SQL Server went in with eyes open going, “Oh, the biggest location, like India you mean. Oh, now I know that the top location is India, that’s what you’re looking for, there’s going to be a ton of people in here. I better parallelize this because there’s going to be a lot of work involved in the sort and I better throw all kinds of RAM at this to make sure this thing doesn’t go to disk. And while I’m at it, what I should really do is tell you to add an index on location. And I would like you to include every single field on it; all of them.”

Well, easy now Clippy. Clippy is asking for an index with every single field in it because he wants to make the select easier; that’s Clippy’s only mission. He’s very shortsighted. He’s always focused on the now. He’s not concerned with what anyone else’s work does. That’s okay. I’ll forgive Clippy for that because at least he built a good enough query plan that has the right amount of work. What option recompile did was it said, as you work through this query – Clippy, you’ve done a great job with building the top part of the query plan, but when you get down here, when you get to the bottom part of the query, it’s not your fault. There’s no way you could have known what the top location would be. But now that you’re down here, let’s go rebuild a brand-new query plan because building a query plan – it has its plusses and minuses – if the plan really sucked then option recompile may be a better idea, especially if the thing is dramatically variable.

The key is in knowing where to put those option recompiles at. You don’t want to pepper them all through here like me with my table champagne. You only want to use them where they’re very appropriate, like Erik with his expensive whisky that he likes. You don’t just drink that with every meal; you only drink it with the good memorable meals. He’s lying – that’s not Pepsi Clear either.

So another way that you could fix it – and I’m going to throw out a couple of other ways that you can fix cardinality estimations here – another way that you can fix it is to combine queries. Sometimes, when you’ve got a couple of different queries and they’re doing different things, sometimes you can give SQL Server a better idea of what’s going on if you collapse them all down into one query so that SQL Server can connect the dots with what’s going on instead of treating every single query separately.

So for example, a CTE – let’s say that I build that original top one query up here in a common table expression. I don’t know why they call them common table expressions, because they’re fairly uncommon still to see, even though they’ve been out for more than a decade. If I take this CTE and I say, with the top location as select top one location from users, go get that top location and then join to the users table. The thing is though, SQL Server is still going to build this entire execution plan all at once. It doesn’t know what the top result is going to be until it goes through and executes it. So when I go through and run this execution plan, we read these plans from right to left, top to bottom and SQL Server is going exactly the same thing that it was doing before in two separate queries, it’s just mashed them down into one query plan, which can be easier to read.

I’m going to read from top right going down through the bottom. Top right, go scan through the list of locations. SQL Server is really good at this. It knows how many locations it’s going to find. It’s going to find about seven million. Then, go sort them. Find the most highly ranked location. It knows exactly how many rows will come out, but it doesn’t know what that location is. It could guess. You and I know that it’s got statistics on location. Microsoft could kind of take a leap there and say whatever the top number one is – but if you think about it, what happens when you start doing top ten? What happens if you start putting a where clause on there, like where the users have logged in recently? Who knows what’s going to come out of there in a real-world type query.

Then comes the second part of this – let’s go find how many people are in that location; the number of people who are in India here on Stack Overflow. Now, go make a list of all of them and then for every one that you find, because our query asked for select star, go do a key lookup and go grab all of the rows that you find. Now, when you look at an execution plan, one of the things that I hammer into you in How to Think Like the Engine is that a seek looks like it’s only going to return one row. When we human beings think of the term seek, we think, I’m just going to dive-bomb in and pluck one individual row out. And you think that a scan is just going to start at one end of the table and read the whole entire thing. That’s not what they mean. A seek refers to the starting point. I know where I’m going to start, but not necessarily where I’m going to stop. I could read a whole hell of a lot of data. And scan means I’m going to start at either end of the index; I just don’t know what those values will be.

So when I look at thee index seek, when I seek in to see how many people I’m going to find, SQL Server is just like, “Oh I think 13 rows. Whatever the top location is, there’s going to be 13 people inside that top location.” That’s ridiculous. That’s not even close. When in actuality, there may be, say, 37,000 people who live in India who use Stack Overflow – and of course, there’s more than that, but people put in all kinds of different things for their location, like Bangalore, India, or Hyderabad, India. So those estimates are wildly incorrect, so when we estimated 13 rows were going to come out of here and actually 37,000 did, we’re reading right to left, top to bottom, this data tsunami is now coming upstream and for every row that we’ve found, now we’re going to go do that many key lookups.

If we found 37,000 users in here, we’re going to go do 37,000 key lookups. Here comes SQL Server. As Marci says, it’s Cookie Monster’s SQL Server voice, “I think I’m going to need about 13 rows. I’m going to do this about 13 times here. Oops, my bad. This is a little hard to troubleshoot because Microsoft seems to have two ways of laying data down on the screen. They take one of two approaches; either they line up everything in perfect alphabetical order, which has nothing to do with how you want to look at it and they aren’t even consistent about how they do things alphabetically. Like in here, sometimes they’ll say actual number of rows. Here, they don’t put the actual in front of the number of executions. So this isn’t even alphabetical and it’s not arranged the way that you would want to read it.

For example, actual number of rows is up here, estimated number of rows is way down here, whereas estimated and actual number of executions are right here next to each other. So either they alphabetize everything or they load up the data shotgun and they just spray the data at the screen and wherever it happens to land, they call it good. This is one of those data shotgun screens where they just slap the data anywhere that it happened to fit. So it’s really hard to understand what some of these things mean.

I’ll give you a little decoder ring. Estimated number of executions, “I’m only going to have to do this 13 times.” Actual number of executions, “My bad, I did it 37,000 times.” How many rows do you think are going to come out of here? “Well, the way that I’m going to display estimated number of rows is I’m going to tell you one, but I’m kind of keeping a secret from you. What this really means is that it’s one for every single operation that I estimate that I’m going to do. So I really estimate that I’m going to get 13 rows back, but I just want to make your job security even harder.” Actual number of rows, “I’m going to show you this in the total, as opposed to earlier I showed you per execution. Here, I’m going to show it to you in total. Just a job security thing – you’ll thank me later. And by thank me later, I mean drink.”

So of course, the rest of SQL Server’s execution plan, again, is going to be completely off. Now, when I’m down to the sort, SQL Server has no idea how many rows it’s going to need to sort and as a result, it wildly underestimates the amount of memory it’s going to need for that execution. Sometimes, when you’re tearing queries apart of combining them together, sometimes you need to take a really large query and break it up into pieces. Sometimes, you need to take some of those pieces and put option recompile on them, for example. Other times, you need to take big ginormous batches and merge them down into just one single query.

My job as a query tuner is to read this execution plan from right to left, top to bottom, figure out where estimates went to hell in a handbag and why. Is there something I can do? Is there a thing like option recompile, for example, that’s going to make this thing easier for SQL Server to interpret? You have a lot of tools at your disposal, not just option recompile. And my goal inside this hour was to sketch out some of the tools that you have and some of the times you need to see when you need to use those tools. There’s lots of ways to accomplish this and very often, as I’m teaching people query tuning, they’ll say things like, should I use a CTE or a temp table? Should I take a big query and tear it down into pieces or should I combine multiple queries together? And it’s just like a carpenter’s bag of tools or a bartender’s array of tools there as well.

It’s a matter of knowing which one is right for different scenarios. And I can’t possibly teach you all of them inside the span of a one-hour session, so what I’ve done is I’ve given you a bunch of resources over there at groupby.org/go/estimates which will take you off to other ways you can go through and learn how to tune queries. If you like, we also teach huge classes. So both Erik and I teach a mastering query tuning class, three days long, coming up here in a month or so. We’ve also got a PASS Summit pre-con coming up where we do nothing but demos one after another and go through and show you what some of the gotchas are inside of SQL Server’s execution plans. You can go and learn more about those over at brentozar.com/training. So, that is everything that I wanted to teach y’all today.

The following two tabs change content below.

Brent Ozar

Managing Director at Brent Ozar Unlimited
I make Microsoft SQL Server faster and more reliable. I'm the guy behind BrentOzar.com, Ozar.me, DBAreactions.com, SQLServerUpdates.com, and GroupBy.org.
Previous Post
How Often Should You Run Backups and CHECKDB?
Next Post
What to Do When SQL Server is Unusually Slow

13 Comments. Leave new

After watching the ‘How to think like the SQL server engine” series, I’d love to see this.

Reply
Ray Herring
March 6, 2018 2:22 pm

I would attend this session

Reply
Rick Kohler
March 6, 2018 3:49 pm

I’ll watch anything Brent wants to make available

Reply

I would attend. Thanks Brent.

Reply
Greg Burnett
March 7, 2018 7:03 am

I would definitely attend this one, and prompt my co-DBAs to do so as well!

Reply

Can’t wait

Reply

Sweet. This is great info. Thanks for sharing.

Reply

Really interesting video, thanks. Having the other two people on the video stream is quite distracting though and didn’t add anything to the content (other than finding out which champagnes not to buy or whos mum is most popular!).

Reply

    Richard – goodness, I hope you never attend a local user group or conference. I’ve got some shocking news for you there: it’s not just you and the presenter in the room. There are OTHER PEOPLE there too! I know, shocking, right?

    Reply

NO, ALEXA! CANCEL! CANCEL!!! BAD ALEXA!

Reply

[…] was my regular 75-minute session during the conference itself. You can see an earlier version of it at GroupBy. The basic demo idea was the same, but I’d added SQL Server 2019 in as well, and tweaked a […]

Reply
published here
August 1, 2019 6:19 am

I have been reading out many of your stories and i can state clever stuff. I will definitely bookmark your site.

http://teacherspost.co.uk/enterprise-skills-day-challenge/

Reply

Leave a Reply to Brandon Cancel reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu