Anyone who stares at query plans for a living.
The optimizer has options.
Not every option is optimal.
Optimistic query tuners hope that the optimizer will make optimal choices.
Often, the optimizer will stick an optional operator in a plan that just doesn’t belong.
Sorts, Spools, Lookups, Tops, Aggregates, and Bitmaps can all end up where they don’t belong.
Opt out of bad powerpoint decks with this demo-only session!
Joe Obbish and Erik Darling will show you when the optimizer’s optics might require a trip to the optometrist.
Why I Want to Present This Session:
Look, you’ve been watching the same query tuning session for years.
There’s a function, there’s a table variable, there’s parameter sniffing, there’s a cursor, and oh lord, something just wasn’t SARGable.
But what happens when you’ve already fixed that stuff, and you’ve still got problems?
ERIK DARLING: Coming up at the one o’clock slot, with crap on his shirt from something, we have Mr. Joe Obbish and Erik Darling doing the problem with optional operators. Take it away, Erik and Joe.
Now, the thing that I have to say very first is that this is what happens when you let people write introduction slides for each other and they’re on a very friendly basis. So, this is Joe, and this is what I think of Joe. And you can just take a moment to scan that list. And this is me, and this is what Joe thinks of me. I’ll give you a moment to read that list, and then we’ll move on.
JOE OBBISH: That’s a pretty short moment there.
ERIK DARLING: We don’t need long moments for these things. That’s why we wrote them that way. So we’re here to talk about optional operators, and we’re going to start this talk – we’ve agreed, barely agreed. It took a long time to come to terms – I was going to explain what an optional operator is. An optional operator comes from the query optimizer being able to make choices about how it’s going to construct a query plan and makes all these choices based on how much different things cost. And it measures those costs and these imaginary CPU and I/O units that are only known to the Sages of Redmond. The cheaper plan, like the cost of a plan, does not dictate how long a plan is going to take to run. It used to mean that a long time ago, but not anymore. Nowadays, the cheaper plan doesn’t mean the faster plan and it doesn’t mean the best plan possible.
When you look at an operator and you see the cost on it, the cost is just like a price tag. And the only thing SQL Server cares about is how much that operator costs. If it found a really cheap way to do something, it’s going to stick with that cheapskate way to do it. It doesn’t care about bells, whistles, dingdongs, flags, streamers, cool anti-lock breaks, sunroofs. It doesn’t care about any of that. All it cares about is getting the cheapest thing possible.
There are times when getting a bad operator is your fault. And bad choices aren’t just limited to query plan operators, those things that show little pictures that show up in there. Sometimes, some of the details of the plan can also be bad. And all of those things come from bad guesses. So if we write a query that the optimizer can’t make a very good guess about, the optimizer is just going to sit there and say, you want to know how many posts in this table? You want to know how many of them have more than 10 years between when they were created and when the last activity was? Well, say about 5.1 million. Even though zero rows actually qualify for that, SQL Server just made a dumb guess because we wrote a query in a bad way. And so, if we ask the same question, just about a slightly different number, like, how many have less than a year between them? Well, 5.1 million, even though about 14 million actually qualify for that. So SQL Server’s just throwing BS numbers at us. It’s making bad guesses because we kind of wrote a query terribly.
Bad guesses lead to bad choices. Look at my life. If you were this guy and you were dressed like this and you were going to a café, you have made a good choice. When you end up in the toilet section of a hardware store, you have made a bad choice. This is no longer appropriate. You should no longer be dressed this way; stripy shirt, little scarf, floppy hair. You’re out. You do not belong here, sir. Those bad choices carry over to other parts of the plan. So if I now have this query that we know is going to make a bad guess and we now want it to do other things, some slightly more involved things, those bad guesses are going to impact stuff in the plan. Like, for instance, when I run this query and now I’m going to do some sorting and some grouping and stuff and I’m going to run this query, and we’re going to get a little warning in the plan. We’re going to get that little ding on the select operator. And we have to be very careful here because this may impact reliability. That 26MB of memory is going to impact it. Be careful there, sir. You have impacted the reliability.
When we ask that query to do now something bigger with more rows, you can imagine what’s going to happen. This is just going to get worse. But don’t worry, because it doesn’t impact the reliability. We just spill all that crap to tempdb, no big deal, no one cares about that anyway. No one’s worried about tempdb. But let’s face it, when it comes to this sort of bargain basement material, you’ve heard all this stuff before. You’ve heard, write SARGable queries, oh no, there was a function, isn’t that cheeky? No one ever saw that coming. No wonder the query is so slow. Maybe you used a table variable. Maybe you just didn’t create an index, you sly dog. How did you ever come up with that? Maybe you did update stats. Maybe you just should have used a temp table. There’s all this stuff that’s just like, yeah, I’ve heard it all a million times.
Granted, people keep making those same mistakes, and that’s kind of why I have a job, but it’s not what I always want to talk about. The reason that me and Joe are friends and we can call each other doody-heads is because, when we talk, we like to talk about when the optimizer makes bad choices on its own, like when we don’t screw up, when we do everything right. And we’re like, hey, wait a minute, this thing is still totally wrong. When it’s not your fault, well, the optimizer can choose all sorts of weird things to stick in your query plan that are really, really bad choices. Some of the ones we’re going to look at today are spools, lookups, sorts, a top operator, and bitmaps, so all interesting fun things that you don’t hear about in every single presentation when you get into performance tuning talks. A lot of them will just focus on some silly thing that you did, some best practice that you violated. This is when the optimizer goes and screws the pooch on its own. I went one slide too far, so I’m going to pass this over to Joe at my first sign of complete irresponsibility. I’m going to make Joe presenter, and Joe is going to show us some interesting stuff on his screen. Hopefully he has hidden the anime girls before starting this up.
JOE OBBISH: They’re all hidden. I have hidden all the evidence of everything. So we have a very simple fact table here with some orders in it. This table was born and it never really grew up. It’s still a heap. It’s still in its natural state. We have some pretty typical boring columns. You’ve got an order date, number purchased, the price, the unit ID, and then this unit class ID. Five columns. Also using a date dimension table from Aaron. And here’s the URL to create that…
ERIK DARLING: You want to take advice from a guy who just ran over a screw? That’s your big idea?
JOE OBBISH: Well, that was today and I think he wrote this article a long time ago, so that really shouldn’t apply.
ERIK DARLING: He may have been sober then, who knows? I still have to figure out what’s on my shirt.
JOE OBBISH: Okay, so let’s say today’s task is I want to take my [inaudible] table, join it to my dimension table, GroupBy the last day of each quarter and my unit ID. And also, in the same results set, that I want to do something similar, here I GroupBy the unit class ID instead and I have just this one simple aggregate here. Now, I’m looking at this query and I remember, oh, well there’s this thing called GroupBy grouping sets I’ve seen on the internet. Some people say that that’s good, right? So this query is the exact same thing as before, [crosstalk]. Right, I want to write it as little code as possible. That is one of my primary goals in life.
So, the way you interpret these grouping sets is the same query as before, where we GroupBy the first one, UNION ALL and then we GroupBy the second one, right. So I’ve got a nice compact query here. I would run it, but it would take too long. So here’s our actual plan instead.
ERIK DARLING: Optimizer, what are you doing?
JOE OBBISH: It took 30 seconds. One of the things that I do is I take queries and make them run faster. 30 seconds, I don’t think, is very good. So, let’s figure out why is query is slow. So, this is a row mode serial query. I can look at the actual time stats per operator, and some of the time here is for that operator, and then that also includes all of the child operators as well. So my scan and date dimension, that takes one millisecond. That’s probably not the problem, right? I scan on the fact table. It’s 1.2 seconds. Okay, that’s not the problem either. We can look at this hash join. So that’s three seconds for the hash join and the two scans.
Look at this spool though; 24 seconds minus three. So, SQL Server took 21 seconds to write that data from the join into the spool and then the rest of the time it spent on the aggregates.
ERIK DARLING: There ought to be a trace flag.
JOE OBBISH: There ought to be, and maybe there is and we don’t know about it. So what’s the strategy behind this query? So, the optimizer really only has two choices here. It can run each aggregate individually and union them all together, or it can take the results of when aggregating, put it into a spool, and then apply the aggregate to the spool twice. So if you imagine, if I had – let’s say that this query was actually very expensive to calculate and it didn’t produce that many rows. The strategy of take the result set, put it into a spool and run aggregates on the spool for as many times as you need to, that could be a good strategy. But here, I’m throwing nine million rows into this and I’m spending two thirds of the time just writing to the spool.
ERIK DARLING: Joe, we have our first great question. Lee asks, “The spool took 21 seconds. Why is the cost estimation zero? Is time to do a thing not considered in the estimated cost?”
JOE OBBISH: What? That’s a good question. So these are just estimates, right? They can always be wrong. Now, I don’t know why this is costed as zero. It’s kind of a weird thing here. Reading from the spool has a cost of 20%. So, the cost percents are often – these cost percents are based on the estimated plan. You could put it that way. This is our actual plan. Yeah, I don’t know why this has a cost of 0%. That must be some kind of weird optimizer trick. But in any case, SQL Server thinks that this is a good enough plan. I would say it’s not.
So a few disadvantages here – writing to the spool is always in MAXDOP 1. All of the aggregates of that spool are always to MAXDOP 1. And there’s a lot of overhead for writing to the spool, even compared to writing to just a normal temp table.
I want to speed this query up. I don’t like what the optimizer did, so I’m going to, instead of using a spool, I’ll just take that initial result set, write it to a temp table, and then I will aggregate on that temp table. And then, to keep things fair, I’ll have both queries be MAXDOP 1. I’ll go ahead and run that. Now, you might think, well this shouldn’t be faster. I mean, writing to an eager spool, writing to a temp table, why should that matter? But it turns out that it is faster, once it finishes.
ERIK DARLING: We have one execution plan, so we’re good there.
JOE OBBISH: Okay, so we went from 30 seconds to 22, and I think some of this time was spent doing I/O. But the only change I made was – okay, I don’t want a spool. I often don’t like spools, so a temp table instead. And here, for the grouping sets query, I mean, I’m just pulling from one table, right? It would be totally pointless to take that table, put it in an eager spool, and then aggregate based off that. The SQL Server doesn’t do that. It’s smart enough to split up this grouping-by sets query in to a pair of aggregates, and we saved a bit of time just by [inaudible] the spool.
Now, there’s really no reason to keep MAXDOP 1 hints, right? That’s kind of an artificial thing to keep things fair. So I’ll go ahead and get rid of that hint, and also combine my aggregate into one column, because why not do that? Let’s see how this query does. The same query – I’m letting the optimizer run it in parallel, if it so chooses.
ERIK DARLING: Oh, there’s an execution plan, boy.
JOE OBBISH: So we went from about 21 to 14. Interesting, I thought it would be better.
ERIK DARLING: That’s what they all say.
JOE OBBISH: Yeah, it was running faster an hour ago. What is the deal here?
ERIK DARLING: It’s Webex probably. Well, you get all those CXPACKETS. It takes time to fill those up.
JOE OBBISH: CXPACKETS, got to fill up our packets, yeah, I don’t know what SQL Server’s…
ERIK DARLING: You tried turning on jumbo packets?
JOE OBBISH: Not yet. So, I’m on SQL Server 2017. I can do parallel [inaudible]. There is a lot of parallelism going on here. There’s a lot of parallelism going on here too. So I’m not saving any CPU time. In fact, I’m probably gaining a little CPU time through these repartition stream operators. But the query ends up running faster. Okay, so 14 seconds isn’t bad. I think we can do better. What do you think, Erik? Can we do better?
ERIK DARLING: I always think we can do better.
JOE OBBISH: Alright, so I’m not a fan of these repartition stream operators. They don’t scale well…
ERIK DARLING: They’re also optional…
JOE OBBISH: And they’re optional, yes, that’s one reason I’m talking about them. So what SQL Server is doing here is, this is a row mode parallel hash join. And the way it’s implemented is, you can think of an independent hash join on each parallel thread. So for the join to work, we need to make sure that, on thread one, all of the corresponding rows from both tables are on that one thread. If they weren’t, then the join would end up losing resolve. So what SQL Server often does is it will partition using an internal hash function to make sure that we don’t have any mismatched rows.
So here, we’re doing additional work. We might be sending rows from thread one to thread two, or thread two to thread three. Now, this is a totally normal thing to see with parallel row mode plans. It’s part of the deal you make if you want parallelism…
ERIK DARLING: Data with the devil…
JOE OBBISH: Pretty much. And here, I mean, here for the aggregates, it’s the same thing, right? So, for the aggregate, you can think of it as independent aggregates, one in each thread. Again, if we didn’t adjust our rows and apply hashing function, we could get wrong results. So this adds some overhead. I’m running a MAXDOP 4 only, because I’m running locally, so it’s not that bad, but it’s still time we don’t need and it’s time caused by an optional operator. Now, running it in MAXDOP 1, of course, makes all these things go away, but now you’re running at MAXDOP 1.
ERIK DARLING: Which is a bummer because you pay a lot of money for all those cores.
JOE OBBISH: Well, I didn’t pay anything, but my understanding is some people…
ERIK DARLING: Nice people out in the real world pay money for those cores.
JOE OBBISH: In the real world, yeah. So I’m going to do something that is unsupported and join to an empty CCI with a join condition that can be optimized out…
ERIK DARLING: Tell the nice people what a CCI is.
JOE OBBISH: A CCI is a clustered column store index. I’ll go ahead and run that. So I’m on 2017. If I want my query plans to be eligible for batch mode processing, I need to have at least one column store index referenced in the query. So before, it took us 14 seconds. Now it took five, okay. That’s quite an improvement, right?.
ERIK DARLING: So we’re down from 30 something seconds to five seconds. I think we’re doing well.
JOE OBBISH: We’re doing alright. We’re not done. So this is all parallel, but our repartition stream operators are gone. I mean, there’s the mandatory one at the very end, because whenever you’re selecting, all the rows have to go through one thread. That’s just how it works. But the difference here is, this runs in batch mode now. So batch mode processing, it can be more efficient. The big benefit here is that batch mode operators, they all internally handle thread management on their own. So there isn’t the need to move rows around between threads. That’s just built in.
ERIK DARLING: It just runs faster…
JOE OBBISH: Sometimes it just runs faster. Now, batch mode is especially suited for aggregates and for has joins and for parallel queries. So this is a very similar plan to before, it’s just everything runs in batch mode, pretty much. We’ve gotten rid of those repartition stream operators, and things on the whole are just more efficient. Can we do better?
ERIK DARLING: What’s better than five seconds?
JOE OBBISH: So we’re taking about 3.2 if we’re writing to the temp table. And then we’re taking – oh boy, that would be, what is it, 1.7 for the rest?
ERIK DARLING: You went to college, come on, man.
JOE OBBISH: So, the only reason this temp table is here is because – well, the optimizer chose a spool. I was trying to keep things similar. We don’t need the temp table. I’m going to get rid of grouping sets.
ERIK DARLING: It’s just rented space anyway.
JOE OBBISH: Yeah, I’m going to get rid of grouping sets. So I’m writing this out the long boring way. Let’s see what happens. As far as I know, the optimizer will not pick your spool plan here. It’s just not coded to do that. Alright, so it took 11 seconds. There’s no longer a spool, so it performs this join twice instead, which is a strategy that it can choose for grouping sets, but this is the only strategy if I don’t use grouping sets. Alright, I mean, so before it was 30 seconds, now it’s 11. That’s not bad, right? So we’re back in row mode. I got rid of my empty CCI. So why not add batch mode back, right?
ERIK DARLING: That repartition stream is like the middle part of query plan choices. It’s bad.
JOE OBBISH: Two seconds, no longer any repartition streams, pretty much everything runs in batch mode. I think our operator does, except for the scans, because these are row start tables around 2017…
ERIK DARLING: What do you think this is, 2019?
JOE OBBISH: These can run in batch mode too, and I think everything else as well – this isn’t row mode, but we’re not spending any time here. So I started with that grouping sets query. The optimizer thought, well instead of doing the join twice, maybe it would be better to take that join, throw it into a spool. That’s not a good fit here. There’s a lot of overhead when writing to the spool. We write to the spool in MAXDOP 1 and all of the aggregates of that spool are in MAXDOP 1. So, to get from that to a faster query, I got rid of grouping sets, added batch mode, and that’s really all I did.
So I wouldn’t say that grouping sets is bad, but if you need the best possible performance and if you have a spool, in my experience, there’s often a way to get rid of that spool to make performance better. That’s the end of the spools demo. Are there any questions?
ERIK DARLING: We’ll wait a second to see if anything comes out. One person’s typing, so we’ll let that slide in. Hiren wants to know why you use the column store hack on the top query and not the second one as well.
JOE OBBISH: By second one, do you think it’s meant by the union all query?
ERIK DARLING: Yeah, because it looks like you’re joined to the batch mode table there, but not on the second one.
JOE OBBISH: Oh, well, sure, so this is one query. So all I need is one column store index in the query. Before, when I was writing to a temp table, these are two separate statements. So if I only had batch mode here, then the second query wouldn’t be able to get batch mode.
ERIK DARLING: No essential parts of the second query were in batch mode – like, in other words, that one temp table being present allowed any operators that could be in batch mode to be in batch mode for the entire query, not just for that part of the query?
JOE OBBISH: Yeah, so I could add this, if I wanted to, but that wouldn’t make a difference here. But that’s something I could do.
ERIK DARLING: If it would make you feel better. Sometimes you just have to write queries that make you feel better; drop database, all that stuff. Alright, I’m going to steal presenter back and I’m going to go into my garbage queries. I’m not nearly as smart as Joe, so my queries aren’t as good as his. You’re going to have to forgive me. He has a degree in some sort of math. I, you know, I can count to her… that’s my gift.
JOE OBBISH: Not even 10?
ERIK DARLING: No, because then I’d have to look over here and I’d lose my place. I’d just get confused somewhere in the middle. I don’t know where it goes. So before I get into my demo, I need to show you something about the votes table in the Stack Overflow database. The only two vote type IDs I care about right now are number one and two. I don’t care what they mean. I don’t care what they are. I just want to show you the numbers involved here.
Vote type ID 1 has 3.7 million entries in the table. Vote type ID 2 has 37 million entries in the table. That means that’s how many times people have cast that vote in the Stack Overflow database. When I run this query, where I ask for the top 2500 rows from the post table joined to the votes table where vote type ID equals two and post type ID equals one ordered by creation date, this runs very quickly. You barely even notice that that did anything. If we go down and look at the messages tab, we did about 29,000 logical reads from the post table, about 29,000 from the votes table, and we hardly even touched our CPUs. We didn’t do anything.
SQL Server asks for a missing index, which is a stupid missing index when you look at it, because it wants a column that we’re joining on and it wants a column that we’re ordering by in the includes, and that is just plumb useless. That is not where I want columns that I care about in my indexes. Joe doesn’t believe in indexes unless they’re column store. He uses heaps in clustered column store only. He doesn’t touch anything else. I’m an OLTP guy. I like to have stuff in the key of my indexes when I care about how I join and filter and order and all that stuff. He works in a data warehouse where he just has leisurely amounts of time to do all his funny little tricks.
So, I’ve got that query that runs very quickly, but when I turn this query around a little bit, things are going to change. Now, the indexes I have in place here on the votes table, I have the creation date column, because I’m ordering by creation date, and I have the vote type ID column because I’m filtering on vote type ID. And I have the post ID column because I’m joining on post type ID, and that’s from the votes table. And then, in the post table, I have a key in my index on post type ID right there. I don’t need the ID column in the index definition because it’s a clustered index and it’s going to be implicitly present in all my non-clustered indexes. And then I’m just including these three columns because all I’m doing is selecting them up here. I’m not doing any filtering, joining, ordering, grouping, anything fun. I don’t even know how to write a grouping sets query, so you don’t have that to look forward to.
But when I run this query, where I’m looking for vote type ID 1 instead of vote type ID 2, things are going to change a little bit. This query is going to run long enough for me to text a kissy face to Joe’s anime girlfriend, and it’s just going to keep on going. This query, all of a sudden, starts taking about 15 seconds just to look for a vote type ID that has fewer entries in the table than the other one. This one only has about 3.7 million. Why would this take longer to return absolutely nothing?
JOE OBBISH: [inaudible] index…
ERIK DARLING: What? Yeah, I know, I do. I’m sorry about that. I should have just used a heap. Everything would have been better. So, when I look at what this query got up to, all of a sudden, I’m doing about 11 million reads from these apiece. That’s about 23 million reads total. And this query runs serially, so the CPU time is just about dead even with the elapsed time. And this is also pretty god-awful.
What happened? Well, it used the exact same query plan as the other one. It did the exact same thing, it just chose really poorly to do the exact same thing. Now, there’s no parameters here. There’s no parameter sniffing, there’s no cached plan. I’m using literally – each of these queries, SQL Server can make whatever plan it wants. It is free to choose, and yet it chooses to do this; this big dumb idiot of a plan. This is like me. This is like my life as a query plan. So right now, it’s estimating. It’s looking at this and saying, look, I only think about 7000 rows are going to end up coming out of here, and yolo all day long. Even though about 3.7 million rows come out, it doesn’t change its mind at any point. It doesn’t say, wait, okay, backup, bad strategy. 7000 rows, this is okay; 3.7 million, not so much. And now we have this happening, and because this produces 3.7 million rows, this key lookup to go get the bounty amount column out of the votes table, well this ends up executing 3.7 million times, even though SQL Server was only like, 7000, no biggie.
What happened? Why did SQL Server do this? And, more importantly, why did we do this index seek another 3.7 million times to produce not a single row? What was on your mind, SQL Server? The crack must have been fantastic. Where did you buy those drugs from? So we did 15 seconds worth of work. CPU and elapsed time were just about the same, 23 million reads. The optimizer chose, freewill, to do a key lookup there. It could have done anything else in the world. But it chose to spend its time doing a key lookup. That was a bad choice. Somehow, this query that processes 3.7 million rows and does all this work does not cost enough to go parallel. It only costs about 45 query bucks on my home laptop doo-dad setup – thanks, Brent. I have cost threshold for parallelism set to 50 because I’m a half-decent DBA. I change that setting when things get started.
So that query doesn’t cost enough to go parallel. It stays serial and it runs for a heck of a long time and it doesn’t return anything. The other query that had to go through 37 million rows finished in, like, milliseconds and was just done. This one here, forever, 3.7 million, nothing. What the heck? So I think that plan is dumb and I want to find out what happens if I tell SQL Server, no, instead of using that non-clustered index and doing that key lookup, what happens if we just use the clustered index instead. And I know Joe is also allergic to clustered indexes. I’m not going to take that personally. But if I go and I run this query, where I tell SQL Server, no, no just skip that key lookup and use the non-clustered index, this is going to finish a bit quicker. But to return nothing, it still takes five seconds, and that’s pretty weird.
The plan changed and now SQL Server has chosen to do something else stupid. SQL Server has chosen to sort this data for some reason. It’s going to take all the rows that come out of the votes table and it’s going to sort them. And that sort is going to spill off to disk. Why did SQL Server choose to sort that data? Well, there are all sorts of different kinds of nested loops joins out there in the world. This one here just happens to have this funny little doo-dad called with ordered prefetch equals true. And now, what SQL Server has done is it’s resorted the data that came out of the clustered index scan to match the order of the join column that we’re hitting in the post table. And when it did that, it spilled a bunch of data out to disk. SQL Server’s being a big dummy today. It chose to ask for a fairly large memory grant. It asked for about 150MB or so of memory, underestimated that, spilled stuff off to disk just to sort this data to put it in order. Why, SQL Server, would you take 3.7 million rows and run a nested loops join 3.7 million times to produce zero rows? If we go and we look at this index seek over here, we can see, the number of executions, all 3.7 million that came out of there, number of rows, zip, zero, zilch, none, nada. You could write a Disney song about this query.
So, what happened? What if we tell SQL Server, no, I think nested loops was bad, I think you’re a little drunk. On top of the crack use, I think you’re also a little intoxicated. This query cost 324 query bucks, which is way more than the 45 that the other one cost. So we can see why SQL Server, internally, was like, no, we’re not going to pick that. But now, we can clearly see that we have a query that runs for about a third of the time that costs way more. So cost and execution time is just not even related. They moved to different countries in the Neanderthal period. They have nothing to do with each other.
What if we tell SQL Server, okay, that sort is dumb, that nested loops join is dumb? What if we just get away from the nested loops? What if we do something else? What if we say, like, you can do literally any other kind of join in the world that you want to do? So you have the option to do a merge join or a hash join, you just can’t do nested loops. That’s the only thing that’s off the table for you. If we run this query, well, that’s going to finish in two seconds. So avoiding nested loops, avoiding that sort, and avoiding that other key lookup, we have now gone from 15 seconds down to two seconds just by saying, SQL Server no thank you, I like what you’ve done in the past but I’m not sure about this. I don’t think you’re quite on point today.
Now, when we look at this query plan – SQL Server chose a hash join, a big bad hash join, big strong hash join, big strong like bear – this is nice because it’s not repetitive. We’re not stuck in nested loops hell for 3.7 million rows to find nothing. It’s good, right? We didn’t waste a whole bunch of time to come up with nothing. This is a much smarter plan to me. I like this. Joe’s going to complain about bitmaps at some point. I’m okay with this bitmap. This is a two-second bitmap versus a 15-second nested loops join. You can give me that bitmap all day long.
When we look at the metrics for this one, we did a whole lot less work than we did with those nested loops plans. 243,000 on the votes, 96,000 on posts – we went parallel so we used a little bit more CPU to reduce the elapsed time, and that’s cool, but boy howdy, this plan was expensive. If we look at the cost of this plan, SQL Server says, oh man, this thing’s going to cost 330 query bucks to run, there’s no way I’m choosing this. I’m like, okay, but it finishes in two seconds, so can we move this up a little bit? I will gladly pay for the faster query, just tell me what to do. I will solve this problem with money, like bail.
So, now we have this query that SQL Server will never choose naturally. It will never say, oh, cheap query, good, we’re going to get this one. Under two seconds, zero spills, 330 query bucks. Well, it’s interesting because when we had SQL Server choose a different join type, it came up with the more expensive query, but it finished a lot faster. What if we just think that SQL Server chose the wrong join order? Because remember, I joined posts to votes up here, and when SQL Server wrote that first query with the nested loops, the first thing it did was hit the votes table and do that key lookup and then do a nested loops to post. So what if I just think SQL – you just put the table sin the wrong order, SQL Server. You got a little confused in your inebriated state and things just got weird.
If I do this, what’s going to happen? Well, that finishes in about two seconds too. SQL Server, you could have just put the tables in a different order, pal, and everything would have been fine. We wouldn’t be here talking about this. I wouldn’t be pulling your pants down. This is a totally different query plan from the last one. Now, SQL Server has chosen a merge join. It says, no, if I have to do things in this order, we’re going to merge things together, I’m going to sort everything that comes out of the votes table, we’re going to do this merge join because merge needs sorted input from both sides. We can’t just have things floating around willy-nilly.
Now, it’s a parallel merge, which is making Joe cringe. I can actually see his bones shaking when he looks at a parallel merge join. But I’m okay with this because, still, we’re at two seconds. We’re not at 15 seconds anymore. SQL Server, why? Well, this query costs 443 query bucks for SQL Server to run. Boy howdy, I’m starting to see a pattern form here. Every query that does better costs more. Now, SQL Server is like, no but I got this nice cheap query over here. I’m like, but it takes you 15 seconds to return no rows, SQL Server, why do you do this to us?
Well, part of the reason is that the optimizer is clearly watching too much anime. The optimizer does not live in reality. The optimizer thinks that it has big muscles and spiky hair. The optimizer is just all wrong about everything that’s going on. These queries that cost way, way more to run where SQL Server chooses good operators are way, way quicker than that cheap plan where SQL Server chose all sorts of dumb operators like key lookups and sorts and other things. I bet if I had downloaded Postgres, Postgres would make much better choices. I think that might work, except Brent doesn’t let me have downloads at work. He blocks all downloads after some incidents – a couple minor security instances where there were some EXEs involved. But I thought they were video games, so you can’t really blame me.
So, hinting the query in the way we did is a good way to figure out if the optimizer is wrong. Like, maybe you just did something weird. We’re going to review this together. We’re going to figure out what went wrong. And what I kept coming back to when I was looking, maybe we can just change the way the optimizer looks at our query. You remember when we ran the slow query – and I’m just going to go get the estimated plan for it – so when we look at this, we have this top all the way over here. So every time this top runs, SQL Server goes and asks for a row all the way over here and it has to do all this and it has to do all this. And SQL Server’s just futzing around like, can I have another row? Can we try this again? Can we try this again? And this top ends up popping off 3.7 million times because it’s like, I only want 2500, but it keeps not finding anything, so it has to go keep looking.
What if I put that top somewhere else in the plan? What if, at the end of the day, it was kind of my fault? What if I was a bad influence on SQL Server? What if there’s a way that I could make that top show up somewhere else? Well, there is. There are ways that you can set funny row goals in other places of your query so that SQL Server will think about the way it’s going to write that query plan for you a little bit differently. So if I run this query, I’m going to stick my top 25 in a CTE, I’m going to order by, I’m going to sort of section this thing off, tell SQL Server, maybe just stick this in a corner. Maybe let this hang out by itself.
Well, I like how quickly that finishes even better than two seconds. Over in the messages tab, if I look at how much work this thing did, it’s about 7000-8000 reads on each table. That’s much better than we were doing before, and we’re back down to using almost no CPU. This is a pretty good time. I’m enjoying this. Thanks for that, SQL Server. And a big part of why that happened is that top is no longer over here saying, try again, try again, try again. It’s over here now and it’s just going to get 2500 rows in the order that I want over here and take those rows and then try to figure out that 2500 times from over here, and that’s a much better way of doing things to me.
I could also do this with a derived table. There’s really no difference. I could stick that top in a derived table and I could cross apply all the results out to posts. And if I do that, well that’s really quick too. And I know what you’re thinking. I know that when you’re looking at this, you’re like, but he’s not returning any rows. There’s no way these results could be right. There’s no way you could possibly do this. I’m still right, miraculously, somehow. I don’t know what happened. Probably because my wife wasn’t involved with this, I wasn’t arguing with her. But I’m miraculously right here because when I change the vote type ID that I’m looking for, when I change it back to 2 – remember that original one that I looked for that actually returned results – I actually get the same results from both of these and it’s amazing. It’s amazing because I’ve been using SSMS for about a decade and I still hit the wrong button to run a query sometimes.
If I run these two back to back, I get my data in the exact same way. So SQL Server actually comes up with the exact same query plan for both of them and it says, yeah, if I just stop making bad decisions because you are a bad influence on me as a query writer, all of a sudden, I can make better choices and I can actually have these queries finish in a reasonable amount of time. So, thank you for rewriting that for me. Because we proved that the optimizer didn’t know what it was doing with that first batch, but as soon as we changed a few things, SQL Server was able to come back and really make better choices for us and run queries faster. So that’s my gig. You ready, Joe? Do we have any questions on that? No, alright, cool. I’m going to pass it back over to you. Everyone’s too busy trying to give their SQL Server a blood alcohol test to ask questions. That’s fine with me. I got stuff to do later. Alright, so I’m going to pass presenter back over to you and you can…
JOE OBBISH: Am I good?
ERIK DARLING: You’re good. I can see all of your – oh god, put that away. Just kidding.
JOE OBBISH: Okay, there’s three tables here. We don’t really need to know much about them. We have a small table, a medium table, and a large table. Now, Erik hacked my computer and he added a few indexes. That’s pretty unfortunate, right? We still have a heap, fortunately. So there are no clustered indexes involved, so that’s good. We have two non-clustered indexes, one on the IDA column and one on the IDU column. I’m going to go ahead and turn off result sets. So the query that we want to run is this one. So I’m joining my small table to my large table on the IDU column. I’m joining my large table to my medium table on the IDA column. I’d like for this query to take about a few seconds. And it’s not.
ERIK DARLING: I’d like a drink. We’re both screwed.
JOE OBBISH: 16 minutes – that’s how long you’ve got left. 23 seconds. Now, this plan is a bit more complicated than you might think. So I’m joining three tables together and I don’t really have indexes that work well, so I’m making a lot of hash joins. That makes sense. SQL Server is saying 81 query bucks, so it’s a parallel plan. Okay, that makes sense. Here’s my scanned small, here’s my scanned medium, but for large, we have this index intersection plan here. The query only needs the IDA and the IDU columns, so we have two single column non-clustered indexes. We don’t have one covering index. We have two indexes, when combined, can be covering. So one of the choices the optimizer has is it can join those indexes together. And it might do that when the tables are very large and your indexes are relatively small. So the optimizer thinks, okay, the overhead here of joining together two non-clustered indexes is smaller than the scan.
Now, this query took 23 seconds. I have these bitmap operators. So all a bitmap operator really is is a shortcut – it’s normally for hash joins. The basic idea, instead of doing the full hash join algorithm for every row on the probe side, instead will create this lightweight small bitmap structure which allows us to skip some of the rows on the probe side. So if my hash side has a lot of rows or if my probe side has a lot of rows not present on the hash side, you might see this bitmap. And these bitmap operators are almost always good, except when they’re not. You can’t directly turn them off, so sometimes you have to live with them.
Now, I have a spill here. That’s not good. One weird thing is this table has eight million rows and I got eight million rows back. So despite this bitmap on column IDA, which was created here – no, it was created here, IDA – none of the rows were filtered out. And if I look at this scan, I see something weird. There’s a parallel plan, but there aren’t numbers for each thread. And if I go to the parent operator, DOP of zero, the query runs serial, and that looks like a serial query. So what happened here? I had resource governor turned on.
Resource governor restricts every query to MAXDOP 1. The optimizer doesn’t know that. It doesn’t know that resource governor is restricting MAXDOP 1, so it can’t factor that into this engine mix. So it thinks it’s going to get a parallel plan and it creates these three bitmaps. Now, row mode bitmaps are only eligible in parallel plans. Now, two of these bitmaps are optimized bitmaps. You can see with the OPT here. All an optimize step is is this bitmap is created during query optimization and it can influence query optimization, whereas there are other bitmaps, they call it post-compile bitmaps or something, where they can add in at the last second. They don’t really change the plan, it’s just a way to reduce the cost a little bit.
Now, this table has eight million rows. SQL Server thought the bitmap would reduce the rows to 1.2 million. And that might have happened if the query ran parallel, but it didn’t. So this operator doesn’t run, this operator doesn’t run. Bitmap isn’t applied, so we get the full results [set as opposed to] tempdb. So if I’m stuck at MAXDOP 1, I then, angry DBA who doesn’t like me – I mean, sometimes, pulling that MAXDOP 1 hint, that can make a big difference. So I’m giving the optimizer more information. I’m letting it know, okay, you have to run in MAXDOP 1. So we’re in row mode. There are no bitmaps. The optimizer is under no illusion that there will be bitmaps. And even better, our eight million row table, it’s only on the probe side. So before, since I had that index intersection plan, I had an eight million result set on the hash side, now I don’t.
So all I did was change MAXDOP 4 to MAXDOP 1. I’m not tricking the optimizer anymore and the query is much more efficient. Now, I can turn off resource governor. Your queries might get downgraded in DOP for other reasons. You might run out of parallel threads, for example. So even if MAXDOP isn’t on, you might have this problem still. What can you do about it? Well first, let’s see how things work in the ideal case where I don’t get downgraded. It actually wasn’t fast. Let’s try running that again.
ERIK DARLING: You need more memory.
JOE OBBISH: I probably do. Okay, it’s a little faster now. It’s a little faster. It’s probably not that competitive with a serial plan. It’s a little faster. It’s smaller than 21 seconds, right? Now, what if I want to make this query better? Well, you might know about batch mode bitmap operators. So these work a bit differently than row mode. One of the important ways they work differently is they’re eligible for MAXDOP 1 plans. So you might think, okay, if I’m worried about DOP downgrades, maybe I should throw in a CCI into my plan. Maybe that will help. We’ll go ahead and try that. It’s slower. I added batch mode and it got slower. How is that possible. I thought batch mode was always better, right?
ERIK DARLING: Well, it is in 2019. That’s what I keep here. [crosstalk]
JOE OBBISH: People say a lot of things…
ERIK DARLING: Especially when they sell software.
JOE OBBISH: Yes. So, part of the query runs in batch mode. A lot of it doesn’t though. Now, SSMS isn’t nice enough to tell you at a glance, hey these operators are in batch mode, these are operators are in row mode. It’s unfortunate. But when you see these repartitioned streams, that’s almost always a sign that operators around there are not running in batch mode. There’s a limitation on batch mode. The index and intersection plans are not eligible. I end up with a mostly row mode plan.
Now, this spills to tempdb suddenly and my estimates went way down. It’s only 120,000 rows. So what actually happened here is this batch mode join is a bitmap creator. It’s creating a batch mode bitmap.
ERIK DARLING: All these strings and creators – this is like Ghostbusters, man. I don’t know what to do.
JOE OBBISH: You can sit back and relax. That’s all you have to do. Here is our batch mode bitmap. It’s in in this filter operator. It can’t be pushed all the way through to the hash join. And, I mean, that makes sense. So this filter is on the IDU column. This index scan is IDA. This index scan is IDU. So, of course, this filter can’t be pushed, right? The column isn’t there yet. The column isn’t in the result set until here. It’s just not output yet.
ERIK DARLING: Son of a gun. That seems suboptimal to me.
JOE OBBISH: What’s really suboptimal is the expected cardinality estimate redacted from this bitmap is pushed. 120,000 rows. It’s the same thing. So, this is a bug. The optimizer makes the mistake wherein it pushes the cardinality estimator reduction onwards, but not the filter. So, as a result, our estimate is way off. This operator just doesn’t get enough memory.
ERIK DARLING: SQL Server is dumb.
JOE OBBISH: So, bitmap filters, or bitmap operators, they’re normally very helpful. There are some edge cases where they’re not helpful. There are bugs sometimes with batch mode variety. And if you get a serial plan, whereas SQL Server thought it would be parallel, you can run into problems. We have time for quick questions if there are any.
ERIK DARLING: No questions yet, but we do have someone typing, so why knows?
JOE OBBISH: I hope they type quickly.
ERIK DARLING: I don’t know, it’s only Niko next. If we go a little over, that’s fine. What’s he going to do? Get on a boat and hit us?
JOE OBBISH: He does live very far away from us.
ERIK DARLING: He does. It would take him a long time to punch either of us. Alright, no questions, so we are going to trudge forward.
JOE OBBISH: No questions – this is the last demo of the session. We have a very simple table here, three columns, sales date, the amount sold, and was it a robot sale. So I employ both robots and non-robots. Some of the robots make sales too. The robots work very hard. There are no indexes except the column store. So I have a column store index defined on my table. Now, I have some suspicions about my non-robot workers. I feel like, when we get to the last month of the year on Friday, they don’t work very hard. So I’m going to write a query where I want all of the Fridays of this month where I didn’t make any sales that were by a human.
ERIK DARLING: Lazy humans.
JOE OBBISH: Lazy humans. Maybe the robots deserve our jobs after all.
ERIK DARLING: Someone ought to annihilate them.
JOE OBBISH: The humans or the robots?
ERIK DARLING: Humans, useless.
JOE OBBISH: Oh, okay.
ERIK DARLING: I don’t know if you saw the career talk, but we’re all destined to be in sales, so this is the worst possible outcome.
JOE OBBISH: Now, robots might have infinite patience, but I don’t. I mean, 12 seconds, [inaudible] I’m only getting back four rows. I don’t want to hurt anyone, right? 12 seconds for four rows, that’s like three seconds per row. I could write those rows out faster with a pen and paper. So there’s a top operator here. I didn’t ask for a top; the optimizer added one. This top is really unfortunate. This is a left anti-semi join, so the inner side would quit early anyway when it finds the first row. The optimizer just adds this top for optimizer reasons that aren’t very good.
The result of this top is this reduces the cost of the query plan more than it should. So, SQL Server thinks this query plan, it’s only 19 query cents, but it took 12 seconds. So there’s something way off with the cost estimation. And it’s not that this scan 59% of the cost. The scan took one millisecond, right? So it’s crazy. This is not the kind of plan you want to see. With column store, I have my column store in the inner side nested loop. It’s running in row mode. I have 36 million rows but I read 129 million. This is horrible. This is a horrible plan.
ERIK DARLING: Everything went wrong.
JOE OBBISH: Pretty much everything went wrong, that is true. Now, this is caused by a row goal. So the row goal optimization isn’t always bad. It can be very helpful. The idea is, if at any point in the plan I don’t need the full result set, maybe there’s a more efficient way to get only the rows I need. So, if you imagine a well indexed query written by Erik, if I only need the first row, it might be faster to build a plan designed to get that first row instead of all of the rows.
Here, there is a row goal added. It’s terrible. It causes problems. How can I make this query faster? Maybe I’ll just brute-force it, right and add an undocumented trace flag and make it run parallel. Parallelism is good, right?
ERIK DARLING: Always…
JOE OBBISH: Always, the man says. Alright, it got a little faster, but not really. [crosstalk] So I’m running at MAXDOP 4. If I have no waits and perf deficiency, I want to see CPU time to elapsed time ratio of 4.0. Instead, it’s like 1.3.
ERIK DARLING: Lofty goals from you, huh?
JOE OBBISH: I demand the best. We can look at waits, but there aren’t any waits here. We have like 100 milliseconds of waits, so it’s not waits. Optimizing by waits just doesn’t work. Look at thread two and thread three. They did nothing. They didn’t help at all. They might as well have been run at MAXDOP 2, so why did this happen? The outer side which drives the loop, two and three don’t get a row. So I’m only pulling back four rows from this scan. There are only four Fridays in this month after all. And the smallest units that the demand-based scan can pullback is a page.
So if I have four rows, three rows on one data page and one on the other, there has to be two threads that don’t do anything. There just aren’t enough rows here. The data isn’t spread out well enough to get even distribution of work over all my threads.
ERIK DARLING: I thought you just let them join a thread union.
JOE OBBISH: I am going to twist SQL Server, bend it to my will. I’m going to add a top. [inaudible] with the top, what I’m doing is I took that date dimension query, put it in a derived table, add an arbitrarily large top. You can use your phone number. I didn’t because I don’t have one. You can use the bigint max, the int max, whatever you want. Now, SQL Server has to guarantee correctness, so it will enforce this top. And when we enforce the top, we add this gather streams, this round robin distribute streams, so now the rows are forced to be evenly distributed, one for each thread, and the query is faster.
Now, it still isn’t a good query, right? I mean, we’re still reading 129 million rows, it’s just more parallel. If I rewrite my date dimension filter so I’m filtering directly on the date, which is the column I’m joining on, the optimizer can push that filter down. So I’m pushing the filter down, I lowered my data in order of sales dates, so SQL Server can use the metadata for the compressed row groups to skip almost all the I/O, which is what you see here. So the query is very fast. So this is still like a bad column store plan, but I’m reading so little data from it that the top isn’t nearly as harmful.
I haven’t really fixed the fundamental problem, I’ve just made the query more efficient. The fundamental problem is this row goal optimization. The top isn’t helpful. The top reduced the cost of this plan way too much. So, this apparently happened often enough that Microsoft created a use hint just for this. I’ll go ahead and run it.
ERIK DARLING: Don’t lie, it was just for you.
JOE OBBISH: It was not for me, I had no influence here. So we’re up to four query bucks, but this query is even faster. We had a hash join. Hash joins are generally what you want for column store. We only scanned the column store once. So here, the row goal that was introduced by the not exists was very unhelpful. It tilted the optimizer to pick a bad plan. Another option you have is you can just say, okay, I know you want a hash join, give me a hash join. Any questions otherwise, you can lead us out?
ERIK DARLING: Nope, Kenneth commented, “That sounds like a great way to confuse a coworker, by putting that top bigint max in there without a comment.” But, you know, comments take up space and time and we don’t have unlimited space and time.
JOE OBBISH: Yeah, if you find yourself stuck in a situation where that’s the way to fix the query, having a comment’s probably what you want to do.
ERIK DARLING: I don’t know, I don’t put comments in any of the Blitz code because, I don’t know, if anyone’s unfortunate enough to read it, I don’t think they’re going to have fun anyway.
JOE OBBISH: If you use your phone number, then the coworker can just call you with questions.
ERIK DARLING: Yeah, I’m not doing that one bit. I’m going to use someone else’s phone number. Alright, so, what did we just do? We just saw SQL Server make a bunch of really bad choices based on what it thought made sense from a cost-based perspective, where every time we ran a query, SQL Server went, no it will be much cheaper if I do this, if I take this shortcut, if I use this operator. But every single time, SQL Server was just wrong. The optimizer makes these choices again based on how much it thinks things will cost to do. That cost is an arbitrary measurement of CPU and I/O, which don’t really mean anything anymore. It’s just these funny little decimal numbers that it sticks in plans to look fancy, I think. I don’t really ask why.
But just keep in mind that the optimizer’s job isn’t to find a perfect plan. It’s the optimizer’s job to say, every single time you run a query, I’m going to go to the bank on this, I’m going to figure out what the absolute best plan is, I’m going to test different plans, I’m going to do all this stuff. It doesn’t do that. It’s like, you at last call, and it’s just like, I want a cheap plan, I want it good, and I want it fast and that’s it; like, just good enough, get out the door, done. It does not spend a lot of time thinking about you.
Sometimes, that bad plan is your fault for all the blah-blah stuff that you’ve heard a million times about the best practices and the functions and the indexes and the yada-yada. Boring, right? Sometimes, when that bad plan isn’t your fault – well, that’s because the optimizer has that jar of desk whisky too. Right now, there are still no robots that go and fix this stuff. There’s no one popping up to say, oh I have a really good idea here. I’m going to go behind the scenes. I’m going to go tweak and tune this thing and make it better.
If you follow those best practices from the blah-blah stuff and you still have these problems, congratulations, you have an interesting job. Because remember, these are choices that the optimizer thinks were really good. These were things that the optimizer was just like, they’re going to be blown away by this, but it ends up being terrible. It’s up to you to track this stuff down in query – A, track the queries down, and then track it down in the query plan, do all this stuff and figure out what’s wrong and really get your little baby fingers dirty on that stuff.
If you can do that, you are safely employed, at least until the robots have come along and assimilated you entirely. That’s all for us today. I think we finished in a pretty wonderfully on time prompt manner.