You’ve been writing T-SQL for a while, but you want some easy pointers on how to make it go faster.
I want to show you how just a little neater and tighter code can also run a lot faster. These are some of the instant performance improvements that won’t change a lot of code, won’t generate weeks of testing and validation for your colleagues and can be relatively quick to implement.
But most of all, they’ll make the code prettier and easier to understand for the next person.
We’ll look at common ways that null values, implicit datatype conversions and foreign key constraints can radically affect the performance of your queries. Inspired by an actual client case, where I tuned a 20-hour query down to a few minutes, with only a minor code change.
This presentation is all-code, some execution plans, and no-powerpoint.
Enjoy the Podcast?
Why I Want to Present This Session:
Brent Ozar: Alright so ladies and gentlemen, in this hour of GroupBy, Daniel Hutmacher will be talking about not just polish – how good code also runs faster. So take it away, Daniel.
Daniel Hutmacher: And there goes the adrenaline. So hi everyone. I already have a heart rate of 150 or something my Apple watch is telling me like, you’re having a cardiac arrest. I’m fine. Totally. So for those of you who don’t know me, I run the blog SQLSunday.com, which was sheer convenience. SQLSaturday.com was taken when I started the blog and I didn’t even know what it was.
Today, I’m going to talk about how neat code, pretty code can sometimes run even better than elaborate, complicated code because the common truth of sorts and even I say this is that the simple code can often be the one that isn’t very efficient and when you optimize it for all sorts of purposes and indexes, the code tends to become a little more complex. So I spent a couple of days harvesting up some really terribad queries that I’ve seen at clients and trying to anonymize them and run them in the Stack Exchange, Stack Overflow database. If you don’t know about the Stack Overflow database, Brent has prepared a huge database, in my sense huge, it’s like 100GB or something?
Brent Ozar: Yeah, it’s up to like 350 now too.
Daniel Hutmacher: Alright, mine’s old so mine’s just 100GB and I run database compression on everything. And so this gives you the chance to run some seriously heavy queries and experiment with indexes and what makes a difference really in terms of performance. So in my abstract I wrote no Powerpoint, only T-SQL. And this is one of two Powerpoint slides, that’s it. So if there are any questions or if there is anything you see, this is a very ad hoc session. I have a couple of scripts I’m going to run through so stop me any time, ask questions, and we’ll see where it goes, see how many scripts we run through before our time runs out.
So this is my Management Studio, and I have a demo server with – it’s a pretty beefy Azure machine. I ran out of CPU allowance, they wouldn’t let me have more cores without actually opening a support ticket and I thought of this this morning, but it has something like eight cores and 50GB of RAM. You can see that this query here has been running for 23 minutes odd. It’s still running and I’m pretty confident that by the time this session is over, this query will still be running. So I’ll just leave that running here and maybe you can hear my laptop as well.
So I’m going to run through a couple of patterns and see some of this is going to look like a hobo stayed in it for two years or something. I didn’t clean up after him. This is actually stuff – I didn’t make this shit up. I see this at clients and even some – let’s call them senior BI professionals – write these types of queries. And so they just hurt my eyes and I always fix them instead of doing what clients actually pay me to do, but on the plus side, their servers run faster when I’m done.
So let’s start off with some date functions. Surprisingly, a lot of people who work in T-SQL don’t really know how to work with dates, and I’m just going to show you some simple examples. So here’s a pattern that I find quite often, and if you look at it, it’s like what does this code actually do? And you look at it and you think, okay, how many dates difference from zero to the date is it, and then you add those days back to zero. And what you get is that you’ve just rounded off the date so you get the date part instead of the full date time with hour, minute, and seconds. And there’s a much, much simpler way to do this. You can just cast it as a date. That’s it. It’s that simple. This is much more readable. So if you’re thinking, is there a function that will round off the time parts of the date time, just cast it to date. Previously on SQL Server 2000 I think, you needed to cast it to float, floor and then cast the date time back before you had a date data type.
Brent Ozar: That data is terrible.
Daniel Hutmacher: Yeah, I see this everywhere. I just thought it was the one person and I kept seeing it in different clients, different code, it’s – you look, your brain can’t figure out what it actually does. On the plus side, I will say when I experimented, you can change the day thing to pretty much anything. And here it comes into its own right so you can round to the nearest hour for instance. It’s still not readable but there is some kind of purpose I suppose. Here’s another one that I found. Say you want to convert a date to the first day of the month. So you could – I’m saying could – you could convert the date column to a character with eight-character length. You specify using the convert function, you can specify the format so you’re sure that it will always be the same. Then you truncate off the six first characters, which is the year and the month, and you add zero one. Now we’re working with the varchar, right? And then you convert it all back to date. And that looks something like this. So here we have the year and month and the first for each date.
Brent Ozar: I have totally done that.
Daniel Hutmacher: Yeah, I guess you can look at this and make sense of what it does, but wouldn’t it just be easier to use the date from parts functions, which is new from SQL Server 2012. You just feed it a year and a month and a one. That does exactly the same thing. So it’s easier to read and on the same vein, last day of the month, this gets even messier because what we do reading from the inside now, we convert to a character, truncate off the last two, so we have year, month, and then we add a zero one. Turn it back into a date, we add a month, and then we subtract one day from that. So we have the first day of the next month and you subtract one day and what you get is the last day of the current month, right? Or you could do the EOMONTH function, which is also new for SQL Server 2012. So if you’re on 2008 or 2002, I don’t think EOMONTH can work.
Brent Ozar: But at the same time, the good part of that, like 2008 and 2008 R2 are out of support next year anyway so it’s one of those times where…
Daniel Hutmacher: Yeah, no one’s working with them anymore. None of my clients are. And on the same note, if you can use EOMONTH to calculate the last day of the month, you can also use DAY(EOMONTH) to get how many days in the current month there are. So I saw this one client who had a distribution function where he has transactions on a date and a month and he wants to split those transactions into even sized rows. One for every day of the month. And so for March, that would mean dividing it by 31 and it was – the calculation was just mind-blowing. It involved something like this.
Brent Ozar: Yeah, that’s awesome.
Daniel Hutmacher: So that’s for the date functions. I’m just going to blow right through all these scripts and just run until someone stops me or asks me a question. Just let me know if there’s anything.
Brent Ozar:There’s a good one up over in the Slack. Mick says there’s a related question on database administrator Stack Exchange, he says, “When I cast stuff to dates, is it sargable? Should I use it inside the where clause in order to get index utilization or not? Do I run into problems with it?”
Daniel Hutmacher: I would generally say whenever you cast something, I think you’re screwed. I haven’t tried. I think I’ve tried it and I think you’re screwed basically.
Brent Ozar: Yeah, if it’s in the where clause.
Daniel Hutmacher: With regards to sargability. You know, readable code and all.
Brent Ozar: It’s a different problem.
Daniel Hutmacher: Yeah. Next up, again, you’re not going to believe this stuff is real but this is actually code that I’ve taken from a client. It’s not that code but in the same vein. So you have two common table expressions, you have the commons and badges. And on Stack Overflow, there’s a badge called commentator, and you get that badge if you have 10 comments or more on posts. So you can select from commons, GroupBy, the user, having more than 10 comments. And you can select the users that are commentators, and now I want to full join these two. So I want all the users with 10 or more comments and I want all the users with the commentator badge, and I want to see where there’s a difference. So the obvious solution to this would be to do select from comments, left join badges, right? And then union it with badges where the user isn’t in comments, right?
Brent Ozar: How did they – I would like to see a diagram of that thought process.
Daniel Hutmacher: Yeah. I guess my reasoning here is I think this person didn’t know how a full join works because full joins come with caveats as well because you can really easily screw one up if you put the where clause in the wrong place or things like that.
Brent Ozar: Or just read advice somewhere once that full joins are bad.
Daniel Hutmacher: Yeah because it said so on the internet. And this is an ugly and expensive query and you don’t want to go there. So instead, you can just full join. Instead of left join you can full join, and that means we get all the rows from comments, all the rows from badges, matching when they do, and otherwise you just get no values. I’m not even sure if this will run if I’m maxing all my eight cores. Oh, and I love the new live query statistics thing. So if any one of you is working with live query statistics, it’s a fun gimmick. It’s really cool to show off. One out of 10 times, Management Studio will crash on you. They fixed a lot of problems in Management Studio 17.4 I think, they still have a few issues to work on. If you’re working for Microsoft and you want to repro on a query that crashes, the live query statistics, let me know.
Brent Ozar: You should put that in at that feedback site.
Daniel Hutmacher: Yeah, I’m going to do that someday. So you can see how it counts up the rows and adds up and…
Brent Ozar: Shows your progress live and shows more estimates are wrong. You know, when it shows 200% done with something.
Daniel Hutmacher: Exactly. Anyway, so that’s for the – I don’t know what to call this. Is this a fake full join or an ugly full join? Next up, cross apply. So those of us who use – work in business intelligence, it’s quite often where we want to unpivot a table. So if you had three or four measures and columns, you want to add a dimension called measure type or metric or whatever, and you want to unpivot the data so you have one row for each dimensional section and combination of metric.
One way to do that is just to query the table multiple times so you can read the data. In this case, I’m splitting up the tags. So this is basically I’m splitting the tags just to give SQL Server something to do because who wants queries to go fast, right? And so we have a metric called answers and here we sum answer count and we have a metric called comments where you do a sum of comment count. And while this query may be readable, it’s not pretty really, and when I timed it, it took 49 seconds to run. If you look at the execution plan, it’s number of concatenations and we basically just seek or scan the index once for every query. So there’s a lot of I/Os going on here.
Brent Ozar: Not a small table.
Daniel Hutmacher: Not a small table. It’s a very, very large table. Instead, you could just scan the table once. This is our split thing and then you could cross apply the values tag. I just learned a couple of years ago, this is very recent knowledge for me, that you can add a comma here to have multiple rows in a values clause.
Brent Ozar: That you can build your own table. Build your own table on the fly.
Daniel Hutmacher: Exactly. Except with cross apply, you can reference columns that aren’t hard coded. So I’m referencing columns in the posts table. So for every one row that comes out of posts, I create five rows in my cross apply.
Brent Ozar: That’s sweet.
Daniel Hutmacher: And believe it or not, this query looks a whole lot simpler. You have a single scan and then the nested loop is once for every row that you get out of the posts table, you perform the string split and then once for every one of those, you split it into those five rows and then lots of aggregation happens and everything. Disappointingly, and I was totally betting on that this query was going to be so much faster. I mean, in theory, it’s a fifth of the number of reads from the table unless the aggregates screw this up. But in theory, it should be like, five times as fast and mostly, it is, but it wasn’t for me. So I have a theory and what I came up with was if I did this first, if I did the split on the tag so I get one row for each tag first, and then aggregate these five columns separately, put them in a sub-table and then do the cross apply thing, so I aggregate first and then ten-tuple the data amounts, that would give me a considerable performance boost. So it went from two and a half minutes down to 25 seconds, and 25 seconds is technically faster than 50 seconds but yeah, it’s a Pepsi Challenge if you think this or this query is more readable.
Brent Ozar: Now, how long did it take you as you were experimenting through these to get from that first query to the third query? Like hours, I would guess.
Daniel Hutmacher: Not minutes. I actually did try unpivoting and adding a small data type as I could because this data type expands into – this is a varchar something and it takes up a lot of bytes. So not only do you get five times the amount of rows. If you look at the query plan here – so if you imagine that this is 130MB, this is 464GB, I’m not saying anything – and this is 2314GB so you would just put a one, a two, a three, a four, and a five here. That would be – just four bytes. Or if you put zero X one, it’s a binary one, which is a single byte, and that really improved performance for me. I got it down to like, a minute or something. But by completely eliminating the whole pivot thing before you’ve had a chance to do the aggregation, that was what got us down to 25 seconds. So yeah, I have too much spare time. I took the entire week off basically to just build this presentation. See, no boss. No one’s asking where have you been all week? Why aren’t you at the office?
Brent Ozar: You get to learn anything you want to, you get to build anything you want to.
Daniel Hutmacher: You guys should all totally be your own businesses. So in the next seminar, we’re going to talk about how to get your business running.
Brent Ozar: Why work 40 hours a week for someone else when you can work 80 hours a week for yourself?
Daniel Hutmacher: This one’s fun. Can you tell what this does?
Brent Ozar: Why would someone do a count star from a sub-query?
Daniel Hutmacher: Yeah okay, so the count star is just so I don’t get two million rows back into my Management Studio. This is actually the query. But we all recognize this pattern, right? So where ID is less than max ID, so for each combination of user and badge, I just want to keep the most recent badge. So you could actually type equals here or something, it’s the same principle. The problem here is because you have the equality here is fine, but because you have the less than and the order by here, this all turns into a loop join, a nested loop. And for those of you who are familiar with the Stack Exchange database or Stack Overflow database, badges has something like 10 million rows I’m guessing. So we’re going to make an index seek 10 million times against badges here, which is probably not a good idea. A sorted index seek, by the way.
So this query performs surprisingly well given how hopeless it is. There’s a much better way to write this by using a window function. So essentially, a window function like lead for a cell value or an expression in the next row, I’m telling it to look for this expression one row forward and if it doesn’t find this expression, if the next row doesn’t exist, it just returns a one. So if I partitioned this window function by user ID and name of the badge, and just order it by name and ID as a tiebreaker, I will get a zero if there is a next row and a one if this was the last row for this user ID and name. So maybe the syntax isn’t entirely intuitive if this is the first time you’re seeing this kind of construct, but I use this all the time because I look at it all the time, I think it’s perfectly normal and I can see people opening up their scotch bottles right now.
So this is the one with the nested loops where it has a loop join, an index seek on badges for every one of those 10 million rows here. If we look at the query down here, we just scan through all the badges, we sort them obviously, by our partitioning expression and our ordering expression. So if you want to optimize this query you would change the index on badges to match these conditions, and it would be super, super fast. At this point it’s just regular fast. And the segment and sequence project here, those are the window function, the lead – the lead function. And I honestly can’t say why we have a segment and sequence project and then a segment and windows pool. In my mind it should be just a segment and a windows pool, but the point is the same.
So we’re ordering the output from badges, which we’re scanning just once, so we’re starting at the beginning of the table and just scanning right through everything. So this query up here goes for 30 seconds or something, and this query down here went for 10 seconds when I tried it. So I’m running my terrible query in the background, we’ll see – let’s try log query plan.
Brent Ozar: This is in all supported versions of SQL Server I think? This even goes back to 2008 or was this 2012 that that came out in?
Daniel Hutmacher: The window functions? I think that’s – I think you got window functions in 2008 and you got ordered window functions, which I suppose Microsoft calls something else, an elliptical or something. Those appeared in 2012. So lead and lag I think were in 2008, but I’m not sure.
Brent Ozar: Yes, this ran 25 seconds even with the intensive load going in the background.
Daniel Hutmacher: Yes, such a terrible load. Right, no questions, right?
Brent Ozar: No, you’re good. Tod says, “I’m literally staring at the code back with similar bad data adds, that bit there made the whole webinar worth it for me.” Yeah, so thumbs up for him.
Daniel Hutmacher: So here’s another one. A client of mine had a table with – it wasn’t an ID column, it was kind of an accounting period column, but the thing was they made a varchar of the accounting period column. No, sorry, this is wrong. Same procedure, wrong chapter. Let’s stay with the BI thing here. We have an imaginary metric on Stack Overflow where you’re rated in accordance to how many questions you’ve answered divided by the number of questions you’ve asked. So we’ll just call a variable ratio and we’ll select the number of answers and divide it by the number of questions. So far, all good. I’ll just select that ratio. But bad things happen so we’ll just put all this in a try catch block, and if you haven’t seen a try catch in SQL Server, it’s pretty much the same thing as in regular procedural programming except you can’t use finally. There’s no finally, and I never got the point of finally anyway.
So if something goes wrong within these two try blocks, begin and end try, the code executes the catch statement here, which in this case, it turns a null value to the caller and then throw is like raise error but throw without an argument re-throws the error that happened in the try statement here. So if something happens in try and you catch it, there’s no error message because nothing happens. You literally caught the error and therefore you might want to use throw just like that without an argument to re-throw the error. So if someone else is calling the stored procedure, they should know that something went wrong. You just wanted the chance to clean something up before you actually re-threw that error.
So I’m going to create the procedure and I have a demo user called something. And we can see that this user has a ratio of 3.0. That means that this user has answered three times as many questions as he or she has actually asked, which is a nice ratio is say, this matters for some reason. Let’s try a different user. Have you guys seen the with result sets?
Brent Ozar: No.
Daniel Hutmacher: This was fairly new to me. So if you execute a procedure, I’ll just run it like this, and select ratio and I didn’t specify a column name or anything, if I add with result sets, I can say anything that comes out of this stored procedure follows this column naming and this data type. You still can’t put it in a view, but it’s useful for stored procedures where you want to change the column names I suppose and I don’t know, an SSIS job or something.
So if we run the same stored procedure on this user, we get a null, which is oaky. It’s a null. But what’s really going on here if we just run it without the with result sets…
Brent Ozar: The name of the column name is throw.
Daniel Hutmacher: Yeah because I didn’t put a semicolon here. Select null as throw, right? So at the risk of sounding a bit like Aaron Bertrand, please put semicolons after each statement because if I just put a semicolon after the null here, run this statement again, I get the ratio, null, and divide by zero. So you can actually screw things up unintentionally by leaving semicolons out, and throw made its debut in SQL Server 2008 I think.
Brent Ozar: I think you’re right.
Daniel Hutmacher: And I’m betting they’re not going to change any of this in a sense that breaks code but rather add more features like these, and I’m also thinking about common table expressions that starts with so – Erland thinks you should always put a semicolon before throw. It’s an ugly pattern. Yes, it works, but I don’t think you should. So I’m betting that even if they don’t add anything new, it makes code prettier and you can avoid dumb accidental mistakes like these. So write pretty code, put semicolons on the end of everything.
I absolutely adore that they brought the try catch block to T-SQL. That has saved my ass on countless occasions. Me too. And the re-throw is really, really pretty. It’s a handful if – like, some of my clients building stored procedures that call stored procedures that call stored procedures that call stored procedures and you nest everything and sometimes they even declare a temp table in the top procedure and then call build a whole stack that is depended on that temp table to exist, which makes it hopeless but yeah, throwing and re-throwing and re-throwing, it’s a handful. So design it correctly from the beginning on and you’ll be fine.
Brent Ozar: Make sure your code never throws an error. Never has an error, then you’re good.
Daniel Hutmacher: My kind of code actually, my kind of code. I like it. Right. Same procedure. I’ve added all the semicolons. You will notice that you can’t add a semicolon after end try. I figured this out the hard way. Let’s try this. Incorrect syntax near semicolon. Bonus tip, did you know you can double-click the error message and go to the row where the error occurred? So apparently, the end try begin catch is a single statement in T-SQL. Because if you add a semicolon here, it won’t work. So I’ve created the same procedure with the semicolons and everything, I’ve suffixed underscore problem because I’m querying a different table. So I’ve made a copy of the posts table that I call problem posts. And I did it in the database that I call Stack Overflow mini, which contains only a fraction of the number of rows.
For this one, I’m going to turn on statistics because I’m not going to be sitting here with a stopwatch timing performance. So if I run the regular query first, this one does eight logical reads, two scans, which makes sense, so one scan for this, one scan for this. It seeks because post type ID is the first column in the index, user and ID is the second column. So if we look at the query plan, this is a really, really efficient query plan.
Brent Ozar: It looks big but it’s fast.
Daniel Hutmacher: This is actually the magic. So we do one index seek on answers, one on questions, aggregate them because this is the count of how many questions and answers you have, and then we just – because they’re one row each here. One row there, one row there, divide one by the other and select the output. This is a super fast query.
What happens if we do the same thing with our problem query where I’ve changed the data type on post type ID to one? Because someone figured out that post type ID is always a single digit value, right? Why waste an integer, which takes four bytes if you can have a single digit length character, which takes one byte to store, right? So if we look at the table…
Brent Ozar: Who does this?
Daniel Hutmacher: So we have post type ID character one. This is pretty common, I think. Or it’s just that I work in Sweden. I don’t know. We have polar bears too. Praise the lord. Now, if I run this query – it’s three seconds, it’s four seconds, and that’s because we did 96,000 logical reads as opposed to just four before. We did 18 scan counts and that’s probably because of parallelism. So we ran somewhere in the range of four seconds to complete the same query on different tables.
Now, those of you who know about sargability already know the answer to this one. If you do an implicit conversion between a character and an integer, the integer wins. You always convert the varchar value or the char value to an integer. And in this case, this means you can’t perform an index seek because you have to try every value of every row in the table and see is this one, does this convert to one, does this convert to one, does this convert to one? And that takes a long time in a big table.
Brent Ozar: The character wins and we all lose.
Daniel Hutmacher: If you run into this in the real world, rather than perhaps rebuilding post type ID and crashing every application every dev ever built, you can just add apostrophes here.
Brent Ozar: And Hiram asks too, when you get a second, show how you quickly pulled up the table definition.
Daniel Hutmacher: Oh, so a standard in SQL Server, you can press alt F1 and alt F1 will run a procedure, a stored procedure called sp_help. All this stuff is out of the books, it comes with Management Studio and SQL Server just the way you’re used to. I built a replacement for this called SP control three. So this is stored procedure that works pretty much like sp_help, but it gives more detail. So I can copy index definitions and it says nvarchar 100 instead of – see the nvarchar here has a length of 200 and it says computed no. And so I found this stuff pretty hard to copy paste when I wanted to recreate tables. So if you’re interested in this SP control three stored procedure, it’s on my blog, SQLSunday.com. Look for downloads. But this is my homemade tool because I think sp_help wasn’t good enough. So you can just copy all of this, put it in a new window and you have the create table syntax ready to go. I never use other people’s tools ever. I rebuild everything from scratch. This is my personality.
So I added quotes so this is now our char/varchar/I don’t care, but the point is we don’t have any implicit conversions anymore and when I look at the query plan now, there’s a seek, an index seek here, and it doesn’t even go parallel. So it finishes in eight milliseconds. That’s good enough, considering it had to build a plan. Eight logical reads, two scan counts. Just the way we want it. So for the love of all that is holy, avoid implicit data type conversions. I’m okay with having funky data types. I’m even okay with having transaction periods in our char column or an int column or whatever floats your boat, but avoid the implicit conversions because they will screw up your sargability, which is a killer when you have a large table.
Alright, so this one’s still running. It’s one hour now. I started it a little bit ahead of the show so this thing isn’t going anywhere. If we look at the CPU it’s still maxing. The memory is still maxed out, there’s – this query is going absolutely nowhere. I’m going to stop it here and this is the query that we were looking at, I put DNF here, and what we’re trying to accomplish is for each user, I want to see if they have asked the question, meaning post type ID one, that hasn’t been answered within 24 hours. It’s not really a trivial question. So I want to see how many questions each user has asked that hasn’t been answered within 24 hours. So I’m basically looking for post type ID one as the question and the ID of the question is not in parent ID of any answer where creation date is less or equal than 24 hours later than creation date.
So this is probably a standard report that some manager wanted you to write. Or the developer just showed up at your doorstep and said, my query’s slow and he drank all your Snapple and left. So if we look at this query, here’s what our query looks like, which is strange, right? Because we have a merge join between users and posts, which is fine, but the problem here is that I’m doing a nested loop join on the posts, and this post here, you can see the scalar operator two at the bottom, that’s the answers. So when we look for answers, we use our nested loop join, and I’m going to be completely honest, the query plan isn’t exactly the one that I prepared. I was going to demonstrate another problem, which is if you look at the table dbo.posts, the parent ID column points from the answer back to the question it actually answers. And you’ll notice that this column is nullable. You can add a null value in parent ID. So if you write a query where something not in a nullable column, SQL Server has to actually verify if there is a single null value anywhere that is a false result. So X and Y where Y is null is always false. So you’re just making it harder for SQL Server because it has to do this extra check and it used to be that – let me see if I can fix this. It should have done a check for null values but I’ll leave it at that.
Anyway, if you suspect that this is a null column, if you do not in and you select the nullable column, you can just add and parent ID is not null. That’ll simplify it. I mean, the obvious choice if you know that there are no null columns, no null values, which is the case here because when you don’t have a parent ID, the developers of this database put a zero in. So in this case, I know from experience or because I built the web app or whatever that there will never be a null value. It’s just an incorrect definition of the column. Let’s say you can’t change the definition. You can just tell it that parent ID is not null. And if we run this query, let’s see what it does to the CPU. We’re rocking.
Brent Ozar: And it’s especially useful in third party databases where you can’t control the schema.
Daniel Hutmacher: Yeah. So this is the same query that ran for the entire session for around an hour and didn’t finish, and basically what it’s doing right now is just hammering my local CPU and my bandwidth because it’s like a million rows, but it turned instantly. I blew it up, sorry. So to recap, if you use not in with a nullable column, it has to check if any of the values is null. And that check is fine and dandy if it’s just once, but because we did a less or equal than, it has to do this check for every single time, which really screws it up for us.
Brent Ozar: Nicholas says over in the Q&A, “That’s cool, thank you.”
Daniel Hutmacher: Thank you. So at 46 minutes, that pretty much wraps up my session. I’m open for questions for another 14 minutes or whatever. Also, please leave feedback. I created a Google form just for you, dear viewer, that’s you. Please, because I’m booked for this session two more times this autumn and I’d like to get it just perfect. So let me know how it went.
Brent Ozar: So let’s see, very nice job too. Excellent. Man, some of the date stuff was just awesome. On that last query, CT asks, “What happens if we use not exists instead of not in?”
Daniel Hutmacher: Interesting. Let’s make a copy of this one. This stuff always gets me. Where not exists and t.id equals parent ID, right?
Brent Ozar: I think you’re right.
Daniel Hutmacher: Smooth.
Brent Ozar: It looks good. It looks good.
Daniel Hutmacher: Looks decent enough, because let’s look at the original query. The cost of the original query was – I’m not going to go there. It’s like, I have to count two zeros. 251 million query bucks as opposed to my suggestion here with the is not null, which clocked in at 299 and this one is a promising 327.
Brent Ozar: Not bad.
Daniel Hutmacher: Let’s see how it goes.
Brent Ozar: Finger’s crossed. I like it.
Daniel Hutmacher: You’ll see my CPU stack down there, it’s working very hard right now. Yeah, so I think they’re about even.
Brent Ozar: Nice. And then Vasel says…
Daniel Hutmacher: Implicitly, if I’m making an equal to compare to a nullable column, that automatically excludes all the null values. So you cannot join something and expect null values to come out. So this problem is really isolated to not in. You were saying, sorry.
Brent Ozar: No, it was the next question was Vasel asks, “Could we split the post table into two different tables, questions and answers, then the questions don’t need parent ID and the answers will have a parent ID as not nullable?” I would say do a select from post types – do you have the post types table?
Daniel Hutmacher: Yeah, I think post types is one for questions and two for answer.
Brent Ozar: Yeah, so the problem is if you wanted to split this up into different types, you’d end up building like, seven, eight different tables and the number of tables just keeps growing as soon as – and wouldn’t it be nice for you to have just one table for everything? Okay, yes.
Daniel Hutmacher: But I mean, technically I think for posts, that’s – I think parent ID is zero, so in practice, they’re using zero as a fake null here. So we know that there are never any null values. It’s just nullable by definition because that’s how it was created once back in the 2000s.
Brent Ozar: It also helps comments a little bit easier so you can leave comments on either question or an answer, and so there’s just one comments table with a post ID attached to it that just joins to the posts table. Yeah, there’s pros and cons of all those.
Daniel Hutmacher: Pretty sure one of our viewers designed this database so I’m just going to keep it down low.
Brent Ozar: When the original one started I think it was Jeff Atwood and Sam Saffron I think, or Geoff Dalgas who designed it? And the thing that always blows people away is looking at the tags columns, so the tags column on there in the output, you can tag it Stack Overflow, you can tag questions with up to five tags, so C#, .NET, date time, et cetera, and you look at that one field with five tags in it and the database people in the house are scratching their skin going, this should never happen, and that’s exactly how it went live. Databases aren’t perfect when they first go live.
Daniel Hutmacher: Yeah, but you get better performance actually. Well, no, maybe you don’t if you want to search by tag. No, you don’t.
Brent Ozar: No, it was terrible. That was the reason why we ended up having to separate out the tags into a separate table. Geoff was like – and people immediately look at that and go, can’t you use full-text search to find any query with say, the wind forms tag, but then look at those tags in full text doesn’t work. Like C# or .NET, full-text just doesn’t work as well there. Plus their tags with versions like – or CSS 3, .NET 35, it’s just tough. Alright, well lots of people saying thank you, great presentation over in the chat. Really nice job.