Database Administrators or Developers who want to create useful and thoughtfully designed indexes
Creating indexes is one of the most powerful ways to improve performance in your SQL Servers. It’s not easy to form indexes for a large and complicated query, and things only get more complicated when queries can come from vendor applications, ORM tools like LINQ, or any number of other sources.
When it’s time to create the perfect index, there’s guidelines and strategies to get it right, and then prove that it’s helping. However, performance is not always about just one index, and general index strategy can also achieve great results. We’ll focus on how to get the perfect index or indexes for your workload.
Enjoy the Podcast?
Why I Want to Present This Session:
I love the power of a good index in improving performance, but I see a lot of controversy over what the right index is.
Arthur Daniels: Alright, so this is my presentation, The Perfect Index, and to prepare ahead of time I put all the slides and everything that I’ll be talking about on my website. I’ve got this link here, dba-art.com/groupby-index. I’ll have this link up a couple times. You don’t need to write it down right now. We talked a little bit about myself before, but I wanted to mention more about myself.
So my name is Arthur Daniels. The easiest way to reach me is on Twitter. I have Twitter as @arthurdanSQL. Right now I work as a database administrator and besides the normal database administration duties, backups, restores, automation, stuff like that, what I really enjoy doing, what I get to spend time doing is tuning SQL Server and tuning queries in SQL Server and tuning indexes in SQL Server.
So if you go and you download the slide deck, what I intended to do with this deck is I’m trying to give the information that you’d need if you had a query and you were thinking to yourself, “This query could use some tuning,” or, “I can’t tune it at all, but I need to make sure that I have the right indexes in place to make this query really blazingly fast.” And I wanted to give you the information so you can make that decision. Because a lot of times I’ll see developers or people who are stuck with a slow query, maybe it’s a report writer, and they’ll go, “Can an index be added here?” Maybe they’ll see a missing index hint and they’ll go, “Well, would an index make this page faster? Would it make this query faster?” So hopefully by the end of the presentation, I’ve answered some of those questions. I want to go over how you decide if you want to create an index and I want to talk about how to decide if the index is right.
So when I called this presentation The Perfect Index, what I wanted to talk about was the idea of an index being perfect for the situation you’re in right now. I wanted to talk about the current query you’re looking at, sometimes that query can change, but I wanted to give you the tools and guidelines you’d need if you’re going to be creating an index so you can judge for yourself, is this the right index for my scenario.
So when I’m talking about indexes, what I wanted to talk about when it comes to indexes or what are called non-clustered indexes in SQL Server. So on the left here you’ve got your table, which is where all your data is stored. It’s all stored on these data pages in SQL Server. I want to focus on the non-clustered indexes. So these are copies of the data. We define what data we want to copy over into our non-clustered indexes. I want to tell you how to know what data to copy over, what will make your query really, really fast. And how do you decide, maybe some columns don’t belong there and how do you make that judgment call?
So let’s talk about indexing. And in this session, I’ve kind of split it up into three parts. A lot of what I’ll focus on will be things that can make your index tuning easier. So to do that, we’re going to work through some examples of indexing simple queries. Now, some of these queries are going to be really overly simplified, but I tried to pick examples of some more complicated queries that you can see when you’re looking at SQL Server so you understand how different tables or different indexes interact when you’re querying a large complicated query inside your SQL Server. Then I’ll be talking about creating the right index when you have optional parameters. So I see this more and more. Either someone’s got dynamic SQL that they’ve written where sometimes the where clause changes, or they’re dealing with this scenario like they’ve got link to SQL or another ORM tool that generates dynamic SQL from the application, and those scenarios, what columns are being queried, what columns are being returned can change dramatically, so I’ll talk about how to index for those scenarios. And then finally, I’ll talk about reasons to not index columns. So in the start I’m giving you a lot of reasons to index, and then I kind of want to walk it back and give you reasons maybe we shouldn’t be indexing. Maybe indexing is not the right answer, or there’s not really a good solution for the scenario you’re in.
So let’s get started by talking about the create non-clustered index statement. So we’ve got our table and it’s got all the columns inside our table, and we want to create a non-clustered index on top of that, and this is just the syntax. I don’t want to spend too much time on syntax. What I want to point out to you are the areas that we can define our non-clustered index. So you have the create non-clustered index, you give it a name, you define the table you want to create your non-clustered index on. Then you define what are called the key columns and then you include what are called the included columns. These are the two most important parts, and they have some very significant differences, so where you put the column in your index matters a lot, especially if your query will be changing, and we’ll talk about that, why you might want to change where a column is, whether it’s inside the included columns or inside the key columns.
So if you wanted to get started and you want to create one of these indexes, you should really start with a query that you come from, the query that made you think that things are slow, that you need to create an index for your query. What you need to do then is you need to assess, hey, what columns is my query requesting, what is it filtering on, and what we’re going to use is the term predicates a lot of the time. Usually, that’s just your where clause, your joins, things that can eliminate or exclude data or include data from your query. We’re going to focus on what filters are being applied and then determine what data is needed to return to the application. That’s how we can determine what will go in our key and included columns by reviewing what the query is really asking for.
So the example I’ll be using today when we’re talking about these key and included columns is I’m using stuff from the Stack Overflow database. Now, I’m only going to be using two tables to keep things really simple, but these tables have a lot of columns in them, so I’m selecting to show you some of the bare essentials, some of the things that you really need if you’re going to understand this Stack Overflow schema. So when you’re a user and you create an account in Stack Overflow, you’re creating a new record in the users table. Then you decide on your display name, you type in the location, and when you start out in Stack Overflow, you have very low reputation. But what happens is as you’re posting things to Stack Overflow, as you’re answering questions or giving questions yourself, you’re asking them, you’re going to accumulate reputation. Well, that reputation comes from the posts table where you’re posting these questions and answers. And each post has a score. So if you wanted to know, let’s say from me, if I wanted to look at my total reputation and then look at the table where the reputation came from, look at those posts, I would be going from the users table and I would join on the owner user ID. That’s the owner of the post and I would look for the score of that post to see how I scored on that post, whether I was downvoted or upvoted, what people thought of my answers.
So I’ll show you some examples with these but I wanted to make it clear that this schema, you can query between the users and posts table and you can join between those on different columns. What I want to focus on is joining between the users ID to the owner user ID. We’re finding the owners of the posts. So let’s start looking at queries. Let’s look at an example of a problem query. And what I’ve done here is pretty much what I described in the last slide. I’ve got a query, I’m looking at the users table where I’m retrieving the reputation, I’m looking at the posts table where I’m retrieving the score of the post, I’m joining the two together on the owner user ID and I’m going to be filtering for this user. So if you don’t create a display name in Stack Overflow you’re kind of given a random username, it starts with user and then it’s got some numbers behind it. I picked this specific user because I know about the information behind here. I know how many posts this user has and I know the reputation so I could create this example query. But really, this could be for any parameter. You could change the display name. We’ll just use that for our example because we want to assess this query. We want to decide, are the existing indexes good enough if there are any, and how do we create good indexes for this query?
So let’s talk about that. Now that we’ve kind of taken a look at the query, let’s go back to the key columns and let’s talk about how you could decide your key columns for this query. What you want to keep in mind when you’re creating your key columns is that key columns are sorted. They’re sorted first by the first key column, then by the second key column, then by the third key column and so and so forth. In that way, they’re almost layered. If you’re going to be accessing this index, you need to access the first column of the key columns otherwise you can’t seek to that index. You’re going to have to scan the other key columns in the index. And I’ll show you an example of that later. Right now we’ll just focus on having a single key column, and that comes into play when we’re talking about the last part of this slide. We’re talking about making data access easy. When you want to make data access easy, you want to index the predicates that we’re looking at, the filters that we’re filtering our query on.
Now, included columns can cover the needs of our select statement. They can do some other things that I’ll talk about later but for this example, let’s just use the included columns to meet the requirements of the select query we’re running. So let’s take a look here at the two parts of this query. I’ve split it up into these two logical parts. Starting on the left side where we’re looking at the reputation from the users table, where the user’s display name is equal to that user number that I showed earlier. The second logical part is then joining on that owner user ID to the posts table and retrieving the score of all the posts from this user.
So when we’re looking at this, we can’t just create one index because an index can only be on a single table. We really need two indexes to solve this problem so let’s take a look at how these queries run in our demo and we’ll talk about this and we’ll show how these queries are retrieving the data, how the predicates are being applied and then we can determine if we do a good or bad job indexing for this query. So let’s talk about how we can assess that and we can assess it using two main tools. I’m going to pick on statistics I/O. So this tool shows us the number of data pages that had to be read for SQL to retrieve the query. That’s important for us because more data pages is a lot more work that SQL Server has to retrieve. It’s more information being processed to return our query results. What we need to use in coordination with that is we need to look at the execution plan. The execution plan will be able to show us, well, what’s going on when it had to do those data page reads. What are the statistics I/O actually talking about within SQL Server?
So let’s get started with demos. I really want to start looking at this data. So let’s run some demos and we’ll take a look at this. So the first thing we need to do is we’ve got the Stack Overflow schema so we’ll go in here and we’ll look at the actual execution plan from Stack Overflow and we’ll enable the statistics I/O, and we’ll go over here to the top and we’ll include the actual execution plan. So let’s run this query and see what results we get back with no indexes on the table whatsoever. And while we do that, let’s split the query up into the two logical parts and see the results that we get from that.
So if we go in here and we look at the first logical part, we’ve removed the part where we’re doing the join, we’ve removed the score, we just want the reputation from the user. When we run this query, we get the user’s reputation back and actually I need to include the execution plan here as well so we’ll start off and we’ll look at the messages tab to see how many pages had to be read to retrieve this data. So it’s telling us it’s 7345 data pages that had to be read. And I’ll copy that out here. And let’s look at the execution plan to show us what predicates had to be used to retrieve that data for us. So if we go to the execution plan and we look at this index seek, we can go down here on the index seek and look at the predicate that was applied. What I want you to pay attention to is what I’m highlighting here. We’re looking at the Stack Overflow database, we’re looking at this display name predicate. So this is kind of the index that we want to create. This will be our key column because this is the predicate that SQL Server applied, we know that when we go back to create an index, this needs to be a key column that we’re going to use. Then when we look at the bottom here and we see the output list, we want to output the reputation. So this will be our included column, so we’ll just create a very simple index to meet the needs of this query. Because we’ve ran the query without any indexes, we know what logic it wants to apply, we’ve split it into smaller chunks so we can understand what’s happening in the query easier, so let’s go back and create that index and see the performance improvements that we got.
So if we scroll down here, I’m just going to create a simple index on display name as the key column over here on the left and then reputation as the included column. So we’ll run that query and now that that’s completed, let’s rerun the exact same part of the query. This time, we go to the message tab, what I want you to see here is there’s three logical reads happening within SQL Server, the part I’ve highlighted down here at the bottom. So that’s immediately an improvement from 7000 to three. But let’s take a look at the execution plan to see what happened. Now that we’ve got our new index, SQL Server did an index seek and if we look at what’s happening inside this index seek, if we go down to the bottom, we can see that it now has what’s called a seek predicate. So beforehand, and I’ll show you this in another example, but beforehand, it was applying a predicate to the index scan. Now it’s doing a seek predicate, so within this index seek, it was able to seek to the value of display name. And as expected, we have the output of reputation right above that. So this was generally a really efficient index. This was great.
Now let’s go back to our main query and we’ll see what happens if we add in the second part of the join, the second logical step that needs to be done. So I haven’t changed anything, I just have this single index, and let’s see what the performance looks like when we run this query. So if we run this query and we look at the messages, this time we have a lot difference logical reads. We have new logical reads coming in on the posts table for 800,000 logical reads. Now, the users table is still done efficiently. Now, to understand this fuller, we have to go into the execution plan. We have to see what actually happened but just right off the bat, we know that we’re still probably using the right index because of this low number of logical reads but there’s still a problem with the posts table. So there’s still work to be done. So let’s take a look at this execution plan and we’ll go in here and let’s ignore the missing index recommendation completely, and we’ll just look at this index scan on the posts table. So what it’s telling us down here is it had to look through the clustered index, unfortunately. It’s telling us, hey, it applied a predicate and this predicate is the owner user ID. But there was no index or it shows not to use an index, it could be either one. So there’s ways to improve this query with indexing. If it had to go to the clustered index, we can try to create an index and see what happens with the results.
So let’s take a look at this. Let’s create a non-clustered index on posts owner user ID and include the score column because that’s the one we’re selecting. So we run this query and we create the index and then we’ll go run our query again. This is the same full simple query that we had from the start and when we run our query and we look at the reads, all of a sudden, everything’s back to normal or back to a good state. We’ve got only three logical reads for both scenarios, so judging by this, I would say our execution plan has changed. So let’s go in there and let’s check out that as well. Let’s look at the execution plan and look at this index seek that occurred and down here on the posts table, now we have a new seek predicate. So not only is it telling us that it used this object up here at the top, it’s saying it used the index that I created but it also seeked on owner user ID.
So we’ve managed to index this query. Now we’re down to a total of six logical data page reads, just meaning we had to read six 8KB pages on SQL Server. Let’s take a look at what happened to the original query since that’s probably completed by now. Let’s go over to this tab and it looks like it took about a minute to run. I’m looking in the bottom right here. This was before I created any indexes. And before I created any indexes, it had to do the summary of both logical parts. It had to do 7000 reads on the users table, it had to do 800,000 reads on the posts table and knowing that there’s no indexes at this time, if we go to our execution plan, what we can probably predict happened is on both examples it had to do an index scan on the users table over here on the left it did an index scan and then below it it did an index scan on the posts table. So this is the value in indexing these two columns together. And we want to look for things like logical predicates like when the join occurs, joins have the potential to only include the rows that are specified in the join. So we had to eliminate all the other rows or look for just the rows with this owner user ID. That’s why when we provided an index on that, it was a lot more efficient and it was able to run a lot faster.
So let’s go back and let’s take a look at this example and then we’ll work with a more complicated example. So we use statistics I/O, we use the execution plan to look at this query. What happens when there’s an additional where clause? And actually, before I go into this next part, I want to ask if there’s any questions about using these two tools because I’m going to rely a lot on statistics I/O or execution plans. Are there any questions over in Slack, Daniel?
Daniel Hutmacher: I don’t think so. A thought that I had was about key lookups, but I’m assuming you’ll get to those later.
Arthur Daniels: Yeah, feel free to go ahead and ask it and I can tell you if it’s coming up or we can include it.
Daniel Hutmacher: Yeah, so I noticed it uses the clustered index instead of your index because your index didn’t cover all the columns you needed and sometimes SQL Server does a key lookup and I was just thinking if you wanted to elaborate on that.
Arthur Daniels: Yeah, absolutely. So in this scenario, if we go back to this example, and if I had – so this was the query before I had created any indexing. If we go over here and we look at the query where we’re indexing, SQL Server was able to do index seeks and retrieve all the needed information. But let’s say I request something else from the posts table. For this example, I’ll just request the body of the post, which is the text that you would get when someone creates a post. So looking here we have this body that’s all the post’s actual information and if we go to the execution plan, because this isn’t an included column and it’s not included in my index that I created, we now have what’s called a key lookup. So what happens in a key lookup is SQL Server takes information from this non-clustered index that I created. It uses that to go back to the base table and say, hey, I have this identifier for this row, but I need to know what the information for the body of the post is. So at this point, it has to retrieve this output list of the body. And if you look here, the seek key is actually the clustering index key itself. So this is one example of a key lookup and I’ll show you an example actually a little later on where other predicates other than just the clustering index key are applied in this key lookup, but that’s a great question. I don’t actually specifically cover what happens if you don’t include columns and this is what will happen.
Daniel Hutmacher: There is actually a question from Amy. The question is, “So it’s okay to create a non-clustered index on the same column as the clustered index?”
Arthur Daniels: That’s right. That’s absolutely right. And because it’s going to be subset of data, it’s actually going to be smaller than the entire index so as long as you define your key and included columns and don’t include every single column in the table, a good example of that is let’s say we wanted to count all the IDs in a table, and that’s the clustering index key, what I would think would happen is if you create that second index on the clustering index key, since it’s a smaller subset of data, it will have to read less pages and have to do less work to get the results you want. You can create a non-clustered index with any key columns and any included columns within – there’s some limitations that I’ll talk on later on – but within reason, you can create almost anything.
And I wanted to show here really quickly, when we did the key lookup, we went from reading three pages to reading 215. We had to retrieve additional data when that key lookup occurred. I’ll show examples of key lookups a little later on, but I didn’t cover just selecting a value that wasn’t in the index. So thank you.
Alright, so let’s add a wrinkle to it. We’ve looked at it just very basic but what happens if we’ve got a dynamic query? And for some reason, when I was looking for the owner user ID and the display name, someone wanted to filter on and they only wanted to look at the questions. So what this means is we need to filter on the post type ID of post type ID one. Now, there’s a table in Stack Overflow. It’s a list of the post types, there’s about five post types. But for the sake of simplicity, we’re just going to stick to these two tables and this time we want to filter on the post type ID. So let’s talk about how we can index for this scenario where there’s kind of an optional parameter that someone’s adding in where occasionally they just want questions but occasionally they want all the answers.
So let’s go back to our demos and we’ll take a look at this, and we’ll leave the existing indexes in place for this demo. For some of them I’ll be dropping them. But for this example, let’s leave it in place and let’s run this query that I just showed you. We’ll retrieve the actual execution plan and we’re filtering on post type ID. So when we run our query and we go to our messages tab first, we can see the pages read, and we’re looking at that 215 pages read again. Now, the users table wasn’t impacted by this but if we look at the execution plan to see where those extra 215 page reads come from, we can see that information. So we have the two index seeks that we expected to see and then down here at the bottom, at the key lookup, what we see is there’s actually a non-seek predicate. What I was just talking about inside a key lookup is there’s actually an additional predicate being applied. Once it’s found those IDs, so it does this seek first and it finds the posts that match the IDs that you’re looking for, but it’s found more than it needs. And we can tell that because the number of rows read up here is 53 but the number of rows returned, the actual number of rows is 48. So given some basic math, that’s five other posts that we didn’t read. Well, those are either answers to other people’s posts or things that users posted that aren’t questions or answers.
So what we need to do in this scenario is we need to consider how can we remove this predicate from the key lookup because it’s obviously increased our page reads by requesting 215 pages rather than three, we’re doing extra work. And if we were scaling up our data over time, if we just kept adding pages, you know, a 215 to three ratio is really bad. The more pages we get in that table, the more work is going to have to be done. So let’s create a covering index for that scenario. Let’s cover this post type ID as well. Now, what you could do is you could try either order here. You could try with post type ID first or you could try with owner user ID first. Now, in my index here, I’m selecting owner user ID first and then I’m doing post type ID. The reason I’m doing that is this execution plan started at the users table and it found just the row for that one user. Then it went to the posts table where it retrieved the information based on that owner user ID and it limited it down to 53 rows. So I want to search after I’ve done that initial elimination down to 53, and this is kind of using selectivity, which is a concept I’ll talk about in a little bit. But try it out either way on your own systems. Try it out and see which one is being used first, or which ones you need to use more efficiently.
What you can also think about when you’re indexing this scenario is if one of them is an optional parameter, should it go first in the key values or second. And I’ll show that in a really good example later on, but for now I just want to create this index on the posts table, so we’ll give that a second to complete. And once that’s done, we’ll run our query again. And we want to make sure we have our actual execution plan on. We’ll run our query and we’ll look immediately at the reads. So we’re back to that good number of reads. So if this was a scenario where post type ID was an optional parameter or dynamic, I would want it second, after the parameters that I can rely on, which was joining the table. And we’re seeing that immediately improved inside our statistics I/O. But let’s take a look at the execution plan to make sure that we got rid of that key lookup, we got rid of that extra work. We go back to our execution plan and we look at the host index seek, and we go down here to the bottom. What I want to show you is the seek predicates. The seek predicate on owner user ID and on post type ID were applied to sort through this data. So this is where reading the query and understanding what it’s going to be requesting really helps, and it helps us understand what order to put the key columns in and why to have columns inside the key columns of our index.
So let’s talk about some other problem queries and we’ll go back, but if anyone has any questions on creating this scenario where you have to have multiple filters and one filter is in your join, another filter is in your where, which is a scenario I see fairly often or any other questions, please feel free to ask.
Daniel Hutmacher: Basel says, “Note, we can also use where (filtering predicate) in non-clustered index for columns with small amount of distinct values.” So I’m guessing he’s on to selectivity, which you’re going to talk about later. Like post type ID in this example.
Arthur Daniels: Yeah, that’s great. The other thing I want to know is if he’s talking about filtered indexes. because if he’s talking about a filtered index, that can add some additional wrinkles and because of their complication I’m not going to cover filtered indexes today but if we have some extra time at the end that’d be a great time to talk about filtered indexes and how they play into your general indexing strategy here.
Daniel Hutmacher: Also, Matt says…
Arthur Daniels: Go ahead.
Daniel Hutmacher: Matt says, “Is there guidance for number of pages we should shoot for? Low number ratio, et cetera?”
Arthur Daniels: That’s a great question. This is a hard one to answer without seeing your environment. Now, in this example, I’m running with a very small database. I think it’s under 10GB. So the pages that I want to see read are very low. But it depends on the importance of the query, so think about does this query need to run a lot, does it need to run very frequently, and think about how many pages are in the total index. So in the first example, I was showing that it had 7000 page reads on the users table. If you had a table that had millions of pages, maybe a thousand reads isn’t such a big deal. So it depends on the scale of data you’re talking about and the important of the query to determine how many page reads is a good number for your index. Anything else?
Daniel Hutmacher: We’re all caught up.
Arthur Daniels: Alright, great. So let’s talk about another problem query. So we’ve kind of covered – what I’ve tried to cover so far is creating a covering index where you know the predicates and where they come in a certain order that you can split out logically. In this scenario, I want to talk out a problem query where all the values, there’s no joining to another table, we’re just looking at the users table and we’re looking for where it equals all the values. We’re looking for the reputation of all users who live in the United States who are 34 years old and have the display name Chris. Now, we want to be careful with this one because when we talk about the display name Chris, as a display name, it’s not unique. There can be more than one value for the user Chris. So how do we decide what order to create our indexes in for this? How do we decide the right key order for this index?
And we need to understand selectivity to do that. The secret to picking these column orders is the idea of selectivity. So what we need to know when something is selective is that unique on every single row. A good example of that is an ID or an internal account number. Hopefully we aren’t storing real people’s like, financial account numbers, but hopefully it’s like, an account number for our system or a display name, something like that where it’s unique on every single row is a great example of a highly selective value. When SQL Server looks for this value, it can only find one or very few rows. Examples of things that aren’t selective on the other end in the spectrum, well, United States. I was filtering for users that said they lived in the United States. There’s over 300 million people living in the United States. Now, they don’t all have Stack Overflow accounts, but they could. So that’s very lowly selective, or not very selective at all. Same with state, or in the example I was showing, post type. A good way of thinking about it is if multiple people can identify with a column, if multiple people belong to the same company, belong to the same post type, that’s not very selective. But here’s the wrinkle on this: if you have a query that’s changing, you have to be careful with picking the key column order on the rows that you can guarantee to be in the query. And I’ll show you an example of where that can go wrong, but first let’s create an index for our query. Let’s create an index in order of selective values.
So let’s go into a demo here and look at these queries and look at the data I’m looking at with location, age, and display name. So to do that, what I’ve done is I’ve counted the distinct values for us. Let’s look at the distinct values for the age, the location, and the display name that we’re looking for in the users table. So there’s 86 distinct ages. That’s pretty good, the high end. I think the longest a human’s lived is maybe 120, maybe 130 years. So unless they’re counting half years or decimals, as long as these are integers, I would expect to have at most, 120 distinct values for age. When it comes to location, because anyone can fill in any location they want in Stack Overflow, this has the potential to be infinite. But in the static database that I have where everything’s frozen and there’s no one creating new accounts daily, there’s only 11,000 unique locations down here at the bottom. What about display names? Well, even though you can have duplicate display names, when we look at unique display names, we actually have 200,000 of these. So these would go in order of selectivity. Age is not very selective location is very selective, and display name is incredibly selective. It’s a great first key column value.
So let’s take a look at a query. Let’s go back to that problem query and let’s try to create an index for it. So if we start here at the top and we run our query, let me drop my indexes first to make sure we have nothing interfering. Alright, great, so we’ll run this query and let’s look for this location, age, and display name conditions. And actually I’ll include the execution plan so because there’s multiple display names for Chris, we have three values here with three reputations. And let’s look at the messages tab. Let’s see how much work it had to do. Well, with no indexes on the table, with just the clustered index, we have 7000 total page reads. That’s a lot of pages read, that’s the entire index. So when we go over to this execution plan, as expected, we have a clustered index scan but if we didn’t know the predicates in our query, if this was a more complicated query, running this even without creating any indexes helps us understand what is SQL Server searching for. And we can do that when we look in the execution plans at the predicates. We’ve got the age, the location, and the display name as the predicates, so we know we need an index on these and we just learned the concept of selectivity.
So let’s go in here and let’s select an index with display name, the most selective first, then location, then age. And just to make it covering so we don’t have any key lookups messing with our page reads, I’ve included reputation and let’s create this index. So reads all the pages in the index and creates a new index for us, and then let’s rerun our query. Let’s look at the execution plan for that. So when we go to our messages tab, we’ve improved the logical reads that our query needed, we’re down to four, and then when we look at our execution plan, we see that it was able to use our index. It had to read three rows, and we see that all the predicates are contained in the seek predicates. It did a display name search, then it did a location search, then it did an age search, which is really great. Once it did that display name search it had already eliminated it to just the three people named Chris because it was such a selective value. But let’s talk about maybe we’ve got a dynamic query. Maybe we’ve got an application page where you can type in a display name or you can leave it blank. What happens if that query is generated and they’re only looking for users in the United States and the age of 34? Can it still use this index? What if its display name is an optional parameter?
Well, we haven’t changed our select clause, but let’s run this query again and see what happens. Remember, we had four logical reads and we’re able to use display name. We execute this query and we go to logical reads, we’ve ballooned up to 1800. Now, keeping in mind this is in scale to the rest of my database, this is about a seventh or a sixth of the entire index that it had to read. So how did it do that? Well, we’ll see our answer inside the execution plan. And what we have now is actually an index scan. Because an optional parameter was the first layer of our index and it wasn’t provided to the query, SQL Server couldn’t get to the other layers of the index, the other key columns are layered underneath that. Now, it can still access them in the form of a scan and it didn’t have to read all the pages so it’s still a fairly efficient operation. It’s better than reading the whole table, but we don’t have seek predicates any longer. These are just defined as predicates, where it’s a predicate on the age and a predicate on the location.
So what would I do in this scenario if I had a query where it’s rapidly changing and I know what columns are set and I know what columns are optional? Well, in that kind of case, where we know that we could occasionally include display name, let’s create an index with display name as the last key. And the reason I include it as a key is because I don’t want SQL Server to have to go read the included pages to find this. If there’s a display name that it needs to search for once it’s read those thousand pages, I want it to be able to find it efficiently. So let’s create this index and we’ve changed the order. So we no longer have the most selective column first. This is kind of a twist on the idea of selectivity and this is because queries can change. So we run our query and let’s look at the number of pages read once we run this query to create it. So we look at the messages and we’re down to six. So this is an interesting twist. I haven’t included display name. All I’ve done is change the order of key columns. It’s this exact same data inside the index but the layers are ordered differently. And because they’re ordered differently, now we see a more efficient number of logical reads for this scenario and if we go to the execution plan and we look at this index seek, we can see that it changed the predicates from being just predicates on a scan to being seek predicates.
This is how you could index. You would want to look what’s going to be guaranteed requested by the application. Sometimes you can tell maybe we only want to look for a buyer code or something like that. That would be a great first key if you can rely on that parameter always being provided. This is kind of the wrinkle that we get when we have more complicated applications or changing parameters. So I think that’s all I have for this demo. What I want to do is I also want to show what happens if you include Chris. So now we’ve gone back and now we have this parameter again. What we’ll do is we’ll drop the index that we created at the very start where we created it on display name first, display name, location, and age, and now that we’ve just got the second index, the only index on there is the one starting with location and age, but it ends with display name. We run our query again and we get the same number of great logical reads. It’s back to three. Even though we didn’t order it in order of selectivity, this index can cover both scenarios. It can cover optional parameters or required. So that’s how I would design an index when I knew that maybe sometimes they’re searching for that value. But if you don’t have a scenario like this or maybe you have a fixed set of columns, you can rely on the selectivity to make your index use the columns in the most efficient order possible. Do we have any questions on selectivity or anything like that, Daniel?
Daniel Hutmacher: Yeah, for starters, you’re getting lots of love for being practiced and smooth and very methodical, so there’s lots of love in the Slack channel. They’ve given up their nerd fight now.
Arthur Daniels: Thank you. I was just going to say, if there’s any feedback, anything you guys want to give me, feel free to tweet me and I’ll check it afterwards. I’ll also try to read through the section and when I show my blog post later, I believe I have a section for comments so I’d just love to hear any feedback or thoughts or corrections even if you guys have anything. So thank you very much for the praise, guys, I appreciate it.
Daniel Hutmacher: And you also got a tricky question, which you can decide if you want to answer now or later. It’s one that I recognize a lot having worked with BI. If you have a database table with like, a billion rows, it’s a wide table also and a lot of queries read and search a large number of these columns. We need to limit the number of indexes for best performance base usage, et cetera, so the question is how do you design indexes to be used across multiple queries where different columns are used?
Arthur Daniels: That’s a great question. This is not my area of expertise. I deal most of the time with transactional systems where you can have about maybe five indexes, and I’d love to hear your thoughts on it, Daniel, and what my first thought would be is that in that kind of scenario, you can actually – I think you can have more indexes in data warehousing type scenarios. You’re able to index more but I still – I haven’t dealt with too many scenarios where it’s a really wide table. What do you think, Daniel?
Daniel Hutmacher: I’m not going to hijack your session. Column store indexes.
Arthur Daniels: Great point. Column store. So whoever asked that question, Nico has a great session, I think it was on GroupBy. Look up…
Daniel Hutmacher: He says it’s actually more of an OLTP table.
Arthur Daniels: Oh, interesting. Okay, so…
Daniel Hutmacher: Don’t try to shoot for the column store actually because the column store isn’t sensitive to column ordering. It just includes all the columns and you can pick and choose which ones you want to use. So try that.
Arthur Daniels: Yeah, there’s a GroupBy session on creating column store indexes and if you want to create them for transactional systems, I think there’s some extra caveats. You want to be careful. I don’t have any experience with those so I can’t actually speak to that.
Daniel Hutmacher: And it’s sensitive to versions as well. It gets better with each SQL Server version simply put.
Arthur Daniels: Oh great. So first upgrade, right?
Daniel Hutmacher: Yeah. Lots of questions, I hope you have the time. Tell us how the index is placed in pages, from Anton.
Arthur Daniels: So I’d be curious to know if he’s talking about the physical storage structure or…
Daniel Hutmacher: I’m assuming yes.
Arthur Daniels: Okay, so I don’t want to go too much into physical storage structures simply because I don’t necessarily know all the internals of indexing, but when I’m referring to the layers of indexing, that’s the physical structure to an extent. Non-clustered indexes use a B-tree structure, so that means at the start, you’ve got the first key column in order and then on the layer of the B-tree below that, you’ve got the second key column and so on and so forth. So the physical structure of the index is what’s preventing it from being able to use the other parts in an index seek when you removed that first key column as a parameter.
Daniel Hutmacher: Yeah. So questions are flooding in here, you’ll just have to tell me when you want to move on.
Arthur Daniels: Let’s do one more and then we’ll move on.
Daniel Hutmacher: Alright, so Brian says – and by the way, Corey with the column store thing, he says he’s on 2012 so column store is not a great choice. Brian asks, “If you remove the predicate for display name and leave the other two, will it use the index with the display name as the first key value? Perhaps scan or skip altogether?”
Arthur Daniels: Yeah, and I think I tried to show that in the demo, but yes, when the data is available in the index and it’s a smaller subset of data than is on the clustered index, SQL Server will often try to find the smallest index possible that still has the data it needs, even if it’s not in the right order to do a seek. So yes, it will do a scan in that case.
Daniel Hutmacher: Yeah. And finally, just one short one. Amy says why you’re included max stop in your create index statements.
Arthur Daniels: So that’s a really great question. Some of that is habit. So I work in a system where we want to specify, we don’t want create index statements to go out of control. And it’s just really a habit that I’ve developed, so I just include max stop when I create my indexes. I didn’t mean to imply that that should be part of your syntax.
Great, so let’s move on with the presentation, I think we’ve still got plenty of time and we’ll try to stop for questions in a little bit again. I’m loving – it’s really great to hear these questions.
Daniel Hutmacher: Really engaging session. It’s really cool to see all the questions.
Arthur Daniels: Well thank you, thanks guys on Slack. So let’s talk about a different problem query. So, so far we’ve covered what happens when you do a join and you’ve got a join with additional filters or you’ve got multiple equivalent columns and sometimes you remove one of the columns and sometimes you add it back. Let’s talk about a problem query where you have different types of predicates. So now we’re looking at the same users table, but I want users who created their account between December 2008 and December 2009. I want users who live in the United States, and I want to do a wildcard search, just show me all the ones with the user name like user. So one thing about the Stack Overflow, the whole website, you don’t have to create a display name and if you don’t, you end up with a display name that looks like user with some numbers after it.
So let’s take a look at this problem query but before we jump into a demo, I want to show you how I would break this down logically if I had this query. What I want to do right off the bat is I want to split this query up into here’s some greater than or less than predicates, here’s some equal predicates, and here’s some like predicates. So your query – some of you have some very complicated queries, but what’s important is to identify what’s being searched. And we’ll throw out the complete option of optional. We’ll just consider this as a static query, all these columns will always be requested by the stored procedure, by this query.
So based on that, how do we create an efficient index? Well, let’s look through some demos and we’ll see what the best index for this kind of scenario is where we have different types of predicates, we want some information, maybe the wild card can search, it’s like a user can search in – it’s like they’re doing a Google search and they type in whatever they want. So let’s take a look at the demo and let’s look at multiple predicates where different types of predicates are applied.
So first I’ll make sure that we’ve cleaned up all our indexes and then we’ll go over here and we’ll look at different predicates. So what I want to do is mentally I split it out in the order that someone supplied them. This isn’t necessarily the best order, I just want to logically get an idea of what needs to be done for this query. So first we want only users from those creation dates, and we run that query. Let’s get the execution plan. We know in the execution plan it’s going to do a clustered index scan because I dropped all the other indexes possible. And as expected, it’s reading the entire table. 7000 pages. So let’s go down here and let’s create an index on creation date and we’ll include reputation. So we’ve created our index, we’ll run it again, and what’s cool about this is greater than or less than are indexable columns. When we’ve created this index, we’re able to get an efficient index seek on these values. Creation date, it’s a great index key column. It worked for this query, we got a seek predicate, not just a predicate, so I would say this is overall much more efficient. We’ve created an index for this query.
But let’s look through the other scenarios. Let’s add in the second predicate that we have. This time, we’re going to look for the predicate location United States. So we run our query, we look at the messages tab. Unfortunately, we’re back to our logical reads. It’s back to reading the entire table. Let’s look at that behavior. Oh, it went back to a clustered index scan and what it’s doing is it’s applying both predicates in this clustered index scan, which makes sense. I didn’t include location as an included or a key column, so to get the data it had to go to the clustered index. So we won’t look too much at this missing index request. What I want to point out is that missing index request like this can appear when you don’t have any indexes in your table, or when SQL Server just doesn’t have enough information.
So let’s create this non-clustered index but instead of ordering it like the missing index where they’re ordering the location column first then the creation date second, let’s try with creation date first because that’s the order it’s written in, right? Let’s create that index and let’s rerun that step. We’re just rerunning location, United States, those users again, and we go to our messages tab, hey, 344 logical reads. It’s a lot better than 7000. That’s a really good amount. Let’s look at this execution plan, let’s see how it used our index. And we still have a missing index request, which is kind of interesting. So when we look at our index seek, we want to see that there’s actually two operations here. First, not the first operation but just first in the order that we’re seeing it is this predicate on location. And then the seek predicate that it did before that was on creation date. So judging by this, in this scenario, if I saw something like this after I created an index, I would want to try reordering the predicates in that index. Because what I’m seeing is I want everything, if it’s going to be a covering index, if it’s going to fit that query specifically, I want everything to fit inside the seek predicates. So let’s run that example. Let’s try the other way around. Let’s try it with location first and then creation date. So we create that index and then we run our query yet again, and we were at 344 logical reads in the previous example. Now when we go to our message tab, we’re down to 11.
So the cool thing about this is if you’re testing in an environment where you can create indexes as much as you want and you have good enough data to test on, it doesn’t have to be real live data, what you can see is when both indexes are created, SQL Server picked the one that I had created second, and what it picked was this index that I created on location and creation date, and when you look inside the execution plan, this is the crucial step, now we have all seek predicates. So SQL Server knew the order that it wanted the indexes in or the key columns in, we just had to try things out, just experiment. When we had indexes as a predicate, not a seek predicate, we had to try things out, see if we can get a better execution plan.
So that’s guaranteed that we have location, but we haven’t added in the third predicate yet, the like clause. So let’s add that in and see what happens. So we’re looking at 11 page reads for our last example and let’s run this again and see what we get. When we go to the messages tab, we’ve ballooned up, we’re at 3679 logical reads. That’s a big difference from 11. And I imagine we’ll see something in this execution plan, so let’s take a look. Oh, there’s our huge cost, right in this key lookup. So in the previous example, it did the seek on location and then it did a seek on the creation date. But what it had to do was then it had to apply this wild card. It had to go look for display names that were like user. And that’s where a lot of the extra page reads came in. If we look up here in the top of the execution plan, the number of rows read, it read 1000 rows. So while that’s not a lot in this example, this example is scaled down fairly. What I want you to know is when you don’t include a column that could be used for searching and SQL Server has to go do a key lookup, this predicate and the term for it properly is residual predicate is fairly expensive. This is an expensive example.
So knowing that we want location first because we tested the index and we know that location’s going to result in less logical reads, then we want creation date second. How do we index for a wild card? Well, the stinky thing about wild cards, the thing that really stinks about this is that wild cards can’t be seeked. When you have a wild card on both sides, I should specify. There’s some special cases and if you have to deal with wild cards a lot, this is something that you should look more into. But if you apply wild cards on both sides of the statement, even if we create this next index and we walk through this next example where we’re searching through the same query, what we’ll see is we got a better amount of page read. We have 14 instead of 1000 or 3000, whatever it was in that last example, but inside our execution plan, we weren’t able to move the predicate to being a seek predicate.
Now, you’re welcome to try this out, try different orders of things, but the thing about a wild card is it has to look through each one to make sure it’s not anywhere in the string. It has to look through the entire set of the string. So it’s not going to be able to seek to that value, so you’re always going to have this additional predicate. But we reduce the amount of work that it had to do by including it in the key columns. You could also add it as an included column. Now, this is something you’d want to test out. So this is kind of an example of how you can order the predicates. Now, what’s important to remember is it’s not the order in which you type them in the query. It’s going to be the order in which SQL Server chooses to search on them, and it’s worth experimenting layering the key columns differently to see if you can get a more efficient execution plan and if SQL Server can search through things faster. So at this point, that’s my multiple predicates demo. Are there any questions on multiple predicates or wildcards or anything like that?
Daniel Hutmacher: No, just more love for your presentation style. I think people are really thinking you’re methodical and – let me see. No, it’s all just love. No questions.
Arthur Daniels: I appreciate it, guys. I’m looking forward to reading the comments afterwards. Thank you, everyone.
Daniel Hutmacher: Oh, lots of people are typing. Will you show what happens if you take out the first percent sign, asks Kevin.
Arthur Daniels: I will do that because I want to show you what happens. So this is kind of a cool case where SQL Server, when you change things around, sometimes they can get more efficient. Now, I can’t guarantee because I’ve tried a lot of these demos out and I made sure the results were the same. I’ve read some stuff online and I’d encourage you guys to go read more because I’m not an expert in wildcard searching because that’s not something I do often. Now, this is not going to be something that’s usually efficient in SQL Server, but I know that when moving the first wild card, sometimes SQL Server can search through these first few characters and do them more efficiently. Now, I don’t know if I have the tools, but let’s see what it did. So, unfortunately, it’s the same number of logical reads for this example that I’ve got up right now. If we go to this execution plan, it’s still reading the same amount so this isn’t a good example of that.
Daniel Hutmacher: What was your index definition?
Arthur Daniels: That’s a great question. So my index definition was location first, creation date, then display name.
Daniel Hutmacher: If you move display name before creation date, that’ll work.
Arthur Daniels: Okay. Well, let’s give this a try and if this doesn’t work, what I’d love to see is I’d love to see if someone has the blog post – I think there’s a couple blog posts about indexing for these wildcard scenarios or I know there’s some information on why you shouldn’t do wildcards within SQL Server, but let’s take a look at this query and see if you’re right.
Daniel Hutmacher: Our users demand wildcards so it’s not for us to choose.
Arthur Daniels: That’s a great point. So yes, that did – that worked a lot better. Thanks, Daniel. Five logical reads instead of 14. And if we look here at the index seek, unfortunately, now it’s doing a creation date predicate. But because it was able to search through those users first, in this example, it was less logical reads. And I can’t say too much to how that’s effective in other scenarios, but that’s a great question and I learned something too.
Daniel Hutmacher: So what it does, it converts the wild card – when you end the wild card with a percentage, it converts the wild card to a greater than smaller than. I think you can see it in the plan even.
Arthur Daniels: Yeah, and that’d be a great thing. Let’s talk about that afterwards. I want to show some things that I just talked about and why I did the things that I did. So in this next slide, I wanted to talk about what I was trying to do. So when I was trying to create an index on this table, what I tried to demonstrate was key columns do a few things really well. What they do really well is when you know what the value is, when your parameter is equal to a certain value. They also do greater than or less than, they do these really efficiently. These are index seek events. You can seek on these values. The things that they don’t do so well and the things that they can do okay on, and obviously I was showing that they can get better is they can do like, wildcard searches.
So what function do included columns do? Well, included columns do some things really well. If there’s a value that you want to select and you’re selecting that value every single time you query a table, perfect for an included column. That’s absolutely amazing. They’re also good for being there, and the reason I say being there – let’s say you have a dynamic application, sometimes we want to select some columns, sometimes we want to select another. If you have included values, you don’t have to worry about a key lookup. Now, key lookups aren’t the worst thing in the world. You need to understand how much work is being done, how many pages had to be read. But if you want to really cover – if you want every single condition covered, sometimes you have to include them as included columns, and I’ll provide a counterpoint to this next. But if you think about it and you need them to always be there, they can always be there as an included column. You can also scan the included columns. So I mentioned very briefly and I don’t have a demo for this, unfortunately, but what can happen is SQL Server will look for the smallest index to use and it can scan through the included columns on that index. So if you have these very, very large tables but you only have a certain number of rows like locations, even if I have to scan through 11,000 locations, it’s still better than reading through every single page on the entire index.
So why wouldn’t you want to include all these included columns? Well, think about it like this: we’ve got our posts table and the first index we created was on the owner user ID, then we included the reputation on that, but for the sake of this just pretend it’s an index on a single key column. Think about this in terms of data sizes. Each of these cubes represents more and more data. An index that includes owner user ID is going to take up very little data. Then when I include more information, when I include the tags of that table, we’re dealing with a larger index. Let’s say we grew from 1GB to 5GB. But if we go crazy, if we decided my index – my query could use anything, I want anything that’s possible, well in those scenarios, if you include every single column possible, you can end up creating an index that’s the same size as your base table. Now, that’s a decision that you have to make. That’s a tradeoff. But I want to provide it as a word of warning. You’re creating a lot of overhead, a lot of size inside your database, and unless you really need those columns, unless you need this query to be mission critical fast, it’s often a bad idea to just include everything. Usually someone ran a select star and they’re saying, well, I need all the columns. Well, you don’t necessarily need all the columns.
So I’m reaching the end of my presentation and I want to talk about the main key points that I wanted to include. When you are doing index tuning, you want to start off and you want to interpret the predicates of your query. If you have query tuning skills, apply these ahead of time. Use the information you know about the query and then run it to see how SQL Server interprets your request. See what it wants to use. A lot of what I did to start every single one of my demos was I ran the query and saw how SQL Server used that data or how it searched the data like it was. Then you want to order your key columns and maximize their usefulness. Sometimes this is in the most selective order, but sometimes if your query is changing or if you want certain things to go before other things in your query, meaning you want account number first or you want location first, make sure that the order of the key columns you’ve experimented with, you’ve tried things out to make sure it’s the most useful key column order. And finally, include or move included columns for your scenario. I’ve seen a lot of effective indexes that have very few or no included columns. It’s not a necessity for performance. And like in the example with the data sizes, the more included columns is the more weight.
So thank you, everyone, thanks for all the kind words in Slack, I’m looking forward to hearing some feedback. I have the link up right now. If you don’t want to download, I have a link to SlideShare where I’ve uploaded the slide deck. If you do want to download, I’ve got the slide deck, I’ve got the demos, everything’s already out there on my website. Feel free to reach out to me, I have LinkedIn, I have all that stuff. The best way is usually Twitter, @arthurdanSQL, thank you, everyone, for attending my session. Thanks for having me here, guys, for GroupBy.
Brent Ozar: Very nice job, sir. Yeah, you definitely want to read the Slack comments. Tons of people putting in thanks, mind blown, thanks for the talk. Very nice presentation. We even had a couple people asking are you an MVP and I said, well – because they said you clearly should be given how much you know about the topic, I said, well, here’s a link to go nominate him for MVP. There you go.
Arthur Daniels: Wow. I’m blown away guys, thank you so much.
Brent Ozar: You bet.