Columnstore Indexes: What are we waiting for?

Target Audience:

DBAs and Query Optimisers – dive into the invisible part of the Columnstore Indexes.

Abstract:

Every time you see a Columnstore Index getting involved in the execution plan, do you realize that there are whole execution plans behind those Index Scans ? Did you ever ask yourself, what are those strange and weird HT_* waits stand for ? Why do we wait for seconds/minutes for something like HTBUILD while it seems that nothing happens ? Why do we have a ROWGROUP_VERSION wait on one server, while the other allows queries to run faster ?

This session focuses on answering those question – to help you understand the reasons and the conditions behind every single available wait for the Columnstore Indexes and the Batch Execution Mode.

Audio podcast:

Enjoy the Podcast? Don’t miss an episode, subscribe via iTunes, Stitcher or RSS Leave us a review in iTunes

Why I Want to Present This Session:

There is no real official documentation on the topic of the Columnstore Indexes waits and I believe it is a geeky fun to dive into the matter and understand the implementation as closely as possible to the real code. A lot of times people will see the Waits but will have no idea what they mean or how they can potentially tune them – this session should provide the insights.

Additional Resources:

http://www.nikoport.com/columnstore

Session Transcript:

ERIK DARLING: Coming up at 2:15pm Eastern standard time, god standard time, is the wonderful, beautiful, and talented Niko Neugebauer with columnstore indexes. What are you waiting for? What is who waiting for? Is it we or is it you?

ANTHONY NOCENTINO: I don’t think Niko’s waiting for anything.

ERIK DARLING: No, Niko’s ready to go. Alright with that, I’m going to bow out and finally use the restroom. Niko, good luck, break a leg.

NIKO NEUGEBAUR: So ladies and gentlemen, welcome to the worst session of this conference, which is columnstore indexes, what are we waiting for? So gentlemen, should I introduce you or you will introduce yourself? Both – two fine gentlemen who are accompanying me here.

ANTHONY NOCENTINO: I’m just along for the ride, Niko, so you go ahead and run with it. I have one question. Do you want questions throughout the session? Do you want us to hold them to the end? What are you thoughts there?

NIKO NEUGEBAUR: Let’s do them as we do the presentation. Just like, don’t interrupt me when I am saying something, which I pretend to be of some meaning but after I like – at the end of every slide I’m absolutely positive for the question. Alright? Okay, let’s go.

So, what are we waiting for? Waiting for this presentation, ladies and gentlemen. My name is, as you can read, and my title you can also read, and you can read where to find me on LinkedIn or in Twitter to follow me. Very fine stuff. If you Google for my name and put maybe something like SQL, you should be able to find me. Nothing interesting. I’m breaking the rules. I enjoy breaking the rules. The good rules.

So what are we talking today? We’re talking about logical fragmentation? No. We’re talking about the dictionary creation? No. We’re talking about barcode API? No. We’re talking about root analysis? Maybe. Are we talking about the wait types? Yes. So what are we talking about actually? What are we waiting for? Are we talking about the columnstore wait types? We will be talking about the columnstore indexes and the batch execution mode. So please, raise your hand if you have never heard about the columnstore indexes. How many people have raised their hand, Anthony?

ANTHONY NOCENTINO: I don’t see any in the interface.

NIKO NEUGEBAUR: Okay, that’s fine. We can’t see them in any case. I’m talking about the columnstore indexes. I’m not talking about a kind of – I call them unfinished still can be useful but you got to push them against the wall columnstore indexes in 2012, I’m talking about the clustered columnstore indexes mostly and nonclusteredcolumnstore indexes starting with 14, the clustered columnstore, and 16 with nonclusteredcolumnstore.

So you know, this picture is almost in any presentation of mine when I’m talking about the columnstore indexes. We’ve got all these fine structures, there are row groups, they are compressed with different degree of success. Most of the time they will have a different kind of data. We have the delta store, we can have one or multiple delta stores per index or per partition. Actually, it’s a sort of huge misunderstanding and huge part of the columnstore indexes problem. What are those delta store? It’s a typical way of loading data into the columnstore indexes. But let’s say not the efficient way. But these are the trickle inserts because actually, if we talk to somebody who’s running a different type of columnstore indexes outside of the Microsoft data platform, they will laugh and sometimes they will even make fun like yeah, so you’ve got the columnstore part, which will be the row group and you’ve got the row store part because delta store, it’s starting with 2016, it’s uncompressed source of data. This is how you land new data into the columnstore. If you are loading in the regular inserts or barcode API less than 102,400 rows. So delta stores are not efficient way of getting data into the columnstore, but they can be open, they can be closed. We’ve got this deleted bitmap, it’s a b-tree structure. Basically one bit for every single row within the row group where we mark the row as active or deleted.

We’ve got the deletes and updates operation we can run and where update works as the marking the row within the compressed row group as deleted and then insert and merge, we update the data into the delta store. Basically, why should you ever run merge and update against the clustered columnstore if you can? First of all, merge is buggy. Buggy in the way how it works with clustered columnstore indexes is beyond slow and I’m just – I’m scared of thinking – seeing a merge operation. An update is slow. Why? Because it doesn’t work in the batch execution mode. Because it will not use something, which is called a hidden row group, which is how you load the data fast into the columnstore indexes. You will be using more or equal 102,400 rows and this is a very efficient way to load data into the columnstore indexes.

So what changed in 16 and 17? We’ve got more structure. Sorry, really sorry for the ugly picture. This one I was drawing while I was blogging a couple years ago so it’s ugly. Beg for your forgiveness. Essentially, in 2016 and 17 and 19, we have a possibility of not building on the table on the clustered columnstore but we have also the possibility of adding the nonclustered secondary row store indexes. And these are compressed row groups, this is still the deleted bitmap. One multiple or none of the delta store and here we are one multiple nonclustered row store indexes. This one, this is a really smart thing, which is called mapping index and I actually confess of trying to torture it in a good way, making it … with any bug or making it the source of the problems. Never managed to do that and diving into it implementation bits and bytes really, really smart improvement.

So what does this mapping index do? It connects the – it helps you identify the row within the b-tree it connected with a concrete row group and the row locator within the row group. I’m sorry if this is a little bit fast and advanced presentation, but this is how it’s supposed to be. Nonclusteredcolumnstore indexes in 2016 and 2017, they got updatable. It was finished nonclusteredcolumnstore index and we have the b-tree structure of clustered row store index. You get for your nonclusteredcolumnstore indexes, the row groups, deleted bitmap, one multiple, one or multiple or none of the delta stores, and you’ve got a deleted buffer.

Deleted buffer is actually not one but the real implementation, there are two of them to serve you in the best way possible. Every time you delete a row within your principle row store structure, instead of scanning deleted bitmap and slowing down the insert update and delete operation, you will – if you are updating or deleting a row within the compressed row group, the key for the operation will be stored within deleted buffer. And when you run the alter index reorganizer or the tuple mover, it’s running – it will synchronize the deleted buffer with the deleted bitmap and everything will work as before. So a question. Do you see any source – potential source for the problems here so far in this architecture?

MARTIN PEREZ: The updates become pretty expensive.

NIKO NEUGEBAUR: Updates are always expensive. So actually I was talking to some very smart gentlemen from Netherlands and UK. Some of them were telling me in some huge installations in PDW, ASP or post-PWD and ASP, you do everything you can in order to avoid any updates. Or merges. You do delete insert, I’ve been told. It’s many years ago and I try to follow this principle. So just a little bit of the processing part for the nonclusteredcolumnstore index. We read the row groups, the data, we read the rows which are marked as deleted and we see if the row locator is not in the row groups, then they pass through. If not, we eliminate those rows. Then we join the information with the deleted buffer and we pick up the keys, which are removing the data, and then we finally join with the delta store.

I’m not even making this question, I’m making an affirmation. This expectation of nonclusteredcolumnstore index is not to be faster than the clustered columnstore index because if you rewind here, this architecture, you see there is one deleted row structure. This one, it’s here on the left side. When we go to nonclusteredcolumnstore indexes, there are two of them. Deleted buffer and deleted bitmap, and this part of the deleted buffer, this is an extra operation because when you’re processing, when you’re reading the data, you will ignore the mapping index even if you have the secondary nonclustered row store indexes. So nonclusteredcolumnstore index processing will be equal or slowing to the clustered columnstore index processes. So will we have here some huge problem and differences between them? I do not expect. I never saw them and I do not expect them to appear.

So just a slide about the in-memory columnstore. Here we do not have in the architecture the deleted bitmap structure. We have a deleted rows table where we just store the IDs, the keys of the rows, which were deleted or updated. In-memory columnstore is a different game, and I hear about some people who are starting to use it in production and they’re actually doing pretty well with the good success but I guess the in-memory is still kind of not the favorite kid on the block, which I think is quite unfortunate. But let’s hope Microsoft will start investing a little bit more into it.

So let’s take a look at the point of our presentation. I covered the basics. If you are not up to the principle architecture when I mention it, I recommend if you Google for Niko and columnstore, you will be able to find my blog. I have over 130 blog posts on the matter and pick up a bit and get back to this part because I’m trying to look a little bit deeper. Any questions before we go into the real presentation?

MARTIN PEREZ: None so far.

NIKO NEUGEBAUR: People are probably sleeping. It’s late here. Alright, so I think it is very important to understand there is a difference between the word heal and the word cure. There is a difference and it is between the patch and resolve. When you’ve got trouble, you can patch. But patch is not the solution for the problem. Patch is you hammer. If you have a boat and you have a leak, you put some metal or whatever they apply, you hammer it to the place where you have a leak and if it’s solved, well it’s not. It’s patched. It’s different. And I try to start with a patch if it’s hurt, which would be heal, but I tend to go into the resolve. This is what interests me. Because patches are only capable of staying for some amount of time, and depending on the kind of problem, patch can be very, very temporary and we control very exotic solution for some query and if it’s a resolution, sometimes it is but a lot of times it’s not.

Cumulative update, a trace flag will interfere the next SQL Server version, you’ll get into trouble. So how does SQL Server and SQL DB works? We have an application, whatever it is, you’re running your Management Studio application, you send a request, it’s in the queue in the SQL Server memory, and then the trace flag from flag pool will pick the request, compile, and execute it. And then the statement can read and modify the data. You can basically see that we start high, conceptually outside of the SQL Server and we go down, down, down, and I see everywhere gives us a better understanding of the problem and from better understanding we can go one step further and decide how we resolve the problem. Maybe also how we patch the problem, but I Googled Stack Overflow, I found the solution. Well, it’s not a solution, you found a patch, and I tend to go into – I’m interested in finding solution, the cure.

So when we are already running our statement, we’re reading, modifying, inserting our data, when we see the things that makes us wait a lot of time, unless some of the waits appeared before, the locking – I get everybody working with SQL Server for more than a couple of months already faced problem of locking. And then we go into the latching potential. These are the internal wait times that are spent waiting for the certain system resources most of the time.

And we go in finally into the wait because for me, waits are the way of understanding what is going on, what are we waiting for? Are we waiting for data to be read into the memory? Are we waiting for the query to receive a memory grant to start processing because it’s going through the biggest, hugest gateway because whatever? Because we need like, five terabytes of memory, whatever.Or are we waiting for the data to be read from the disk into the memory? Or maybe these are some of the system tables that are blocked and we cannot get to the resource we are requesting. Maybe it’s locked, maybe we’re waiting for the latch. Or maybe we have a chance of looking into the internal processes and understand if those internal processes, the real programming code is doing something that it’s not supposed to. Maybe it is a bug. But these are the things I consider to be important. These are things that I try to understand. Do I understand them all? Of course not. I don’t know, Rob Ward, Joe Obbish, I don’t know, Paul White, Dmitri [Belugin], these are the guys who I’m looking up to. These are the guys who understand the real stuff. But I’m trying. I’m trying.

So these are the hierarchy of the performance analysis. It’s not [inaudible] but it’s the process – it’s the execution plan, it’s locking and ledges, and finally it’s waits. With every year overall, I tend to go and look more and more into the waits because I think that the waits are telling some more detailed story, and sometimes when looking above on the higher-level needs, we see something that we interpret in a different way or when we’re looking into the wait stats.

So how do we analyze in the execution of SQL Server? Well, my query is slow, okay, let me run on the production of this big query again and see if it’s slow. That was fast. Try it again. Oh, this is slow, nobody can connect to SQL Server. Okay, we found the problem, this query is slow. Very good, you just killed SQL Server. Maybe you can extract the backup restore, you run, you analyze, maybe you use DBCC clone DB, you can see the execution plan, you understand that you are getting a very ugly execution plan with wrong statistics and so on. You need to run them. That’s the first step. Well, maybe sometimes if you have errors, you can look at the errors. If you’re on 2019 SQL Server, which is still not released by the time I’m presenting this session, or maybe you’re looking at 2017, which already has the trace flag in order to see what is being truncated, maybe you find out what is the exact problem. About what to do – if you can’t get to the error message or if everything is fine but the query runs slow, and maybe it runs sometimes slow, logs, a very popular way of analyzing and I guess we all do when we see, we run something like monitoring for the queries, looking at the logs. If this process is being logged by somebody and it’s waiting to get the necessary exclusive log.

Latches, we can analyze them but well, I think it’s not the thing – as strange as it will seem, I think that the latches are much less accessible, rare than the waits. And this is what I enjoy more and more time. And especially we can enjoy the wait types more in the rest of SQL Server versions because Microsoft was so fine that they inserted the possibility of seeing them within even the execution plan. I think it’s actually brilliant. There are items we can discuss if it’s completely what we are looking for or not, but actually – I’ll praise Microsoft once again for delivering us and the [inaudible]. Incredible job once again.

So waits, well, all of you can read. If you are just listening to the recording then you are out of luck but if you want I’ll read to you, maybe because I didn’t talk about this possibility. So when the thread cannot proceed because some resources are not available, it will have to wait for this resource and this time is defined as wait. Example of the resources that can run, can wait for is a disk. For example, we send a request to read me some information from the disk, scan this table, disable, get this data, stick this index. And when the index is trying to get the information when the disk is not returning the information to the process, this is the wait time. And maybe waiting for the memory grant because well, the resources are not unlimited. Maybe we’ll have to wait for another thread to finish, and regarding the object of this presentation, this is very typical situation if you will see that another thread to finish is a typical parallelism. I’ll group it as under one umbrella of parallelism. You will wait for parallelism because as you all know, you cannot get all the processes to finish at the same time. You’re running with eight cores, 16, 32, 64, whatever, how big your first process is, you will be waiting for them to run two, five of them, finish at one time, and some of them can wait – can take some significant time to finish.

SQL Server tracked hundreds of them, of the wait types. I don’t know, 1500. Any one of you gentlemen has an idea of how many wait types are right now in 2017? Am I already alone in the presentation?

ANTHONY NOCENTINO: No, we’re still here.

NIKO NEUGEBAUR: Let’s agree, I think it’s just like, my head equals like, 1500, 1800. You can Google. I can’t because I can’t impress people with my knowledge but this is…

ANTHONY NOCENTINO: I’m Googling.

NIKO NEUGEBAUR: And so what does the bean tell you?

ANTHONY NOCENTINO: Andrea says she was just reading the slides. She says there’s well over 1000 in 2017.

NIKO NEUGEBAUR: Okay, I think actually 1300 was 2014 or whatever. Anyway, let’s advance. So execution process live time. We have three main states for any process. We’re running, we’re suspended, or we’re runnable. When we’re running, we are not running continuously because I mean, unless you want – if you start four single core processes on four cores, they would run for all the time, you would not be able to start the fifth process at all unless one of these previous four processes would come to an end. So you have this 15 minutes of fame where the SQL Server are the four milliseconds. This is the time it takes for the quantum of execution. You have the four milliseconds, then you go suspended.

When your thread becomes suspended, your process becomes suspended, it will see if it needs – will I wait for some resources, do I have all IDs to keep on running, then it will become runnable. When it becomes runnable, it will have to go up through the queue of the runnable processes when it will be peaked and it will be moved into the four milliseconds fame phase and it will circulate until it will finish.

So imagine I have an ID 100, we start, we have all the resources, we don’t have to wait, we go into runnable, we go through the queue, all other processes they went through the execution phase, we are on the top of the queue, we jump into the running, we get our four milliseconds of fame, we go into suspended, and now we do not have the resources. Maybe we’re waiting for some thread to finish and we will spend the time waiting, and these are the waits we can observe. Some of the waits. I’m not going into the details between the signal waits, which will be the time we spend going through the queue and I’m mostly talking about the resource waits, which are when we are in the suspended phase, waiting for some real resource to give us a location for mission to advance. Any questions?

MARTIN PEREZ: No, this is good.

ANTHONY NOCENTINO: I’ll give you a question. So is that queue, the runnable queue, is that a fixed FIFO queue or is there some intelligence in selecting the processes that go from runnable to running, as it pulls things out of the queue?

NIKO NEUGEBAUR: So to my understanding – I think it’s a great question because I will not be able to answer it in a competent manner but you have different queues. You have the different queues for the execution, which are connected with the resource of the [inaudible] gate – it’s like when you’re running a small query, a medium query, and a huge query. A small query will advance much faster through the queue than the medium. Medium will take more time because it will consume more resources, right? And the big queries, in fact, huge gateway I believe it’s called, before some fixes in 2014, 2016, you could run just a couple of those big queries. And my expectation, not my knowledge, but my expectation is that going up through the queue will correspond the gateway you have tasked based on the number of the resources you are consuming. Because SQL Server does it fast in order to make the old queries advance and there are decisions and a trade off. A great question and I’m admitting – I wish I would do a little more of it. I will research. But I will be pretty confident about that it will work this way.

ANTHONY NOCENTINO: Yeah, Linux actually – that’s how it schedules in picks processes to be selected to be scheduled. It’s a similar mechanism where they increase the interactivity of the smaller queries but also not starve the larger queries and vice versa, so there’s a little – maximum throughput I guess you could say.

NIKO NEUGEBAUR: So there must be some weight – not wait like time wait but depending on how heavy the query it is, it’s assigned some sort of number like let’s say one to the small query, 10 to the medium, 25 to the big and 100 for the huge, and then there is an algorithm, which will say we passed three 100, three 50s, 10 of the others and 100 of one. So I kind of feel equilibrium, and this is something, which is unless we are throttling the queries to make appear to consume less resources like decreasing their memory, it’s difficult to influence the execution process live time. We will increase the speed definitely when we move from suspended into runnable. Eliminating resource wait, that is essential.

So finally, like after whatever it is, 35 minutes, I’m starting to talk what I want to talk about. I want to talk about the columnstore indexes wait types. And even before there was a batch mode on the row store, which will be hugely awesome in 2019, and don’t expect it to be backported to service pack of 2017, first of all because there are no service packs on 2017, and secondly, I think it will be worth good money for a lot of people to migrate to 19. Well before it was announced, for me, the separation of the wait types was the batch mode because it’s been always Joe Obbish with showing with the columnstore injection. You could mix the row store and the columnstore with the batch mode well before 2019. So the process is something which is mostly connected with the batch execution mode. And please use this presentation – a lot of points I expect to be still valid for 2019 batch mode on the row store.

And the other thing is the architecture. Architectural waits, they are really connected to this complex architecture that I was showing you in the columnstore index. And do not expect that you can influence a lot of things. You are able to influence some of the things, but it’s mostly to your understanding of what is running and maybe you can understand that bigger parallelism, especially beyond 16 cores should be used with a great, great responsibility because the more parallelism you got, the more synchronization you got, the more waits you will get.

I mean, if you are running everything with one single core, you will not be seeing the waits but your execution time will also not be fastest for the big amount of data, which is also logical.

ANTHONY NOCENTINO:Niko, is that something that you’ve observed on your own? Is there a theoretical constraint there with regards to capping at 16 processors or threads in a query?

NIKO NEUGEBAUR: I have definitely observed this and I used to tell to the clients, to enthusiasts, to friends, when we’re talking about the query execution, if you look at the curve because I can’t see myself, just one screen here, just tell me if I’m showing within the visible space for you. Of course I’m extrapolating, there are more exceptions to the rule, but the rule is one, two, four, there will be a linear progression we’re expected, unless you’re hitting limitations. You got to eight and it’s not the same line. It will start going a little bit less than the linear progression. You got 16 – and a lot of times you’ve run the execution with 32 cores and you expect it to go a little bit higher, but you will see a lot of 32 cores execution to go slower compared than 16. And no matter what the feature is, you will observe this curve. And I wish it would be different way on the big machine, but as all of us know, there are all the jokes about the parallelism problems, parallelism problems. Parallelism is hard.

ANTHONY NOCENTINO: I’m trying to pull it from the back of my brain but I remember studying that in graduate school where there’s a plateau between 16 and 32 in multi-processors there’s even great computers as well, so it’s interesting to see that that’s still the case in cutting edge technology.

NIKO NEUGEBAUR: I guess we don’t have to forget that SQL Server, because we’re talking here about statistically SQL Server, but I can tell you you can observe not only on Microsoft, but you can observe a lot of this similar stuff on other operating systems and not only SQL Server, on Linux, but other databases as well.

Cutting edge is one thing, but if you look what SQL Server does, it for me, is the best general purpose database. If you will remove the general purpose, I guess the query optimizer, all the architects of – they can blow the socks off out of the performance, but you need to balance and general purpose means you keep happy guys running small little queries on small databases. I mean, I’ve seen things which 18 instances on a single VM. 18 instances of the SQL Server. I’ve seen one today, 350 databases on a single instance of the SQL Server, and it functions. It delivers, and people are happy. So you need to balance those extreme cases with the cases when you’re running a huge data warehouse, and I think if you are in the midst of processing huge amount – huge data, whatever big data, there is also a huge one, or huge some people say. You need to look at the MPP system because MPP systems, they are focused on processing very huge queries. They are optimized for that. They are not optimized on running joins. If you’re running on SQL DW, I have 10,000 rows, it’s a big data table. Well, next table see if it finishes. But if you put 10 trillion rows in a table in a SQL DW, it’s a platform I would be looking at to process. Maybe you would go to Hadook, maybe you’ll do Spark. You need to understand the need.

And so this 16, 32 area, I have to believe that 2020, 2023 I would expect – I’m not a scientist, whatever, just expectation, when we start catching up on this space, where the queries will start improving because as the data will grow, Microsoft will invest progressively more and more and make those queries run faster. Alright, I love talking about technology, my friend. It’s so easy to – I’m like this dog from the cartoon, I’m like, squirrel. Any questions or I will try to advance.

ANTHONY NOCENTINO: No, I don’t see anything right now, sir.

NIKO NEUGEBAUR: Alright, it’s the first time.So let’s take a look at the batch mode wait. These are not only batch mode. I use them as batch mode waits when I’m running the columnstore indexes, the not included because HT does not stand for columnstore. We would use a CC or we will use, as you will see, the exclusive batch mode BP batch processing waits, but this is hash table view. These are the waits. Hash table delete, hash table memory, hash table reinstallation, hash table repartitions. You’re working with hash table, and if you’re working with columnstore indexes and you’re working with the batch execution mode, even on row store, you will be working with a hash table because batch execution mode is not supported for the merge join and not for the inner loop join, unless let’s join our hands together and say Microsoft, please do not waste the precious time of the engineer trying to make those two other types of joins work with the batch execution mode because it doesn’t make any sense.

We’ve got the BPSORT. It’s a batch processing sort wait. We’ve got a pretty exotic one,I’ve been seeing more of this one, which is a pre-wait query batch processing memory. You will see this one more on more busy, complex, huge number of parallel process running systems. The columnstoreindexes, the architectural type of the waits is row group versions. I have actually an ongoing exchange, a different conversation, actually more than one on this one. People seeing this will be unreasonably popular, especially on the complex and availability groups based row groups wait. Then the operational stats wait, shared delta store creation. Also a wait you will see, but these are the waits which are connected to the delta stores, and when are we using delta stores? When we are loading data into the columnstore in the sys with the trickle insert. I mean, I wish there would be like, I don’t know, maybe actually I’ll throw an idea for execution plan one day to include a kind of a warning. Like, you are not using in an efficient way to load into the columnstore. Because this is the source of a lot of problems. Not only working problems but wait problems because when you’re loading the data, do not expect to like, I have one process which will be loading. Yes, you will be loading, and 10,000 people potentially will be expecting the same delta stores in parallel and you will have to share the access time and you will have to have the synchronization process and this is what will be making the access to those structures very ineffective.

This one, I mean, if you ran 2012 and 2014 SQL Server, well, extended events, this is the only way you could understand actually with this contribute throttle extended event, how you could understand the reason for the row group framing. The row group framing is the process when we are not reaching 1,048,576 rows, which is maximum, the perfect size for the columnstore indexes row group. The throttle is when we are externally from the row group, but the query optimizer will say, you know what, I’ll catch at 500,000 rows. And this is something which will not make me or a lot of people happy.

Columnstore migration background task. And I’m not talking about any illegal immigrants because it’s not immigrant background task. I’m talking about the tuple mover and the migration of the growth or open delta stores into the compressed row groups. Tuple mover, just a reminder, this is an advanced presentation. Tuple mover is the background task, that’s why it comes in this part, which is run by default every five minutes, it looks for the closed delta store, the ones that reached the maximum size over one million rows, and it will take them and compress into the row groups. And there are important waits and there might be cases where you will say, you know what, I’m getting deadlocked and I’m getting huge wait times because of the tuple mover, and the reason for it, trace flag, and supporting document 634 with which we can completely disable shutdown the tuple mover process, which can be pretty much useful.

This one, it’s listed as a possible columnstore wait type but we’ll see what it is or it is not. I’m not doing pretty well on this so I will speed up. Any questions?

MARTIN PEREZ: None so far.

NIKO NEUGEBAUR: Everybody is probably like, Niko is doing an interaction to the session. Almost one hour and he’s still introducing this stuff. Hash joins, ladies and gentlemen. The HT wait types. They’re not exclusive to the hash joins but you know, grouping data using a hash functions will reproduce in a way this. So we have a build side, we have a prod side. With build side, we build a table with hash values, with a prod side, we pick the – we build for every row a hash value. We compare with the hash value from the build side and we say yeah, it’s a match, it’s good, and I say well it’s not a match, then probably these are not the same values.

It can get more complicated than that but that’s just the theory. So let’s take a look at the HTBUILD. This wait type is related to the process of building the hash table. This one. So when we’re on the build side. It’s a time when we spend on building, making available for every thread. And if you’re running at 2012 columnstore indexes, you will be surprised but it’s absolutely different. It was a total rewrite of the process. Right now we have a sharable hash table, which is shared between the thread, and in 2012 we had for every single process it would build its own copy, which you think about the memory, it would spend, but it would be hugely effective in a way when you have enough memory when you keep it there. You do not have the collision.

The synchronization of building the hash table is something that will be a time spender. So let me switch to my presentation, which is this one. I’m not running the queries, I will be just showing them. They take just a couple of seconds, but I’ll try to speed it up. So I’m running here. I’m sorry, I forgot something which is called zooming. I am very well organized, as you can see. Any questions so far? Please fire them away as I’m proving my uselessness.

MARTIN PEREZ: You may get to this later, but is there types of data that are just inherently not good matches for columnstore?

NIKO NEUGEBAUR: Yes. Strings. All of them. All of them. All those strings make me go unhappy. Niko starts crying when he sees strings in the data warehouses and he sees them. Some will say I see that people I see – I say I see strings in the data warehouse and I cry. I actually mentioned in my last YouTube video for my columnstore batch execution channel that I wish that row group elimination would get some love for the strings because it just – it’s a mess. Avoid strings. When you can convert to anything else, avoid them. And I have a demo for that. Any more questions?

MARTIN PEREZ: Nope, nothing in Slack.

NIKO NEUGEBAUR: Good. So I run this query, let’s see what we’re waiting for. Page I/O latch, right, not interesting. Latch FH – wow, we’ve got him. That’s HT build. two wait types, 28 milliseconds. When will I see this number going ballooning and very, very high? When this number will go high. And when this number will go high, I have a very big number of hash match groups operations and when I have huge number of cores on this VM because I’m writing on Azure and I’m trying not to overspend, I’m running just four cores and I’m limiting to two cores the execution sometimes. But as you see, which iterator might have caused this wait? Uneducated guess, this one.The hash match operation. In my case, I mean, I got just like, one load whatever because I’m doing here a very good – you can see the actual number of the locally aggregated row is 12.6 million rows for the fact, online fails table, which is a total table. I’m doing the aggregate predicate pushdown so it’s very efficient, but I still got some waits.

But when you see this HT build wait, it’s – take a look at your hash joins, hash match operations, take a look what is the parallelism, take a look maybe if you can lower the parallelism and it solves your problem. This is like the basic tool I can give you. So with building the hash table, it’s the build side where the things are happening and making us wait when we are synchronizing the processes. Because you know, sometimes the work is not distributed equally between the threads most of the time.

HT delete, well, once you finish the job, once you finish with the joining, you need to delete your table. You clean up the mess you’ve built. Let’s see. There is a HT delete, there is another – when you see the nested groups you’re like, what kind of a query are you building, Niko? It’s a very efficient query for what I want it to show to you. We suffer here more different kinds of waits. We still have HT build, we have the CXPACKET. We have HT delete. But you know, it was all fine. It’s one milliseconds that we wasted. Reasonable time.

I don’t have here any demos that are doing unreasonable things. I’m showing you easy-peasy all perfect world. My goal here is to explain. So anybody wants to do a guess because we just have nested loops, where does the wait take place? HT delete. Okay, I’ll do a guess because people are shy. It’s one of those hash matches. I wish we could go and say like, can you show me the waits here? I’m not suggesting but I do suggest one day I would love to have this opportunity, but it’s also the thing, how expensive that would be. But maybe I can do some debugging, some extra option where I could determine that.

I mean, I’m really fine and grateful of being able to access the wait stats here. This is good enough for me. You will see sometimes huge times for HT delete, especially on 14 and 16 SQL Server. In the later versions it seems to fade less wait and I know that [inaudible] and all his engineering team are doing an incredible work of making sure that every version gets some under the hood non-documented, and some of those probably will never find out the improvement for the engine.

HT memo, well it’s also something which has to do with the synchronization point before scanning a hash table. Don’t have a demo or do I have a demo for that – no I do not because I wanted to limit it. We are already getting pretty far away. But it’s something that really look at it. It’s non-action item, as Joe Sack mentions. Something you can observe. It’s the particular way of how engine deals with the hashing of the table. REINIT, you see this one again as the synchronization point at the end of the scanning hash table and when you are throwing the output to the non-matching row. Do we have the output of the non-matching row in our execution plan? Well no, which makes it one of the most exotic wait types, but if you see something like this one being a problem, this is probably something that is buggy and you need to address in a very exotic way. Or you have a bug.

Repartition, well, just think about this as the repartition of the parallelism, something you do not want to see but you will be seeing. It’s repartitioning. It’s, as you can read, it’s used after the build side repartitioning, the hash match aggregator. Unless it’s a huge bug, you’re out of luck. Just try to – this is where you need to start looking at the more exotic solutions. Anything I can show you is this. Not in particular very complicated query. I just removed the foreign key to make it work a little bit more – this one actually does a lot of the wait types I want to show but – it’s not like you get all these wait types all the time.

Well you see, I run it and I didn’t get it actually. I was drunk, as Erik would say. Let me see if I can get it here. I got more wait types here. These are really the exotic unicorns most of the time, so I’m out of luck, sorry. Probably actually – I don’t know why it happened so sorry about that.

The batch process. This wait type to the batch mode sort operator, and actually this one is actually become in the batch execution even better example, but this one as well.

So when you have the order by or when the iterator – when the query processor inserts the fourth iterator, this is where you will see sometimes the wait types going into BTSORT you see this one was part of the bigger one but page I/O latch will be the bigger worry in my case. If you’re seeing on your instances BTSORT and you’re running nonclusteredcolumnstore indexes, sometimes there is an idea, maybe you don’t need the nonclusteredcolumnstore indexes on this instance. If these are blowing out of proportion, these are the type of the hybrid HTP – transactional SQL processing or operational analytics systems, you will want to tackle those.

You have the trace flag also 4199, 9347, 9349. These are the batch store mode and the top four iterators disable trace flags supported documents starting with 2016 that’s tested well. I guess we all need to test more at being able to reproduce the complex workload in order to do a better job. And another one is – I’ve seen some problems with it but still love it in some cases, really awesome. Starting 2016, cumulative updates three if my memory doesn’t fail me. It’s a dynamic memory grant for your batch mode operations. It can be when you are failing with the sort operation, maybe you just simply need more memory. But from the other side, the intelligent query processing in 2017 and in 19 will be tackling some of those problems. So it depends, you need to have your own head.

This one, seen this one – this is the wait type related to batch mode memory operation and my current understanding is the time until – the wait time until we get enough memory to proceed with our operation. See this one in the wild with higher frequency but with a very insignificant wait times. Unless you’re seeing like, something exotic and you’re getting at the memory problems, but even it’s [tiger] will point you into the right direction.

Alright, this one. This one is a good one. So delta stores, you still remember them? The delta store, when you are not using the barcode API and you’re not loading couple of hundred thousands of rows, when you insert data within the delta store, what might happen is you’re loading the data and you have somebody who is trying to access this data. He’s scanning the whole columnstore and he will scan the delta store. And he’ll scan it and other process scan and you load data, more trickle inserts. You reach 1,048,576 rows and your delta store will become closed. Then the background processing tuple mover will come convert this delta store into the row group. Everybody understand that? Easy peasy lemon squeezy, right?

So what if we are using snapshot or recommitted snapshot isolation levels like you will use automatically in Azure SQL database or anywhere using the always on availability groups. Well, you need to access this same version of the data, right? So SQL Server will have to keep a version of the delta store that you were accessing when you started your process. And it will need to keep both versions. The non-compressed delta store and or the compressed row group. And this is where you will see this wait type. I see as people are running more and more columnstore indexes on the availability groups, starting with 2016 for obvious reasons, because availability groups and clustered columnstore were a non-item on 2014, people will see more and more of this kind of problem.

The solution is once you are loading the data, load it in bigger chunks. Group it together, more than 102,400 rows. It will go into already – so called invisible row group. It will get compressed. No delta stores. I’m actually – I’m anti-delta store. Call me that. I’m into the favor of doing whenever possible a better job. Of course delta stores are unavoidable. Somebody runs it, oh, I’ll correct just this little 10,000 rows and everything will be fine. Oh, you got a delta store. If you want to avoid them, try to – once you finish your process, index reorganize, use the documented and supported hint, close all row groups on to get rid of the delta store. Don’t let them hanging. Any questions? I expect questions on this item.

ANTHONY NOCENTINO: So that’s limited to nonclusteredcolumnstore indexes that use versioning in some way, right? Combined with either availability groups…

NIKO NEUGEBAUR: No, row group version is not limited. It’s a universal – in memory I have no idea but I would not expect that to be, but in clustered columnstore index, let me just rewind here to me architecture slide. So I can have some support.

So imagine you are loading the data. You’re loading 10,000 rows. Where do they land? Will they land into the compressed row group? Never. They will land in the delta store. For the clustered columnstore or for the nonclusteredcolumnstore. If you were loading 102,400 or more rows in one single insert, they will land into compressed row group on their own.

ANTHONY NOCENTINO: Got it.

NIKO NEUGEBAUR: And if you are running updates, take it your delta store. Your information will go into the delta store.

ANTHONY NOCENTINO: We are coming up against the end of the session, Niko.

NIKO NEUGEBAUR: I’m ready to go to sleep and to cry in the shower. So whenever you’re dealing with delta store, and by default, unless you have a very clear ETL process, which is finished and during this ETL window you’re not querying the data, you’re dealing with the delta store. And that’s not a good thing. Run with the higher frequency alter index reorganize with compressed or row groups – compress all row groups. It will tackle most of the problems.

I remember some clients a couple of years ago, I wrote the script. I need actually to publish it in my library, which would basically – I knew that some ETL processes written by other people that would ignore all the good rules and I would simply run a job after I knew that all of the ETL processes are done and I would go to all the clustered columnstore indexes and I would – from sys indexes, I would pick all of them and I would run alter index reorganize, with compressed all row groups on.

Just a word of warning. You need to be running this with resource governor because unless you’re confident that you’re doing the right thing, alter index reorganize, if you have this problem which I’m talking to Microsoft for one and a half, two years about, by default, it can grab up to the limit of your resource governor maximum configured memory, which is 25% by default. So imagine you have four terabytes, you run alter index rebuild, it will take you, I don’t know, 20GB of memory to rebuild your table because all it needs is to feed a couple of row groups in the memory while processing. And then you run alter index reorganize and it will grab one terabyte because the statistics told them to do so. So scary. Yeah, it is, because alter index reorganize, it is an online process. It still takes a lot on some things, and grabbing one terabyte of memory means that you’re not going through the smallest of the gateways. It means the query will be throttled. That’s tough. But if you configure a process for alter index reorganize, run it after your old ETL finishes, it can do miracles for you. This is a production story, this is nothing like a theory or whatever untested. It really does good stuff for you.

Alright, so we’re trying to advance. Row group operational stats. All of you reading, you can read, for all of you not reading, I will read it for you. This wait type is associated with looking up row group metadata stats for a given row group ID. It accumulates the value for different lock types, blah blahblah. It’s one of those particular – it will help you understand the problem but you will not be able to tackle it. It’s useful, you say, well that’s how it is implemented. But this is not how you go into – not a very practical one.

Shared delta store creation, so if you are running by default no process isolation, no barcode API, failing or throttling down because of the lack of memory grant, you will be creating shared delta store, which means multiple processes will try to insert into it and this wait type has to do with it and you see it a lot of times with the nonclusteredcolumnstore. It’s something that should not be a problem most of the time. So at least my experience, and I’m not very experienced nonclusteredcolumnstore indexes user as I am a clustered columnstore index user because I’m more of a data warehousing and bigger installation guy and you try to avoid nonclusteredcolumnstore there.

This one, this is basically a wait type which you will get when you are building something that is columnstore indexes and you are being throttled. How can you get throttled while building columnstore indexes? Building or rebuilding them.Very easy.Let’s say I’m a bad person, let’s say. Some people easily agree with that statement. So columnstore table where we have this one, 800 bytes column. Anybody expect any problem? Or if I will run a – I will insert with the replicate 750 of the same letter, anybody expect a problem here with the data when we create a clustered columnstoreindex on this table? It will take like, I think around a minute to run, to insert 1,000,100 rows into it. Anybody expect a problem? No? Can we go?

MARTIN PEREZ: Based on your earlier comment, I’m expecting a problem just because of the warning against text data.

NIKO NEUGEBAUR: Yeah, actually I failed myself. I gave you the tip. Well, anything – I’ll throw you a number and don’t ask me why. I’m ready. I will see the execution plan for that, for the index creation because I can. Anything beyond 32 and the data, which is similar but partially with some differences or the VertiPaq engine will consider it because it’s pretty efficient actually. Fascinating topic for some presentation just to dive into what we can learn from it, from its behavior. It will consider giving you the result that you are not wishing to have.

Oh, I have some waits. Let’s see what kind of waits have I got. Latch, latch, write. Take a look at just one of them. Just one. 719 milliseconds. The columnstore index build throttle. Why would it be considered a poisonous wait? If I see an instance with this, I already know that there is a problem. That’s how bad it is. You can consider this one to be a poisonous wait. Why? Let’s take a look at the physical stats of the row group. I loaded 1.1 million rows. What would be the minimum number of row groups for that? Two. Take a look at the total rows. Every time someone says, “We get one million rows per row group,” I’m like, see you around. I have a row group with 6590 rows. What is the reason for trimming? These are two residual, and you can see I have two cores, which are doing a very good job of distributing the task between them. All equal. But most of the time I’m throttling because of the dictionary size. And it’s interesting because if we look at the dictionary, we still cannot see the real size of the dictionary because we’re getting throttling once the dictionary gets to 16 megabytes. Does this look like 16 megabytes? Not to me. First number is seven. You know why? Because it’s compressed value. We do not have a real size of the dictionary. It’s page compressed.

So we don’t know, supposedly it hit the limit. That’s what it is. And so we throttle, and because of that, when we have these smaller row groups here, we will not get a very fine execution of the batch execution mode. I can guarantee that. Because if you are driving the bicycle through the city, it can be pretty efficient way, right? If you get a Ferrari and you go onto some Formula One track, bicycle versus Ferrari, batch execution mode. Let’s go back into the city. You’ve got your Ferrari, well, you’re out of luck. Turn to the left, oh, you cannot advance. That’s what will happen with the batch execution mode. That’s why it’s supposed to be one million. This is one of the huge problems. Right now unsolvable, not reasonably solvable problems for the columnstore indexes.

The other demo I will not use because I’m already overspending your time. Thank you everybody for the patience, but I’ll upload the slides for everyone’s enjoyment there. You throttle the memory, in a lot of different ways you will just rebuild something, some table, and you will see it will get huge number of wait time and it will be memory size, not the dictionary size because we don’t have enough memory to put the whole row group. This is why seeing that wait is always poison. It’s just avoidable, please avoid it if you can. Please avoid.

This one is tuple mover, I already explained it. If you see a lot of columnstore migration background, take a look at what tuple mover is doing and maybe don’t wait for it. I like – when I finish ETL and I know I’m the only one, alter index reorganize on the columnstore index with compress all row groups on, so no closed delta store, no open delta store. They are closing up the problem.

This one, unicorn. Apparently not even implemented, but I hear sometimes from people saying that they see things that are known to be impossible so maybe if you see this, take a picture of it, take a selfie, send it to me. We all will enjoy the conversation with the developers because they say it’s non-existent and we will bring them the unicorn.

Over all recommendation at the end, keep your statistics up to date. They will cause less problems. Create additional statistics. I mean, multi-column statistics is the most underrated fix for a lot of problems, like ever. And you know, somebody take this topic, make a presentation, become famous, and buy me some good drinks later please. The amount of parallelism, sometimes lowering the parallelism, going even from eight to six can do miracles. Patch in your SQL Server, if I don’t have it here. On 2016 because it was probably one of the – not the greatest release in terms of the features like since 2005 definitely, patch yourself up. Patch it up. There are so many fixes over fixes. They can make your query execution significantly better. I’ve seen it in a lot of places.

If you are observing the throttling waits, get rid of the delta store by using the alter index reorganize compress all row groups on. Think about the isolation levels. Triggers, actually I didn’t mention, they will also get you into the snapshot thingy situation so I mean, try to avoid them. ETL processes, barcode API, group the data together before loading. Do the staging, use the partitioning, all the good stuff. Type of index, sometimes you can solve some of the problems by changing the index, but I’d rather say that this is a very theoretical thing. I considered doing this once but no, we solved this in a different way. But I have the imagination of where it would do a good thing, but particular system, which was more of a hybrid one than the defined data warehouse in this case. Nothing more, nothing less. Thank you for your patience. Should we have questions please?

ANTHONY NOCENTINO: Well Niko, thank you very much for this incredibly deep dive session into waits on columnstore. It’s much appreciated. There’s no questions but if you have anything else you want to add for the audience or you’re good to go, sir.

NIKO NEUGEBAUR: Thank you very much. If you’re looking to find more, please go find me, find my blog post, shoot me some questions. Right now I’m pretty available on answering them and thank you very much for your patience. Both of you gentlemen, thank you very much for accompanying me and I hope everybody will have a great time.

The following two tabs change content below.
Niko Neugebauer is a Data Platform Consultant. A SQL Server MVP with over 20 years of experience in IT, he is passionate about the Microsoft Data Platform and community. Founder of the Portuguese SQL Server User Group and the main organizer of the first SQLSaturday event outside of North America (#78 Portugal), Niko speaks regularly at events such as PASS Summit, SQLRally, SQLBits, and SQLSaturday events around the world. Niko loves sharing information and knowledge and has authored over 130 blog posts on Columnstore Indexes, and regularly contributes to the open-sourced CISL library focused on Columnstore Indexes.
Previous Post
Making Your Emergency Toolkit
Next Post
T-SQL Tools: Simplicity for Synchronizing Changes

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu