Columnstore Clustered vs Nonclustered, when to choose what

Target Audience:

Solution Architects and DBAs, responsible for the implemented architecture.


The incredible Columnstore Indexes can increase your analytical query processing speed multiple times, they are updatable (Clustered from SQL Server 2014 and Nonclustered from SQL Server 2016 respectively), but they keep on supporting different sets of the functionalities – such as Change Data Capture (Nonclustered Columnstore) or LOBs (Clustered Columnstore), and this brings a great confusion onto the table.

This session will light up your path on when to use what functionality to use and when, even though sometimes one of the type of the Columnstore Indexes does not seems to appear as a default choice for your scenario.

Audio podcast:

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS
Leave us a review in iTunes

Why I Want to Present This Session:

There is a great confusion on when to choose which Columnstore Index, and having faced real-world implementations with the “wrong” Columnstore Indexes, I wanted to help everyone to discover the right solution.

Also, given the limitations and supported features, I see an incredible need to explain which features can be used in which architectures (think HA/DR scenarios, HTAP reporting scenarios, etc)

Additional Resources:

Session Transcript:

Brent Ozar: In this episode at GroupBy, Niko Neugebauer is going to talk about how you choose between clustered columnstore indexes and nonclustered columnstore indexes, and how it’s different with different versions of SQL Server from 2012 until current.

Niko Neugebauer: Ladies and gentlemen, welcome to GroupBy. This is a session from me, Niko, and I will be talking about columnstore indexes. I love columnstore indexes and I have heard that in the recent SQL Server versions, or maybe either SQL DB or maybe even in the SQL data warehouse, you can build the columnstore indexes and then some of these products you can build clustered and nonclustered. Would you care actually about? Does anybody from the cohost care about the clustered or nonclustered indexes?

Brent Ozar: I know I’m supposed to, and I have no idea when I’m – like, I think I’m supposed to for data warehousing it’s supposed to be clustered and then for some OLTP tables I think it’s supposed to be nonclustered.

Niko Neugebauer: I’m leaving the session, thank you very much. I’m so done with this session, with this conference, and Brent, personally with you. I’m just like, I’m…

Brent Ozar: For the rest of the time he’s going to talk about vacation.

Niko Neugebauer: Okay, yes, talking about the vacation. So there are some exceptions to the golden rule, and let me walk you through some of these things and some of these thoughts. And I would like to make this session in a way a little bit more like dynamic and would like to hear if you have some opinions, maybe even strong opinions about the technologies involved and so on. So don’t be shy, and please ask questions.

Alright, so my name is Niko Neugebauer, as Brent already mentioned. I describe myself as a guy who works with data platforms, mostly Microsoft data platform and basically that’s it. So you can see the rest of this stuff here, I work for a wonderful company which is called OH22, and in my unfree time, I am taking care of a couple of user groups here locally in Portugal, in Lisbon, and I have something which is an open source library. For the rest, I blog about the columnstore and you know, all my Twitter, LinkedIn contacts if you want to – and if you care to connect, please do.

So today’s agenda will be index types. There are apparently two of them. Actually, it’s a lie. There is a third type, which is called no indexes. It’s a different type, we don’t go there, it’s too advanced, it’s too far. Then we talk about columnstore indexes. I’m just like, trying to bring everybody for a little, little, you know, base. Then we talk about the OLTP architecture. So what about the people, who loves OLTP, who works with OLTP? Please raise your hand. Okay, very good. What about the data warehouse, BI, or reporting application architecture? Please raise your other hand.

Brent Ozar: I leave that to the smart people like you all.

Niko Neugebauer: I totally – through my camera, I can see you all. Alright, so Brent, please do the count and we will present the end numbers at the end. So it’s very interesting regarding the columnstore indexes is that – and it gives a lot of – it gives a huge pleasure to all consultants to know that with every SQL Server version starting with the original one, more or less successful because it was only Enterprise, 2012, with every single version, Microsoft would deliver innovation and new features for the columnstore indexes. But sometimes, sometimes you would expect some of the features to work and well, you might not get what you would expect. So we will go through different versions currently available, 12, 14, 16, and 17. We will have accompany the development of the applications because my goal with this session is to give you an opportunity when you are designing, or even influencing the solution, you can look forward and say, maybe we’re running 2012, maybe we go 2016 by the end of the year. But will the technology that we are currently using, will they allow us to be still integrative with the columnstore indexes? Or maybe we need to go to 17, or maybe we need to wait for the vNext.

Azure SQL database, just a couple of words and like, maybe one sentence about Azure SQL data warehouse. But I guess the most – the smartest of you all, so already have understood that if we’re talking about the data warehouse, then maybe, maybe, maybe OLTP is not a goal there at all.

Alright, any questions so far? Ask them, fire away. I’m going straight into the content. Index types. As was mentioned before, there are two types of indexes. At least I understand them this way, I’m not talking about the exotic stuff like XML indexes or JSON indexes – oh, sorry, there are none. Anyway, so for traditional B-trees – please.

Brent Ozar: That was good. I liked that, that was very subtle.

Niko Neugebauer: So subtle that since we’re pointing at this, nobody will even understand. I’m playing it safe. So the traditional rowstore indexes, we have clustered and we have nonclustered indexes. So what is the difference? For the clustered indexes, we have to – or we auto-magically include every single column from our table. So take a look, I have a table with just four columns, ID, first name, last name, and salary. And if I create a clustered index, it will include all the columns from the table. So it’s – for clustered index, it’s a physical structure of the table. This is how the table is stored in the storage engine internally, right? Easy. How many clustered indexes can you have? One. Ein. Uno. Just one. You cannot – so I know I have shown three, but it’s the wrong number. It’s just one. Gosh, it’s – I need more coffee.

Brent Ozar: You’re on vacation though, so you can be forgiven.

Niko Neugebauer: Oh, thank you. It’s the second time you mentioned. I will be famous for vacations.

Brent Ozar: Portuguese – especially during the summer, you guys just don’t work that hard, that’s what it is.

Niko Neugebauer: Yes, we stole it from Mexicans. Oh gosh. From Latino – sorry guys, I didn’t want to offend. It’s just me, I never work, I hate working, I just – forget it. Anyway, so nonclustered indexes. You can select from one to all of the columns within your table. So for example, I can create a nonclustered index on ID directly, right, so just one column will be – it’s very good because if I don’t want to write the copies of the data over and over and over again, I will select just the columns which will feed my workload.

So for nonclustered indexes, you can include even all columns of the table, making it effectively a kind of – or in the terms of the data, is similar to clustered index. But just because it can, it doesn’t mean that you should or really don’t do this, alright?

So because we can put different combinations of the columns, you can choose and you can create multiple nonclustered indexes. So how many indexes, nonclustered indexes can you create for the recent – I believe since 2008 version of the SQL Server, who knows? It’s a magical number.

Brent Ozar: A lot more than you should.

Niko Neugebauer: That’s the magic number. If you ever create more than 354 indexes, which is the magic number for one index, there is something wrong with you. But you can create potentially…

Derik Hammer: Don’t do that.

Niko Neugebauer: You can change every index every day, it will be fine, or it won’t. So you can create actually until 999 index nonclustered, plus clustered 1000. Don’t do this. Just don’t do this. That’s it.

So now talking about the rowstore and the columnstore. So columnstore is how I always say that the non-IT people think that the data is store. Columnstore stores the data from one column together. So for the rowstore, we store the value and imagine this is a clustered index, including all of the columns. We store the first row together, ID, first name, last name, and salary. When we go to the columnstore, we don’t go horizontally, we go really for the column. We store one, two, three, four, 50 gazillion, whatever, whatever you store a columnstore index will do its best to store, compress, and optimize it. Because let’s be honest, the values within one column can be pretty similar, and there’ll be huge number of repetitive values. And that’s one of the ways how columnstore indexes can achieve this a lot of times, incredible compression. I’ve seen over 20 times improvement over beige compression just by enabling columnstore. It’s a very highly repetitive values table but it was just mindblowing.

So columnstore, vertical data, rowstore, the old store – it’s not the old store, it’s not fair. It’s the traditional store. It’s stored horizontally. So who can tell me the number of columnstore indexes you can create on one table?

Brent Ozar: Can you – and I don’t know this. Can you do filtered columnstore indexes yet or no?

Niko Neugebauer: Yes, you can do the filtered columnstore indexes starting with SQL Server 2016.

Brent Ozar: So – because before I was always like, well, you would only ever want one, but now I guess I could see that you would want more than one, and I don’t know how many you can have.

Niko Neugebauer: Alright, so I’ll give you a picture. And I know, Brent, you are old enough to know this. I guess there will be some viewers will be shocked by this picture, but this is the answer on the number of columnstore indexes for the table.

Derik Hammer: I recognize that picture.

Niko Neugebauer: So that can be only one. You can have either clustered columnstore or nonclustered columnstore index. You cannot have both of them. And if you go for nonclustered columnstore index, you can have either nonfiltered nonclustered columnstore index, or filtered. You cannot mix them. So just, you know, take a screenshot, take a mental screenshot. The Scottish Spanish guy will come after you and will scream there can be only one. You will have to submit just one index on the table.

Derik Hammer: I just do it up on select from reader and remember it.

Niko Neugebauer: Yeah, so it’s just – I will put this in my curriculum. It’s just – you know, international fame achieved. I will be as popular as Sean Connery, hopefully. One day. Hopefully not.

Anyway, so let’s talk about the OLTP scenarios, and by all means, please, you can punish me on Twitter, on your blogs, I’m not trying to cover every possible scenario here. I’m trying just basically to give you an overview what – you know, just like, measure – being the wind, I will point whatever type of the OLTP scenarios there are.

So one of them is online transaction processing. You’re running a website, you’re running an online betting company, you’re doing your best web app ever with HTML 17 and you are absolutely invited to have zero columnstore indexes because you do not need to process massive amounts of data. You do not. There are some already known tricks for speeding up some queries with the filtered nonclustered columnstore index, but please, trust me, you do not want to go there. Just because the batch execution mode can speed it up, it doesn’t mean that it will speed it up for you. It doesn’t mean that your sorting, your hashing operation in your execution plans will not crash you completely. So if you’re running a pure OLTP, just ignore the columnstore. Just – and I had a gentleman who actually said that he is a unicorn. Do you see my unicorn? So I heard that people would run clustered columnstore indexes on online transactions or processing, and I’m saying like, you want to live on a prayer, you’re welcome, but I don’t think you should, alright?

So there is a different scenario where you run the OLTP and this is so called hybrid transactional analytical processing, HTAP. The year before it was called operational analytics. Basically, you run your OLTP application and you need to process massive amounts of data. Maybe you need to have an idea how many transaction to place today or in the last hour and you’re dealing with millions of rows. You can consider a columnstore index if your table is hundred million rows, you should consider columnstore index. The question will be which columnstore index, but you can consider it.

Derik Hammer: Does that let me run like, operational dashboards off the OLTP system as well?

Niko Neugebauer: Thank you, that’s a very good question. It is thought to run some operational dashboards on your OLTP processing, but you got to be careful. So there is a line which separates both of them, right? So massive amounts of data, you will control the memory grants, you will control the CPU spendings because it will change. It will become more in some cases, which in the terms of the memory, you will spend more memory, and you will need to see how it will affect you. But yes, that’s it. Absolutely this scenario where you can consider using the columnstore indexes.

And if you look at the vNext, what was already announced, what the wonderful Joe Sack and Kevin Farlee, all the incredibly team of the processor are working on is the approximate calculations, which will allow you to have even the less impact with your queries and view the better dashboards. So stay tuned, Microsoft hopefully will be showing this cool stuff very soon. I’m very excited about it.

So data warehousing. Data warehousing, BI, reporting app, and what about the data warehouse? I used to say in the last six years, you should consider columnstore indexes. Now I say why are you not using? You must. There is no way – I mean, there are some exceptions, I do admit. There are some exceptions even for data warehousing, you will not use, but this is the 0.000 something percent. And you are just 0.1%, right? So you’re not there most probably, or maybe you are. You will see this in the slides.

But as a rule of the thumb, you should consider absolutely you should be needing the columnstore indexes. For the speed of the data, of the processing, and if you want to save on the storage because I know the storage and the memory they are very cheap, but it still costs money, right? The same will apply to the BI and reporting scenarios. And I will even include the data marked here. You will need one columnstore index. This will be a very tough question. Which one will you need? Because as you will see in a number of minutes, the technologies that we are loving such as replication – please raise your hand if you love replication.

Brent Ozar: No thanks.

Niko Neugebauer: And if you love merge replication in the complex architecture please raise all of your hands, right? Okay, so but it’s still useful. A replication is still incredibly useful in a lot of scenarios, or maybe you’re maintaining the old architecture and you’re not going to mess with this part because you don’t want it to be your fault. So you will need to take a look, which of the columnstore indexes is comparable in which version to take this decision appropriately. Because it’s not like we put the columnstore indexes everywhere, it will be faster, it’s all a party. You know…

Derik Hammer: Niko, we got a question that just came in. Marek says, “How can you tell by which column the data would sort it at the time a columnstore index was built?”

Niko Neugebauer: So the question is very good, by which column the index was sorted. So one of the key properties of the columnstore indexes is that the data within the columnstore indexes is never sorted by default. The reason why, and it’s called segment clustering, and if you Google it, I blogged about it, Microsoft blogged, everybody, blogged about it. Let’s talk about the clustered columnstore because it’s easier. But you can apply a similar technique on a nonclustered with less success, but still try.

So you can create a clustered index, sorting on the column the data you want, clustered rowstore, traditional index. And as the next statement, create – execute, create, cluster it, columnstore index with name on the table, and if you have enough possibility created with DOP1 with just one thread, so because you will not want the query processor to spread the data a little bit to skew it, which it does by default unfortunately for so many years, and if you create it with drop existing on, replacing the pre-sorted structure, you will have the data sorted.

Otherwise, every time you execute outer index rebuild, there is a process which is – I call data shuffling, the VertiPaq agent. It will shuffle the data in a different way in order to get the best performance. And that’s an important thing about the columnstore indexes because if somebody has seen what Oracle offers since 12.2 of their in-memory option, the columnstore indexes, they even have a bigger clarity with five different algorithms for the columnstore indexes. They can say better compression and they will use more methods to compress data even better. Or you can go the opposite way and say I want the data to be structured as fast as possible. Do not apply so many compression methods. Do not optimize. So it’s like best compression and fast extraction.

And so what this VertiPaq algorithm does, it tries to find the most optimum way by default, giving you no control whatsoever so far. And actually, I shouldn’t be saying, but if you look at the error messages of 2016 or 17, you will see some strange messages such as you cannot sort the column of the columnstore indexes in tempdb. And I’m wondering, why would somebody put an error message? I mean, we need this functionality, but there is no – nothing right now, and I will still keep asking and begging and praying but well.

Brent Ozar: It’s why I want you to never go to work for Microsoft is that you can say things like this. If Microsoft ever somehow sucks you up into the mothership, then it’s – we’re going to lose a genuine resource here.

Niko Neugebauer: Brent, for two years – I’m sorry, this presentation will take a little bit more time and people will be complaining but I will tell you this one. So if Microsoft will ever integrate or manage to integrate me in their – or considering, there is a picture of me from the year 2000 and if it ever gets public – it’s not public of course, so I will never be able work not even for the new Microsoft, I guess. As you can imagine.

Brent Ozar: Wow. Well, note to self, go through Niko’s photo albums when he’s…

Niko Neugebauer: Exactly. But Microsoft announced that it hires Niko, what for, but just publish it. Anyway, so and there is something about the hybrid systems. And hybrid systems, it’s the systems which are basically non-IT people design, which in my point of view, it’s most of the systems. IT systems are designed by non-IT people, it’s basically – it does OLTP and at the same time it’s a reporting and it’s data warehousing application. So it’s mixed, it does everything. And in this case, I wish you luck. There is nothing I can do for you most probably, and you will need to dive into the specifics to experiment. And I know you’re running development and testing for sure, I mean, I’ve never seen any other situation. I swear, where do you develop? Here. Like, yes, okay, let’s test it. So be careful. If your application has these qualities of both systems, you cannot use the machine learning, you cannot use the artificial intelligence, you have to use your brain. I know it’s very old and traditional, but that’s how it works. Alright, so…

Brent Ozar: That’s a good line. You need to choose your friends carefully. That’s really…

Niko Neugebauer: Thank you. Just another badge I will get for myself. Alright, so ladies and gentlemen, we came to the end of the session. You know about the columnstore indexes, Brent already before I started, he told you when you use which one. So this is the end. Well, or is it? Let’s take a look at the more detailed granular level.

So 2012 SQL Server, wonderful. Wonderful release. Nonclustered columnstore index has appeared for all of us mortals, nothing more APS or PDW fancy, no more millions to play with their reasonable amounts of data, and you can see this diagram which is pretty trivial. It’s basically if you are running a data warehousing, you can use a nonclustered columnstore index. And if you have any non-supported features such as a higher precision than 18 for the decimals, if you’re using the LOBs you cannot use any nonclustered. You cannot use any columnstore indexes at all. You’re going into the none bucket. And if you’re running any OLTP hybrid system, you go to none instantly, there is no discussion. Does – can anybody help me to understand why this triviality happened?

Brent Ozar: I want to say that this – the nonclustered columnstore indexes were read-only in 2012?

Niko Neugebauer: Absolutely right. You have read some beautiful documentations, sir. Microsoft marketing did a great job so it was – and I want to make it very clear. In 2012, 2014, nonclustered columnstore indexes was non-updatable. But you can say hey, hold your wolves…

Brent Ozar: How many do you have over there? Have you got like, a whole tray of different – I probably don’t need to see them all at once.

Niko Neugebauer: Yes, you don’t. It’s just – it’s all out of these animals. Even strange ones. Don’t ask me. So it’s up to you update the read-only index. So there are a couple of ways. First of all, you can disable the nonclustered columnstore index, update the table, recreate, rebuild your nonclustered columnstore index, and keep the reading access successful. The other, a very good option, would be to use partitioning because you can use partitioning, switch in and out for the partitions, and this functionality is supported for every columnstore indexes since the very first release. Any questions on that, on 2012?

Brent Ozar: No, it makes sense. JD says your collection of animals is faintly disturbing, but…

Niko Neugebauer: I will have to talk through with my daughter about that. Alright, so 2014, we’ve got clustered columnstore index, and the difference to the nonclustered columnstore index was that the clustered columnstore index became updatable. So you could write directly into the table with the clustered columnstore index, you could modify the schema, you could build a lot of stuff. You would have a lot of limitations, we’ll see in a second which of them are the most important, and there will be like, you know, I say and I claim it, there are a couple of exceptions why would you go for the nonclustered columnstore index. If you would have any of nonsupported features and OLTP scenarios, I mean, unless you are a unicorn, you should not run OLTP with a clustered columnstore index. Nonclustered columnstore would stay in 2014 read-only. So now my question to you, beautiful public, what would be the exceptions when you would run a data warehousing on nonclustered columnstore index? Because it’s a secondary. It’s not your principle physical structure of the table. What would be the exception for that?

Brent Ozar: I don’t get why you would if it’s read only. I don’t know.

Niko Neugebauer: Yeah, I mean, that sounds like crazy, right?

Brent Ozar: JD says business intelligence, but I don’t know what he means by that.

Niko Neugebauer: I have the artificial intelligence, sir. I’m just – business is for the smarter guy. So yes, it’s business intelligence, the way that it’s data warehousing, but technically speaking, there is at least three great reasons which just – if somebody says, we’re running data warehouse in 2014, which index should be choose, clustered or nonclustered? I just need to ask about those technologies and when they give their answer, I know that I can tell them no clustered columnstore index for you. It’s over. You can go nonclustered or you can actually ignore the columnstore indexes at all.

So the next slide is – here are the exceptions. So snapshot and read committed snapshot isolations levels, unfortunately, are not supported by clustered columnstore index. So you run your read committed snapshot isolation, you cannot change this. You’ve got to be using nonclustered columnstore indexes. Period. They were only implemented in 2016 and I think it’s – I cried every day. I genuinely cried not – not joking. I had the project which I really, really badly wanted and we almost went as far as starting planning it and then just, oh, rewrite the availability groups and we will use the readable secondaries. I’m like, no, no you can’t. What do you mean you can’t? Like, in 2014, it just lookup, snapshot, read committed snapshot. Always on availability groups, which isolation level do they need? You’ve got – you can’t escape this.

So if you are running a high availability or disaster recovery, whatever it is, always on availability groups, if you need to keep them readable – actually, if you want to use the columnstore index, if you want the – the speed of the batch execution mode, nonclustered columnstore index is the only solution. The other thing is as we have already spoken, by nonclustered index, you can choose your column, right? So if you’re running something like LOB or CLR, you can avoid including these columns into the index structure, and this way you can use nonclustered columnstore filtering, not including the column. You need the secondary rowstore indexes. So clustered columnstore 2014 is awesome, it’s beautiful. I think it’s just like it was really – it gave the light to most of the people understanding like, the potential power, the speed improvement, the – even storage improvement in the terms of the real money that it can bring.

But if you have the short range scans, so like 10,000 rows, would you use columnstore index? No, it’s too expensive. If you have just five rows to read, would you use a columnstore index? No, you need a regular index and clustered columnstore indexes in 2013 – and it was solved in 2016, but in 2014, you could not use the clustered columnstore index. So that’s another scenario.

This one was a – and still for people running 2014 is one of the most difficult things. I mean, I can argue with you until the conference is over that you should be guaranteeing the quality of your data through the load process but if you have a primary key, if you have a foreign key, these are not compatible with clustered columnstore in 2014. So no choice. You will have – you want the speed of the batch execution mode nonclustered columnstore index, over.

Another one which solveable but not guaranteed is unique constraints. So you can actually create an index view with a unique index over the table with the columnstore index and first you will indirectly guarantee the uniqueness of the data batch as well. You better take a look at your workload and what this means that your index view will be maintained, because you know, in 2016, you can build a nonclustered columnstore index on the index views. Just raise your hand who knew this. Do you know why so few people know about this? Because breaking news, because on index views, you can view the nonclustered columnstore index. Did you know that, Derik?

Derik Hammer: No, I didn’t know.

Niko Neugebauer: Yes, do you know why you didn’t know about that? because it’s not documented.

Derik Hammer: Got to love those, right?

Brent Ozar: Are you implying that we read the documentation?

Niko Neugebauer: Okay, you get for that – let me see, you get a different unicorn.

Brent Ozar: I don’t want to know where you’re going to put that unicorn.

Niko Neugebauer: I’ll put it back on the table where it was, just making it professionally very clear. So basically, I even argued as far as in May – in March of this year, so three months ago, I was talking to the team and I said in my point of view, two years after this feature is developed, no matter what is the actual status, it’s – it should be documented. But I’m not writing documentation and I love the Microsoft product. I really, really love, and I built a significant part of my professional life around it, but I’m not writing the documentation for Microsoft, sorry.

Alright, so 2016, we have this same picture and it’s wrong, actually, for some reason. Because you can… I’m sorry, I – let me correct it live because…

Brent Ozar: Oh yeah, do it.

Niko Neugebauer: It just – you can see how good I am at preparing my presentation because why would you see at 2016 data warehouse slides OLTP. I mean, makes no sense, right? Alright, so in 2016, we can clearly separate between the data warehousing, BI workloads, and OLTP, and operational analytics. So data warehousing will stay like this. By default, you try to go clustered, you take a look at the exceptions, and there are other exceptions in 2016. Different ones. Some of the technologies which were not available at all in 2014, they were not available to work together with columnstore indexes, they become available, but just for the nonclustered columnstore indexes. So, which makes it basically the league of the possible scenarios with nonclustered columnstore grew significantly bigger. But I would argue that as I mentioned before, if you are running something, not a data warehouse but maybe data mark, so you extract some part of the information, you make it available, and some of those technologies you will be applying there. So don’t be afraid on your data marks, on your BI solution to consider nonclustered columnstore. Yes, you will not get the perfect optimization of the storage compression, but yes, you can make it work. And it’s more valuable.

So here we are, first of all, change data capture, change tracking. I was actually shocked in Poland one month ago that when I was talking about this, there were like, whatever, 300, 400 people in the room, but they came to the wrong room, obviously. I told them, there are better presenters, just go, there was two people left, I was disappointed. I said – I mean, I was expecting two to stay. Whatever.

I asked the question about change data capture and change tracking, and nobody implied that they were using that. And I was very shocked. I use change data capture, I use change tracking, and if it’s not Oracle change data capture, I love it. Because well, the integration with Oracle, it gave me more work than I was hoping for this year.

Anyway, so if you are intending for your data mark to extract the data in an incremental way, in the Microsoft supported way, you can consider change data capture, change tracking, and you will need to use the nonclustered columnstore indexes, which become updatable, which is woo-hoo, great news. It’s like, I would call it actually not a nonclustered updatable columnstore, I would call it a finished columnstore and 12 and 14’s nonclustered columnstore is unfinished one.

The same subset that – of the data types which are not supported and obviously, since 2014, we can go as far as 38 on the precision for the decimal types, but it’s – some of the data types are still not supported and you know, CLR, LOBs, you can avoid it including the column.

And as Brent mentioned, filtering the index can really be interesting. Imagine you have one trillion tows in your table. And you just need two billion rows to be processed. All the rest is unimportant and you don’t care, for some reason, you don’t want to compress it, you just enjoy your life, whatever, or it’s different partitions. Create a filtered nonclustered columnstore index and you will be aggregating much less amount of information. And if your queries are selective enough to use the filtered index, it can be a good solution. It’s really – you should be considering it.

The other two players which are my favorite ones, and I know William Dirkin likes them as well, is the replication. Transactional and merge replication. 2016 supports both of them. It doesn’t mean you should use them, it’s not, but sometimes you just want to replicate your data for your data mark somewhere and you do not want to write with BIML some of your custom integration services packages for whatever reasons, or you want to do some different – I still could even transactional and merge replication that they do – if anybody can enlighten me what would be a good substitution I’m really ready to hear.

Brent Ozar: I got nothing. I got nothing.

Derik Hammer: Nope.

Niko Neugebauer: And I argue that Microsoft should actually either build something to substitute them or start fixing them. Hi guys.

Brent Ozar: That’s why they would hire you, just to make you stop talking.

Niko Neugebauer: And because I’m so incredibly polite. Don’t forget that part. So it’s just – I actually should add it since Brent allows me to do this live blogging. And I would say really crazy, crazy – see how crazy it is.

Brent Ozar: You know as presenters, we always joke around you should never type in a demo. Like, your IQ drops by 20 points and you’re – all of a sudden your fingers are made of butter.

Derik Hammer: You just…

Niko Neugebauer: No, I disagree, I disagree, I think it drops by 100 points. Not just by 20. Not just like, you know…

Brent Ozar: Charisma goes up. I think I get more charismatic when I get in front of a screen.

Niko Neugebauer: I think you can put any animal in front of my keyboard and we will have a new challenging competition, right? We’ll be doing better. Anyway, so as I already mentioned, you can build on index view, nonclustered columnstore updatable, but I mean just like, I run the test in 2016, when I accidentally had this crazy idea I wanted to test it and it worked. So the story goes actually, I was going on this SQL Saturday in Munich for speaker dinner and so I put like, two million rows, three million rows in a table, and I just run the update. And I went to speaker dinner and I was talking to Kalen Delaney, if you don’t know who she is, she is the person who spent the most time working with SQL Server. She’s an amazing person, and I was so excited and I’m telling her, she’s like, “Wow, really?” Yes, and I’m excited when I will be back home after the speaker dinner I will see how fast it goes. So after we went for all the drinks, all the conversation, all the fun, I came back and it was still coming to an end. Which gives you the idea of, you know, you can but doesn’t really mean that you should. True story, true story.

OLTP, 2016 OLTP, well, I say OLTP but I still imply the operational analytics. I mean, if you have like I have a big table, like, yes, tell me more. It’s 53,000 rows. Well, columnstore starts considering, you know, officially one million rows. I argue that my rule of thumb, 500,000 rows. I’ve seen real advantages on dimension tables and data warehousing on 100,000 rows, but yeah, I mean, sometimes, at some point, the table becomes so tiny small like…

Brent Ozar: Here comes the pony again. My little pony.

Niko Neugebauer: Your little pony is here, and he is – I would say she is really small so don’t do it to her. No, it’s not fair. So you are running a really aggressively aggregations or calculations, and you got to speed it up and you have resources to spend on it because there are even trace flag documented and supported which will disable for example, batch execution mode for the sort interlayers in 2016 and 2017. Because people weren’t like, oh there is some stack overflow, I can build it, it will run fast. Yes, but you will spend the resources maybe you do not have anymore, so you got to really, you know, don’t expect – what is it? The quantum computing to help you out in the next couple of years at least. So you got to use this. Louder headphones, the brains. So it’s really, really important.

Even I recommend and say it works, yeah, test it. I fail more times than I’m right. That’s not a joke. That’s absolutely so. So nonclustered columnstore is a rule of thumb. In-memory, please raise your hands, and I’m just letting you switch my screen here. If anybody, please raise your hands – no, it’s not this. It’s this. I’m sorry. Who is using in-memory for your replications? Yes, half of the public. Jos de Brujin, for you, if you hear me, you made it. It’s half of the people. I don’t see other people, I don’t care.

Brent Ozar: And too, I put up a poll in Slack, so for those of y’all who are in Slack, you can put in yes or no, and at this exact moment, Niko, you’re right, it’s 50%. Oh no, it’s now down to 33%.

Niko Neugebauer: No, if I didn’t see it, it didn’t happen. That’s the rule of my kids, right? At least – I modify it for my kids in the way that if I hear that you were playing the iPad at the time I didn’t allow you, it still counts as if I have seen you. But well, they still claim it, you didn’t see it, you never know. There is an app for that or whatever.

So if you’re using in-memory and if you’re using data warehousing, I mean, the most underrated feature of in-memory is schema only tables. The way you can land the data without touching on the disk, without touching – without hammering the tempdb – just a word of warning, there are some serious implications with the parallelism and they blogged about this and there are trace flags to help you out for some of the cases, but just, you know, consider it.

But if you’re running over the in-memory OLTP application, you can still successfully stay there by using the in-memory columnstore index. It will – it’s in-memory columnstore, it’s clustered in the way that it includes all the columns but it’s not clustered because it’s not the physical structure of the table. So we don’t have another 40 minutes for me to explain this but it just – you can create an in-memory columnstore but there are some problems. Does anybody know what are the problems with in-memory columnstore?

Derik Hammer: No, I don’t.

Niko Neugebauer: So beautiful thing is if you play with in-memory 2014, once you build your table, you cannot modify it, right? So you go to 2016, you upgrade, you can modify the table, you can alter the functions, life gets beautiful, you build an in-memory columnstore index and from this moment on, you cannot modify your table anymore.

So if you switch on the in-memory columnstore, your table becomes auto-magically non-modifiable, which is – yes, and if you want to modify it, you need to delete the in-memory columnstore, modify the table and re-add it. If you’re running 2016, do not do this operation until service pack one because for me, the biggest columnstore feature of service pack one is that adding the in-memory columnstore went from single threaded to multi-threaded. A huge feature. Really, it was like Microsoft didn’t really play on it, I was like, yeah. I’m a fanboy. Don’t care about it.

Anyway, so in-memory columnstore, if you’re using in-memory and watch the recordings with Ned Otter, he did a beautiful job around one year ago here on GroupBy, and I believe Sunil was one of the persons present on his presentation. He jumped in. Very, very interesting. Ned Otter’s view on the constraints and why – well, why some of the design solutions might – should improve for the in-memory.

And this is really a very important thing. Since the beginning of time, we kept on hearing use natively compiled stored procedure, right? And actually, use stored procedure always, and I tell you, do not use natively compiled stored procedure because first of all, you do not see the execution plans unless you’re running vNext or Azure SQL DB or the trace flags, and you think, oh it will be – there is no way a natively compiled stored procedure can use clustered columnstore index. None at all because there is no path even internally. It’s not implemented. Only interop. And it’s just one of the key things, you’re just like, oh it works, it’s fine, we’re going into production. Please don’t. Please don’t because it will cost you so much time to rewrite the things, rethinking them in panic, no. So please avoid natively compiled stored procedures for using the columnstore index on in-memory solutions. Alright, any questions?

Brent Ozar: Nope. I’m not going to tell you what the current poll results are, you know, like how many people…

Niko Neugebauer: You want to make me cry?

Brent Ozar: Yeah, no. I’m not going to do that.

Niko Neugebauer: I will show you a little kitten.

Brent Ozar: That’s better.

Niko Neugebauer: It’s not bad. I’m actually getting out of the – I don’t have any more animals to show you. might be meaning something. People probably happy about it. Alright, so going into 2017, 2017 was a kind of a big release from the point of view from the Linux edition, Linux support, and I think it was pretty underappreciated because some of the very, very juicy and important fixes, improvements went into the engine, went for the columnstore indexes. And this is the same picture as it was in 2016. You run data warehouse and BI and reporting data mark, you try to go clustered columnstore index because the biggest advantage you compress the data and all of your columns will be included. But when there are exceptions, you go for nonclustered columnstore index, and what is the most noticeable exception on 2017? Which one was really badly asked and cried for 2016 for the operational analytics? It’s the online rebuild operation, which is only available for nonclustered columnstore index. If you want an online rebuild, you got to be using nonclustered columnstore.

Do you want to hear the good news which I can publicly say because I can point on the record at Microsoft employee saying that? vNext is getting clustered columnstore index online table rebuild.

Brent Ozar: Nice.

Niko Neugebauer: Raise your hands up if you’re happy about it. Because I mean, because I guess it’s kind of a bragging, but I can – I’m the only person that you can ask and it has in his headphones, brains, whatever, all the possible scenarios where you can use which index when. Because otherwise people okay, there is an online rebuild for the columnstore indexes, we go clustered columnstore. They go into production, Houston, we’ve got a problem. And I’m a fan of the old times in the sense that at that time, the pictures were delivered consistently for all types of scenarios, and this is what I would like to have, but it’s not the case right now. So this is what – actually, this is the only difference between 2016 and 2017 slide, it’s the online rebuild. But you already – you know, keep in mind that because vNext, and who knows which will be 18, 19, 20, I guess it will not be 18 because we are already way too far into 18…

Brent Ozar: I don’t know. Yeah, I could see them trying to push it, like, December 31st.

Niko Neugebauer: But I mean, for marketing perspective, please marketing people, I love you all. How many installations will you get in 2018? It’s just like, I imagine somebody desperately typing through all kinds of stuff on their keyboard to make it and I don’t know. But just keep in mind, if you need an online rebuild for your data warehousing, and you’re still not using the columnstore indexes, maybe the next will be a very, very good fit for you. Maybe. Hopeful to see it soon.
Alright, and then there is something about this, what I mentioned a second before. There is something that really, really – I think it’s just a personal opinion, it’s kind of disturbing because you can have large binary objects, the LOBs, in the columnstore index, but only if you’re using clustered columnstore. If you’re using nonclustered columnstore, you cannot. So it’s basically it’s like, you have clustered and nonclustered columnstore, and you see the feature and you think you go over there, then you see the other feature that you need and you go there, and where you stay, you stay somewhere in the middle with nothing, right?

So while LOBs and computed columns, non-persisted only, for whatever reason, I mean, makes sense because it’s more difficult to implement, but these two are only supported for clustered columnstore index, which is pretty upsetting. I would like to see the LOBs and the computed columns to be fully supported by both indexes. I have no expectation in the vNext this would change. If it is, please guys, Microsoft, I love you even more, really. Because this is – this data warehousing solution in the health industry, in the financial industry, the computed columns is the backbone. It’s just – it will not go away like, ever, because you do not want to leave the complex calculations to any new developer in your team. You want to define them and leave them and guarantee that they are the way they are.

So OLTP, still the same, I mean, you have the advantage of the online rebuild, which is only for the disk-based nonclustered columnstore indexes, and these are the same. I will highlight it again in the beautiful, I would say, purple, whatever it is, color. I’m colorblind, by the way. If you didn’t notice it before. No I’m not, but my choice of color can be bad. I recognize that.

And going to Azure SQL DB. I know we’re really a bit over time but it’s just because of this unicorn situation.

Niko Neugebauer: That’s right, they always make everyone late.

Brent Ozar: You can always have this or that, and we still expect to find out who this animal is. I will keep it a mystery until the next GroupBy session. My session, which means – I don’t know, maybe in a couple of months, maybe in a year.

Derik Hammer: Guess you’re coming back.

Niko Neugebauer: No, I am.

Brent Ozar: Every time he goes on vacation. Every time he goes on vacation he’s going to schedule in.

Niko Neugebauer: Exactly. So I talk to my wife it’s like, I wanted to present for Brent’s conference, can we go on vacation? It’s a divorce strategy. So Azure SQL DB, it’s the same like in 2017, with the expectation that the online clustered columnstore index rebuild should be making Azure SQL DB much smoother and I didn’t try it in the last month or two, and maybe – you know, maybe somebody will run and do a beautiful blog post that’s already there. Who knows? Maybe even in the managed instance. I need to blog about it. Oh, you can.

So it’s the same reasons. Just Azure SQL DB is the – largely on the storage engine, it’s good new SQL Server with the newest feature. There are some very, very cool with the three features with the database scope configuration, which are already available for Azure SQL DB, which are not in 2017 that are new. Be aware, some of the features they are different, such as there is no support for application. You can use the Azure SQL data sync, and I do use it but it’s not a replication, to say the least. It will replicate the data but not in the same way and not with the same guarantee. Still beautiful thing, I wish it to grow, there is so much space for the improvement, but it starts – you cannot draw your replication, you know, you cannot draw your design schema and input the replication. I mean, you can, but it will not work.

And the other is the bulkload and simple recovery models. It’s all running fulls (recover model) with the read committed snapshot isolation, so your data warehousing will function differently if you’re running on the simple recovery model, to say at least.

So here is for the OLTP operation analytic scenarios. You can go in-memory, just remember not to use the natively compiled stored procedures and don’t forget that in-memory are only available for the premium instances. And that’s the point where I mentioned that if you’re running S2 or better, so Standard edition of the Azure SQL DB, you can have – since two months, you can have the disk-based columnstore indexes, which is really, really awesome. So at least you can do the development for your data warehousing, for your BI solution right now on the Standard edition. You don’t have to go premium, which is really, really – I like it. I’m really grateful to Kevin Farley for him, for you know, working and pulling it through. He did a great job, as always.

Azure SQL data warehouse, well, there is only one type of indexes you can build there and it’s clustered columnstore, and it’s default type and if you have had never an idea, it’s like, oh, we’re running a data warehouse, we call it to Azure SQL data warehouse, like, besides that, take a look at a number of connections you can establish.

So it’s not like you can – I have 16,000 users, you know, issue in the queries. That’s not the Azure SQL data warehouse was thought for or designed for. And the other little thing is there is something which is called distributions, and I will be blogging soon enough about the Azure SQL data warehouse, and the thing is, unless you have in the table 60 million rows, you will not get a full segment. So just an idea like, I have 10 million rows, I can go to Azure SQL data warehouse. Yes, your row groups will be very small, you don’t want this, it’s not really efficient. So 60 million rows, it’s like the rule of thumb, it’s around there, you’re starting to, you know, use the advantages of the Azure SQL data warehouse. But there are obviously exceptions. Follow somebody like James Rowland-Jones who is – who went the right way and joined Microsoft, besides organizing SQL Beats and you know, doing so much awesome stuff, and he’s the guy who’s pushing Azure SQL data warehouse forward. So follow him for the news on this. So at this moment I say stop and fire away your questions.

Brent Ozar: Hammer time. Alright folks, feel free to fire away and ask any questions that you got either in Slack or in GoToWebinar. I’m going to ask you because you’ve obviously worked with Azure SQL data warehouse at least a little, what’s the ballpark connection limit now? I want to say it was like, 30, but what’s the current connection limit? Do you know?

Derik Hammer: Concurrency?

Brent Ozar: Concurrent queries.

Niko Neugebauer: Concurrent queries, I am – I know that it’s more than 30. I am confident – pretty confident about 40. But it’s not thousands.

Derik Hammer: Yeah, they recently came out with 128 but it’s really misleading to say 128 because there’s concurrency slots that each query takes up and when you’re doing anything above the smallest resource class, they take up more than one slot at a time, even though there’s technically 128, it’s really a dynamic equation about how big your warehouse is to see what you could actually achieve.

Brent Ozar: Nice, okay. Cool.

Niko Neugebauer: I think to describe the Azure SQL data warehouse, it’s an incredible platform for storage and for processing massive amounts of data. This is why it’s true to NPP but it’s not – the name is the kind of challenging right now and I’m confident that Microsoft is more than aware with such smart people as JRJ, Orloff, Uscher, that they will be working on making it work even for the traditional data warehousing scenarios.

Brent Ozar: Are you just saying that to make sure the picture doesn’t get out? Now, one other thing I should ask you. So now when people want to go learn more about columnstore, where should they go? Like, which of your websites do you want to point them to?

Niko Neugebauer: I would like to invite you to, or if you Google for Niko and columnstore, there is a page.

Brent Ozar: Yeah, pull it up there.

Niko Neugebauer: Yes, so I have 120 plus blog posts and my site is back online after some trouble for a week. And I hope it’s still online. Let me check.

Brent Ozar: Yeah, it was good this morning. You’re in the weekly newsletter from Monday, so I was like, watching to make sure you didn’t get hacked again.

Niko Neugebauer: Okay, so you are my angel. It’s the same joke – no, not that.

Brent Ozar: Guardian – like a My Little Pony but different.

Derik Hammer: We had question pop in. I’m not sure how to pronounce this correctly but The Bassist asked, “Does automatic index recommendations in Azure SQL database take columnstore indexes into consideration?”

Niko Neugebauer: I confess that I have no idea. Would be a great subject actually to consider. I would be kind of surprised because if you look at the DTA – what is it? Database tuning advisor, they came with a support for the columnstore indexes pretty late, like, a year ago or something for 2017. It went live, and I’m guess that if you are somewhere in the – around the tipping point where it’s worth having columnstore and it’s not – it’s pretty – it requires a human to kind of understand this, it’s not only – my argument is always that this is a sign that you have like 122 blog posts. Yeah, so you know, you have a solution. One query is not really efficient, and if some – its total consultants will come and say let’s put an index on it and you know, it’s wonderful and it’s running fast. He does not have an idea why is it slow in the economical point of view.

So for example, imagine, you have a CEO of your company, and CEO and director is using a dashboard. Those queries got to be running as fast as they can. And if the others, like, I don’t know, your boss is suffering or you yourself or some support people take five seconds to see the dashboard, nobody cares. And that’s what machine learning cannot process because it doesn’t have this – you know, physical word context, right? And I’m just – if it’s supported, I’m excited about it to find it out, but I guess I will be having more work.

Brent Ozar: Nice.

Niko Neugebauer: Thank you, just really appreciate it.

Brent Ozar: Well, thanks for coming and talking to us today at GroupBy. Thank you very much, Niko, and I would totally encourage everybody to go to Niko’s site. He is like, the place to get started with columnstore indexing. Fantastic.

The following two tabs change content below.
Niko Neugebauer is a Data Platform Consultant. A SQL Server MVP with over 20 years of experience in IT, he is passionate about the Microsoft Data Platform and community. Founder of the Portuguese SQL Server User Group and the main organizer of the first SQLSaturday event outside of North America (#78 Portugal), Niko speaks regularly at events such as PASS Summit, SQLRally, SQLBits, and SQLSaturday events around the world. Niko loves sharing information and knowledge and has authored over 130 blog posts on Columnstore Indexes, and regularly contributes to the open-sourced CISL library focused on Columnstore Indexes.
Previous Post
Voting Open Now for GroupBy’s September Lineup
Next Post
How to Monitor Everything

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.