SQL Server 2016: It Just Runs Faster

Download BitTorrentMagnet – Audio podcast:

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS
Leave us a review in iTunes

Target Audience:

DBA, Developers, IT Pros, and IT Decisions Makers.


Based on the popular blog series, join me in taking a deep dive and a behind the scenes look at how SQL Server 2016 “It Just Runs Faster”, focused on scalability and performance enhancements. This talk will discuss the improvements, not only for awareness, but expose design and internal change details. The beauty behind ‘It Just Runs Faster’ is your ability to just upgrade, in place, and take advantage without lengthy and costly application or infrastructure changes. If you are looking at why SQL Server 2016 makes sense for your business you won’t want to miss this session

Why I Want to Present This Session:

I want the community to learn about some of the less known enhancements we have made to SQL Server 2016 that allow an application to just run faster with little or no changes. I also want to explain why we have made changes to the SQL Server Engine to keep pace and stay ahead of modern hardware changes to allow applications to scale.

Additional Resources:


Session Transcript:

Intros and How Bob Got Here


Brent Ozar: Good morning. In this session of GroupBy, Bob Ward is going to be talking about why SQL Server 2016 just runs faster. Take it away, Bob.

Bob Ward: Brent, thank you so much. Guys, I’m so honored. Brent, thank you. Honored for the folks that rated my talk to be part of this first-ever GroupBy event, so excited to be here. As Brent talked about, my session to start this whole event off talks about SQL Server 2016. There’s a lot of cool features in SQL Server 2016, such as Always Encrypted and Query Store and so forth, but my goal today in the next 70 minutes, with ten minutes at the end for questions, is to convince you that with very little changes to an application, moving to SQL Server 2016 can actually make your application run faster.

I’ve done this presentation at a couple other different events, I’ve actually got a new twist today to add an additional demo. We’ve got a couple demos as well as we go through the deck. Couple logistics: Brent mentioned about questions, certainly put them all in there. Knowing me, I’m committed to answering any question you have. We’ll have about ten minutes at the end to answer questions but if we don’t get through those, we will find a clever way to make sure we post those on my blog or I’ll make sure I can even answer those directly, bobward@microsoft.com.

I’m a big believer in getting my materials to you. If you go right now to aka.ms/bobwardms, there’s a presentations folder, it’s a OneDrive folder. This deck and the demos that I’m going to show you are all out there for you. So no questions about how to try to get all this stuff. It’s all there. If you see me running queries or something of that nature, there’s no need to take screenshots and those kinds of things. You’ll have it all available to you.

How “It Just Runs Faster” Came About

How we got here

Let’s get right into the material. A quick story about how we got to this whole situation. Myself and my colleague, Robert Dorr, we run a fun little blog called BobSQL. His name is Bob Dorr, I’m Bob Ward. A couple years ago, when we were both on support team and we started looking at what the product team was doing with SQL Server 2016. We noticed that there were some performance enhancements that they were making to the core engine. We started digging into this, going why are they doing this? It kind of dawned us that if you think about SQL Server 2016 a lot of the fundamentals of the engine date back to SQL 7.0 in the late 90s. Technology has kind of moved on.

If you think about faster I/O, faster networks, denser core CPU type technology, we have to make sure that the core algorithms and some of the features we’ve got in the engine keep up or stay ahead of the pace of that technology. That’s what got born out of the situation. Bob started doing interviews with members of the product team to say, “What about this change you made here? Why did you do this?”

What I hope to show you today is a little bit of the history or the result of the history of that research behind it. When Bob came into my office and was talking about this concept, he says, “You know, Ward” – he calls me Ward – he goes, “I think this thing just runs faster.” I said, “There’s a brand name. Let’s stick with that. Let’s go out on the road and show people about these kind of type things, why SQL Server 2016 just runs faster.”

The second point here in the slide is, this is how we ran into this. It was all about customer experience. This customer issues reported to support, MVPs telling us scenarios through the Connect site, or just MVPs telling us about problems they are running into, or customer benchmarks we were doing us technology like XEvent and Xperf. The themes that we found through this investigation of what we change, you know, it’s all about scalability.

I have this new term for you called reverse scalability. What I mean by that is customer’s expectations of moving to new hardware, and perhaps even a new version of SQL Server, or perhaps a new version of SQL Server with the same hardware, and not getting the scalability they expect but it’s actually worse. Not even getting the performance as the same as before but getting worse performance. So moving to bigger hardware and actually getting worse performance before, what I call reverse scalability. We ran into that.

Topics We’ll Cover Today

All kinds of things just run faster

Some of the techniques that you see on this screen are what you are going to hear about today. Partitioning is a big concept about scalability. You’re going to find several scenarios where we partition things in order to achieve that scalability. Parallelism. You all know about parallel queries in SQL Server but we have certain situations in the engine where we’ve just decided to do other things in parallel, different than queries, and allowing us to get the scalability we’d like from the engine.

Sometimes we just decided to do things larger. I’m literally serious that we had this concept called logwriter in the engine and we said, “What if we actually just added more of those? What if thought on a scalable system with NUMA nodes if we just added like one per node, would we be faster?” Sure enough, we were. We tried to make sure that the engine was attuned accordingly to run with more logwriters. But just by doing some of those types of analysis and experimentation allowed us to get faster.

Dynamic response is an area where we said in some cases we don’t want to automatically partition things to start with but perhaps we could dynamically respond to a problem. We could observe a problem that’s going on in the engine and without the users really kind of noticing there was an issue, we’d get ahead of it and actually partition on the fly.

Then, finally, improved algorithms. In some cases we just said, look, we’re using a spinlock in this part of the SQL Server code. We don’t even need to do that anymore. We’ll use some sort of lock-free approach like we’re doing in in-memory OLTP. Let’s just do that instead and see if it’s actually got better performance. That actually is some examples we did there.

Before I show you some examples of what this actually looks like, one thing that is very important to note here, nothing that I have in this deck today, none of the features I’m talking about, the enhancements I’m making and showing you, have anything to do specifically for an edition of SQL Server.

Nothing Here is Enterprise-Edition-Specific

Now you may know now that in SQL Server 2016 SP1 we actually have enhanced or lit up some features that you didn’t have available for in things like Standard Edition. But anything I’m showing you today is not specific to Enterprise Edition. It could be there is still a feature that is Enterprise only that we’ve enhanced but none of the changes we’ve got listed out here are specific to a specific SKU.

Before I go into these features, I want to make sure you’re aware of two things. I added this slide after I first started doing this presentation because I think a lot of people that even if you’re just fundamentally familiar with SQL Server are not aware of these two technologies: columnstore indexes and in-memory OLTP. Both of these available before SQL 16 and have been enhanced in SQL 16, but these query performance numbers that I’m listing to the right over here are not just made up stuff. These are actual observations from the lead program managers of these features.

When I talked to them I said, “Hey, what kind of numbers are you seeing with actual customers?” These are the type of improvements you can see from using these technologies by adopting them. In these particular cases, these are one in some cases that do require some changes, as opposed to some of the things I’m going to talk about. But I want to make sure you’re aware of these because these could absolutely make your application faster, your SQL database faster just by adopting these technologies. In one particular case, we’re going to talk about enhancement, we’ve actually made the columnstore in SQL 16.

This is the list:

You’re going to see throughout my deck here these hyperlinks throughout the deck. When you get them, click on them. They actually go to reference information. In this particular case, this hyperlink that says “just runs faster” actually points to the BobSQL blog and a series of blogposts that Robert Dorr started and I’ve actually enhanced that talk about all these functionalities and features. There’s no way in the 60 minutes now that we’re going to be able to cover all this stuff so I’m going to pick a selected few from this list and go over them and talk about why we did it and how they work and how it can make you faster. Then, of course, we’ll do a couple of demos.

The little stickers that say “new” right there are just things that I’ve introduced since I first started doing this series that may or may not have a blogpost yet that we need to actually produce. This is the list. I’ve got a sticker here that says “there is more” because there is more. There’s actually more that are out there on the blog that I don’t even have on this list because I’ve run out of space on the slide now. There’s more that is listed at the back of the deck that we don’t even have a blogpost yet, we haven’t even researched yet but we know about that there’s a feature.

Another thing that’s really interesting when you get this PowerPoint deck and look at it, if there’s a particular feature on here, like multiple logwriters, that I don’t cover today, I have at the back of the deck something called bonus material. Bonus material is a set of slides that I’ve built that do talk about these features that you can read over and look more at, also pointing to the blogposts in the case that you see something today that you wish I’d cover but you want to learn more about. A lot of good information available in this PowerPoint deck when you get it, both for just viewing this information in this presentation and for reference for later.

Intro to Non-Uniform Memory Access (NUMA)

Let’s dig into core engine scalability and talk about NUMA. If you think about NUMA technology, I harken back to when I look at this slide here to machines that I used in the mid to late 90s, SMP machines, that had eight CPUs. I thought I was the king, man. I had a Unisys server with eight CPUs back in SQL 7.0 and I was doing testing and support. I thought I was like the greatest guy in the world.

If you think about this technology, as CPUs grew more in these machines, to say like 32 or even 64, bottlenecks started to arise within the hardware itself, right? Thus was born this NUMA architecture where we effectively partition CPUs into their own little mini computers with their own local memory bus and local memory access to make things faster, hence that term partitioning you see come into play here.

In SQL 2005, a rare fortune that Slava Oks and the developers of SQL OS, they thought of NUMA way back then and said, “Let’s build into SQL Server the capabilities of recognizing NUMA and taking advantage of those technologies.” Most of these original designs had no more than eight CPUs. Well, multi-core takes hold into the industry—first of all, dual core then hyper-threading kick in—but then there’s multi-cores and you can get CPUs on the market now easily with 24+ cores for just a given socket. Now what you see is sometimes these NUMA nodes experience the same bottleneck within the node that occurred in these SMP machines. So you think like spinlocks that we use inside SQL Server, they don’t scale sometimes in these scenarios.

Automatic Soft/Virtual NUMA in SQL Server 2016

So in SQL Server 16, we decided to do something very interesting. We decided to further partition the NUMA concept in what I call kind of a soft or virtual concept. If you’ve got a hardware NUMA node scenario where we detect more than 8 physical processors per your node, by default SQL Server will partition these nodes into further nodes into less number of processors for these nodes than the physical configuration. Now we’re not changing the hardware configuration, we’re just changing SQL’s view of what NUMA looks like.

You might ask yourself why would you do this. Any code in our engine that benefits from NUMA partitioning can get a boost. For example, in the slide here I list something called IOCP worker, which is called the I/O completion port worker. That’s a dedicated thread that handles like TDS packets coming into SQL Server. Well, by default we build one per node.

So if we add more nodes to the hardware configuration of NUMA nodes, then we’ll get more of these workers and therefore theoretically get a boost in connectivity and batch throughput just because we have more of these workers. The results are interesting and you look at these slides, they’ve got these little green shapes that I’ll show you in the right-hand corner in many of these. These are some of the performance enhancements that we actually observed by making some of these changes.

You look at the last one here, a workload derived from our TPC benchmarks. We simply were taking some of these workload benchmarks we have internally, we would actually stick it on a server, we would then just turn on Auto Soft NUMA partitioning and see a 25 percent increase. So the core engine itself has several other areas where it uses NUMA node partitioning. Just by using this technology, we’re seeing performance enhancers for the server. It’s on by default and there’s an alter server configuration parameter to turn it off if for whatever reason you see that there’s an issue with it.

Visualizing NUMA Changes in 2016

I thought it would be interesting for you to see more of a visual about how this actually works. This is based on a machine that I have on the labs here at Microsoft that I use often. It’s a four socket 18 core hyper-threaded machine, which what that means is that for each node you get 36 logical CPUs, so total of 144. This is a picture of a hardware node, a memory node. These numbers represent the CPU IDs. I listed them in this way on purpose. A zero and a one vertically represent one physical core of the CPU. So 36 logical CPUs, 18 cores.

NUMA nodes on a 4-socket box

What SQL Server does when it sees this configuration, it will take this particular memory node and it will break it up into these four logical nodes within SQL Server’s own code to actually see it that way, all for the purposes of scheduling worker threads. If you look at the pattern of these numbers, there’s a method to the madness here. There’s a rhyme for the reason why we did this. In node 0, you’ll notice that it’s 0, 2, 4, it’s not 0,1,2,3 consecutively. You’ll also notice here that for CPU 1 it doesn’t show up on node 1, it shows up on node 2. Here’s why: we first try to avoid putting two logical CPUs from the same core on the same soft node. That’s because we’re trying to do as much as we can to avoid any conflicts of running worker threads on the same actually physical core.

In addition, we try to avoid putting physical cores on sequential nodes because for degree of parallelism for actual parallel queries, that’s often how we just schedule the work. We just say, “Hey, start running these subqueries on consecutive node numbers.” Yeah, you still could spill over, depending on your degree of parallelism but we’re trying to get smart about how we actually configure these nodes in a logical way. If you look here, we’ve taken this concept of 36 and broke them into four nodes of nine. The method there is that we’re going to try to get close to that magical 8 number as possible. So we are one over 8 at 9, but that’s how we actually break these things up.

If you go look at things like the DMVs, like DMOS nodes or DMOS memory nodes, you can see a mapping of the fact that there’s a hardware node out there that we can’t ignore from a memory perspective, so there is a mapping to these. We try to make sure that these four logical nodes still appear and mapped in SQL Server on the same hardware node, which is memory node 0. You’ll see it in there.

In fact, I actually after doing this talk and getting a lot of questions in this space, especially when it comes to like virtual machines, I’ve got a blogpost that is linkable off this slide where you can go in there and see how I took various types of machine configurations, both this one, a VM, and another machine I have, and you can see what the error log looks like, what the DMVs look like, and the binary mapping to CPU IDs. You can get a picture of what that all sees here. It is my contention that out of the gate, you move to SQL 16, we use this NUMA configuration and you can get much great benefits from it in these high dense core NUMA scenarios.

2016’s Changes to Dynamic Memory Objects

Dynamic memory objects, you can see how this is going to go today. We’re just going to take a tour through some of these types of features and talk about how they work, why we did it, what kind of benefit you can see from it. Dynamic memory objects is an example of this dynamic response concept that we built in SQL Server that I mentioned earlier in the talk. Memory objects are a concept in SQL, if you don’t know what they are, if you think about most of the memory allocated in SQL Server, most of it comes from things like buffer pool or procedure cache, which is, you know, things like buffer pool are like 8k pages.

Well SQL has other memory needs for its own purpose, for different data structures and overhead within SQL, things like connection memory of that sort. So we use this concept called memory objects, which are a little bit like Windows heaps, if you’ve ever done Windows programming before, variable size types of memory we need to allocate. If you’ve spent any time understanding how memory management works in an operating system, when you allocate memory, you must track this memory. You must track who is allocating, how much have they allocated.

What happens is you literally have a list of structures in memory that track the memory you’re allocating, things like linked lists and so forth that track all this information. Whenever you build a system like that and you have multiple threads that are global in nature, you have to protect it so that one person doesn’t try to step over the memory that’s being allocated by a different thread.

CMEMTHREAD Waits: Fixing ‘Em Automatically

We use this concept in SQL Server for these global memory objects called CMEMTHREAD waits. When you have a lot of different threads that are all trying to access these memory object structures and they run into hot contention, you will get a CMEMTHREAD wait problem as you see from things where you normally see wait types and dm_exec_requests or OS wait stats. We consider a hot memthread to be one where you get a lot of waits and maybe a higher average wait time per wait. As it turns out, we built an infrastructure in the engine to create these memory objects and these structures so that they’re automatically partitioned by node or CPU but there’s a little bit of overhead to do that so we don’t do it by default for a lot of these.

We also built in the concept that if a particular memory object is already partitioned by a node, you can promote it by using this trace flag 8048 and that’s something we’ve blogged about and people have actually used in these scenarios. But here’s what happens: customers can contact support or somehow the product team finds out that one of these memthread wait problems are occurring because of a given memory object and we end up creating a hotfix for it. Customer has a scalability problem, we see a wait, it’s on a memory object that we didn’t think was going to be a scalability issue, we had—spatial was one we ran into a year or so ago. We run into these from time to time.

We said to ourselves there’s got to be a better way. Why do we have to create a hotfix every time somebody runs into one of these things? Why can’t this just work and respond to if a problem occurs? Either partition these ahead of time or as you see a problem occurs, why don’t you partition on the fly? That’s exactly what we do. So behind the scenes a CMEMTHREAD wait scenario could happen, but it happens in such a tight timeframe that you don’t even notice it and we would automatically partition by node and eventually by CPU to try to make this a very scalable solution.

In fact, this is a small little perfmon graph and you’re going to see an example in a second, a demo I’m going to do. This right here represents the inflection point where the batches go up from a scalability perspective and the waits go down automatically to almost zero. It’s my contention here, this just works now, you don’t even notice this is occurring. In many tests we’ve run, we’ve seen 3x improvements in memory allocation with less CPU being used, which I’ll explain in a second, just by doing that.


You know what? Let’s take a look. Why have slides the whole time you’re doing a presentation, why don’t you see a demo? I’ve got a perfmon chart here that shows you thread safe memory object waits per second, which is equivalent to the CMEMTHREAD wait concept. I have batches per second, I have processor time. Now this is the funny twist to this whole thing here. This feature when it’s turned on by default, which is the default when you install SQL Server, this problem if you run it it happens so fast, like you can’t even detect the problem. So there’s this undocumented trace flag here to turn off the feature so you can see the problem first and then I’ll turn the feature back on.


I have a stress scenario to run here that you can have access to yourself after this presentation. I’m going to run this; I’m going to start seeing the waits that actually occur here. Let me go ahead and just scale these counters so you can see them a little bit better. Basically what you have here is you have waits here occurring and you have batches per second occurring. Let’s scale this one a little bit differently. You see here I’ve got batch requests per second happening at a certain rate. I’ve got these memory waits occurring here at a rate as well when I’m running this repro here.

Let me go back here and this is kicking in here. By me now going ahead and turning on the scenario here—let me go back and look here. I’ve got a DMV, and this is a fairly complicated query for you to take a look at here. This is actually going to show you for a particular memory object that some of the contention that can occur because of this and also the fact whether this is partitionable or not. This particular scenario, what I’ll do is I’ll actually go ahead and turn on this feature and let’s observe the results.

What could happen in this scenario here as these batches are running is all of a sudden these particular waits can go down to zero as you see here. Notice here that the particular batches went up. Now it isn’t significant with this particular demo here but can you see here the inflection point as the waits go down, how the batches go up here? It used to be at some sort of rate here, around 2000 or so, and now it’s up in the 3000 type level range. So SQL Server was observing waits, they occurred and batches go up. Now if you go back up here and look at this query that we were talking about for the DMVs and you run it again, you can see here that automatically we’ve partitioned this by CPU. This contention factor is a new column we’ve added to this DMV that you can see the actual waits and the contention occurring in the system.

Now in a normal scenario when you’re running SQL Server, you don’t even see that this is happening. I’m actually showing you an example here using this trace flag so you can observe the behavior. I turn off the trace flag, which basically turns on the default behavior and automatically partition things. So it’s our intention here that this core design of the engine is something that you don’t have to worry about. We partition on the fly and if you were in this demo without turning on that trace flag, this happens so quick that you don’t even see the scaling factor occur within the engine itself. So pretty interesting feature that we’ve added, something that just works by default that you don’t normally see but you can see now that what we’ve added is the intention of keeping things scalable and dynamically responding.

New in 2016: Parallel Redo

Let’s talk another core engine scalability feature called parallel redo. If you don’t know enough about or are not sure you’re sure of the details of how we do recovery in SQL Server, there are basically three phases. Here’s a pointer to a primer about how it actually works. Effectively, it’s the analysis phase, a redo phase, and undo phase. Analysis is going and scanning the log to determine what we have to do, what transactions do we have to go commit that were in the transaction log but the pages on disk don’t reflect those changes, and what transactions we have to roll back. Where changes on the pages themselves have been reflected but the log says they weren’t committed so we have to undo those.

If you think about the redo phrase, it’s all about applying changes to pages. There’s committed transactions in the log itself, we look at the page on disk, it’s not reflective of that actual change so we need to redo that change on the page to make things consistent. Typically, it’s I/O bound because we’re reading these pages from disk, we’re going and making the changes here, there’s a lot of that work to do to make that happen. So we’ve never really had to worry about doing anything special in the redo code in the past because I/O devices have been at such a latency level that I/O is usually the bottleneck.

But if you think about the typical fast I/O devices that are on the market now with latency sometimes less than a millisecond, we’re running tests and all of a sudden realized maybe our redo code is the bottleneck now. So we said, “What if we did this in parallel?” If we did things in a parallel fashion, we could utilize more of the CPU and actually make this faster. That helps scenarios like secondary replicas where you’re continuously running redo scenarios so that you can go read and see the changes happening in a secondary fashion. But if you’re thinking about doing things in parallel in the log and redoing, you’ve got to still keep things consistent.

How It Works: Pool of Redo Thread Workers

So here’s what we did, we built a parallel pool, a redo worker pool. These parallel redo tasks you see right here are the same as the command you might find in dm_exec_requests when you see this occurring when we run recovery. Now this worker pool is simply SQL Server worker threads that we dedicate during recovery to doing this task, so it’s not some special different pool that we’re building, We’re just dedicating certain ones from the normal worker pool to do this task.

We also have something called the dirty page table inside SQL Server that’s used for recovery purposes. Here’s what we did that’s quite clever. We said, let’s take the dirty page table, which is a list of pages that we must go make changes to to do commits for redo phrases and let’s partition it by page ID. Then let’s take each redo parallel worker and assign it a page ID. Then in order, apply the necessary log changes to make things consistent. So one parallel worker cannot touch the changes to a different page. This keeps things a very consistent way, but yet in a parallel way. So I get page one, page two, and page three. I need to go apply all the changes in order to page one and I can do this at the same time and still maintain consistency because I’m not taking any changes and applying them from one page to the other across these log sequence numbers.

Now, versus standalone recovery where I don’t do things in parallel, I do it in serial, I’ve seen up to 80 percent increase in performance. I’ve got a little pointer here on the slide as well, you may be asking, how many of these things do I get? Does it get out of control? Well it’s basically the number of cores in your system, what these pools look like per database. There’s a max per db and a max for the instance itself, only used when we’re actually running recovery which would occur of course when you’re doing a restore, continuous redo, or when you’re actually restarting SQL Server.

Parallel Worker Thread Redo Demo

Let’s see this in action. In order to make sure we use our time wisely today, I ran the scenario, it takes several minutes so instead what I’m showing you here is a couple things. One is this is the result of an extended events session showing off a new feature of SQL 16 with an event called database recovery trace that shows you the difference between a serial execution and a parallel execution of redo on a database. So first of all, this is actually just handy for looking at recovery scenarios.

If you’ve got a slow recovery scenario and you want to trace some of the details of it, you can see that by using this extended event trace session. If you’re asking yourself how do I turn on an extended event trace session if it’s starting SQL Server, it’s like a chicken and the egg concept, you can actually build this xevents session, and it’s in my demo scripts, where you can say startup equals on, we will start up this session before we run recovery. So the session kicks in, it’s already running when recovery runs you’ll get all the information you need.

Using undocumented trace flags and Extended Events to watch serial redo

The way I ran this in serial, by the way, is another undocumented trace flag, not because we’re trying to hide something but because you don’t normally need it, but it’s in the deck in the notes session. It’s there just for testing purposes so you can see the difference between a serial execution and a parallel execution. In the serial case here, you can see the phases of recovery: here’s like the analysis phase, here’s the amount of log we’re trying to process, which is like 7 gig in this case. If you see here, with the redo phase starting, it starts here about 9:43:22 seconds, here’s the phase starting. If you go down here, you can see that it finishes about 16 seconds later. So about 16 seconds of redo time and actually in the error log and in this trace it tells you the breakout of how long it takes to do analysis, redo, and undo, so 16 seconds.

Here’s the parallel case where we’re not using the trace flag. First of all, notice here that there’s additional entries talking about the fact that parallel redo is enabled for this database and the different workers that are started for this process. If you look here at the phases of redo, this one here starts at 9:35:59 and you can see here about ten seconds later we now finish. So a ten second execution time for the same type of recovery that’s occurring here versus 16 seconds. Now that doesn’t sound all the impressive just from running this scenario, but this is very small database and as you get into a scenario where we actually can add more parallel workers, depending on how many cores, for a large amount of redo, this actually can get faster and more scalable.

Again, this just works by default. This is no turning any kind of feature on. You don’t have to observe and make any changes. We just run parallel redo by default, we take advantage of the number of cores in your system, and database recovery trace, a new, nice diagnostic feature for you to take advantage of in SQL 16.

2016’s Faster DBCC CHECKDB

Let’s move on and talk a little bit about DBCC. Nobody cares about DBCC, I’m sure. Certainly it’s something that in support I’ve certainly paid attention to over the years. In 2008 timeframe, I had this passion to make check faster. I was just tired of some of the time I was hearing about CHECKDB that was running. I thought to myself, what can I really go after? Running a full CHECKDB is a little bit difficult because of all the different index checks we have to make from a logical perspective. It’s kind of hard to put your head around how to make that faster but I thought to myself, could we make CHECKDB with physical only faster? Here’s why: CHECKDB physical only is pretty much reading all the pages from disk and doing an audit on the pages, that’s really what it comes down to. So if that’s what’s going on here, how do we make that as fast as possible?

I’ve got something I call the backup to nul test, that’s kind of my benchmark to try to compare how fast we can do this. If you didn’t know this trick before, this is an old trick that’s been around for a while, if you backup a database to a device called N-U-L—that is not a typo by the way, nul is just one L. Backup to disk equals nul, that tells SQL Server go ahead and read all the pages from the database to back them up and basically do nothing after that. Don’t write it anywhere because nul is like an empty bit bucket device. So in a way, you’re really effectively testing the disk speed of how fast we can read pages from disk. In 2008, we made some changes to make that faster. We also made some other changes in that timeframe, between 2000 and 2012.

What we observed is is that we could improve the design of the multi object scanner latch—which I’m sure you’ve heard of before—it’s an internal latch we use to protect data structures as they’re reading pages from disk over multiple objects. In 2016, we said we can make this better. What’s interesting here, as I talk about the no-lock approach is that the developers of the Hekaton project, in memory OLTP, they’re used to not doing latches, of getting rid of latches in their design. They took a look at this and said, “Look, you don’t even need that latch anymore internally, guys, doing CHECKDB. You can use a no-lock approach and you can improve your read ahead capabilities on these pages.”

Benchmarking Faster CHECKDB in 2016

Here’s the observations: on an SAP type database, which is a large number of objects database, one terabyte, we out of the gate—with no changes to anything—got 7X performance gains from CHECKDB with physical only. The more parallelism the better to a point. There’s a threshold here, depending on your system you could run into, but we were achieving scalability results here, not reverse scalability but scalability results here just because of this change. In fact, with a small database of five gig, I’ve been able to see 2x faster performance using physical only.

This is a chart that one of the developers gave me. This chart is not what you think it may be. The bars, as they get bigger, are actually a total duration. You can see as you go across the board with degree of parallelism, this is a reverse scalability concept, things get worse as you go across parallelism versus getting better. In the case of the no-lock approach, which are the orange bars, things get better or stay the same as you move across parallelism. Again, 7x faster performance for a one terabyte database with multiple objects.

Our ability to go into the core engine, you make no changes, you still just run CHECKDB physical only, you get faster performance. A full CHECKDB could see some advantage of this but I think it depends a lot on what other things CHECKDB has to do, with how many indexes you have, and so forth. But certainly physical only is something that has had vast improvements from this change.

TempDB Improvements in SQL 2016

Tempdb, another topic that nobody really cares about, has never been talked about in the SQL Server community. In fact, is it probably the most contentious point in the history of the product? Maybe. I certainly have been a part of that discussion. In fact, there’s a link on here from a PASS Summit talk I’ve done on inside tempdb where I talk about this concept of multiple data files. You know, it was my contention back in 2011 and I still have that feeling today that one file per logical processor up to 8 and then add 4 until it doesn’t help anymore, is kind of a motto I’ve used and I think it has been successful in looking at tempdb multiple file performance scenarios.

The thing to remember here about multiple files with tempdb is that it’s really not about I/O. Sure, it’s true that you could take tempdb files and spread them across disks and when we have to do I/O for tempdb, which remember, for tempdb user tables, we don’t do checkpoints on those, right? So it’s really only in the case where we have to spill those to disk because of a memory pressure problem or sort spills or hash warnings and else forth where we do I/O. So you could get an advantage from that, but the whole reason to add a bunch of files here is all about allocation contention for pages like GAM, SGAM, and PFS. So if you really think what you’re doing when you’re adding multiple files is you’re partitioning the allocation contention for tempdb. You’re helping create scalability by partitioning our allocation structures across multiple files.

So what have we done in SQL 16? First of all, by default, we have gone in and said we’re going to give you by default one file per logical CPU up to 8. Then in setup, you have a new tab here and you have also the setup options through the unattended install choices to add more, if you like. We also made the autogrow autofill for tempdb to be 64 meg. That aligns perfectly what we call a PFS interval. So by default if autogrow does kick in, we have less contention in the PFS page because we actually are going to grow by one PFS interval by default.

We also allow you to spread your file across drives in setup by simply adding a set of folders here. We’ll just round robin add these files across these folders or drives. Then the transaction log now has an 8 meg default as well. So, new defaults in SQL 16, setup offers you new choices to configure tempdb to your liking during the setup process. None of these existed before SQL 16. Does anybody even care? Does it even really matter?

Benchmarking TempDB in 2016

Benchmarking TempDB

Look at this chart that I dusted off back from my 2011 talk. I took that repro and I ran it again against SQL 16. Notice the difference here, seconds being the total duration of this workload that I did using OSTRESS. You can see here just by going to 8 files a measurable performance increase. Then in this particular machine, going to 32, and then things kind of tapered off. Now you may be asking by looking at this chart here, what about this trace flag here. What is this 118 thing mean? Well in the back of the deck there is a slide talking about trace flags.

Trace flag 118, which has been used forever is on now by default. In fact, funny story is if you turn the trace flag on, we actually give you a warning in the error log saying, “Hey, you don’t even need that anymore, so we’re not even going to recognize your trace flag.” Trace flag 117, which is used for autogrowing files at the same interval, that’s also on by default for tempdb. Both of those options, for 118 which is uniform extent allocation, and 117 for multiple tempdb files, both of those particular options are available now through alter database commands for user databases. For tempdb, we just turn them on by default.

So what about just simple scenarios? My little poor man’s laptop here with 8 logical CPUs, if I run a tempdb stress of 2014 out of the box, 16 out of the box, I get 2x type performance benefits. It is true that if you configure 2014 to have all the proper number of data files as SQL 16, you’ll get similar performance. But out of the box—without making any changes to 16 and 14 on my 8 CPU machine—I got 2x performance. 2x performance out of the box on an 8 CPU box. Again, if you take a look at many different scenarios, it is even conceivable that from a file perspective and a trace flag perspective, you may not have to make any changes to tempdb except for maybe the size of the files to avoid autogrow when you’re first going out of the gate. So there’s tempdb out of the box, better performance in 16 by default choices we’ve made and giving you the ability to be more flexible in configuring it during the setup process.

How Instant File Initialization Impacts Performance

I/O is an interesting topic, lots of different things in I/O we’ve done. Some of these are listed in the back of the deck. I picked out two in particular that I think you’ll have interest in. Instant file initialization is one of those. You know, I say in the slide here this has been around since SQL 2005 so you may be asking, “Why are you even talking about this in SQL 16?” There’s a couple reasons. There’s an important change we made during setup that I’m going to talk about in a second but also because I think a lot of people miss this feature. They miss turning this on and they miss a big performance boost. Also it’s important for you to know how does this work, I mean, why is it even important.

Prior to 2005, the speed to create a database based on the size was the speed to write zeros to the disk. The reason is that’s how you create a file of a given size. If you say, hey, I want a 10 gig file, create me a 10 gig database file for the operating system, you’ve got to create the file and then you have to grow the file. So we would write zeros to this thing to grow it out to 10 gig. Well on slower disks systems you know that takes a while.

Discussing Instant File Initialization

So Windows came along and introduced a new API call that we took advantage of in SQL Server called SetFileValidData(). SetFileValidData works this way: instead of you creating a file and writing out zeros to create and grow your file, you would just tell Windows, hey, I want a 10 gig file and Windows would come back and say, “You know what? Here you go. You got 10 gig” in the snap of your fingers. In fact, creating a file for a database became almost the same speed regardless of size. A 10 gig file to create a database was almost the same size as creating a 1 gig file. This is quite handy for us. This is a concept that we started using in SQL 2005.

You may ask yourself, for create database, I don’t care. I’ll go create a database here at my desk, I’ll go to lunch, I’ll come back and it’s finished, big deal. But you do care about restore. You care about how fast it takes to do that. You definitely care about autogrow because this same process to autogrow a file has to take place and that can be a major blocking problem. So if you’ve got a major blocking problem because of autogrow it occurs because you’re trying to run a transaction that’s inserting a row for example, we have to allocate a new page. We don’t have space. We’ve got to autogrow the file by x number of gig. So you could block the world just because we’re trying to grow this file by a certain amount of size. So instant file initialization could be a great benefit to you.

Security Implications of Instant File Initialization

There’s a catch to all of this. First, the session or the actual security role for the task that’s trying to create this file must have the performance volume maintenance task privilege. And of course you all memorized that, right? Everybody knows what the performance volume maintenance task privilege is. I forget it half the time, right? That’s something that’s a little bit of a pain if you to have to remember what that is. By default, we do not do that for the SQL Server account. The reason we don’t do that is because some people perceive this as a security concern. Now I don’t believe it is that way but let me explain why they think that. If you have this privilege, you can see any bytes in the space on disk that were previously there when we created this file. So if you create a 10 gig file and Windows comes back and says, “Hey, here’s 10 gig.” What it’s really done is it’s reserved for you the space on the disk for that 10 gig but it didn’t touch the bytes on disk. So if you have the privilege, you can actually see whatever was allocated there previously on disk. Now if you don’t have that privilege, Windows would only present zeros to you. So if you were just a user, you would not actually see those bytes. So the only people seeing these bytes are the session that actually has these privileges, which by default are Windows admins or somebody that you actually assign access to, which could be the SQL Server account, right?

Now the transaction log cannot take advantage of this and there’s a lot of myths around why that doesn’t occur but the main reason is because we have to rely on a known byte pattern. We can’t worry about the fact there were previous bytes on disk that may have existed there. We have to look for a known byte pattern to find the tail of the end of the log to see how to scan the log. That’s why the transaction log can’t rely on it.

For SQL 16, the reason why it’s important, you even talk about this is because instead of having to worry about what this privilege is, instead of having to worry about what Windows tool called the Local Policy Editor you have to do to go turn on the privilege, we just give you a little checkbox. So you’re running setup and you’re actually configuring the service accounts, you just click this checkbox and here all of a sudden now we’ve assigned the SQL database engine account to that privilege. It exists also in the silent install as a switch for you to turn on in the configuration file. The new installer for Developer, Evaluation, and Enterprise Editions will turn this feature on actually by default.

If you’ve never looked at the new installer, I encourage you to do that. You can install now Developer Edition like in three clicks where you couldn’t before, it’s 200 percent faster, easy for you to test. You can take a 10 gig database on your laptop, you cannot run this privilege and run this privilege and see the performance differences. Again, restore and autogrow are ones you really care about.

The Persisted Log Buffer: Using New SSDs for Less WRITELOG Waits

“You know, that’s an interesting feature, Bob. We’ve heard about that before and it’s nice to setup out of that, but what about something cool?” Well, this one I think is pretty cool. It’s called persisted log buffer. It’s actually a good example of us not just taking advantage of existing hardware technologies but staying ahead of the game. In many scenarios, specifically like an OLTP application, the speed of running the log is really the bottleneck for the speed of the application.

If you think about the evolution of storage here, if you think about traditional hard drives and even early SSD drives, the latency or the transfer speed to these disks are thought in terms of milliseconds. You might take an older type of drive and say, oh yeah, it typically takes about 10 milliseconds to seat the drive, to transfer bytes to the drive. The older SSDs, and some of them are like 1 or 2 millisecond type timeframe, right? So if you think about this new technology of PCI SSDs, in fact, my laptop has one of these. If you’ve never seen it before, I’ve opened up my laptop to look at it, it’s a little memory card looking device. It looks like memory chips. It’s actually a drive. In many cases, the latency for these are less than a millisecond, that’s what that symbol is, that’s microseconds.

You know, that’s interesting, but if you’re really tired of writelog waits, if you actually are saying, look, I really want to maximize performance for any transaction log latency scenarios, there’s a new technology called NVDIMM. This symbol is nanoseconds and in fact, if you go look a little bit of searching on the internet, this is also referred to as persistent memory. This is also a technology that in the hardware level almost looks like memory but really is persisted storage with battery backup and the whole type of thing. Now today this technology is smaller in size. The ones you’ve seen by some of the manufacturers out there, and you can look for this, are like 8 gig in size. But I think you’re going to see this get bigger.

Persisted log buffer

So we saw this technology, we said, “How can we take advantage of this technology?” I mean 8 gigs is not a lot, right? But we thought about the transaction log and how it exists in memory and said, you know what, we can take advantage of this technology. So the Windows Server team in 2016 also recognized this technology. You can actually configure one of these as a drive, just as a drive letter on your machine.

Normally with Windows Server 16, you can just support it as a normal drive and you’ll get a performance boost but there’s a new interface called DirectAccess or DAX with Windows 16 that can recognize these drives. Here’s kind of how it works. You take one of these drives that you configure with Windows and you format the NTFS volume with a special parameter called /dax. Well SQL Server in SP1 for 16 will recognize this. So if you’ll take a second transaction log file, you build your normal transaction log file on your normal drive, maybe it’s a PCI drive, maybe it’s some sort of SSD of a certain size. That’s your transaction log. Maybe it’s 100 gig, I don’t know. This smaller drive you format as a second log file on it, now SQL Server automatically recognizes it as NVDIMM drive or one of these DAX formatted drives.

Now the tail of the log, which is the active log in cache can now be a memory copy to make your commits very fast. So effectively, we can take a look at the active log that we’re going to flush to disk and we say, you know what, we’re just going to copy that to one of these NVDIMM drives, which is like a memory copy. We’re going to let your commit proceed. Then lazily behind the scenes, we’ll take that data from the smaller tail of the log drive and we’ll flush it to your normal transaction log drive. So in a sense, it’s persisted and it’s guaranteed to survive any kind of restart of the machine. You’re never inconsistent and when we restart we go look for any files on that drive and we copy it to the normal transaction log. Now, writelog waits become zero.

In fact, it’s kind of a funny story. When we’re looking at these benchmarks here, we’re running these things, we’re looking at dm_exec_requests and we’re seeing a wait type of writelog and a wait time of zero. Now that’s not like a bug. That’s just because the writelog waits are less than a millisecond. So all of a sudden we’ve got to reevaluate our DMVs now and decide a millisecond is not the new standard now, right, for these things. So there’s some interesting videos for you to take a look at the actually demos these technologies. It’s now available in SP1, fully supported for you to use. And Kevin Farley has done a really nice job of a really gory detail blogpost to show you visuals of how we actually take advantage and use that technology. You know, we can get 2x speeds using this over even this fast new PCI SSD type technology. So cool example of us taking advantage of new functionality, new technology inside hardware, and staying kind of ahead of the game.

If you think about a really good scenario here is in-memory OLTP. In-memory OLTP as you know is in memory, right? But, we still have to write logs to the transaction log. So if you’re looking to maximize performance for a scenario using that type of feature in SQL Server, using this technology is something to certainly take a look at.

Batch Processing for Windowing Functions

So this is something new I added to the talk that I haven’t done before. We’ll talk a little bit about query processing here and the fact that window functions are going batch now in SQL 16. I’ve hijacked the Microsoft Windows symbol here, not because this has anything to do with the Windows operating system, because it has to do with window functions.

Batch mode for windowing functions

Now I have to admit to you that before I did this slide, I wasn’t a big person who knew deeply about window functions, but I’m a huge fan now after looking at this technology. In fact, I’ve given you a pointer here to Itzik Ben-Gan’s blogpost about what this technology is. If you’re somebody that does a lot of time doing aggregation or doing a lot of analytic type queries and you’re struggling with all these sub queries and self joins and limitations of groupby, you’ve got to take a look at these T-SQL type functionality. It’s something that could be very powerful for you.

We introduced these functions back in 2012 even and we’ve iterated on them over the releases. Now, starting in a previous release with columnstore technology, we introduced this concept called batch mode. If you think about SQL Server, it’s a very row-oriented type technology. But there are many scenarios where we discovered as we’re doing query processing where we should perform things at the batch level, not the row level. Why not take a batch of 900 rows and perform an operation against it versus a row-by-row type thinking. We’ve seen great performance benefits for this batch mode in columnstore since we started doing that.

So now we kind of combine the two worlds. When you’re using these window functions with SQL Server, let’s take a look and use batch mode on top of that. This is an example of what a window function query could look like, this over clause is key to this entire scenario, where you’re providing a partitioning scheme and you’re providing what’s called a frame type of syntax here and then providing an aggregation on top of that window function.

Demoing Batch Processing of Windowing Functions

After going through the demo I’m going to show you in a second, I think I’m going to put a blogpost together called like “Window Functions for the Rest of Us.” Maybe Itzik can explain more what those look like but after going through and breaking down how this works, it’s kind of quite clever how these work.

On the right here is an interesting chart. This is the chart we’ve seen from scalability using this technology. With window function and aggregation and other type of window functions, the red line represents the row processing speed and the blue line represents the batch mode speed. You can see here, as parallelism has increased how scalable we can get across the board.

Let’s go take a look at one of these things. Now, when you look at these queries here, you’re going to say, hey man, there’s no way I understand just from your demo here what this all looks like. These are actually two CTEs, common table expressions, you’ve got a with clause here and another here and a select using it. So the second CTE uses the first CTE and the select uses the second CTE. But here’s the use of these functions, here’s the lag function and here’s an over clause. Here’s a sum function and an over clause. What this demo is showing you, it’s kind of simulating clicks on a website, like trying to do some analysis of how many times people have clicked in certain intervals of time for what we call a session for a given IP address.

Execution plan showing batch mode processing

What I’ve got here in the demo you can see, there’s two tables. One called click stream row and one over here called just click stream. I’ll kick off the click stream row query and then I’ll click off the click stream normal query. The difference between the two tables is very simple. One has the columnstore index and one doesn’t. Click stream has a columnstore index, click stream row here has not a columnstore index. If your run these particular queries, you’ll observe their performance. In fact, the click stream one with the columnstore index will come back a lot quicker than the click stream row query. When you take a look at the results of these, like this one came back here at about 19 seconds. This one is going to take longer.

When you go take a look at the operators here, this is the key to this whole technology here. We’ve introduced a new operator in plans that you’ll recognize called window aggregate. You’ll see here the estimated execution mode is called batch. So whenever you see this new window aggregate operator, this is an example where we’ve taken these window functions that you may use in your queries and we’ve implemented this new batch mode when you have a columnstore index. You can see here, it says columnstore index scan as part of the query. If you come back here, you can see here, this took like 46 seconds to run. The only difference between these two tables is the columnstore index. That’s it.

Linux? You’re Soaking In It!

Just to tell you how serious we are about our technology here, if I go over here and run @@version against this server, you’ll notice here this is vNext, but not just vNext, this is vNext on Linux. So I’m running these queries on Linux VM on this laptop and just to know you how I built this demo, I took the demo scripts that I had on SQL 16 on Windows, I propped up a Linux VM in about 15 minutes, that’s how long it took me. I installed the databases, I ran the queries, and that was it. That’s all it took.

Surprise! We’re showing vNext on Linux.

So in 30 minutes, roughly, I had a Linux demo running with no changes at all to the SQL syntax or to the tables. So that’s how serious we are about [inaudible] Linux. Things can work just as they do on Windows. So that’s a quick little demo of windows aggregation. Take a look at window functions for analytic queries, very powerful technology.

The last topic we’re going to talk about here today and then we’re going to leave plenty of time for questions. Just to let you know, I’m going to let Brent moderate those questions. I’m sure there’s a ton of them on there but I’ll let him ask them. I love having dialogues with Brent anyways, so it will be fun to have that interaction.

Always On Availability Groups Improvements in 2016

Let’s talk about Always On Availability Groups and why I think they are turbocharged in 16. I think the Always On Availability Group design is pretty good. We started that work in 2012 as an enhancement we did in database mirroring. Back in those days, the hardware designs were being used in primary machines and even on secondary machines. It was pretty good but not lightning fast compared to today’s timeframe. So what we noticed in 2014 was customers coming to us saying, “Hey, here’s what I’m doing. I’ve got an Always On system I’m trying to prop up. I’m putting on a single sync replica as a secondary. All of a sudden, my overall transaction performance is tanking. This is not what I’m looking for. I need high availability, which is great, I’d love to have that but I can’t have performance go that bad on my primary just because I have a single sync replica propped up.”

So we started taking a hard look at some of these customers and saying, “Okay, what is wrong here? Is it an I/O bottleneck? What’s going on?” The customer is saying, “Wait a minute now, I purchased the fastest secondary machine I could find. I put a really fast disk on there. I’ve got a high-speed net connection between the two. I’ve got a multi-core machine on there. What else do I have to do to make this fast?” So we first said, okay, it’s got to be a hardware problem and then we took it back and said maybe it’s not. So we took a look at the architecture of Always On Availability and said there’s some bottlenecks in our code here. There’s some problems we need to take a look at.

There’s a couple other factors that we looked at as part of this. In-memory OLTP as we talked about transaction log performance is king here to making that fast. So if you want to put that into an HA scenario with AGs, you’ve got to make that really fast. Then if you didn’t know this or not, the replication technologies you see behind the scenes like GO replication for Azure SQL database, it uses the Always On Availability Group technology behind the scenes. So those guys were saying, “Hey, this has got to be fast. You’ve got to make it really really good.” So you can see here there’s a need to take a look at making our code and making the algorithm’s design better.

AG Goal: 95% as Fast as Standalone Servers

We created a couple of mantras. We said, hey, we want to be 95 percent as fast as standalone speed doing our benchmarks with a single sync replica. Well, what is standalone mean? Standalone means I don’t even have an AG, so I just prop up a machine, I run a transaction workload and a benchmark. Then if I add AG and a sync replica to that mix, I want to be 95 percent as fast as that original scenario. HADR sync commit is a wait type that you can see when there’s a latency and thus trying to actually harden the log on a secondary. So we wanted to make sure that some of those latencies we see, at least on medium type workloads, would be less than a millisecond. Those are benchmarks we internally created for our own team. We said we need to meet those to make sure we’re keeping up with today’s technology. So this is the kind of stuff we did.

This is an interesting one, the first one. We literally white boarded out the entire design. We said, “What is the number of context switches for threads that have to take place to start a transaction on the primary, ship it over to the secondary, and bring it back?” It took 15 worker thread context switches. We said, okay, we can make that better. We took it down to 8. It takes 10 with encryption, a couple extra.

We improved the communication path. The first one is an interesting point. The logwriter is the background thread submitting these changes to disk. So we said, hey, if that secondary can keep up really quickly, instead of farming off the logwriter work to a separate set of threads, to go ship it to the secondary, let’s let logwriter submit it directly to the secondary. Now that’s only going to work if this is really fast, right? Because we don’t want to keep the logwriter from doing other work on the primary. Another example of us streamlining the process.

Other things we did, a lot of things in parallel you can see here, right? Streaming log blocks in parallel. Multiple logwriters on primary and secondary. I mentioned that as a scenario earlier on in the talk. Parallel log redo, something we’re adding on to these secondaries because it’s doing continuous redo. Then just us taking a look at the code and saying we don’t need a spinlock here, this algorithm could be better, a lot of it using benchmarks and using things like xevent, xperf, other technologies behind the scenes to see where in the code could improvements be made. All these things taking place. This is kind of the results.

Always On AG Benchmark Results

I did a blogpost on this called Always On Turbo Charge where we publicly listed these results here. This is an OLTP workload based on our benchmarks that we’ve done internally here. Let me explain how this graph works here. The number of users are here on the x axis. The log throughput on the primary, which is a way to measure the overall throughput of the scenario in the workload is here on the y axis. The blue line represents the standalone workload. So you can see here the performance of a throughput, it does bottom off but as we get to a certain number of users on our test, you can see how the blue line on the standalone works. The yellow line represents the throughput for 2014. Not good. You see the difference between the yellow line here and the blue line, that’s not what we’re looking for here. The orange line represents SQL 16. If you actually take these numbers, I did it myself before I published this, this difference is about 5 percent. So we are achieving now this 95 percent goal of standalone speed. The grey line represents with encryption.

The results here, a single sync replica at 95 percent, two replicas 90 percent, with encryption 90 percent, and two replicas 85 percent. These latencies were down to less than a millisecond. This is the specs of the machines we used, both for the primary and the secondary. Yeah, they’re fast machines when you look here, but they’re not just like the most expensive machines on the planet to build. So, yeah, somebody asked me when I did this before, “Can I get that kind of speed?” My response is, “Sure.” I think you can.

The key I think is the secondary. A lot of people build primary machines out in these AG scenarios that are pretty quick but typically in the secondary scenarios, if they’re using it as a sync replica, they’re not thinking in terms of a fast machine. But if your goal is to achieve and maintain HA but keep the primary transactions at a high speed rate, then you need to make an investment on that secondary machine. Now if you’re doing async replicas for read only purposes and other scenarios, those don’t have to be near as fast obviously, right? But if you have the right hardware in place, our design we think can keep up with the pace of what hardware you’ve built.

Recap, And What to Do if 2016 Isn’t Faster

That’s it. There’s a lot of stuff I talked about today. But here’s other stuff we’ve blogged about. These are things that are pointed directly onto our BobSQL blog and here’s the other stuff we need to brag—I mean blog about. Here’s other things within the engine that I haven’t even talked about today and we haven’t even blogged about yet but we know from our notes and our interviews we did with the product team, there are other things that we’re going to introduce and we’re going to keep that blog series alive by adding new types of things.

I’m really serious about trying to get feedback from people in scenarios where 16 is not faster. One of the things I do internally at Microsoft since I talk a lot on this topic is I pay attention to things where people report, “I moved to 16 but this isn’t as fast.” One scenario that I’ve seen and you may need to be aware of is the fact that some people are moving from SQL 2012 or 2008 to 16 and we’re using this new cardinality estimation model that we have for query processing, that is one scenario where I’ve seen customers struggle some and there is compatibility levels to help offset any issues you have there and technologies like Query Store to go look at and observe the differences between those. So I’m trying to do a little more investigation there to see how much do I need to get out to the community about scenarios where that may not be faster for you to just be objective about the fact that not every single thing you’re going to do is going to be fast but I think I’ve given you a lot of examples today where things are faster.

Here’s some research for you, if you look at this moniker aka.ms/SQL2016faster, that actually points to every blogpost that we’ve tagged and that were going to tag that deal with this topic. The SQLCAT team has some really cool blogposts about things in SQL 16 that are really interesting. Some are about performance, some not. Some overlap with some of the things we’ve talked about. Then this is just a pointer to what’s new in the database engine. I’m sure you have some questions.

Questions and Answers

Brent Ozar: They do.

Bob Ward: This is my email. This is my Twitter handle, and this is a hashtag I’ve been using for a lot of my tweets. This is a pointer to our new SQL called BobSQL, you can just search for that on the internet. Then behind the scenes here, I’ll go back to this slide, this is all the different bonus material slides for details about things that we didn’t talk about today. But, hey, Brent.

Brent Ozar: Yes.

Bob Ward: How much can I pour into a session for you, man?

Brent Ozar: That’s fantastic. You know, it’s great, technical, lots of people are commenting over on Twitter things that they’d seen out there, that is very slick. Two people, Dianne and Dan both said, “Love the shirt. Go Cowboys.”

Bob Ward: That’s probably the best comment I’m going to get the entire day today.

Brent Ozar: Dexter asks, on tempdb file growths he said, “Why the 1 gig limit on file size during setup?”

Bob Ward: That’s a good question. I’ve had that question before. I’m looking into that one with our team that runs the setup team. I think the reason why by the way, Dexter, is that we don’t use instant file initialization until you’ve installed, so I think what I’ve heard before is during the setup process instant file initialization has not kicked in so we were worried about the time it takes to run setup with a very large database size for tempdb. I believe that’s the technical details, but that’s one I’m still trying to follow up on. I’ve heard that complaint before.

Brent Ozar: Josh asks, “When you change the service account for SQL Server, do you have to manually perform that—perfmon instant file—”

Bob Ward: I know, Brent, isn’t it crazy, nobody remembers… The answer is yes. If you’re going to change the service account, you must go into what’s called the local policy editor and make that change. I need to just post a blogpost on the PowerShell script to automate it, right? I’m so tired of figuring it out. I myself have to remember, “What was the name of that app I have to go in and run?” But, yes, if you change the account after you run setup you must go in and make the change yourself.

Brent Ozar: Matan Yungman asks, “Just out of curiosity, why was the number 900 chosen?” I think he’s thinking back to the windowing functions demo. I can’t remember where the 900 was.

Bob Ward: 900. Yeah, I don’t remember where that was in the demo. Maybe they can tell you more about the details. I don’t remember where 900 was used. Oh, I talked about batches. Yes, sorry. That was just an example. 900 is not a fixed batch number we use for batch mode processing. The batch mode is just going to vary depending on how much we think we can process. So I just picked 900 as an example but 900 is not a fixed number for how we do batch mode. I think that’s what he was talking about.

Brent Ozar: Yeah, batch mode vector size. Nico chimes in and says something about 912 rows as well.

Brent Ozar: Michale asks, “Are there any plans for making these batch mode operators for windowing functions even when there’s no columnstore index present on the table?”

Bob Ward: I knew I was going to get that question. I don’t know. Great question. I can go back and ask the team that has built that functionality. They’re the ones that actually we are talking about adding this to the deck, but I don’t know of any plans to do that but that’s a question I still need to go back and ask those guys. I don’t think there’s any plans right now that I’m aware of.

Brent Ozar: Mohammad asks, “Regarding CHECKDB being faster, are there any real world production examples or stats that we can read about?” Man, I would push that back on you, Mohammad, go get yourself a dev server. Now that 2016 is free, man, it’s so easy to go download and install.

Bob Ward: I want to hear about it.

Brent Ozar: Yeah, it’s crazy.

Bob Ward: There’s nothing out that I’m aware of. In fact, I think—there may be, a lot of the guys on the CAT team that deal with the SAP software suite, they’re a lot of the guys that help in many cases look at these. They may be posting something on their test using SAP databases but I’m not aware of a specific one that’s posted out there.

Brent Ozar: Alexander says, “Hi, Bob, from Brazil. When will SQL Server implement parallel rollback?” Oh, wow.

Bob Ward: Here’s the funny thing for you. If you go look closely at my xevent session, there’s an actual line that says “parallel rollback enabled.”

Brent Ozar: Really?

Bob Ward: It is, it’s in there, but that’s a little bit of a faux pas, because it’s not enabled. It’s an [inaudible] we added to this trace because we were taking a look at that. So I’m not aware of the details of when we might implement something like that but if you see that in the xevent trace, it doesn’t mean it actually is enabled, but there was some thought when we first started looking at this about doing that, but that doesn’t exist today.

Brent Ozar: Okay, cool. John says, “Where can I download the slides?” You can go right on the URL there on the screen. Jim says, “How much coffee did Bob have this morning? He’s a fast talker and I’m only on the first cup.” JD asks an interesting question. “For SSIS, do you know about any improvements in performance for SSIS in 2016?”

Bob Ward: That’s a great question, JD. I’m not aware of any performance improvements. By the way, quick answer, it’s my third cup. But I’m not aware of any, specific on performance, I’m not aware of any enhancements for SSIS.

Brent Ozar: Sam says, “Would you still recommend using domain accounts for the SQL Server service account or if I use the NT service accounts, can I get instant file initialization with those?”

Bob Ward: The security choices you make for domain accounts or service accounts is independent of performance, right? So there’s all sorts of security considerations when you use domains but as far as the instant file initialization answer goes, Windows admins are the only ones by default that get that privilege. So anything else would have to be added yourself.

Brent Ozar: JD says he’s an army reserve chief warrant officer and there’s no such thing as too much coffee.

Questions, answers, and coffee

Bob Ward: I did go pretty fast this morning I have to admit. It’s pretty early for me and I did kind of crank it up a little bit on the speed. I apologize for that. Hopefully everybody got the—that’s why you have the recording, Brent.

Brent Ozar: Exactly. That is exactly what I was going to say and you’re going to give our transcriptionist a workout as well, yes.

Bob Ward: That’s right, exactly, probably.

Brent Ozar: All right. I want to thank Bob for volunteering to speak here today. He totally came here out of his own loving heart to help you guys out and teach you about SQL Server 2016. I would have everybody give him a round of applause but of course he can’t hear you.

Bob Ward: Golf clap, maybe?

Brent Ozar: Golf clap, yes, in front of the webcam. You can follow him over on Twitter over at @bobwardms and thanks a lot for volunteering today, Bob. I really appreciate it.

Bob Ward: Love being here, Brent. Thanks.

The following two tabs change content below.
Bob Ward is a Principal Architect for the Microsoft Data Group (Tiger Team) which owns the development and servicing for all SQL Server versions. Bob has worked for Microsoft for 23 years supporting and speaking on every version of SQL Server shipped from OS/2 1.1 to SQL Server 2016. He has worked in customer support as a principal escalation engineer and Chief Technology Officer (CTO) interacting with some of the largest SQL Server deployments in the world. Bob is a well-known speaker on SQL Server often presenting talks on internals and troubleshooting at events such as SQL PASS Summit, SQLBits, SQLIntersection, and Microsoft Ignite. You can find him on twitter at @bobwardms or read his blog at https://blogs.msdn.microsoft.com/bobsql.

Latest posts by Bob Ward (see all)

, ,
Previous Post
SQL Server Health Checks
Next Post
SQL Server and PowerShell: Let’s Get Serious

5 Comments. Leave new

Bob really fills the session with content – I will want to watch this more than once to let it sink in. I am also struggling to find his deck and demos, having hit his blog and the site here.


Very good presentation.


Great presentation…do you have other presentations where you dig down into the engine internals more?


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.