This is intended for those who want to dive deep inside internals of In-Memory OLTP
Based on the successful 500 Level talk at PASS 2016, this session will provide an internals view of how In-Memory OLTP works for SQL Server 2016 and 2017. Because this session is Advanced, it is intended for those that understand the basic fundamentals of In-Memory OLTP but want to learn how the “Hekaton” engine works behind the scenes to deliver the amazing 30x performance improvements seen for customers moving to In-Memory OLTP. This session will include a look behind the scenes at threads, data and index design, transactions and concurrency, logging, storage, and natively compiled procedures.
I’m a big believer in visuals and demos so you will see plenty of that as I describe how the Hekaton engine is truly lock-free and latch-free. And of course since this is an advanced talk, the Windows Debugger will definitely make an appearance during the session.
And since this will be an advanced level session, of course the demonstration of the looking at the Hekaton engine with the debugger is a must.
Enjoy the Podcast?
Why I Want to Present This Session:
I feel that by learning how things work you can better learn how to use them and react when problems occur. Plus its just plain fun to learn how things work and understand how to use the Windows Debugger.
Brent Ozar: Well, in this session at GroupBy, Bob Ward will be explaining what’s inside SQL Server’s in-memory OLTP. So take it away Bob.
Bob Ward: Hey, thanks, Brent. I just want to tell everybody, thanks for attending these sessions. When I first saw Brent post this, I think Brent was like late last year that you were saying you were going to do this. I was kind of excited because as much as I enjoy travelling and going to different conferences, it’s pretty cool to sit here in my office in Irving, Texas, and be able to reach a lot of people at a large scale, but still show off the same type of things I show at a conference when I travel. So thank you, Brent, for organizing these events. It’s cool stuff.
Brent Ozar: My pleasure.
Bob Ward: So we’re here to talk about in-memory OLTP today and by the way, I’m not going to use that word for 90 minutes because who can say in-memory OLTP three times really fast? I get tired of using that term and I wish we had not coined it that because this technology is so much more than just OLTP, and we’re going to talk about that today.
But the reason I picked it and the reason I built some slides on this and I’ve been talking about this topic for almost a year now is I was so fascinated how we built this technology. The amazing engineering that went behind creating really high-speed transaction data and just in technology was pretty amazing, and some of the smartest people we’ve got on our product team were behind this. I put this number, 1.2 million batch requests per second because that’s a number that one of our customers achieved using this technology. Step back and think about that for a second, that’s not trivial to get that kind of throughput from SQL Server.
A lot of people ask me for decks and demos and things of that nature, and I know Brent has a way for you to see all this stuff, but if you go right now to this URL, aka.ms/bobwardms, there’s a folder called Inside In-Memory OLTP, already out there right now, with this deck and a zip file of all the demos I’m going to show you today. So I’m a big believer in making sure that you get to see this information, and also this is actually based on a talk I did last year at the PASS Summit. That was one of those – I think people call it – brain-melting sessions, that was three hours. People actually sat through this for three hours, I still can’t believe it, where I had a lot more information than I’m showing you today. But what I did is I took this deck and I took that deck, because I’d made some modifications to it since then, and I’ve got what’s called a bonus material section at the back. So if there was some content that I covered back then at the talk last year, I’ve put them in the bonus material so you won’t be missing that.
And finally, I love talking about SQL Server. So email@example.com, I can’t promise you that I would respond to every single email within minutes, but I do like questions about this technology. If you go through this yourself and try it and say, “This isn’t behaving the way I expected”, send me a note. @bobwardms is my Twitter handle, I am not a social media giant like my friend Adam Saxton, but I do get on Twitter some and talk a little bit about these technologies and retweet others about technologies. And finally, #bobsql, my colleague Robert Door, his name is Bob Door, Bob, or Bob Door, we just call ourselves Bob SQL. Bob works on the Linux project for us for SQL Server. We do have a blog called Bob SQL. We post some very interesting stuff about our product and if you just look for #bobsql, you’ll see a lot of good information about that.
So let’s kick this off. You know, here’s what we’re kind of going to cover. We’re going to cover a little bit of what is this technology. I know that Ned Otter did a talk last week on this, which I’m so happy. Ned is one of our champions in the community and I love reading his posts about this technology. I want to talk a little bit more about behind the scenes though, why we built this technology, what it is, give you my perspective on why is this different than just using normal SQL Server, and then we’re going to go on to the internals a little bit. We’re going to do some demos, we’re going to bring up the debugger, what is an inside session without using our Windows debugger, but those are the kind of topics we’re going to cover. The management and diagnostics section, this is a 90-minute thing we’re going to go through. I may spend a little less time there because I want to make sure at the very end I give you my perspective on some of the learnings I’ve done since I started these sessions back a year ago.
But hey, if you go through this and you didn’t see Ned’s talk last week or you haven’t even studied this technology, I’m a big believer in giving you reference material. So when you get this deck, look for these little hyperlinks that I’ve got in my slides. If you’re just starting out, you’re like, “I don’t even know how to start with this stuff”, go watch this video where Jos De Bruijn, our lead PM in this technology talks about it, and he’s got some great case studies where you can see practical examples of people using this technology.
Finally, everything I’m going to talk about today is based on SQL Server 16 and 17, and what we’ve got running today in Azure database for this technology. 2014 as you’re going to see in a second is kind of where we started this journey, and so some of the things I’m talking about do apply there, but when you have to do internals and you look at the source and you have to really give details about how things work, you got to use a baseline, and so I use 16 and 17 as the basis for this.
Okay, let’s talk a little bit about the path real quick. When you’re my age, you love talking about history, I guess that’s what happens when you get older, you just want to talk about history, right? It’s funny, we were just joking here before we started about some of my fun CDs I’ve got from SQL 60 and 65 back in the day, so I do like talking about the history of SQL Server. This technology that we built is not something we just invented out of nowhere. There’s been research in the industry by folks like Michael Stonebreaker on how do you improve transactions in a way that’s high-speed, especially when you get into systems that have a lot more memory footprint. And back in 2007, with Project Verde, that’s what we found. We found that there was a trend in the industry where people, hardware manufacturers, and customers buying hardware, were starting to purchase machines with larger amounts of RAM that were cheaper. And so we were like, how do we take advantage of this, and how do we improve transaction processing in a way that does, and so thus was born really this initial research called Verde really turned into what’s called Hekaton today. That’s really the code name for this project, and the real earnest work started in the 2010 timeframe, and where it finally landed as a product was in 2014, and we called it in-memory OLTP.
But really, to me, it was really unleashed with 16 and some of the restrictions that we had in 14, some of the things that customers gave us feedback on about, “I’m really not sure this is really ready for primetime”, I really feel like we sured those up in our 16 release. You can see the variant, I’ve got some comments about our 17 release about some enhancement and tunings we’ve made to it.
If you see – and I’m going to use Hekaton probably throughout this session if that’s okay with you, because that’s easier to say, but if you see these acronyms, XTP and HK, we’re talking about this technology, and the reason I mention that is you’re going to see sometimes in error log entries, error codes, dynamic management views, these two acronyms can show up, so if you see them and you’re like, “What is that stuff?”, you’ll know what we’re talking about. XTP and HK both being related to this technology of in-memory OLTP, or Hekaton. So that’s kind of the path and the history behind this product, and we’ve had some really, really intelligent people throughout this timeframe spend a lot of time trying to make what it is today.
So what really is this? You know, people say it’s an in-memory table, that’s what I think I read a lot about people talk about, and a lot of people talk about in-memory technology with column store, and I don’t know if they’ve really seen and created the difference between the two of these. What we’re really talking about with Hekaton technology is optimized access to memory tables. I can take many workloads, or many tables in SQL Server and cache them, and based on my memory, keep them cached pretty much, and I can get fast access to that, but can you get super fast access? And so what we really tried to achieve is 30x performance gains, as much as that, in the transactions that are accessing these memory based tables. So when you think about Hekaton, yes it is all in-memory. That’s one of the requirements for it to make it fast, but it’s not just that. What I hope to show you today is what we’ve built on the engine to make it optimized to access that memory and why it’s different than accessing tables that have their pages in cache the entire time.
So a couple walk away thinkings for you, what we’ve built really is an engine within an engine, and what I mean by that is that we’ve built a series of DLLs that make it what we call Hekaton, that run in the context of the SQL process, interact with it, but really in its own engine. And we have this kind of state that we made, it’s lock, latch and spinlock free, and I’ve got some slides and some demos to kind of prove that we actually did it that way. So the engine itself, the Hekaton engine is lock, latch, and spinlock free, and that’s one of the ways we achieved this optimized access.
The tables themselves look and feel like normal tables, but internally completely different. I’m going to show you why that is today, and then we have this concept called native compiled stored procedures, that is something if you want the peak maximum 30x performance that we’re talking about, you would use this concept called the native compiled procedure, and I’ll talk about what that is.
And so you know, the reason I said it’s unfortunate that we called it in-memory OLTP is that I think it’s a lot more than OLTP. Sure, many customers that require low latency OLTP transactions love this technology, but consider some of these other scenarios. What about ETL? What about staging tables that you really can throw away, or if the server gets restarted you can easily rebuild them? You can use a concept in our Hekaton technology where you make them schema only, where there’s no durability to them and they’re extremely fast. IOT scenarios; we even have some discussions on our documentation about how to convert your use of tempdb to use this technology, and one of the more popular things that I’ve seen done recently, especially I saw MVPs, is take their table types and turn them into memory optimized with a very simple syntax change, and then also behind the scenes, we’re using this technology. And then if you think of the word cache, if there’s any time that you want to use SQL Server to cache something temporarily, maybe it’s session state information from an application, this could be a great target for you to use this technology.
But what is it really? What really is this technology? Well, first of all, it’s made of several components that are different than what you’re going to see with a normal SQL database. We use a normal SQL database as the context for this, but the number one thing you’re going to find out when you use this is you’re going to create a database, or alter a database to add something called a memory-optimized file group. This is a different filegroup than your normal SQL file groups, and one of the big differences you’ll find out is, you just give us a path or a folder where you want the filegroup, or you can give multiple paths so we can spread it around, and we’re going to write inside this file group a series of what we call checkpoint files. These checkpoint files are not the same as the standard SQL Server checkpointing process, and we’ll talking about that.
A couple things we do differently, number one, as I just pointed out, these files are independent of your database files, your MDF and LDF. We still need a MDF file even if you use a database just for Hekaton, because the schema of your actual tables and the initial text of your stored procedure we’re going to keep inside the MDF file, and we still need logging, transaction logging. We’re going to talk about how we use it, so we need an LDF file, but the separate checkpoint files will be in a separate folder that you’ll tell us when you create or alter the database.
Second of all, you’re going to then go through a process of creating what we call memory-optimized tables. A lot of times when people talk about this technology, they just use that term, memory optimized tables. It’s easier to say than in-memory OLTP. So it’s going to feel like a normal table, your syntax is going to look very similar, you’re just going to add on a keyword and say this is memory optimized, and when you do that, we’re going to kick in this optimized access to index and data and then have a totally different internal structure. And you’re going to have a choice to use a non-clustered index. Every memory-optimized table must have at least one index, but you’re going to have a new choice called a hash index, we’re going to talk a little bit about what that looks like.
So that’s two components. File group and these tables, and those are your optimized access to your data and in-memory. And then I mentioned before the ability to create a natively compiled T-SQL stored procedure, and what that looks like is you creating a SQL stored procedure with a special syntax, but the code inside the procedure looks like your procedure, T-SQL code. We are then going to – when you create this, we are going to build an actual dynamic link library, a DLL, and put it on disk and load it. And that’s going to contain code to tell us how to access and run your stored procedure. Think of taking like a query plan, and us optimizing the query plan by baking it into a DLL, with the actual code to do it, and you’re going to find out what that looks like. We’re going to step through a debugger session to see what one of those things actually looks like.
Now, all our other access to these memory-optimized tables, we call those interop queries, and what we mean by that is that you’re going to run a standard SQL statement; it could be a stored procedure, it could just an ad hoc query, and the SQL engine is going to run it, but it’s going to know that you have a memory-optimized table, and we’re going to call that an interop query, and it’s going to have to use something called a schema DLL, which is similar to this DLL for natively compiled procs, to understand how to map what the data looks like to access that. And I’ve got a pretty interesting diagram to show you the structure of what that looks like, and we’re also going to kind of step through one of those as well.
So three components so far I’ve talked about, and then finally, I said this was an engine within the engine, right? So if you look at the SQL architecture, here’s Bob’s crude SQL architecture drawing of the day. Obviously a lot more complicated than this, but if you think about query processing, storage engine in HA, major components within our code of the engine, living inside this entire process space are these three components. Our HK runtime, that knows how to run these natively compiled procedures, our compilation engine that knows how to build these things, and then the HK engine itself, which understands how to interact with the SQL engine to do things like insert data into memory, but then also go to the transaction log and write out any logging we need.
And it’s these three components that are latch, lock and spinlock free because that’s where the speed matters the most. When we actually have to access these tables, that’s when we have to be as fast as possible. If we need to go off and do HA, we can still let HA do its thing with our data, but we don’t have to worry about being latch, lock and spinlock free. If we go off and write to the transaction log, we’ll let log writer handle that. It’s not going to be latch, lock and spinlock free because we don’t need it to be.
And then finally, at the foundation of this is SQL OS. I’ve talked about this topic for many years before. We still use SQL OS, so all the HK components that need to create threads, worker threads, access memory, they’re going to use SQL OS, which is very important because we want to cooperate within this whole SQL process, when it’s threading mechanism and it’s memory mechanism. So really, if you think about what is Hekaton, this is it right here. It’s a file group with these checkpoint files for durability, memory-optimized tables for this optimized access, these DLLs that we’re going to compile, and then these runtime components which are going to turn out to be actual dynamic link libraries that run inside SQL Server.
Okay, so how is this different then? I mean, I kind of get a little bit what you’re talking about, but there’s got to be some things I should be aware of. Well, number one, all data is stored in-memory, but they’re separate from the buffer pool. So when you actually access normal SQL disk – we call them disk-based tables now – that’s a page structure that we store in the buffer pool. Any memory from memory-optimized tables are completely separate than that, and notice here, a couple of comments I’ve got here. That amount of memory could end up being two times the size of your actual data. That’s because of the concurrency model that I’m going to talk about that we use for this, plus we use deferred garbage collection for deleting – rows that you delete, and then we have overhead of our indexes as well to build.
So keep in mind that the memory requirements – so we’ve got some documentation to help you plan this out, but the memory requirements could be two times the size in which you normally store for your data. And then the default, because I have this question come up, like, “Well, do you just use all of memory and squeeze out the buffer pool?” And the answer is maybe. The default is a percentage of the target of your memory, like your max server memory, but you can control it. So we have resource governor capabilities where you can say, okay, just fashion off this 64GB of memory and that’s for my memory-optimized tables, and then the rest of memory is for buffer pool procedure cache, et cetera.
So because it’s memory optimized access data and it’s unlike our buffer pool, there is no paging to disk under memory pressure. So what would happen if you had memory pressure and we ran out of memory, you would get like an out of memory error. Literally, if you put too much data into Hekaton and you don’t have enough memory and we can’t expand it further, or you can’t expand it further, you would run out of memory. It’s like running out of disk space effectively, and you would have to remove some data, or find a way to add more memory to store the data that you need. So there’s no paging system like there is from the buffer pool system.
We use completely different internal structure. There is no pages at all in these deals access. We’re going to talk about what this internal structure actually looks like, and then I mentioned about this new index called a hash index. I’ll just make a quick comment, I’m not sure – like, I wasn’t able to attend Ned’s session last week but from my opinion and what I’ve seen before, if you’re new to this technology, don’t start with hash indexes. It gets a little bit complex in figuring that out. But if you really want to optimize single row operations, like you’ve got a transaction system where you’re going to insert, update, and delete data, and you have a very finite key that you access, these hash indexes if built right can be extremely fast and perform better than a non-clustered index. And so that’s really what optimized access is all about. It is using this memory based system and doing something completely different than the buffer pool, but all cooperating within the SQL process space.
What about I/O? So when you build a memory-optimized table, you have a choice. You’re going to say, “I want a schema and data based memory-optimized table”, that’s one choice. The other choice is schema only. So schema and data mean the following: even though it’s all in-memory, we’re going to back up your data, not a paging system under memory pressure, but we’re going to make sure if SQL gets restarted, you can reload the table. We will reload the table, and we do that by using a combination of the transaction log and these checkpoint files. That’s the main reason for having those. The checkpoint files become effectively the backing store of your schema and data memory-optimized tables.
But let’s say you’re in one of these caching scenarios where you don’t really care. You’re like, “I want maximum speed, and if SQL got restarted I can easily repopulate these things, so I don’t care if it’s backed up”, and so that’s where you use the schema only table concept, which means that we do optimized access in-memory to the tables, but we never even write them anywhere. There’s only persisted – they’re only kept at in-memory format.
Another thing we do from an I/O perspective that’s a little different from normal disk space tables is that we only log at commit time. There’s no undue process that happens with Hekaton. There’s no undue from a log recovery standpoint, so redo is the only thing we ever do when we try to recover the tables for Hekaton database. And in fact, what I mean here is not only did we only write the transaction log entries at commit time, we don’t even write them to transaction log cache at commit time. So there’s absolutely no logging happening to your transaction log until you commit your transaction.
There’s a separate background process that we use to actually write out these checkpoint file data for you, and of course we do that to speed up startup. And it’s a continuous I/O operation, and it is independent of your checkpoint process that runs for your database, or the background recovery writer for indirect checkpoints. So separate background processes, and I’ll show you some threads that look like that in a second.
And then disk speed can matter though. If you use schema and database tables and you’re going to go through durability process, when disk speed really becomes a factor for you for that file group is database startup, because the database startup, what we’re going to do is we’re going to go through these checkpoint files and we’re going to load all that into memory, and then look at the transaction log and apply anything that didn’t get – that was changed, that wasn’t in those checkpoint files. So what we’ve seen from customers is if they don’t put their file groups on fast disks, database startup is pretty slow, and we have to load everything into memory. If you notice this comment here, all data loaded into memory are database startup for memory-optimized tables.
And then finally, another reason why we achieve speed here is concurrency. So if you’ve not heard of the term optimistic versus pessimistic concurrency, SQL Server by default kind of has a pessimistic model. In other words, we assume that you’re going to get blocked. So we use this – we assume there’s going to be conflict and you’re going to get blocked. That’s pessimistic, right? Optimistic says, I’m going to assume that I never get blocked, and so what happens is we don’t block. We don’t block these transactions between memory-optimized tables, but what’s going to happen is – which you’ve not seen before is you’re going to have to have conflicts that might occur between transactions. So you’re going to have some error handling in retry logic, which you may not have had on your application before.
And then finally, one of the ways we achieved this optimistic is we use versioning. We use a multi-version concurrency model for this optimistic model, but the versions are not stored in tempdb like versioning for SQL Server tables. We actually keep the versions in line with the actual data itself, and we use that for update purposes for versioning, and we use the deferred garbage collection model not only for deletes, but for any versions from updates that are considered stale and not being used.
So that’s the difference, right? Different memory structure, all within the engine. It all plays within max server memory, but it’s different than buffer pool. A different I/O model, except we do use the transaction log model from SQL Server, and then we use a different concurrency mechanism which has ramifications for your application itself. Too fast? Too slow? Do we need to like speed this up? Any questions so far guys?
Daniel Hutmacher: Not really that I can see. They’re talking about cars.
Bob Ward: They’re talking about cars? Okay, nobody really cares, like I don’t care about this. I know – they’re like, guys, I already know all this stuff. So if anybody in the phone knows all this stuff, you know, I’d love to talk to you because I should have you present this other sessions.
So one thing I’ve had customers get confused by is speed. Like, they go out and they will set up one of these systems and then they’ll complain that they didn’t get the speed that they thought, and then it’s because something else in the lifecycle of their query was the problem.
Like, we don’t speed up client-server communication for this thing, right? It’s only about this native compiled procedure execution and data access that it matters. So you’re going to be pressed to go take a single insert, one insert against a disk space table and compare it to Hekaton and go, “Hey, how come it wasn’t faster?” I mean, that’s not a good comparison, or if you have a client application with a web server and you prop this up but your bottleneck’s on the network side, you know, you’re not going to see any speed, right? So make sure when you take a look at how this is faster, you’re actually taking a look at the optimized access part of this model and that it’s a concurrent workload.
You’re not going to find again, a single use workload that’s really going to speed up tremendously from this, but we’re going to show you a demo in a second that when you prop up multiple users you can see a big difference. Also before we do a demo, let me just talk about some of the threading model a little bit about what’s behind the engine. In fact, it’s so different that we actually have a separate DMV called dm_xtp_threads that if you go look at it, we’re using the standard SQL OS worker thread model, we’re borrowing threads from overall SQL OS and SQL Server, but we dedicate them specifically to Hekaton. And the first type you’ll see is called user, and that’s going to be just the normal user transaction you’ll see in dm.exec_request but if you see them separately in this DMV, they just mean user transactions.
But this is where it gets interesting. You’re going to see some threads show up in dm.exec_request or dm.os_workers, and you’re not going to know what they mean, but this DMV can help tell you what they are. So for each database, you’re going to have a series of threads. You’re going to have something called a checkpoint controller, a checkpoint close worker thread, and then multiple log flush workers depending on how many nodes you have, and this is the one actually doing the I/O. When I say checkpoint I/O, I mean the one doing the actual writing those checkpoint files.
Unfortunately, we called them log flush threads, but they’re not going to the transaction log. The log writer thread is still doing all our log work, but these are the threads per database doing that actual I/O work. We also have per numa node, a garbage collection thread, so we dedicate a background worker to do garbage collection. We still do some in-line garbage collection if we notice that there’s deferred records that we’ve got to delete. So some worker threads user transactions can do some garbage collection. Not totally unlike SQL, but we dedicate these per node to try and make sure we’re spread across nodes and their job in life is to go clean up any deleted or old version records from updates.
And then finally, we do something called an in-memory thread pool. So each database has these dedicated threads, but for the entire server, we’re going to build a – we’re actually going to dedicate a series of worker threads that are going to cooperate with each database that may be on the system to go through the process of doing one of these checkpoint operations. And one of the biggest takeaways from this slide about this architecture is the following: if you’re used to cheating by running a query saying session ID greater than 50, which I do all the time, I cheat, I do it all the time; get ready because as soon as you install a Hekaton database, you’re going to notice multiple of these threads are going to have a session ID greater than 50. So unfortunately – and they’re going to be called like task manager or something, or XTP offline checkpoint or something, so unfortunately, using the cheating method of, hey, only user sessions are greater than 50, which couple versions ago you’ll have been told not to do that, that really shows up when you look at this threading model.
So when I said that there’s an engine within an engine, what I meant was is that yes, there’s optimized access to these transactions to these DLLs, but in those DLLs are these background threads that do work behind the scenes to support things like this checkpointing operation. Now, what’s important for you to know is if anybody gets on and they’re really smart about debugging SQL and they go look at a log flush thread for XTP, they say, “Hey, I caught you Bob, there’s a latch this thing is using”, you’re right. We do have latching and concurrency for these background threads, but when it comes to user transactions running in the Hekaton kernel, we don’t at all. So when talking about latch, lock, and spinlock free, it’s only for the portion that we actually access the user transactions that really where we – and that’s all we care about. These background threads are important but it’s not that important that they’re latch, lock, and spinlock free.
Okay, let’s do a demo. But I’ll ask my hosts – I’m going to put pressure on my hosts, right, so what does this mean? Do you guys know what this little symbol is? Anybody know?
Daniel Hutmacher: I don’t read Greek.
Bob Ward: Well, I’m impressed you know it’s Greek though. That’s Greek for…
Daniel Hutmacher: Hekaton, right?
Bob Ward: Yes, it’s Hekaton, and that means a hundred. So the reason why we created the name Hekaton for our project is our original goal was to be a hundred times faster for transactions. We’ve only achieved 30 at this point, but that’s the impetus behind the codename Hekaton. But let’s do a fun little demo. So what I’ve got here, so James, Daniel, can you see my screen still okay?
Daniel Hutmacher: Yes.
Bob Ward: Okay, great. So I’ve got perfmon up and here’s what I’m tracking. I’m tracking processor time, batch requests per second, non-page latch and page latch waits because I want to show you the difference between disk-based and memory optimized based when it comes to latching and throughput. And these demos are all available to you. I’ll just quickly kind of just show you real quick here a couple of databases I’ve got. Being the sports fan that I am, I’ve got a database called gocowboys, let’s get it on, Cowboys this Sunday night against those vaunted Giants. And if you’ll notice here, here is the new syntax that you would use to create a memory-optimized database.
So I’ve got the same database called goeagles, as gocowboys, except it’s not using memory-optimized tables, and then if you go look at the table itself, I’ll just quickly show you that syntax, it looks like this, and basically with memory optimized and our durability option, that’s how you specify one of these. And if you notice here, I’ve got another one called supercowboys, which is going to use schema only, and I’ll show you that in a second.
So I’ve got that going, I’ve also got a native proc, I’ll show you what one of those look like real quick. This is the native procedure, and here’s like the syntax, with_native_compilation, and you have a couple other keywords you have to use, but inside this you’ll notice that what I’m doing is just basically a loop of a hundred times of inserting rows into this table. Very simple insert, right? Hundred inserts in a transaction, and I’ve got the same procedure for eagles except it’s not a native compiled procedure.
So let’s just go see how well these perform against each other. I’ve got – I’m using this program that we use all time called OStress, you can download this for free on the internet, and I’ll go kickoff a goeagles workload for eight users and I put my server name. This is just a command script I’ve got, and then I’m going to run it.
Then I go to perfmon over here, I’m going to see the behavior. If you’ll see, batch requests per second or something about 300 and you can see there’s some waiting here, right? There’s some non-page latch waits and page latch waits and this finished in about five seconds. That’s not bad. Eight users, all running those 100 inserts in five seconds. Let’s see how the cowboys perform using the same structure, except it’s memory optimized, and it’s not very big but it’s an interesting little test. So I run this, and I go over here, and if you’ll notice, I got up to 1500 batch requests per second as opposed to 300, and it finished in less than a second. So that’s interesting, but what if I push this a little bit. Let’s go to the eagles and push them to 25 and see how they perform.
By the way, I have an eight-processor laptop that I’m running on, and if you’ll notice here, batch request per second don’t get all that exciting. In fact, it doesn’t get much higher at all, in fact, it’s a little ragged, and my waiting goes way up. So I’m not hitting any disk here except for the transaction log, but you should know that the log for this laptop is a PCINVM e-card where I can get write log waits less than a millisecond. Like, I can get really fast transaction logging on this laptop. But still, because of the latching between this, I didn’t achieve very high performance here.
Let’s see how the Cowboys perform with 25 users. They did about 1500 before, I think that’s what we saw. Let’s see, let’s go to 25 here. I know you notice very quickly that those guys got up to 2300 batch requests per second, and the CPU was up way high. Now, why was the CPU high? Because everything was in-memory. All access was in-memory, now, this was all in-memory too though, so why wasn’t CPU high in the disk space scenario? That’s because there was a concurrency problem with latching. But I have no latching here, like zero latching. You saw there’s no waits, so it’s completely latch-free, I can peg the CPU as high as I want to go.
Let’s try the supercowboys. That was the one that was like, no transaction logging. Because of the fact that this is a very fast log, I don’t know how much faster I’m going to get, I’m just going to try 100. I’ve never tried this before, let’s see what happens. Let’s see, so it went up to about 3300 if you’ll notice using just a schema only, and it ran in seven seconds. Just to let you know, this is a great little demo you could do on a multi-cored box. I could actually make that run about ten times faster if I had more cores. I completely hit CPU at 100%, so I can’t go any further because I’m just bottlenecked by CPU running 100, but if I’m running on 100 cores, I guarantee you I could really knock that number out of the park.
Now, just to – and by the way, you have all these demos, you can run this stuff yourself and see it and serve it yourself because I’m just using T-SQL and this OStress tool. But just to kind of show you, I did an actual x event tracing of latches and locks between these two environments, and I just rather than run it, I’m just going to show you that you can look through looking at lock and latch required for this eagles proc, and you’ll notice all the different things that have to take place like page latching, other types of latches, other locks that have to take place, to run those 100 inserts, right, with multiple users. But if you go look at – and this was a single user by the way. I took a single user of that and said what all happens from an x event perspective, but go look at the Cowboys. This is it.
Now, this blob access lease renewal latch is not really – that’s a bug. It’s actually a latch on a structure called DB table, which is requires because of the fact the reason the SQL engine, and then we just have database locks, and that’s it. This is the full execution of that one procedure. There’s no latches, no locks, and I actually went to the source code of all the Hekaton DLLs themselves, and I actually did a search for our spinlock class code, which we use throughout our engine, there wasn’t a single instance of it being used. So when we say that, we’re not really lying. We really mean that we really have optimized access of running these transactions.
So I think what that really proves to everyone here is the Cowboys are far superior than the Eagles. I’m sure somebody in the phone is an Eagles fan that I’m going to hear from, but that’s okay, I don’t mind. At least I didn’t use the Giants; I should have though, because they’re playing the Giants Sunday night.
Daniel Hutmacher: J. D. Walker said that you should get rid of Tony Romo and the Cowboys will get instantly better.
Bob Ward: So he needs to keep up with the real world, because Tony Romo is retired. He’s an announcer now, right? And Dak Prescott’s our guy. So we’ve followed through with that person’s suggestion.
Okay, so let’s dive a little more internally. I think that to me was more of a 200, 300 level discussion. Let’s kind of amp it up and make this a little more internal. So let’s bring the debugger out because we got to do something fun today, it’s a Friday, right? So what I talked about optimized access to data, and it’s different structure internally. Maybe some of you from Paul Randall or other experts, Brent, other folks in the industry have shown you like a page structure before, like DBCC PAGE, they’ll show you what the rows look like and they’ll crack it out for you.
Well, you can’t really do that for Hekaton. There’s no pages, right? So what I’m showing you – and this is from Kalen Delaney’s internal white paper. If you guys want to get a great backdrop to taking these slides and complimenting it with a white paper, Kalen Delaney did a really good job of an internal white paper, a pdf file that you can – that compliments this talk. So I encourage you to go read that, and I told Kalen I was going to borrow her photo. She had a really good image of what a row looks like in-memory optimized tables.
And so what we mean by this row header payload is the following. In every Hekaton row, no matter what, is this header. There’s these time stamps and these other values. The payload is your data, and to the Hekaton engine, we just call this a bag of bytes. In other words, it’s unknown to the Hekaton engine. Now, if you think about DBCC PAGE, and DBCC PAGE goes in and takes these rows and it cracks out the values for you and the column types, it does that because it has metadata stored in SQL Server that’s cached in-memory that it needs to access when it accesses a row to understand the format of it.
Now, some of the formats encoded in the row structure, like where are the offsets and so forth, but if you want to get the types and everything else, you’ve got to metadata. Well, what happens when you go to metadata is that that’s got to be a shared structure that is accessed by multiple threads, which causes latching. So what we did in Hekaton was the data format is not known to the engine, but it’s known to the DLLs. So the native compiled procedure DLL and the schema DLL are going to give a mapping to the Hekaton engine on how to deconstruct or reverse and crack this payload. That’s one of the beauties of the speed is the Hekaton engine didn’t have to have that logic, and therefore didn’t need this latching concept or shared memory structure. We just used the DLLs.
Let’s talk about – and we’ll talk about that in a second, let’s talk about the header, and why do we need these fields, why does every Hekaton row need it? Well, first of all, we use timestamps for concurrency purposes. So a timestamp of the insert is this beginning timestamp, and it’s when we commit the transaction that we fill this in. The timestamp of a delete is the in timestamp. So if you delete a row, we use these timestamps to show the visibility of the row to transactions. If it’s not deleted, it’s just – we use this little term for infinity, meaning not deleted, but it’s actually a different value encoded in the eight bytes. So when you see a visualization, if you see a beginning timestamp but not an in timestamp, you’ll know it’s an inserted record at that point.
So updates are deletes plus inserts. That’s how we do versioning. So when you do an update statement, we’re going to go mark a row deleted with this timestamp, insert a new row with a beginning timestamp and not an in timestamp. That’s how we do versioning.
So the statement ID is used for Halloween protection. You can do read about Halloween protection, it’s kind of a separate topic, but we need that to avoid those problems, and this is where it gets interesting. So the row itself, there’s no concept of a clustered index in Hekaton. What happens is that there’s a single instance of this data, of this row in-memory, but in this header is going to be a ray of pointers to link lists, and that’s effectively a chain where this data shows up in different indexes. So the number of indexes that you have for your table is going to be indicative of how many pointers you’ll need and what this value will be. So number of pointers equals number of indexes. You got two indexes, this would say two, and you’d have two eight-byte pointers in this row header. So the row header is fixed except for the end of this portion here is variable in size based on how many indexes in the table. And the row will show up in chain through these indexes through these pointers.
So the other thing about these chains is it points to another row in the chain, and what I mean by that is that if a row shows up on an index, it may be chained differently to other data depending on the index structure. If it’s non-clustered, it’s a different chaining mechanism. If it’s hash, it’s a different chaining mechanism. I’m going to show you an example in a second of what a hash looks like.
Another thing is rows don’t have to be contiguous in memory. We can scan these things even though they’re not contiguous in memory but it is an important point to note here that scanning an entire Hekaton table is not particularly interesting, and it’s not why we built this. So when we say optimized access to memory, we do mean that, but we usually mean because you’re going after a certain set of data.
If you’re out to just scan an entire table, we didn’t built Hekaton for that purpose. I’ve had customers contact me and say, “Hey, I’m trying to scan this Hekaton table and it seems to be slower than going after a disk space table. Not significantly slower” but I’m like, okay, well that’s not why we built this technology. So we can support scanning, but it’s not going to be – you’re going to see lights out faster. If you want that, you should use a technology like column store.
So that’s what a row looks like, and if you want to know more about versioning, there is a great set of documentation on internals here, and I actually borrowed it from Ned. Ned Otter has a really good discussion about how these versions look and how you can take a look at them. So that’s what a row looks like. Now, we talked before about the fact that to crack this row, the payload part, the Hekaton engine needs help, and the schema DLL, which I mentioned earlier in the talk is how is one of those ways.
So if you’ve got a table where you don’t have a native compiled procedure, every table’s going to get one of these DLLs, and you’re going to find on disk, in your installation folder, in the data section, under XTP, it’s going to be a database ID, and in there is going to be a set of files that make up these schema DLLs. So any time you build a table, alter a table, you restart SQL Server, we’re going to go build this DLL, and we’re going to leave around a trail of the source code itself and information about the actual compilation, because we actually run the visual compiler behind, in the background, as we build these tables.
So first of all, these file names have some encoding that will help you, the database ID is part of the name. Second of all, like I said, we keep the C source around; will show you that in a second. This symbol file is there, that’s going to be very helpful for us today so you can actually step through this. It’s not going to be a lot of meaningful to you, but we do leave the symbol file around. And then the object ID is another part of the encoding of this, so you could tell from here which table this belongs to by looking at the name. And then we have the output of the compiler for troubleshooting purposes, and for some reason that failed, which I have not seen that before. We could use that for the support team. And then this is the other part that’s interesting. So whenever you see one of these, a T means it’s a schema DLL for the table, and a P means for a native procedure. So some of these files will be the same for either one, but the P or T will tell you is this for a native compiled proc or for schema DLL.
And then very important, when the database comes online, we always rebuild these, and the reason we do that is for security reasons. If you think about it, we’re sticking a DLL out here on disk, and the C source file, right? Now, the DLL is already built, but if the server just loaded this DLL each time it ran, somebody could theoretically get in and access that DLL and hack it, right? Interject some problem that we don’t want, and if we just loaded it, we would have an issue. So what we do is anytime you restart SQL or take the database online and offline, we’re going to actually rebuild those DLLs to avoid any security intrusion issues. So that’s pretty cool.
What about this row payload? So inside a schema DLL what can I expect this to actually look like? Well, here’s kind of what it looks like, an example of one. We have a table here called letsgomavs from my basketball team, the Mavericks, and I’ve got an integer column that is a primary key. It’s based on a hash index. That’s why we set this hash with bucket count, that’s the syntax for doing a hash index. It’s got another fixed integer column, and then a variable column that’s nullable, 100 characters.
So what this looks like is the following. In the DLL, we use these C structured concepts to denote what the format looks like. So you’ll notice here that this structure is representative of my table. I’ve got two long values which map to these integer columns, I’ll show you this in a second. Here we go. So call one map to this guy right here, and past the header, this fixed value will be here, past that will be the next fixed value and any other fixed columns will show up in the beginning. It’s a little bit like the SQL row structure where all the fixed ones show up first.
Then we have this bytes to tell us that we have no columns and then we have something called deep offsets, which represent this array right here. The number of values in this array will be depending on how many variable linked columns we have. This case we have one, so we need two values. Why do we need two values? Well, first of all, the first part of this array is a value that says the offset in this row is the start of this variable linked column, and the second one is the end. So if we had another variable linked column, we have a third entry that would point to where is the end of the next one. So that way, you could go to this deep offset array and you would know where are all the variable linked columns in this table.
So that is really kind of how we support showing the metadata that we mapped at the Hekaton engine. The Hekaton engine can actually get a mapping of the C structure and use it to decode what does your payload look like. So before I show you a demo of what that looks like in a debugger, let me stop and talk about this hash index, because that’s how these rows pointed to each other.
So imagine I’ve got a table where I have a character column called name, and one called city in the same table, and then I have two hash indexes, one on name and one on bucket count. Now, this hash index concept is different than a non-clustered index. It’s not a B tree concept. It’s more of like an indexed array. So the way this works is you have an array of hash – this of these as number entries. We call these buckets. So in this case, I only said there was eight of these entries. So what’s going to happen is every time I get a value for a name, I’m going to apply a mathematical computation on the string and I’m going to come up with a place in this list of buckets of where to put this value. And what I’m going to build is effectively an index list of how to go find data based on these buckets, because any time somebody wants to go search on one, you’ll run the mathematical computation again and you’ll figure out where to go find it.
So this is kind of what it looks like visually. Let’s assume that I had somebody called Ryan from Arlington in this case, and this number here, remember that timestamp I talked about, indicates kind of an ordering sequence. So the first one’s Ryan on the list. And when Ryan gets hashed, he’s going to show up in bucket four. This is just a simulation. It may not be a four, but it’s something like that. So remember we only have eight of these. So Ryan shows up as a four. Now, let’s say Bob, who’s from Dallas comes in after Ryan, but he also hashes to a four. Bob and Ryan are two different strings, but because we only have eight possibilities, our mathematical computation takes that into account and says I only have so many possibilities here, they end up hashing to the same guy.
Now, what happens is because they’re a part of that same bucket, I need to chain them together, and this is where that chaining comes into play, and it’s going to look something like this. So when Bob’s inserted after Ryan, if I need to go find Ryan, I’m going to run the computation on Ryan, I’m going to go to the fourth bucket here, then I’m going to go to this list and go, is this Ryan? I’m like, nope, it’s not. Let me go to the chain and find – it is Ryan. The chain ends here, so I’m done. I found all my Ryans by making this list, but I’ve had to make two hops. I went to the bucket, I went to Bob, I went to Ryan, now I found Ryan.
Now, if these buckets were better spread out, I could directly go to four and find Ryan immediately, but because I only have eight of them, I had to go to this chain to find Ryan. Now, let’s assume that Bob comes in again but from a different city. So I got two Bob’s now, he comes in later. Now I build a chain like this. Now to find Ryan I’ve got to go through three hops, and to find Bob, I go through two hops, right? So that’s what – and if Ginger comes in and if anybody knows my family, I’m picking family names here. If Ginger’s from Houston, she happens to go to a five, and that’s really nice for her because it’s very easy to get to her, just one entry. But if another Ginger comes in, now I’ve got two entries as well.
So this is a possibility, right? I’ve got a couple of Gingers, I go find them through its chain, I call each of these chains, anything that goes off one of these buckets is called a chain, and I walk this list to find those values.
Daniel Hutmacher: Sorry, are you versioning Ryan and Ginger and Bob?
Bob Ward: That’s for asking Daniel. In this case, notice that the symbol I’m using here is this infinite symbol, so there’s no versions that have happened here. I’ve not done any updates. All I’ve done is inserts at this point. If I had versions – remember, like an update, it would be a delete and an insert. So if I updated Ginger and said Ginger now is in San Antonio, I would end up having two rows. I would have a Ginger with a delete timestamp and another chain to another Ginger that points to San Antonio with just an insert timestamp.
Daniel Hutmacher: The reason I was asking was that you’re updating the pointers to the next record on…
Bob Ward: Yes, thank you for asking. So I have to update pointers to rows to insert even new rows, because these are chains, right? That’s the only way to build these lists, but I’ve not updated the values of Ginger, Bob, Ryan or these cities. So the updates are happening internally to insert new rows, but no logical updates are part of this, so there are no versions. The only time you see a version is if you have this delete symbol here that would show up on the actual timestamp, because that would show you the time sequence of when the version exists. Like, when is that an actual record that is visible to another transaction? That’s a good question.
Okay, so what about city? So I got the same type of possibility. Remember, these rows exist here, so I’m showing this in a time sequence as though city happens later, but this all happens at the same time because I’m inserting these rows, so I have to maintain the hash index for both. I’m just making it easier for you to see the sequence.
So Arlington, for example, could point to five, Dallas could be four, Houston could be three, but what about four with an Austin? Well, just based on the hashing algorithm, it would be that Austin is also at three, so to find Houston now, I’ve got to go to Austin and then to Houston, and Fort Worth would also be a three, which makes this even worse, because to find Houston I got to go to Fort Worth, Austin, then Houston.
The reason why this all makes sense – or the reason why it’s important to bring this out to you is when you design these bucket counts, things matter, and when you look at performance of memory-optimized tables that are using hash indexes, looking at something called a hash chain and collision matter. Any time I’ve got values for a given column that’s part of an index where they are all chained together but different values, that’s called a collision.
Notice that duplicate keys are not collisions; that’s your choice as an application developer. You built an index and you started putting in duplicate values, they are going to have to be chained together, right? But when values that are not the same get chained together, that’s called a collision. But in either case, these are chain counts, so we have DMV data that shows these chain counts for you. So when you start looking at – and it shows you the dense, how full these buckets are, like did you build a hash that was too big or too small, but when we’ve seen people have performance issue with hash indexes, we’ll go find they have massive chain counts.
What does that mean? It means their bucket count was probably too small, or they put a lot of duplicates in there. So these hash indexes again, are not really good for those scenarios. They’re not good for a lot of duplicate values, they’re not – they should not be used in those cases, or they’re not good where you improperly sized the bucket count, because these chains get really long and it’s very slow to then go find the data.
The optimal chain count, of course, is going to be something really low, where when you go hash on a value you go right to the bucket and find the row immediately. That’s a very fast optimized access lookup. Now, versioning, as you just mentioned, adds to the chains too, right? So if you do a lot of deletes and they’ve not been cleaned up, that unfortunately then adds to the complication of using a hash index. The whole point here is that they’re very good for single row type lookup operations and they’re very fast, but by looking at this diagram, you can see the complications that could come in, why it may not be a great choice for you out of the gate.
In fact, I’ll make a very bold statement. If you’re going to try this technology of memory-optimized tables and you want to see how well it can perform, do not go write to hash indexes. It’s going to complicate your evaluation of it. Go build it together and then go decide what kind of application pattern you have in these tables, and if you’ve got one that’s a very single row lookup oriented, then you start actually considering hashes as your way to go. But it’s also a good way to kind of show you internally for this row structure, why do we have these pointers around, why do they exist, why do these timestamps exist, and how do they exist in the form of a has index.
Okay – sorry, I forgot to mention, we have – I think we have some pretty good guidance on sizing these things and how you would monitor them, but again, it adds to the complication of making this migration to using this technology.
So just really quickly about the structure of a natively compiled proc and then we’ll do a demo, and that is just that these were the keywords that are important. When you say create procedure with native compilation, that kicks in the logic for the SQL Server engine to say, I’m not going to go through the normal process of building a query plan for this, I’m actually going to go kick in the Hekaton compiler to go take this text and go turn it into a DLL. Schema binding is required so that’s an absolute requirement that you cannot be dropping columns that are referenced in here while this is bound.
And then we have this keyword called begin atomic. Begin atomic is required and tells us that everything inside this block is a transaction, and you have to specify this requirement. You have to tell us the isolation level and the language in order for us to build the DLL correctly. Now, you should know here though that even if you specify a different isolation level than snapshot, internally, we still use versioning. So we kind of use snapshot technology behind the scenes, even though you pick a different one like serializable. And I’ve got a visual to show you the ramifications of the fact that if you pick a different iso-level for your application needs than snapshot, why? Because we use versioning that could cause some issues for your application.
And then here’s your code. Here’s your queries. Your select statements, your inserts, your updates, deletes are all inside there. There is restrictions on this, and we’ve done a better job over the releases all the way up to 17 to relax what can you put in these native compiled procs. I will tell you out of the gate though, for example, don’t go throw a scan in here. Don’t go through a SELECT*, we can’t use *, but don’t go throw a SELECT of a huge table inside a native proc. It could perform even slower than a standard procedure. So use the principles and what I’ve mentioned before about accessing a certain set of data, updating a set of rows, deleting insert updates as part of these procs to get maximum performance.
And here’s some other – so there’s some restrictions and best practices, but we keep trying to expand the surface area. But here’s another gotcha. Because we’re trying to do just maximum performance, we do not compile recompile automatically based on statistic changes. So we support statistic changes, but it requires manual recompilation once enough changes have been made to get the proper plan.
Okay, let’s do some debugging. Let’s throw the Windows debugger out there, we got to try something, right? Now, I’m a big believer when I do demos – I think I’ve got a different window for this – yes, when I do demos, I put read me files in there because I hate memorizing demos, and even though I use the debugger a lot, I don’t remember all debugger commands like the top of my head exactly. So I mentioned about the schema DLL, and I’ve got a database called letsgomavs – let me get rid of these files. So from this demo – so this tells me the name of that database ID and I want to know that because I want to go into explorer and find that directory.
Now, I’m running SQL 17 in this case, so in SQL 17 out of the data directory there’s an XTP folder, and I’ll back up so you can see it. This is every database ID that must have some sort of memory-optimized table. We wouldn’t build this folder if it didn’t. Now, we knew nine is our Mavericks folder, so this represents the C source for that schema DLL. Notice the T syntax and the C file, and I’ve already got this running under Visual Studio.
So this is what that looks like. Do you recognize this structure? This is what I showed you guys earlier in the diagram. So I used the same table format as I put in the diagram so you can see it. So this is actually in the C file itself. Now, to make this a little more complicated, down here in the C file is something called an offset array, and this is something that the Hekaton engine knows how to go look at the DLL, this is a symbol that’s actually out there in the DLL to know how to access this, and each one of the members of that structure is a specific mapping adhere to that.
So let’s see an example of how the schema DLL is used by SQL Server. So what we’re going to do is we’re going to load SQL up onto the debugger, I’m going to use the winddbg command here. This is something I didn’t put in the read me – I might have put it in the read me. I’m going to use my symbol path, which is I’ve got a symbol cache on my server, and I’m going to attach the SQL Server process. So at this point I’m attached and I’ve frozen SQL Server at this point, but what I – my goal is to be able to show you how the SQL engine will use that DLL and use that structure called the offset structure.
Now, a schema DLL is not code that SQL Server is going to step into. There’s really no code. Actually, that’s not totally true. For accessing indexes, there are some functions, but to actually just map the row, we don’t need to step into code, we just need that structure so we know how to map it. So I want to set a breakpoint, but I got to get clever on this. Like, how do I set a breakpoint on that structure? I will use a command called break on access. This BA command, which says – this is what we’re going to run right here. We’re going to say, hey debugger, any time somebody accesses this structure, I want you to break into the code and show me who’s doing it, right? I need the same of the DLL to do that, and I’m going to use a command like this to do it.
So LM means list modules, and I know my database ID was nine, so if I did that, I’m going to see this is a DLL that’s loaded in the process space of SQL in the debugger, and I need to use that DLL to go find the symbol location of that array. So I’m going to do something like this, I need a fully qualified DLL name, but in order to copy this it’s going to throw something else out there so I’m going to erase that. So now I’ve got this break on access, R8 means eight bytes, so this is a pointer of eight bytes, and I need the name of that structure, which is back here. It’s right here. So I’m going to put this name in here, which is a known symbol, because I think I told you, there were symbols with these.
So the debugger is going to go, okay, great, I’m going to break when anybody touches that. So if I hit go with this point, I’m running SQL again. Now, how would I go access that breakpoint? Well, all I need to do is really just query this table. I don’t have a native compiled proc here, I just need to run any kind of select statement against the table and so I’ve got – actually right here, this is a select statement. So when I run this, I expect the debugger to kick in, and sure enough it does. And if you go then look at the call stack, it may not show you what you expect, so when I first did this, I thought, “Okay, I’m going to see in here somewhere a Hekaton DLL”, right? Well, it doesn’t show up. Well, what does this actually look like?
So these call stacks, if you ever look anything from a query perspective, this is kind of common for any query. Something called process request and C source execute, that’s like running a select statement, right? And in fact, I can see here, here’s a symbol for running a select. This is where it gets interesting. I’m scanning this table, which like query scan, and then I know that this table is a HK table. That’s a code in the SQL engine that says, I know this is a SQL table, you’re doing one of those interop things, okay? While you’re dealing with those interop things, I need to go to the DLL and find out that offset structure to go get to the columns, and sure enough, right here, this is the engine going into that offset structure and saying, I need to figure out what that row looks like for each column, so I’m just going to access your structure and it’s a mapping – think of like a map – to tell me, here’s where column one is, column two, column three, because the structure in the DLL tells me how to do that.
So this is an example where it’s an interop query where we’re not accessing the Hekaton engine code, we’re just using the schema DLL as a map to get to the data. But what about one of these native procs that I was talking about before? Let me BC*, which clears all breakpoints, so I’ve got another database called cowboystothesuperbowl, which absolutely is going to happen this year, I’m very confident it’s going to happen. So – and of course after Sunday night when we lose, I’m going to get nailed by somebody on Twitter saying, Bob, you said they’re going to the Superbowl, good luck.
Okay, so I’m going to go here and find the Cowboys – here it is right here, cowboystothesuperbowl database ID. Now, that’s an eight. So back to our directory structure, if you go under eight, you’re going to notice here that there is a T as well for eight. I should see a P as well, maybe that’s the wrong one. I don’t know why there’s not a P there, let me go see under the debugger what’s going on. We have to recreate this database here. Let’s go back and run this command I did before – yes, there’s only a T, I don’t know why that is. That shouldn’t be the case. So let’s go here.
Okay, we’ll create our proc. Now if I go back here, I expect – because this got recompiled now, I expect to see a P and sure enough, there is a P. Notice that there’s – now, there’s also a T because I may access this table that I’ve built through schema DLLs but in this case, I have a native compiled procedure. So I’m going to run this command again that I just showed you guys, the debugger – that’s because I was using T. If I do a P, there we go.
Now, looking at these procs is a little difference because this time, the SQL Server engine knows it needs to run the code in the DLL to run the proc. It doesn’t just need to map the structure, it needs what’s called a function pointer to run the code. Well, how does it find that? Well, every DLL for native proc has interesting variable, a known symbol called g_Bindings, I’ve got that listed here. g_Bindings, and every DLL has the same structure for this g_Bindings. So if I would go dump out what that looks like, I can do something called a dump type command with the debugger, and if I just grab this, every native proc that you grab has this format. You can do exactly what I’m doing. Grab the name of the DLL, put in g_Bindings, and the symbols tell us that it looks something like this. It’s always this structure, and if I – and the debugger’s really cool in this way. When you’ve got these known symbols, you can kind of drill into what they look like.
So the one called ProcInfo is key. That’s the one that’s going to have a pointer to the code. So ProcInfo at the very top, there’s something called ProcFn, ProcFN, this address is the function pointer to run my native procedure.
So I should be able to do something like this. Breakpoint on that address, go in, and run it. And I think I’ve got one here. Yes, run this stored procedure. So if I run that native procedure, I did hit a breakpoint. Now, what you’re seeing on the left-hand side over here is not the source code of SQL Server. Nobody should faint from Microsoft on the phone that I’m showing the source code. This is the source code of the actual native DLL in the C format, and if you go look at a call stack here, you can kind of see what this looks like. You can see here that we ran a stored procedure, that’s like normal for a stored procedure, but then notice here, we noticed in the engine it’s a Hekaton stored procedure. So we got to call this code, which this one is smart enough, this calls the Hekaton runtime, which based on the DLL says, I know that I got to go call that g_Bindings thing, and that ends up being a call into this DLL.
If you did this for a different native proc DLL, everything would be the same except for the top, this would be the name of your DLL differently. Here’s what’s kind of fun is we can actually step through the execution just like we could source code.
Daniel Hutmacher: I think Bob’s audio dropped out.
James Anderson: Yes, we’ve lost sound. Well, your headset come out?
Daniel Hutmacher: Judging from Bob’s look…
Bob Ward: Can you hear me now? I just replugged in my speaker. Sorry about that, where do I need to go back to? How long was I gone?
James Anderson: Like, 30 seconds.
Bob Ward: Okay, so I’m sorry, I don’t remember exactly where I was, but I was showing you how to step through this stored procedure, right? Did you guys see that?
James Anderson: Yes. It was during that.
Bob Ward: During that? Okay. Basically, I was showing you is I was stepping through the code of that native proc, and what I was stepping through was effectively your query plan. That was the query plan to go run that insert statement, right? But at some point, you have to insert the row. And to insert the row, the DLL doesn’t know how to do that, so it just takes that bag of bytes I told you about, the data, and it calls this HKRowAlloc So effectively, HKRowAlloc right here is like an insert command in the engine, but it’s the Hekaton engine. That’s like inserting data, and the Hekaton engine again, is very optimized, this is where the optimization comes into play. It knows how to take your bag of bytes, it knows how to put a header on it with all that timestamp stuff and it knows how to go chain them to the hash index and everything else it needs to do in a very optimized, latch, lock, spinlock fashion.
So the DLL has got your plan, and it knows it needs to do an insert and that call of HKRowAlloc in the DLL code is an insert. That’s what it turns out to be, and if there was a select, it would be a different call and so forth. So that’s how the interaction works, and if you see right here, the interaction is this is the engine, this is the Hekaton runtime engine, and then this is your DLL, so this is an example of the engine going to the runtime, saying go call that guy, go call your native proc, native proc doing some work, going back into Hekaton to insert your row and coming back into the DLL, and that’s – the DLL’s execution will be this back and forth between the DLL and the Hekaton engine.
Now, occasionally, the Hekaton engine has to go back to the SQL engine, and one example of that would be for transaction logging. So if this native compile procedure has to commit when it’s done with that block, the Hekaton engine would then take that sequence and then call to the log itself. So that’s just some fun with the debugger, I showed you that schema DLL, that’s how we map things so that the Hekaton engine doesn’t have to know row formats, and this is an example of native compile procedure execution, your query plan baked into a DLL and interacting with Hekaton. Pretty cool stuff, and then quick debugger tip, if you’ve never run this before, obviously do not run this in production. Please do not go to your company and say Bob Ward told me to do this in production, but if you were to do this in a development environment, don’t run queue, run .detach because .detach says get rid of the debugger, but let SQL Server stay up and run, so anyways. A queue would be bad because then you would crash SQL basically.
Okay, so pretty fun stuff, let’s move on and keep talking about this memory optimized access, and let’s talk about concurrency, let’s talk about transaction logging and why some of the things behind the scenes we’ve done have made this successful. This is a simple example of concurrency, why it matters to look a little bit at your application on things like conflict and error handling. So consider this scenario: remember the timestamp numbers I was talking about? So I’ve got two transactions with a normal disk based table. Now, when I start a transaction in this case, I’ve got one row one over here that’s already in the table, but if T2 inserts another two, but T1 started before that, if you recommitted isolation or serializable, what happens? You get blocked. That’s the normal thing. This is why we added that recommitted snapshot isolation level option in database terminology with SQL, because in that case, we use a snapshot versioning behind the scenes and you only see row one, you don’t see row two, right? So snapshots or recommitted snapshots don’t get blocked, they get to proceed, but they only see row one.
Now, if you commit and you come back later and run the select again in the same transaction, a true snapshot sees only one, but recommitted snapshot gets to see both. That’s the normal behavior of a disk-based system.
Memory-optimized is going to behave differently. We’re going to use an optimistic approach to use behind the scenes a snapshot approach but if you tell us to use a different isolation level at commit time, you might get an error and have to retry the transaction. Here’s an example. Same sequence as before, I run the select – remember I’m memory optimized – and a snapshot always use, so I only get the row one. But if I use serializable for transaction one, remember I told you in the native procedure you might use serializable as an option? If that happened then you commit, you’re going to get an error because the rules of serializable were violated for your transaction. Somebody tried to insert a row that comes within your per view because you queried rows that might match this range and a serializable isolation level should fail. Over here what would happen? Well, I got blocked. We don’t block you over here. We let you keep going, but when you commit we have to go make sure we didn’t violate your isolation level. So we use versioning, that’s the optimistic approach, the only downside is if you need different isolation levels that need to adhere to something other than a snapshot, you could have retry problems.
The most simple example is updating the same row at the same time. And there’s a specific error that says – somebody is going to fail, so when you update a row and somebody else is updating at the same time, you know, two transactions concurrently, you’ll get an error that says, “Hey you can’t do this, somebody else is changing your row, you need to retry.” That’s the most simple one we see. This is a little more complex, but it does show you that one of the downsides or one of the challenges of using this technology is to build your application to be able to retry at a client perspective.
Now, one technique that you can use, interesting technique we’ve had some customers do, and we’ve got this documented in our books online. You can build a non-native procedure as a wrapper. So you build a non-native procedure with try accept logic for these types of errors, and then you call the native proc inside the try block. And then your accept block would then catch the errors and then decide to go back and try it again.
So one technique is that if you don’t want to put this logic in your fronted app, you could use standard stored procedures to have try accept logic and hail these errors accordingly and then redo them or retry them. That’s not different than a deadlock, you may have to retry deadlocks in your app today. It’s just that we don’t have deadlocks that are possible here because we don’t have any locking. But this is an example of having a similar scenario, you may have to retry a transaction.
And then we’ve got some pretty good guidance on this. So we’ve got a list of some of these types of errors and conditions you may run into, and then what is the guidance for you to do. Almost always, the guidance is you have to retry, and then I’ve had this question, “Well, how many times?” Well, how many times it takes. And it could be that some of these errors you try to retry is not what you want, you want to actually fail and then have something happen different in your application logic. So it is the huge benefit of optimistic approach with the challenges and downsides of having to do retry operations because we’re being optimistic.
So is it really lock free? Remember what I said earlier, I just want to make sure I emphasize this. Code executing and transactions in the engine are lock-free, and I mentioned and I showed you some examples of this. Here’s an example of a fix we made and a hotfix for a customer based on SQLCAT where we’re fast during the engine but we notice outside the engine for interop queries, we were going after a spinlock that was really causing problems. So because of this optimization, all of a sudden, some bottlenecks in the rest of our engine can kind of actually show up.
This is some really crazy – you want to talk about internals – go read some of this stuff afterwards. It talks about how do we actually achieve this lock-free approach using this compare and swap process. The reason it’s important to you is that we actually created a T-SQL function called IsXTPSupported, which determines is this instruction in the intel set, does it exist? Because if that instruction doesn’t exist, we can’t do this type of operation, and the reason that’s important is it came up I think – it may not be the case anymore, but there was some virtualization technologies that didn’t support that instruction and therefore Hekaton would not work. So in our installation, in this T-SQL function, you can run this to say, “Does my instruction set support it?”
Today, most modern processors I’ve not seen that be an issue. Now, here’s interesting. As I built this slide, I thought somebody is going to go look at DMOS wait stats and find ones that are called XTP something and they’re going to go, “I’ve got you Bob, there’s wait types called XTP and you said that was Hekaton. It must be that there’s not lock-free.” Well, it turns out I looked at every single one of them and they’re not in the transaction code path. In other words, they are not in the Hekaton engine, they’re for the background tasks, they’re for like log flushes and checkpoint closes and other things I mentioned or garbage collection.
So anytime that we do have any kind of waiting or concurrency models that we need to build in that are not one of these lock free approaches, it’s something outside the scope of this optimized access to a transaction. And then remember, the host may wait. Like, SQL OS or transaction logging, it still may have waiting operations but not our engine.
So I thought you might find this interesting, looks like we got about 15 minutes left guys, is that about right?
James Anderson: Yes.
Bob Ward: Good. I think we’re on track here. Quick, are there any other questions before I…
James Anderson: Yes…
Bob Ward: Got something significant we should go over?
James Anderson: I haven’t seen any in the chat, but I’ve got one. So does it ever get too good, so is it – sometimes have to almost try and peg your procedures that you’re not going to kill your concurrency by just that procedure taking up CPU?
Bob Ward: Yes, I think I showed you an example of too good, right? So I used a schema only insert native compile procedure example, and I could not get it to run faster. In fact, if I pushed that super cowboy procedure, if I push it harder, I get worse results because my CPU is so saturated at that point that I can’t do anything. In fact, if you do that, SQL Server itself could become affected because it can’t do anything else. So the too good part is when you’re so pushing the CPU so hard with this model that you don’t have enough CPU horsepower to run the workload.
Let me give you an example. It’s in one of the slide decks but I didn’t mention this. There’s a new technology called persisted memory that the hardware manufacturers have, and it’s even faster than this NVM card I’m showing you. We’re talking about latencies at memory speeds. I mean, raw latency is nanoseconds, literally, raw latencies. And Windows server 16 allows you to show those as disks. So imagine a disk that’s like 8GB, they’re pretty small to start out with. An 8GB disk where the latencies to the disk is less than a millisecond. I mean, that’s you know – or less than a microsecond, excuse me.
Now, what we do in SQL Server, in SQL 16 and 17, we support you using that persisted memory disk as a backing of log cache. So when you commit your transaction, we will write it into that memory persisted structure and then we’ll let the commit go, and then we’ll lazily write it to the real transaction log. So we were seeing crazy performance numbers from these technologies. Really, really fast. In fact, imagine – this is a funny story. When we were testing this, we were looking at dm.execrequest at write log and seeing the wait times zero. Somebody thought it was a bug. But the reason it’s not a bug is because it’s less than a millisecond, right? And so what was happening in some of these scenarios, we were so fast running out logs for Hekaton that we were too good and we were saturating the CPU so hard because there was no latency anywhere else, and we had to move to a higher powered cored box to get true performance. So that would be the biggest thing I would look at, is that if you optimize this so much, you’re now bound by how fast the CPU can run basically and how many you have. Does that help?
James Anderson: Yes, that’s it. Thanks for that. There is a question I did miss, there is one. We’ve got, why is the second pointer to the end of the variable data field needed? Doesn’t one end just before the next one begins?
Bob Ward: You mean back on the row structure? You also have to have two. You got to know where it starts and it ends, right? If you only have one – because remember, I’ve got fixed structure, and then I’ve got variable structure. So I need one to say here’s where it starts and where one ends, and then if another one goes I need another one to say where that one ends. So I always have to have one to a starting place and an end place because the first number is not the size of the variable length record. It’s just the beginning of all the variable linked data, and then the other pointers just point to each one in the list. Good question though.
Let’s talk about some transactions to give you a feel of how we’ve optimized things because people – I mean, everybody makes assumptions of how good we are in SQL Server, I do it as well because it’s a pretty good system, right? But think about a normal insert like I’m showing you in these examples. In order to do an insert, we have to log an insert record, we got to obtain locks, latch pages, modify pages, and then depending on index pages, you have to actually get more locks and more latches, and then potentially do more logging, right? And then release all of that, and then at commit time we log – we do flushing, right? So that’s just a – like, even a single insert goes through all that code path.
Well, what people don’t realize is that if your insert involves a page split, you’ve got more transactions and something called a system transaction where it requires a log flush in the middle of your transaction and then any time a latch is required in SQL Server, we need spinlocks to protect the latches.
So there really is a lot behind the scenes, I think it’s a really good system. There’s a lot behind the scenes just to do an insert, and we’re pretty fast at it to do all that code. This is what Hekaton looks like, that’s it. You insert a row into memory and you maintain your index in-memory, that’s a key pointer here. We do not store indexes anywhere. Notice how on those checkpoint files and things I talked about, that’s just your data. Indexes are maintained in-memory only and when you restart your database, we will rebuild your indexes based on your data. So that’s part of the whole database startup process, but that gives us just incredible speed. There’s no logging for indexes, and remember, there’s no logging until you commit. So before you commit, this is all in-memory, all optimized very fast, and also, remember that for schema only there’s no logging at all. So you just have these two operations, insert a row and maintain your index, which you may have to maintain multiple indexes, we showed these chains, right? But none of this takes place. None of these types of operations are required to do any of that. That’s just to give you a feel for how we’ve tried to optimize for example, an insert type operation.
And then if you look at logging, if you’ve ever done FNDB log, you can just take a hundred row insert and look at all the stuff that has to take place, right? So for each row, I’ve got – in this particular case I had a non-clustered index. I’ve got to insert the heap, the non-clustered index, I’ve got to allocate pages potentially twice to do a hundred rows, I need PFS latching, I need metadata access, which is what this HOBT_DELTA is all about, I need to do log flushing, and I’ve got to do PFS latching multiple times to update free spaces, and I think it took, in my single one hundred row insert, which is a very small table, abut 33K in the log. But for Hekaton, it’s optimized, so the same type of logging has these three records and that’s it. And notice this is about a third of the size.
Now, we created this new log record type called LOP_HK, and if you want to break that out, there is a DMV, different than FNDB log that you can say show me what’s behind the scenes of this LOP_HK, in this case, there is a begin, commit, and then a hundred inserts. And remember, we only even go into log cache at commit time, and then we also flush it at the same time as well, and there’s no rollback. There’s no log records for rollback purposes because we don’t have to do undue recovery.
Now, we do support rollback as a concept, but we don’t have to log anything because we don’t do undue. We just do roll forward. And the reason we don’t do rollback, people have asked me this question, is that we don’t have write-ahead logging. Remember, we’re only logging at commit time, anything in the transaction log at all. So since we only put things in log cache at commit time, there’s no concept of writing out a dirty page that hasn’t been committed. So since we don’t do that, we don’t have to support undue recovery. It’s one nice optimization of only logging at commit time.
So let’s finish off here and talk a little bit about optimization and I’m going to jump to the learnings discussion. So we might go a little faster in these last few slides here. So basically, why do we even do this checkpoint thing? Well, it’s all about speeding startup. If we didn’t do checkpoint files, all we have to do is read the transaction log and redo everything in the log, but then the log can get really big, right? You wouldn’t be able to truncate it, so that’s why checkpoint files are there, because it’s faster to go read a series of checkpoint files than serially read a log of committed records.
And the I/O is continuous. What we don’t want to do is at a checkpoint event time, we don’t want to saturate your disk where you have your checkpoint files, so we kind of write continuously to these files, but then there’s an actual checkpoint event that’s written into the transaction log, and it’s based on the amount of generated log and that will actually spur off taking – making sure we have a known point in the transaction log to say, “Hey, checkpoint has been done from this point forward”, just like SQL Server, anything pass that would have to be rolled forward.
And then what happens in these files, there’s a whole section that Kalan has in her doc, she does a really good job of this, to show you what these files look like, because I’m going to show you in a second, but it looks like just almost nonsense on the disk and you have to make heads or tails of it. But basically, the data itself is written in pairs. Something called data and delta, and what that means is that the data contains all the inserts, and the delta is a filter for what rows in the data that were deleted. Remember, an update is a delete of an insert. So think of deltas as just filtering what things in the data file we don’t have to apply because they were deleted.
And we had this concept called merge, because if we just kept doing data and deltas over and over and over, and you especially had like a lot of deletes, you would have a lot of files out there. And so in order to conserve the footprint of what we have, we have this merge operation that we do behind the scenes where we can kind of compact these files, where we can actually take pairs of datas and deltas and merge them back into one file.
And then we have fixed sizes, so when you look at the size of these files that are there, even if these are not full, we don’t create variable sizes. It’s much easier to create a known fixed sized footprint from an I/O perspective.
Remember, there’s no index writes; we rebuild those at startup. You’re also going to find out that we recreate files to avoid a bunch of creates. So you’re going to always see 15 or 16 files out in this directory structure I talked about, but not all those files are actually used, and then I’ve had this question, like, “How big could these files get?” Well, they could be greater than the size of your data, and one easy reason why is that they’re a fixed size. Like on large systems, we can make these as big as a gigabyte, but we may not use all of the gigabyte. So you have to understand about the fixed sizes that we have and the fact that because – especially because of the fact that delta, when you do deletes, are stored there as well, the size of these could actually be bigger than the total amount of data you actually have.
And then we don’t have CHECKDB yet for this technology, but we do do check summing. So we use CHECKSUMs and single bit ECT correcting on the files themselves. So if a file got damaged because some I/O problem, when you loaded SQL, you would get an error saying I can’t load this because there was a CHECKSUM error on the file.
So when you look at these files, you’re going to see these GUIDs and you’re going to have these pre-created states, you’re not going to know what this means, so I thought it would be interesting to see real quickly the states, so when you first run a create table, you’re going to get a bunch of these pre-created files out here that look like this, but one root file, which is metadata in a file format to show us what the other files look like. When you insert some rows, here’s where the continuous I/O kicks in. We’re going to start populating what’s called under construction data for data and delta, and then when you do an actual checkpoint event, like if you manually did checkpoint, we would turn that into active, and the reason it’s important to know this – and I’ll just keep running through this sequence here – the reason it’s important to know this if you keep going through the sequence, when you start up SQL Server, anything that’s in green that’s active, that’s what we’re going to use to populate data. The grey will not be used because the transaction log doesn’t show that we’ve done that because we only write in the transaction log at checkpoint event to say, hey anything before this, use the checkpoint files, anything after this, go redo the log. But the grey represents the concept that we’re doing continuous checkpoint to avoid flooding your log, and this DMV here is very handy to see what the state of this looks like.
So let’s go do that just real quick. If you go and look at one of the directories, and I’ll show you here I’ve got one. I think it’s called gobayler, yes, gobayler_inmem. This is a file group directory structure, and under this $HKV2, these are your checkpoint files, and you can see the fixed sizes. But look at these names, these GUID names. Who would know which one is active or which one is data or delta? So if you go under this DMV, I’ll show you real quick, you get an easy mapping of this. So in this database, and I’ve got these scripts for you to run. If you look at this, this is going to give you a list of all those files, but what kind of type they are. Notice a bunch of these are pre-created, and if you look over here, you actually get to map in which file goes with which type.
So that’s interesting. What I generally do is show me all the ones that are not pre-created [to so of ones] even matter, because pre-create are just ones we’re keeping around to use later. So I only have one root file because I just created this database. But if you go run some transactions, I’ll just go run some inserts against one of the tables, you’ll see how this takes place here real quick.
So we run some transactions, and then go back and then look at these files, you’ll see how I’ve got a delta and data that are under construction in a root, but if I run a checkpoint and I go back and run it again, now you can see the sequence of what I showed you visually. You’ve got a delta and data that are active, a root that’s active, and every time we do a checkpoint we create a new root file, but this is kind of showing you that sequence. So if you’re trying to figure out what all these files are and what we’re doing, you can use this DMV to actually go and do that.
And again, you really shouldn’t care too much about what this looks like, except for how big it’s going to be, and for the fact of that you might want to put it on a very fast disk for startup speed purposes. And you can stripe it to even support that even faster.
So let’s finish off real quick and talk about a couple things. There is resource governor capabilities here to – you don’t need a classifier function. If you want to control the amount of memory we’re using you can use it for that purpose, and then also diagnostics is very important. All the diagnostics work for Hekaton except for one problem. Natively compiled procedures by default do not show up with diagnostics. That’s because it would slow them down. So if you want to do that, there are stored procedures to turn on query diagnostics like query run time stats and then if you recompile it you’ll get that. So by default we’re not going to do that to make sure we maximize speed, but if you need to do diagnostics, like what statements are running slower in my procedure, you would need to actually turn that on.
Couple things in 17, not a lot of major changes here, but it was a little bit more about opening up the surface area for these procedures to add more T-SQL syntax, and then to speed up database startup. That was one big complaint we got, is that database startup was not fast enough, so we made some specific enhancements to 17 to support that.
And then here are just some learnings. I think I mentioned some of this already, like please don’t go and do a single user insert against Hekaton in a disk-based table and then say, “Why isn’t it faster?” I mean, that’s just not going to – I mean, use the demos I showed you for concurrent access of what we’re doing here. And again, remember we’ve optimized. If your transaction log is really on a slow disk, you’re going to really slow down a durable memory-optimized table and you’re not going to see the gain that you’re expecting out of it, right?
I think I mentioned some of these already, that – you know, start with non-clustered indexes, move to hash as you learn. Please bind a resource pool, like when you’re testing this out, yes, SQL – we’re going to go use I think by default, depending on your size, almost up to 90% of your max server memory with this. So consider sizing out what you need and using a resource pool.
And then do you really need durability? I mean, a lot of folks use this technology, they would show me their performance, and I said, “Hey, you know, just based on your table name that looks like some sort of like session state thing or caching thing.” Like, “Oh, you’re right, I don’t need that to be stored on disk”, and then with the schema only and they were just blown away by the speed they were getting.
So anyway, this is an interesting one as well. If you don’t have enough memory for these things, like let’s say this is going to represent like, history data, like you’re going to keep inserting in here, but then you have date data, if it gets historical, we have a really nice feature with 16 to 17 where you can combine this with temporal tables and we’ll automatically shift data historically into the temporal table, which is disk-based.
And then if you’ve got enough memory, like if you’re just one of those users that I got a lot of bucks to spend and a lot of memory, go use these together. You can actually put column store on top of these tables to do something called HTAP, which is hybrid transaction processing.
So and then I got questions about capabilities like, okay, what about availability groups and stuff, and this is a really nice list for you of things that we support, and one of the biggest things we did if you didn’t know, is starting in SQL 16 SP1, you can use this technology both on Standard and Express editions. It is more limited in how much memory you can use, but the technology itself is fully functional in those editions.
And then this is kind of my case, you know, and the biggest thing I want to point out is if you think this is just about in-memory OLTP and stock exchange transactions, it is amazing at that. But think of these other scenarios like ETL, IOT, table types, other things where ingestion of data and high-speed access to data could be something you want to take a look at, and if you’ve got the memory for it, why not? Why not take a look at it?
And then if you’ve looked at 14 before and you said, “This isn’t good enough”, you know, please take a look again at 16 to 17 because I think what we’ve built is actually pretty amazing stuff. And then I always provide people resources, so please go take a look at this, and I think Ned, like last week spoke, I think he’s one of the best guys in the community out there that keeps people in touch with the practical uses of this and where he’s using it and seeing where it’s beneficial. And again, the decks and demos are right there in that URL, so they’re right there right now. If you want to go use them, take a look at them.
One minute over according to my clock. James, Daniel, what do you guys think?
James Anderson: I think it’s an awesome presentation.
Bob Ward: Thank you. Thank you, I don’t know if there are any other questions that we want to try and handle here before we sign off?
Daniel Hutmacher: J. D. Walker says, “Mind blown.”
Bob Ward: What’s that Daniel?
Daniel Hutmacher: J. D. Walter says, “Mind blown” and John says “That was an amazing talk.”
Bob Ward: That’s very kind of you, thank you. Anything else you guys see out there?
Daniel Hutmacher: Can’t see anything, no.
James Anderson: Still got people typing.
Bob Ward: Well, for folks that are still and for you guys that are hosts as well, if you really like these internal talks and you still want these coming, you know, I’d love your feedback. There’s a lot of effort that goes to putting these together. One of the reasons I love doing it, it helps me learn a lot about the technology, I think it also brings a different perspective to people on what we’ve done behind the scenes to get an appreciation of it, so if you want these to keep coming especially in an event like GroupBy, let me know and once again, I want to say thanks to Brent Ozar for putting this together. It’s a great event.
Erik Darling: Hey Bob…
Daniel Hutmacher: I have a question actually from Mark, “What do Hekaton critics say is bad about the product and why are they wrong?”
Bob Ward: What was the question again?
Daniel Hutmacher: The question was, “What do Hekaton critics say is bad about the product and why are they wrong?”
Bob Ward: The number one reason I get besides the fact that I just don’t have memory to support this is the fact that it’s not an easy migration, especially if you’re an existing SQL user. If you saw what we had to put together, you have to build a file group, you have to change your syntax for tables, and then you have some of these retry considerations, and the fact that I think some workloads just don’t lend themselves to this, but that’s what I see. I think sometimes that it’s also a misunderstanding of when you could use it. I think some people think “Oh, it’s just for this high-speed OLTP scenario when it could benefit these other ones”, but the number one reason I hear mostly is about migration. Like, it’s not a very simple thing just to turn a switch on and say, “I used to use disk-based tables, now I can use this.” That’s what I’ve heard.
Anthony Nocentino: One mistake I made when this stuff was all new back in 2014 was I did this batch processing type of thing where I had a limited data set that I would repeatedly update and version over and over and over and so I would run out of memory in a second.
Bob Ward: Because of your versioning, right?
Anthony Nocentino: Yes, and I couldn’t figure out how to flush the stuff away to get rid of the old versions fast enough, so…
Bob Ward: I think we have made some improvements in 16 for garbage collection a little bit, so I think you have an older version. One thing about versions to keep in mind too, is versions have to stay around if transactions are active and need to view them. So if you are doing what you’re doing, you probably won’t run into that, but I’ve seen customers who – their Hekaton database got bloated because they had a lot of versions that we couldn’t keep up because transactions still need to view those versions. Just remember that a version means that if we can’t clean up a version of a row that got changed, it’s because some transaction needs to see it. As long as a transaction is visible in those timestamp ranges, we can’t get rid of the row. So that’s one consideration.
The other thing I’ve seen is that I say optimized access to memory, and again, I’ve seen people try to go run a million row scan against one of these and say, “Hey, how come it’s not faster?” I’m like, “Well, you really want column store. If you’re after scanning a million rows, that’s not what we built this for, right?” I would say to anybody in the phone who’s like, “Look, I just don’t know if this is right for me”, go look at the case studies. There’s a lot of them where people really said this is why I needed it, this is why I used it, and then take a hard look at table types. If you use table types in your environment already, then making them memory optimized could really have huge benefits. There’s no tempdb access anymore. If you got the memory for it – you know, if you don’t use table types, that’s a change for you, right? So moving from tempdb to this if you don’t use table types is not trivial, I mean, I’ll admit that. But – here’s one other thing, I’ve had some people say, “Look, I don’t need this speed. I just don’t need this. This isn’t something that in my environment I have to have that I don’t – I think somebody said to me, “I don’t have latching problems” and I said, “Well, that’s only because you don’t have massive latching problems. This is an – you may not realize the impact latching has to your application until you truly measured it against comparing to this technology.” But those are just some of the things I’ve heard.
Erik Darling: Mr. Bob, I have a couple questions. One is a personal question and then a more general question, I hope you don’t mind. I asked Sunil so I have to ask you too, I don’t want to get you left out. What’s your favorite trace flag?
Bob Ward: Oh my god, my favorite trace flag, wow.
Erik Darling: I’m writing this down, so answer carefully.
Bob Ward: We’ve gotten rid of so many of these things, right? Gosh, well I guess the one I have to use the most is 3605 because without that trace flag, some of the DBCC stuff doesn’t light up. Or 3604 I guess, sorry, 3604. It’s like page and some of those kinds of things, but I’ve got to be frank with you, I don’t use trace flags much anymore.
Erik Darling: You don’t need it, you have the debugger. You’ve like reached the next level.
Bob Ward: No, but I have to answer questions about them all the time but as far as my personal use, the one I would light up the most is 3604 because there may be some internal DBCC commands that don’t dump if I don’t use that.
Erik Darling: Alright. So what’s your favorite secret trace flag then?
Bob Ward: Oh man, you asked me to memorize a secret trace flag? Well, there is one. There is one that I don’t remember the number and I have to admit, I have to go look it up, I still don’t memorize that kind of stuff. But there is one, and – okay, so I’ll tell you what, you can ask Paul Randall, he probably knows what it is. There is one that lights up DBCC help, so if you ever use the DBCC help command, it gives you the syntax of DBCC commands. Well, some of these DBCC commands, the help doesn’t work. It doesn’t show you what it is, but if there is a trace flag you can use where will light up some of that stuff.
Erik Darling: So there’s a trace flag for trace flags?
Bob Ward: There’s a trace flag for trace flags, yes, something like that. That’s a good one. Kind of weak answers, I’m sorry.
Erik Darling: No, no trouble. My other question is around the office we have nicknames for collection of our fondest wait stats, poison waits, like thread pool and the resource semaphore. How much can Hekaton contribute to or be affected by those kinds of poison waits? So I would expect like the resource semaphore ones and memory waits to be like particularly rough on Hekaton, but like a [crosstalk]…
Bob Ward: The resource semaphore shouldn’t have any effect directly on Hekaton because…
Erik Darling: Really?
Bob Ward: Well, we don’t do grants, right? There’s no granting to access a memory-optimized table, but if you did an interop query, that is going to combine querying a disk-based table and a Hekaton table, right, you could do it. You could join those, right? Then maybe resource semaphore slows down your expectations of getting to the data. But there’s no resource semaphore acquired to access a memory-optimized table, it’s not needed.
Erik Darling: So like the core engine hitting memory related waits wouldn’t push on the Hekaton at all?
Bob Ward: Well, maybe though because Hekaton is going to grow its memory as it needs, right, as you’re inserting data, so if somehow you were doing large grants where we actually access a lot of memory, maybe you’d run it out of memory from Hekaton, right? That’s a possibility. It’s possible. The thread pool is an interesting one. Remember, Hekaton uses the standard worker pool for SQL Server.
Erik Darling: Okay, so there’s nothing kind of reserved.
Bob Ward: Now, when you startup the database, we create most of those threads we need in the background as you start the database, so that shouldn’t be a problem. But user transactions are the same thing, right? We’re going to use the worker pool just like anybody else.
Erik Darling: Okay, so like when you were talking about a situation where you had to really up the number of cores on a box to push the insert workload, that could totally hit a weird thread pool?
Bob Ward: I don’t – that’s not a thread pool scenario. What I meant there was let’s say that like, on my laptop there are eight processors – eight logical CPUs. Well, if I go post a hundred users against that, I probably won’t hit thread pool waits, but each of those tasks are going to sit there and do SOS schedule yields back and forth so hard that all you’re doing is switching between each other. I can’t push any more CPU power because that’s all I’m doing at that point. So I would expect to see SOS scheduler yield get really hot in that scenario I’m talking about on a smaller CPU machine because the only thing I’m doing is switching. That’s it.
Erik Darling: Right.
Bob Ward: Everything’s so fast, I mean, the question was what happens if you get too good? Well, you get too good, you just have to switch with each other on eight schedulers for over a hundred users, right?
Erik Darling: Once you get too good you’re going to get better hardware I guess.
Bob Ward: A little bit, yes.
Erik Darling: Cool, thanks for answering, I appreciate it.
Bob Ward: No problem. Any other questions before we drop off?
James Anderson: I was going to mention one thing. As the talk was going, I was – my mind started thinking about like, hot cold workloads. So if I had a partitioned table, I doubt it’s very – I doubt it’s possible to make a partition just in-memory, then I was thinking is it possible to have an in-memory table in a partition view? I guess…
Bob Ward: That’s a great question James. I would say the hot cold scenario is not one that we’ve optimized very well for this. We do have this concept where we integrate memory-optimized table with a system temporal table, you can actually do that, and that works. But then some people have told me there’s issues with temporal that they run into right, but the partitioning questions you’re asking about, no, we don’t support that. So the whole hot cold scenario with this technology is probably not what’s best for it. I think some customers that have had to do that have had to like move – like keep their hot data memory optimized and they move the rest into a partitioned disk-based table for their queries.
James Anderson: Okay. Cool, thanks.
Bob Ward: Hey guys, I really appreciate your time, great questions, thanks, it’s such an amazing audience that comes on here and I look forward to coming back again to this event with another topic.
Erik Darling: Awesome, thanks a lot, Bob, thanks for joining us.