Any DBA or developer that has ever faced the problem of moving data changes across databases.
One of most common problems a DBA or database developer may face is finding and syncing data changes efficiently between multiple systems.
In this session we will look into different methods focusing on effective use of T-SQL and explore how these methods can mixed and matched in your ETL workflows:
- Brute Force
- Log shipping/Read Only standby
- Rowversion datatype
- Change Tracking
- Change Data Capture
- Temporal Tables
Enjoy the Podcast?
Why I Want to Present This Session:
Free up time and energy by knowing all the options available, brute force is not the only path forward.
Brent Ozar: Up next at GroupBy here, we’ve got Martin Perez talking about T-SQL tools to synchronize changes, so take it away, sir.
Martin Perez: Thank you. So, my name is Martin Perez. I’m a SQL Server DBA in Austin, Texas. There’s a lot of text and slides, and towards the end, we will fire up SSMS and kind of look at some of these things. Any questions or comments, I welcome them. Let’s get the conversation going.
So we were talking earlier about what sort of things do you just sort of live with as a DBA? Whether you’re a developer or kind of a hybrid, sometimes you’re the one-stop-shop for everything data-related. Sometimes you have very clear lines between admin and data; you’re a programmer, maybe you’re a BA. But there is never a time when I don’t need to get something from database A, do something with it, and move it to database C.
And whether it’s a reporting solution, whether it’s to be in your data warehouse, whether it’s just to be given to the BA to take that process up – we have this raw information, and now we’re understanding what it means to the business and now it becomes valuable. And you want to be in that value stream. You don’t want to just be the guy who says, no don’t touch that; you want to be the guy that helps get this from, don’t touch it please, to, yes this is yours now, figure it out.
And to that end, it’s to our benefit to understand the different tools we have available to us and different strategies. Sometimes you learn one way of doing things and you repeat that for the rest of your career. You don’t have – ten years in, you’re one times 10. And regardless of how we approach this, we’re going to be doing the same fundamental process. Starting with brute force and understanding the steps that we have to do, we’re going to be doing these same steps, whether we’re using some of these other system manage approaches, but if we understand the core concept then we’re just kind of spending our brain cycles on, this is a variation on a core concept, instead of every new thing you learn, it just kind of takes up more and more space in our heads.
So something to think about when we’re just doing a brute force approach – it’s brute force because it is brute force. You’re going to scan every row in the table. If it’s a trivial amount of data, that may not be a big thing. Once we get into the millions or tens of millions of rows, then we start to have a problem. Sometimes, this is just the only we have. This is a very real situation if you’re working in a lockdown environment or you have a third-party application that, if you touch it, you’re no longer supported. And these are very real things we have to deal with. So understanding efficient ways of doing this and understanding what we have to do and understanding the kinds of things we have to think about, it’s to our benefit.
So if we’re going to be comparing every column, we have to know how to handle comparing nulls. We have to know what to do if your source data is a certain collation in your reporting database is a different collation. I see this – these little bullet points, I learned the hard way in lots of hours, like why doesn’t this match? Why do I get this? Why is this update statement failing? And data types – if you’re connecting to an old SQL Server 2000 database and they have text or in-text or something and you have to understand what sorts of steps you take to what’s the modern supported version of the data type, and these are just things we have to know. And if we have to deal with deletes, say on our reporting database, what kinds of things do we have to think about? Because deleting 100 rows is a much different proposition from deleting a million rows. Or if you have a very chatty system and you have inserts and deletes and inserts and deletes, these are things that you have to put your admin hat on and understand. Sometimes, we get to live in just the creative, I’m developing something, but there is a cost to every decision; there’s a benefit to every decision. So the more we focus our attention on what’s the cost of a certain approach, the better off we are going to be, the better conversations we’re going to have with our peers and life is just going to be better. You don’t want to always be fighting and pointing fingers.
So, brute force – if you’re going to brute force it, you’re going to be living with a few – these patterns, the not-in, not-exist, outer-join, and except, these are all those different ways of doing that same chord. They’re different ways of doing the same job or getting the same thing and it’s good to know more than one. If you have a table and you’re fortunate enough, like hey this is a clustered index, maybe an identity column and your source and your target, then you’re going to be in a little bit better position because you can use an index column and you’re still doing a seek or a scan, but you can be a little bit more efficient.
I have – for the real world, very expensive applications will be in heaps. And we have to deal with that because this is not something I am allowed to touch. It doesn’t matter if I’m super admin. Once it gets in-house – even in a situation where the server is leased from a different company, so I have zero admin access to it. But there are some clever ways we can kind of get around that. So these are just some examples of the kinds of queries you’re going to be writing. If you’re just comparing and looking for new rows and you have the primary key, the unique index identified, the unique columns identified, then we can just say, give me everything that’s not in where I’m already going.
So from my source, select all the rows where my primary key is not in the target database. And this is kind of the trivial example of, just show me the new stuff. If we have a composite key – I didn’t change the query, but if you have a composite key, the not-in is not going to work, so we’re going to do the not-exists. And we have our nice little-correlated sub-query and we start to think about things of outer references, and this is another thing that’s – I remember, like, that’s getting really stuck. How do I write this? I know how to do the not-in, and then this not-exist and this correlated subquery comes into action. And again, we do get the benefit of more efficient indexes involved. So we’re not quite in full brute force and then we’re just inserting our new rows.
And then another variation on that is just to do an outer-join and we’re looking for null columns on, I think, the left side on this example. These are all different ways of doing the same thing. You can run all of these and you look at what the query optimizer has decided to do with it and you get the same performance. You get the same cost in terms of resources on your server and it’s really important to not only think about the cost in server time and storage, but the cognitive load, because we’re creatures of fixed resources at any given time. So if you can take all those concepts and ideas and then you’re able to just compress them into a single concept in your mind of, does it exist in my target table already? And these are just different ways of doing that.
You take these concepts that are really exploded out in our heads and taking up a lot of space and a lot of our attention, and you just compress it down to a single thing and that frees up more space for us to hopefully tackle more interesting problems, or probably just the next problem that’s in front of us. Okay, so that is a brief overview of the brute force, and later on, we’ll show an example of if we’re doing updates as well and we have to compare all the columns. It’s not just, does it exist? Has it changed?
So in brute force, we’re living in developer managed, or DBA managed workflows. We have to know the business. We have to know the data. Is there an updated column? We’re keeping track of the primary keys? There’s a cognitive load on the individual, and if we can take that cognitive load and move it to something that’s really good at keeping track of things, like this, like a database, like a system, like a computer, that’s a good trade.
So we’re going to look at Rowversion, which is really old. It’s built-in. when I found this, I couldn’t believe I didn’t know about it. The same thing with Change Tracking – it’s been there a long time. It has some quirks. And Change Data Capture, which honestly is my favorite. I love it. I use it every time I can. There’s lots of quirks about it. There’s a lot of complications around it. It’s a skill set you have to cultivate a bit, but the payoff is potentially very valuable to the organization and it’s a great thing to have in your hip pocket because it can solve a lot of problems. Temporal Tables needs to be changed into Capture Evolved. And it’s really a fascinating thing. There’s a few things I don’t like about it and I have not had the chance to use it in a production situation, so unfortunately, the way I see it, it’s just sort of like dreaming, like, I could do this. I can see how it’s valuable. But for sure, the first three, I’ve used it in many situations, and honestly abused it in many situations, especially Change Data Capture, but they’ve solved very important problems to the business. And I think sometimes we, as developers and as individuals, have a real problem with scale, because that thing is, like, oh man, I spent three months figuring out how to use this thing – and if you’re working for an energy company, well they have 100 rigs out on the field and the bill rate is 25,000 a day and it’s a company with a trillion dollar market cap or in the 10s of billions of dollars involved. So your little three month adventure to put something into place that could, potentially, years later still be solving that same problem, it’s a worthy investment. That’s just kind of my little aside, because I think we sometimes just don’t understand the value proposition of what we do and any investment we make on what technology we learn or approaches or how to better approach our job or have better tools, mental tools, mental models, they pay off in a very real and tangible way. Not just for yourself career-wise, but for your employer or your client or, you know, maybe a potential future peer because you’re able to bail someone out in a jam.
Brent Ozar: For years, yeah, they just keep paying off for years, ideally.
Martin Perez: Yeah, and that still blows my mind, that something you’re very proud of at the time and it’s like, I kind of pieced together this dinky mail merge and access … , but it was talking to our – you know, it hit our CRM system and I remember, this feels gross but it did the job. And you find out that that used to be two weeks of someone’s job or a week of someone’s job. Year after year, this was their responsibility to – hey, it’s the third week of the month, don’t talk to so and so because you know what they’re doing… And that’s real value. That freed up that person’s time. They didn’t get rid of that person; that freed up that person’s time to do something else. And I’ll leave the job and I’ll have lunch with some friends and it’s like, yeah we finally replaced some crappy AXIS database. There was a computer in the corner that no one was allowed to touch. I’m like, are you kidding me? That’s been running for five years. And you kind of pat yourself on the back and realize that something that, as you progress and get different tools under your belt, might appear to be a little janky, but it solved a business problem. It provided value and that’s good. I know I’m kind of on a philosophical tangent here, but that’s also an important lesson of the empathy when I go on site or I start a new project. It took far too long to learn that nobody’s going to be impressed with you by looking at someone’s bad code and complaining about it. Nobody cares. And you don’t know the constraints the person was up against when that was in place or something perfectly reasonable for a 2GB database, and it could have been 2GB for years, is just not the right answer for a data warehouse project. This wasn’t intended to be a data warehouse, so if course, the tools in place and the solutions in place don’t match. So be nice, have empathy for someone else.
Okay, so Rowversion is still very much in the brute force category. We’re going to have to do a lot of comparing things. It doesn’t help us with deletes, but what it can do is dramatically decrease the number of rows that we’re going to have to compare and look at. If I have a table with 10s of millions of rows and I have a way of saying, okay the last time I looked at this, the highest version was this. So I’m still going to brute force a lot of things, but my candidate rows are anything newer than this version.
There’s things that are not good about it, and this might be on a later slide, is if you update it to itself, if you just take a row and update it to its original value, it still increments that Rowversion counter. And I can see that as, from a technical perspective, that is probably correct. That is that something happened, the fact that the data didn’t change is irrelevant to that. I’ve issued an update statement and I said, please change this value. That may be the correct answer, but it’s not an interesting answer because I don’t want to do stuff with it if it’s still itself.
And, of course, there’s I/O overhead. This is going to take up space on your storage system. This is your modifying – sorry, I jumped ahead of myself – not pictured – SQL Server replications, triggers, and third-party tools. These are valid approaches. They are perfectly the right answer in many, many situations – probably not triggers. But since the title is T-SQL tools and we’re just talking about T-SQL things, some very, very good approaches are just not in the preview today. But yes, third-party tools – if you can get them, do it. Save your brain.
Okay, so now we can talk about Rowversion. I love it because it’s system managed, it’s incremental, it’s not guaranteed to be unique, but we don’t care. That’s not what we care about. This is not a candidate for primary key. This is not a good candidate for a clustered index, unless you want to just have lots of data pages moving around because you like that. it is just a thing that increments when a DML operation is fired up on a row. Nice things – you can add it to an existing table. And if it’s a big table, we still have to do common sense things. Is it something that we can do in a reasonable amount of time? Do we have an outage window? Is this always an firing off heavy highly transactional database? Do I have spurious updates going on? I learned the hard way that sometimes applications just like to say, I’m alive, I’m alive, updated, I’m alive, still here. Anything happen? Nope, but I’m here. You had to know these things because that could just invalidate this as an option.
So what you do get is this nice little binary value that you can know that it’s going to be bigger than the last time. So a whole slew of questions can leave your brain. You don’t have to worry about it like, what is this thing? It’s bigger than last time. If it’s different, if it’s changed, if it’s been touched, it’s bigger. So by keep track of the last version I looked at – and I should probably store that in a table somewhere – then I can know with confidence that I don’t know what happened to it, I don’t know if it’s been inserted or if it’s been updated. I definitely don’t know if it’s been deleted because the row is gone so it doesn’t help us there. But I don’t have to guess. I can put a nice index on that and I can maybe brute force 100,000 rows or 10,000 rows or however active your system is versus the big giant table.
So some limitations, you only have one Rowversion column per table, and that makes sense. I can’t even imagine the value of having more than one. The value increments with each insert update operation, so any DML, it’s going to take that counter a little bit higher. It’s not guaranteed to be unique but for our purposes, that’s okay. And you kind of have to – I think you have to go out of your way to get them to be the same value. Maybe a batch insert. But even if it is a batch insert and even though it’s the same, that grouping that’s the same is still bigger than the last time I checked so I do want those rows. Those need to come to me so I can look at them and figure out what to do next. Not a good idea for your primary clustered index because of that. If you update, you don’t want to cluster on something that’s likely to change and this for sure will change. And a no-op is going to increment that value, so again, technically we’re correct but for me who’s trying to get something to the data warehouse or RBAs or report writers or whomever, this is interesting but not valuable. So any questions so far or any comments?
Brent Ozar: No, you’re good. Let’s see here, one person said – Steve Malcolm said, “I’m going to have to watch this about 15 times.”
Martin Perez: Well, man, I’m watching these all morning like dude, my code samples aren’t cool like that. So many graphics, oh man.
Brent Ozar: Everybody always feels bad compared to – it’s also interesting to see like, three out of the four sessions so far have been on the Stack Overflow database and I’m like, yes.
Martin Perez: Oh yeah, that’s – I did try to like, I’m going to get the full version, it’s like, I don’t have 200GB. It’s like, oh, 15GB that’s pretty good. 15 compressed and no indexes. Great. Yeah, this is great to have a common language to work with and it’s a common set of tools like, oh yeah, badges, know that one, I’ve seen it. But this does bring up an interesting point in my example here. I’ve modified Stack Overflow database. In the real world, I’m probably not going to be able to do that. So I have definitely – when you’re having that conversation of what sort of value can you unlock, you do things like can we get on a call with a vendor, can they [crosstalk] for us.
Brent Ozar: Are we allowed to use this feature?
Martin Perez: Right. I’ve had people come back like, oh, CDC unsupported. I don’t believe you, but okay. In-text data type, well, we’re not changing our code so it has to be in 2000 compatibility mode or we’ll break it. Well, I’m on a newer server. It’s like, nope. Unsupported. You’ll be unsupported configuration and the people writing the checks go, unsupported? No. Hit the brakes, pump the brakes.
Okay, so that’s the end of that. But deletes are still going to be a brute force. So with – if you need to find out, alright, in my store system, what has been deleted, and I’m relying on Rowversion for my inserts and updates, I’m going to say alright, we’re going to invert that not-in so what’s in my target and it’s not in my source? And here I have a nice little delete CTE. Really learn to batch your deletes. Have a couple of ways of approaching that because you just don’t want to fire off – in these examples, my mindset is this is probably something that’s going to be set up on a schedule. You’re not going to – it’s not going to be 8am and you drag this script file from your desktop. Like, you hit the button and do something, and if you are, that’s a bad place to be and we need to get out of that because there’s a better future for you that just blindly deleting stuff like, things happen in big numbers on source systems that you don’t know about. So you don’t want to fill up the transaction log. You don’t want to kill the reporting server because you’re processing 10 million deleted rows because somebody forgot to kick off delete – the batch delete job in the source system for months. Real stories. It’s like, we got afraid, we haven’t deleted for six months and now we’re good so we just deleted six months of data in the source system and now that just flows through. So just that concept of have more than one way of doing things. I think I’ve even seen – you experiment and you learn things. It’s like, oh wow, if I order by the clustered index and I delete batches by that it’s actually – it’s pretty efficient, I can churn through a lot of data very quickly that way, as opposed to just delete and let the pages kind of be grabbed as the optimizer gets them and then you have lots of potential for blocking and bad things can happen.
Alright, so our Rowversion recap. System maintained, which is a good thing. It is unique for database mostly. You can do things to get duplicate values there, but for keeping track of what’s been modified, that is no big deal. It can help us really – we’re still in brute force land, but it can give us clues on okay, I know how to look at this much, I know how to look at this much or this much. And it doesn’t help at all with deletes, and it’s going to lie to you because if you’re trying to move data changes and it got updated to itself, it’s kind of a lie. So that’s it for this section. Any questions or comments or?
Brent Ozar: No, looks good.
Martin Perez: Okay, cool. Change tracking. I do like this one quite a bit. So we’re moving up from like – we’re going to be in brute force for a long time but this is a nice step up because this solves more than the Rowversion. So whenever I talk about some of these new options, I do want to make sure that we understand, unless you’re the sole DBA developer, we’re going to have to have a conversation with your admin. We’re going to have to have a conversation with your SAN admin, we’re going to have to have a conversation with other people who help to support your system because every time you get clever and you’re the only one that knows how to do something, that’s not a good place to be. So the more you bring in – if you’re going to bring in something to your organization, if you’re going to bring in something to your environment, understand the pros and cons, understand the administrative concerns, and be able to explain it to your peers or else you’re going to get a call at three in the morning when it’s not your turn to get calls at three in the morning.
Okay, so it has to be enabled to database level so you can’t just as a developer probably set this up. You’re going to have to coordinate it, and you additionally have to enable change tracking at the table level. This requires elevated permissions, make sure it works in dev, figure out what needs to happen to get into production and we do have an option, which I really do like, to track changes at the column level. So what that will unlock is a bit mask that says okay, at this version – and you have to do some things to pull the details out but you can say oh, this update happened and these two columns changed, this one column changed. That lets us do things like well, if it’s just my spurious update where it’s like I’m alive, I’m alive, everything’s fine, it’s like, okay, I can ignore that. If it’s something like a customer ID or things that are important, or even not everything gets moved to the reporting database, not everything’s a candidate for the data warehouse, so if you’re at that level of maturity and you have enough metadata to know this is important, this is not, you can start to winnow down the sorts of things to look at and be more efficient – more efficiently approach it.
And it is a first-class citizen in SQL Server so you get some nice built-in functions, you don’t have to get too clever. You just have to learn some new syntax. But it’s a very, very good middle ground. It’s one of my favorite approaches because it’s not as intensive as Change Data Capture, it doesn’t give you as much, but it solves enough of this problem that it’s very likely with that 80/20 rule, it’s very likely going to give you a lot of value, a lot of bang for the buck, or brain bucks.
So this is at the database level. If you’re a developer and you can do this on the prox system, that’s probably a bad thing so you’re going to have to talk to your admin. Let’s talk about that retention period and cleanup behavior. I do recall – fortunately, I’ve never run into this so this goes in the pile of things I might have to look up later if it happens versus war stories, but I do recall I think Kendra had a blog post about very active system and the cleanup process wasn’t happening fast enough so the database had a lot of bloat to it. And there’s no supported official way to just get the cleanup going the way you need it and there’s some kind of like, okay, somebody – I don’t know how you get these secret stored procedures or secret things from Microsoft. It’s like okay, run these and it’ll do you. But that’s something to keep in mind. So this is just part of that. When you’re bringing in something that solves a problem, these are things you have to be aware of and these are things that you’re going to have to have a strategy in place to deal with. I know with Change Data Capture, if things get crossed up, disable Change Data Capture. Boom, all my stuff is gone. Re-enable it, rebuild it. So that brute force is a great thing to understand because the brute force is always a reset button, that’s your get out of jail card. So it’s really good for you to know things like when things are great it takes an hour or two to get all the changes and move them across. If I have to scan the table, we technically have an outage because we won’t have reports for a day or two. But you’re not going to – those are things you have to set expectations upfront. People, they’ll be upset with bad news but it’s even worse if you just don’t tell them or you try to hide it or you just don’t know. It’s like, how long does it take? I don’t know, good question, let’s find out. You don’t want to be there. But it’s great to know that regardless of what more refined approach you have, brute force, you have it solved, you can always hit that button and get back up and running.
Retention period, it’s important to keep track of. If you have a job that needs to move data and it’s been down for three days and your cleanup period is two days, well guess what, you’re going to have to hit your reset button. So understand these things, make conscious decisions about it and make conscious decisions that are appropriate to your environment. If it’s a system that has moderate activity and you have plenty of storage space and you have plenty of resources and you just want to be super graceful to yourself, five days, you have lots of opportunities to find a problem and your data’s still there. If you’re really tightly constrained and you’re kind of squeezing things in, maybe that one day period is going to be – or if it’s just a very active database and you don’t want all this metadata just stacking up and stacking up, you make those informed decisions.
And I really do like this option for the column – tracking changes at the column level because that is interesting information. That’s valuable information. Some data is more important than others, some updates are spurious, some updates are interesting but not important. And if you have a way of filtering those out, that’s a good thing. That’s a nice thing to have.
So this is what you get a new system function when you enable change tracking. You get a nice change table, you pass in the table name, you pass in the Rowversion, and what you get back is all the changes that happened since our last thing. So you’ll see a theme is like, keeping track of the last time you checked something. That’s what you want to do, whether it’s a table – it’s a table. Spoiler alert. It’s a table. Get your logging in place, have an approach, don’t shoot from the hip, don’t make it up every time. Just pick a model and stick with it, and once you have it kind of rolling, it should be kind of a big deal to change. That should fall into the bucket of that’s a solved problem for me. If I need to back up a database and I don’t have LightSpeed, it’s okay, it’s a solved problem for me. If I have to keep track of what changes, it’s okay, it’s a solved problem for me because I just have my little tracking table and I’m good.
So this is great. You have the sys change operations, so these are in my little output from this table function, the changed operation, the changed columns, and the version. So I have insert, update, deletes. If I care about inserts, maybe that’s one workflow. If I have updates, that’s a different workflow and like you can split out your workflows. You can make informed decisions based on what happened to your source system. Your sys change columns, it’s a var binary. It’s a bit mask. I don’t believe I have an example on there. I can probably follow up, but it’s one of those things where like, oh, yeah, you learn how to do it and then you just put [crosstalk]. That becomes another one of your solved problems. Like, what changed? Cool, this is important, do this, this is not important, let it go.
And very common things like if you’re not used to working with table functions, you might forget to alias it and you get really weird errors. So these are just – you bring in potentially new things in to your workflow, these are new things that have to go in your head, and that’s just a reminder because I’ve definitely done that where like, why is this working, why am I getting an error, it’s like, just the dumb stuff. Alias it because SQL Server doesn’t know what you’re trying to do otherwise.
So what we have here is just kind of an example of we get our – just lots of extra information. So I’m looking at something that changed, I see I have a delete, I have an insert, I have an update, and my update is that bit mask and I can turn that into a column list, I can turn that into a targeted insert or update or delete statement. And I really like being able to target my updates. If I’m in a situation where – and this is real, this is not the best place to be, but this is a real place I’ve been. I’m moving stuff over, link server, so if I can just send the three columns I care about to that link server, and I don’t have a dual active active 40GB per second, I don’t – strangely enough, I don’t always have that. It’s not guaranteed to exist for me so having things like targeted updates, if I do have 40GB a second active active, I’m probably doing things like, oh, just restore the whole database over there. Like, send it all. That’s not the case though. I very rarely have that sort of excess of resources available to me.
So change tracking recap. Great. Doesn’t require Enterprise edition. You already have access to it, you can do it now. You can do it today in depth. You do have additional storage overhead so anything we do is going to have administrative – not cost, but administrative concern. If something goes wrong, there is – you know there is a bug and if you had an unexpected restart, well, you have bigger problems but one of the things is you can’t trust that change tracking data to have everything. You could lose changes. So your system that tells you what has been modified can lie to you and so you have to know that that’s a thing that’s possible and you have to know that, oh, we did have it, we had a restart that we weren’t expecting. That means I have to hit my reset button and do the brute force approach again. But if you didn’t know that that was even a possibility, you may be – your reporting system may be lying to you. And any new thing, it’s like alright, got to look it up. Are there KB articles? Are there bug reports? Are there blog posts? It’s just an ongoing concern of keeping up to date of the pros and cons of what you’ve brought in. Change Data Capture, again, we’re just compositing how we look and get on. Questions?
Brent Ozar: No, you’re good. Alleged DBA points out, he said – he calls it expensive edition is required for CDC for 2016 until you get to Service Pack 1.
Martin Perez: Yes. So up until Service Pack 1, Change Data Capture was always a big meeting. It was a big presentation. Since Service Pack 1, this is medium sized meeting, maybe the manager instead of a director, I still remember Change Data Capture it was like, we got to do this and then, fortunately, it was the sole system that actually did have Enterprise edition, but yeah, that’s a hard sell. That is a real hard sell if you don’t already have it in place. But if you’re in a position where you’re in 2016 SP1 or higher or if it’s your previous version that has Enterprise edition, this is a great option. You can have two capture instances, so whenever you enable this feature on a table, it creates what’s called a capture instance. And you can grab all the columns, you can grab a subset of columns, and you can get the before and after values, which is really great for auditing solutions. It can be great for this is a really important configuration and somebody did something they weren’t supposed to, and I know exactly what it was right before because true story, this is a configuration that your DRP system and that changes how we value things in the warehouse and those are very, very, very important configurations that have very real and expensive consequences.
So knowing what it was before so you can rectify the situation and get your depreciation or get whatever, reprocess quickly is – that’s a good thing to have. And if you as the DBA or the developer or the data steward can say, oh yes, it was this, it was changed by this on this date, unfortunately, without some out of the box, we’re not going to get who did the change. So it only tracks the actual DML, the data changes. It does not add any system level like your S user, anything like that. I’ve tried to get cute and keep – do a database audit at the same time and correlate timestamps, but now I probably just use the who is active trick and just log those and say, based on the evidence between Change Data Capture and who is active and the host names that this came from, it was probably this person.
So lots of flexibility. I see that as a pro. This is a great thing. And lots of flexibility is also a con because you can do silly things or you have to spend a lot of – potentially a lot of mental energy deciding what you want out of it or how you want to approach it. So this is what we’ve talked about so far, this I feel has the biggest potential benefit but also the biggest cognitive and administrative load so far.
A couple of notes on best practices. A separate file group is recommended for the Change Data Capture instance. There is additional I/O overhead and if you have things like tiered storage, sometimes we’re in a situation where we have – you get the SAN admin just hands you a slice of storage for your database. Sometimes we have that tier zero, tier one, tier two storage, and we definitely want to make sure that the CDC data lives in a place that’s appropriate for the I/O patterns. We’re going to get a lot of appends only inserts and just these delete jobs that run as the data ages out.
Once we enable it we have some SQL Server agent jobs so these are things we need to keep track of. And we do want to keep track of – the great benefit is you can read from your CDC file so you’re not impacting your OLTP system as heavily. So your OLTP system can be happy just doing whatever it needs to do and you’re reading data from a separate filegroup, maybe even a separate set of spinning disks or separate LAN. And you can hit that data without dragging down your online system.
And you do want to make sure that you understand how – what role has access to the tracked data. So if you have tables that have sensitive information, whether it’s payroll information or if you have a [page I] that don’t put yourself in a situation where you’re going to sidestep your own security policies in your source table because you have secure information that’s being replicated into your CDC tables and anyone who is CDC admin can read that data. So that’s a concern that you’ll have to keep track of is if you have a gating role on the database, is that person getting elevated permissions to data that they otherwise would not have access to.
Brent Ozar: I never thought of that.
Martin Perez: Oh yeah, that’s a big one. And if you’re going to restore your database for testing your development or whatever, you have a nice new keyword in your width is keep CDC because that happens and if you don’t know that, you’re going to have a real bad morning because people think things are gone when they’re not. So just some things to keep track of with CDC.
There’s some great reading time ahead of you if you decide to take this because you get all sorts of brand new functions to learn and think about and understand which ones are important to your life and which ones aren’t. You have lots and lots of options for your cleanup job and how you read the transaction log and unless you’re in a real, very high throughput environment or a specialized situation, fortunately, you can kind of settle on some sensible defaults early on and not have to turn too many knobs. But you will have to understand your environment and understand the I/O patterns and make sure things like your log reader job doesn’t fail to start up or somebody stopped it because they didn’t know what it is or deleted it because they didn’t know what it is. So we’re talking to people in your environment so they know what it is.
But yes, these are all links that on my computer are a different color. And are functions, so these are stored procedures, these will start it, stop the jobs, create your capture instances, and these are all the things you have to think about. And it’s really funny to watch the different solutions that have been introduced in different times. Rowversion has a function and its own special way of saying between this and this, and then you get Change Data Capture and it has its own special way of saying between this and this, and we get to temporal tables, it’s like, another new special way of saying between this and this.
So that’s – it’s kind of a mental exercise to always really – really taking the time to pin your mental energy to the core concept and now you’re just kind of learning, oh, this is a variation of a core concept between this and this. So that’s firm in my mind. And then it’s like, oh yeah, this is how they call it – it’s a compression for your mental energy.
So yeah, we have – there’s now a really cool GUI, so you’re going to be writing some SQL and making more decisions. You’re going to – this is the cost of it. But again, pretty early on, if you understand the environment, you can settle on some defaults, settle on some patterns. This represents kind of my go-to. You got to pass in the schema, the badges, you give it a role name, CDC admin, that’s just out of habit. There’s no special way. I think that was just an example on books online when I learned it and I didn’t see – I was like okay, CDC admin, tells me everything I need to know. But that whoever’s in that role, whoever has access to that role has access to every bit of data in any tracked system.
But now that I think about it, this is not necessarily a role name for CDC across the board. It was because it was how I set it up. But if you do have PHI, I’m sure you could do CDC PHI admin or Gary, and only Gary’s allowed to look at this or however, you want to role that. So that’s something to put in consideration when you’re – if you do have financial data or PHI or things that are very locked down that you could potentially just have different gating roles per domain of data. So just how you might lock down data by schema, you could take a similar approach here in your change tracking or Change Data Capture. But again, that is – we don’t get out of understanding our environment, understanding the data that’s involved, understanding the value of data because you do want to be in that value conversation. And being in that value conversation means I don’t just get to look at the bits and the cool tech. I understand what the data means to the business and I understand what happens if data goes where it ought not.
So come up with a naming scheme for your capture instances, this support net changes, I typically don’t enable that because a lot of times when I’m using it, I do want the intermediate values. So I want to know if something changed from this to this to this, if you do support net changes, I think you just get some extra indexes and you can say okay, I don’t care how many times it changes but as of 5 pm yesterday, what did it look like?
And the captured column list, this is important. Don’t default to grabbing everything. Make informed decisions about what you want. And I keep going back to the example of the spurious updates because that’s a very real thing where you have I’m alive, I’m alive, I’m alive, something change, no, I’m alive, I’m alive, everything’s okay. And I might need to know that so I would create two capture instances. One that is just the I’m alive column and then the other ones that are the rest is okay, these are important for reporting, the business, or whatever else I’m doing with this.
And do name your file group. So when you enable Change Data Capture, create a new file group, call it CDC, if you can, put it on a different LAN, different storage, faster storage is better so if you have that tier one, tier zero because you’re going to have lots of writes and deletes, it’s going to be append only because it’s just like, something changed, insert, insert, insert. That’s what happens, it inserts. Keeps going. So if you – knowing that the I/O patterns are potentially very different from your OLTP system, you might make those decisions accordingly and you’re going to have to have – you don’t have to, but it’s in your benefit to have a conversation with your SAN admin. Unless you’re just in one of those all flash storage places and doesn’t matter, I guess.
Brent Ozar: Yeah, money’s not an object.
Martin Perez: Right. So they are – same thing. You have to know what your index is. You have to tell the capture instance what uniquely identifies a row. If you don’t know that, you have bigger problems. Solve those problems, then look at this stuff. Monitor the rate of change. Again, if you have the spurious updates, you’re going to get a lot of I/O. Could run out of space and that’s not a good thing.
Brent Ozar: Jodie asks while you’re on CDC, “Does the captured column list for CDC levy any restrictions on schema changes being made to the table, your monitoring?”
Martin Perez: Okay, I think if you drop a column, I don’t recall offhand if the DDL fails or if it breaks your capture instance. If you add columns, cool, it just doesn’t get tracked. You have to add that. One of the system tables does keep track of any DDL on a tracked table, so if in the past where DRP system, somebody can go and make a change and they hit save and that will add a column or change a column. So without any database DB admin intervention, I have changes to a table that can happen in a normal course of business. So I had very lofty dreams, I was like, yeah man, I’m going to read that DDL, I’m going to just replicate the change on the fly and maybe send an email that I did it so I just know what’s going on, no. But I can for sure have an alert fire. I was like, hey, by the way, this happened. Go do something.
Brent Ozar: Get a human involved.
Martin Perez: Yes. So if you do write that piece that’s all awesome, post it.
Brent Ozar: Open source it.
Martin Perez: Yeah. It looks like we’re running short on time so I’ll try to catch up. So this is something to look at. Any capture instance, the system automatically creates these tables for you. You get lots of new metadata and you have your start LSN, your log sequence number, your end, your seek val, which is great that know that that’s unique and the LSN is not, and you can spend a lot of time having updates not work the way you think they should because you don’t know that LSN is not unique but seek val is. You have your operation that tells you was this an insert, was this an update, update with the before value and the update with the after value. You have your update mask that can say okay, for this operation that happened, these are the columns that were modified. And then you actually have your copy of the column.
So the way Change Data Capture works is if you update a single column, the entire row gets logged in to your log table so that’s why say, it’s important to understand the I/O patterns, it’s important to understand the frequency of change on the columns. And it’s important to understand how big a row we’re talking every time something changes. Like, a very small value can change but we’re logging 40 columns every single time. That’s a different conversation when you have two values that change and you’re logging five columns.And this command to ID, it’s new. I’ve not needed it. I think it’s really cool. But it does let you know things that happened in a single transaction. So if it’s important to replicate a transaction as a transaction, you have the raw ingredients to do so.
And again, this is just over and over again, what happened between this and this. We have a new way of learning. It’s like, oh, all updates old. These are just things that you’re going to have to learn but once you have that core concept of here’s my upper boundary, here’s my lower boundary, something happened between it, and don’t make silly errors like off by one. It’s like, is my upper boundary less than or less than and equal to? Is my lower boundary greater than or – you know, I mean, those things can trip you up because now I have a new way of calling it all updates old, what does that mean? That’s not a database thing. Who writes these things? But you have to understand what all updates old versus all updates means because those are two different things.
Brent Ozar: Rock, paper, scissors kind of thing.
Martin Perez: Yeah, but that’s again, establishing patterns and sensible defaults, those are your friends. Have a pattern, write it down, and keep it because the learning curve is high but now if I needed Change Data Capture, I have my scripts. I know what it’s going to be named, I have the behavior nailed down. It went from a very difficult thing with high payout to a very easy thing with high payout.
And the really nice thing is we have some functions that you can translate the LSN to an actual date. So it becomes really easy to say alright, every change before five o clock, or every change between business hours, every change before yesterday at two or whatever you want to do. These are great things. And just as an example, seeing how that LSN number gets translated to a date time. Because if you’re reporting on something and something’s important enough to move from your OLTP system to a data warehouse, probably – or the context of when something happened may be important. So that’s good extra information that you can make smart decisions based on.
And we do get this great level of information where I can say process separately, give me all the primary keys for things that are deleted, show me all the inserts. I want to see the old value before it changed because we’re going to audit, we’re going to log it, we’re going to do something with it, or we need to fix it because they weren’t supposed to do that and you have your new value.
So you have new things to monitor when you enable Change Data Capture on the database and these don’t get created until you do it on the table, I believe. You have some new SQL Server agent jobs. You will want to make sure that they start, that they’re running, and that they don’t stop, and for whatever reason you’re not monitoring it, you don’t notice, and that’s a bad thing because if your capture job is not running, your transaction log is going to grow and grow and grow and grow. If your cleanup job is not running, your CDC tables are going to grow and grow and grow and grow. And if you have a very busy system and it could be that that’s a driving force behind it because that’s why I initially had to actually use this or learn this was because of an incredible busy system so I couldn’t query the online tables because dashboard has stopped working and very important dashboards that people looked at all the time would stop working. So we set up this CDC and we’re talking, it was ridiculous. It was just – maybe not in absolute terms with all the compute power we have today, but in relative terms of what we had available to us, this was – we’re pushing at max. At maximum. So it was incredibly useful to have, to be able to split the I/O load and I can just however often I want, grab that data from the CDC tables and then ship it down to our BI guys or whatever was needed for it.
So our recap for Change Data Capture. You’re going to have additional I/O overhead. There’s more things to learn, there’s more things to administer, and very flexible, which can be a pro and it’s very flexible, which can be a con. And if you have SQL Server 2016 SP1, standard edition, it’s available to you so, by all means, take the chance to explore and see if it’s going to work for you. How are we on time? Do I need to…
Brent Ozar: Got about 10 minutes left.
Martin Perez: Okay good. Temporal tables. This is the newest thing for me. So from my perspective, this looks like Change Data Capture evolved. This has all the niceness that we have in Change Data Capture but the system does it for you. So when you enable a temporal table, when you enable this for a table, the archive table automatically gets created for you. I don’t have to go through the extra effort of creating new table, grabbing that data, landing it, getting my ETL written. It just is there. It’s beautiful.
How that translates to storage overhead or where the data is written, I’ll be honest, I don’t know. I know when I do CDC I’ll change the default filegroup and then enable the feature so all the system tables are not in my data or where my data lives, my OLTP data lives. And so exploring this, that would probably be the first thing I’d do is like okay, where does this live? Because just because in SSMS it’s like, a secret thing, it’s like no, it’s still a table somewhere. I think sometimes the visual metaphor we see with the little folders and the plus sign and these things separates us from like, oh, this is how it lives in the file system, this is how it’s represented on our servers.
And more new syntax. So the great thing with this, and this I really do love it, is right here at the bottom, you can specify this name but the ValidFrom and ValidTo. So now for each row in a tracked table, you not only have oh, this changed in the table itself and you can use these in your queries. This was the value between this date and that date. So you have as of. What did the data look like as of yesterday afternoon? What did it look like at two o clock because something went wrong at 2:01? Or you get your – slowly change your dimensions for free. That’s incredibly powerful. This also has the thing I really don’t like is a no-op will get me a new entry.
So I consider updating a table to the original value as a no-op. Nothing actually changed. So the service says yes, I processed your DML statement, check. And you look at the data, it didn’t change. So if you’re using this to make decisions to feed it to a data warehouse or it is your data warehouse, that’s something to keep in mind that you might have to do a distinct, or I don’t know, these are the little subtle areas that can creep in if you don’t really understand all the systems and you’re not actively thinking about what can go wrong, what can violate my expectations, are my core assumptions that it’s only going to change if something actually changes? Incorrect. And you have to check that because you’re bringing in new tools that satisfy different problems but they have their own quirks and they’re all things to know.
So back in our mental model of core concepts, I know I have to understand between these two points in time, or points in record, or versions or whatever between these two points, what happened and what sorts of things can tick off that something happened counter. The thing I really love about CDC is from my point of view, it behaves correctly with that no-op. So for the problem that I’m trying to solve, just what actually changed, if I update it to itself, I don’t get a new entry in my CDC tables. Only if something actually changes. So that’s the only reason I kind of still keep that at the top of the stack even though temporal tables just from what I’ve seen and played with so far look pretty amazing and I definitely want to find ways to incorporate them because I think they solve a lot of problems. Especially around your data analytics, data warehouse type workflows.
And you get a whole new way of doing the greater than, less than, before, between. Everyone has their own spin. Someone should have a single Powerpoint slide…
Brent Ozar: MC standard.
Martin Perez: Yeah. But again, the trick is okay, how do I map this to my core concept, and now I just keep track of variations, little twist on things. So I don’t think of these as separate domains, it’s like oh, no, what happened between these two things, is it greater than and equal to? I have my hard-learn boundary lessons, a long history of off by one errors, so by this point it’s like okay, this is the same as between, this is the same as this. And it kind of helps the mental map.
Okay, and very quickly, combining these things. Everything except for Change Data Capture I would say that there’s some sort of brute force potential involved. And I even put temporal tables in that because knowing that I can have no-ops that will get me if I’m doing counts or sums or whatever, I have to know that behavior so it’s going to be an extra overhead. Very real-world stories. Sources from third party, not allowed to touch it, and it’s on a server that’s in a different – outside of our data center. But the connection to copy files is actually faster than our link server. So I had better performance by getting a backup database, piping it over to our server, restoring that, and then I just like, brute force it. You can combine that. If you can get someone to enable change tracking, that’s a fightable one, that’s a winnable one. Just enable this at the table level and then you get your backup, be smart. On Sunday restore the full, have them ship differentials. It does work faster. Restore it, standby. It’s great. Like, I’m using this in a project right now and the weekly restore, it takes three hours but the daily work is about a gig of changes and that’s 10 minutes, 15 minutes. So in 15 minutes, I can have just those day’s changes and I can like, kind of take the brute force hammer and chip away at it.
And if you restore your database and switch – let’s say you’re doing the full restore, it’s faster to actually copy. This does happen that it’s faster to just copy and restore an entire database than to try to do a link server or a really crappy connection to a different data center. You can restore it as a read-only database, then you will never have locks. So fire up as many queries as you want at the same time and hopefully you’re doing things like parallelizing your workload and having concurrent – figuring out a way to concurrently run things if that’s being clever with multiple jobs or you’re being smart with SSIS. Don’t just have one T-SQL script that says one, two, three…
Brent Ozar: Select from.
Martin Perez: I’ve seen that. So these are things that you can combine because it’s very common to – the connection to your database is so crappy that it’s faster to back it up, restore it, and do your ETL stuff. And that is it for the demo or the presentation. I do have a couple scripts but I think we’re pretty tight on time.
Brent Ozar: Yeah, perfect, we got enough time for questions here. Steve asks, “Are there any size limitations for temporal tables like dimensional tables versus fact tables?”
Martin Perez: I don’t know. I’ve done some coding samples with this, I’ve not run it in a production workload so I have never been in a situation where that’s happened. Just because it’s a system table it’s going to be written wherever your primary – your default filegroup is. If it behaves the way other system managed tables behave, so that’s based on how much storage available. So I don’t know if there’s an inherent limit on the feature itself, but definitely we still have to keep track of what our system can handle and if we can – if that is the option – it occurs to me as I’m talking to explore specifically is man, if I can just create a new file group and then enable these features, does it write all those to my separate filegroup and then I can have my conversation with my SAN admin, say okay, please, maybe this isn’t tier one storage but can I get a bucket of the “cheaper stuff?”
Brent Ozar: Absolutely. Alright, well good job. Thanks a lot, Martin. That was very cool. And it’s interesting too to see as you go through all the features, I’m like, oh, with every one of these I know there was a real-world story about here’s how I learned this, here’s where I found this useful.
Martin Perez: Yeah, Change Data Capture is my favorite because that – for about a year, that unlocked a lot of long-term problems the business had had and that was a learning experience for me professionally because the conversation around the technology went nowhere. Because that’s not the audience. It’s not their job to be impressed with cool stuff. It’s like, it’s reading the transaction log and you have a – nobody cares. But if you can go through it and it’s like, hey, I want to make a deep intrinsic change to our most critical system and here’s why, because that’s what they hear, hey, I want to touch our most important and rickety system. Now I want to make a deep change to it. But it was great to understand that pinning the change and pinning the risk to a mappable dollar amount. It’s important to understand. And once we’re able to communicate it’s like, hey listen, see, all this bucket of problems can be solved like this. So the risk is high. Definitely had some late nights and it was a painful transition process, but it got us through about a six month to a year transition process where they were trying to retire their old system and they brought in consultants, we had a bunch of people building a new system, but we were able to have a very stable transition period using this. And there’s another really cool auditing solution. I did some dynamically generated metadata driven queries, so if we enable change tracking, it would automatically build all the audit tables and do everything so the programming curve was high but the cognitive load to implement from there on was zero. And that was a great trade. It’s like hey, I want to be in a cave for about a month because I’m learning this stuff but now any time we say oh, I need to audit this, this is a solved problem. We paid for it once and now it’s free to use. So those are the kinds of things. It comes from real world and it’s very exciting and it’s useful for a long time. So that’s my website. I do lots of things that are not just – and any questions, comments, I’d love to hear from you.
Brent Ozar: Perfect. Well, thank you, Martin, excellent job there. Nicely done. Nicely done.