A DBA new to Availability Groups. Familiar with basic clustering and database mirroring concepts.
This session will cover the pros and cons of four typical configurations of AlwaysOn Availability Groups; stand-alone instances, fail-over cluster instances, multi-subnet, and a hybrid approach that I call disaster-recovery-on-the-cheap. It will also include their exclusive features such as read-only routing and backup off-loading. In addition, you’ll learn how to use the Availability Group Listener correctly and why you should be using Windows Server 2012 R2 or above, along with some of my own personal lessons learned.
Enjoy the Podcast?
Why I Want to Present This Session:
High-availability and disaster recovery topics excite me. I have been speaking on Availability Groups for a couple of years now and find that a large number of DBAs are interested in learning what I learned by making production mistakes.
Full Session Transcript:
Brent Ozar: So in the next session at GroupBy, Derik Hammer, who has just, you know, talk about cool ass names. Derik Hammer is going to be talking about Always On Availability Groups. So take it away, Derik.
Derik Hammer: Thanks, Brent, I appreciate it. Alright, so hopefully you’re at the right session, you know, architecting Availability Groups. As Brent mentioned, my name is Derik Hammer. I’m a data platform MVP and on the SentryOne product advisory council, friend of Redgate, but what really matters is that my career has been mostly in the database operations or production DBA area, focusing a lot on high availability and disaster recovery. I’ve just had a few short stints as a developer and right now I’m actually working as a manager of database architecture and design for Subway. So a lot of this session is about my lessons learned in the past, from when I first had my first production availability cluster, back when version 2012 was the first feature, and then working up to current.
So what are we going to talk about today? I’m going to cover four different architectures of how you can set up your availability group, working from simplistic up to a more complicated scenario. And then I’m going to cover a couple of the highlight features of Availability Groups, specifically, read-only routing and multi-subnet failovers. This is mostly going to be like a design, concepts and features talk with some lessons learned mixed into there. It’s not going to be a how to.
Also, Availability Groups is a very large talk. I would be able to get to everything within 60 minutes, so there’s a whole list of things that we’re not going to be able to cover today. We’re only going to touch lightly on distributed Availability Groups. We’re not going to be talking about basic Availability Groups of all, which is the replacement for database mirroring effectively. And we’re not going to get into anything with Linux. So Availability Groups can run on Linux, they can run clusterless, they can run without domains with SQL Server 2017 and Windows server 2016 features that have come out, but we’re not going to get into all of those.
So we’re going to scope pretty much everything today about running Availability Groups within a Windows failover cluster, and some of the more traditional based on the version orders that we’ve gone through since 2012, methods of doing Availability Groups. So let’s talk about the most simple architecture you can possibly have, which is just two servers in the same data center. This is inside a single Windows failover cluster, and what you’re trying to accomplish is duplicating your data. That’s the heart and soul of an availability group is the fact that you’re going to have two redundant and identical copies of your database.
So what we have here is two servers, and those servers can be physical, they can be virtual. It really doesn’t matter as long as they’re physically separate. So that’s kind of the first key tenant of architecting availability group is to make sure that you actually have the redundancy that you believe you have. If you have two virtual machines, they’re running two different replicas but the way that the hypervisor’s DR features are set up allows for them to live on the same host at any given time, then that defeats the purpose.
So you want to be careful to understand that when you’re duplicating something from an availability group, that whatever it is truly is on separate physical hardware. So you would need rules within your like, VMware clusters for example, to make sure that your VMs never land in the same host together in the event of a host failure or a lowered base migration. That extends to your I/O layer as well. So you see here I have two disks in the diagram, they’re showing them as local, where they’re local or remote. It doesn’t really matter as much. The [gi] once again understand that your duplicated data is physically separate. Now, I know a whole bunch of SAN admins will say, “Well, my SAN is redundant, therefore it’s not a single point of failure.”
But I’ve also personally lived through several SAN outages, whether it was a bad firmware upgrade or whether there was a massive power outage or something like that. So please don’t treat your shared storage as if it is truly redundant. Putting both of your replicas on the same physical disks is just as bad as putting both of your VMs in the same physical host. You never want a situation where your availability group can go down because of you losing any singular thing.
So in this architecture, really all we have is HA. We’ve got high availability, which is typically described as the fast and highly responsive failovers. There is synchronous commit between replica A and replica B, and what that means is that it’s replicating data from the primary to the secondary, and it’s doing it in a manner in which it can guarantee zero data loss. The way it handles that is the application will send a transaction down to the primary and then as its writing log blocks, those are being sent over to the secondary. At the end of the transaction, when it commits, and the secondary has to harden its blocks, which means it has to be able to make sure that data is durable on the secondary, it’ll reply back with confirmation that these blocks have been hardened and the primary can then allow the commit to complete and reply back to the application that the transaction has been completed.
Because of that, you know that any given time, if replica A, the primary, were to drop out where the server crashed, whatever the case may be, you know that only one of two things could have happened. Either replica B has all the appropriate data for it to finish its redo phase and you haven’t lost anything, or the transaction never committed. It didn’t commit, the application gets an error because of the loss of the server, the connection’s going to die anyway, and the application will know that it never successful completed that transaction, therefore, it can do whatever appropriate retry logic or whatever error handling’s built into the application from there. But at least with synchronous commit, the application is never left into a situation to where it was told the transaction was completed and then it lost that data.
Because of this, you can achieve automatic failover. It’s a simple flip of a switch effectively, whether you want manual failover or automatic, and you can only do automatic failure failover when you have synchronous commit. And that’s because effectively Microsoft’s not going to put a feature in there that allows you to automatically have data loss. That would just be disastrous for all of us. So what we have here is successful high availability at our database level, and I want to be clear about that because these are user databases that you’re grouping together.
One of the great things about Availability Groups is that you can have multiple Availability Groups on the same instance because you can put different sets of databases into different groupings for failover. And if your application relies upon three databases for example, then all three of those can failover together. And it’s a big unit that goes back and forth in order for your application to never have them on separate servers and you don’t have to have the entire instance move – for example, this is just databases, your other instances is online and active the entire time. What this does not include however, are system databases. So master, model, et cetera, MSDB for example, those ones are not being replicated in that part of Availability Groups, which means that right off the bat, even with this very simple architecture, you’re going to have to handle synchronizing your server level objects yourself.
And what I mean by server level objects is you’re going to have to manage logins, you’re going to have to manage credentials, SSIS packages, SQL agent jobs, certificates, all of that’s going to be on you as the DBA to handle. Now, as we continue through the presentation, I’ll mention some resources and such about automating that, but I want to make it very clear because unless you have a very well tested failover plan, you tend to find out during a failure that you moved over to your secondary replica and that new log in that was added last week doesn’t exist under secondary, and now your entire application doesn’t work. So don’t let your instance or server level objects induce a secondary functional outage when you’re relying upon high availability feature. So it’s a big asterisk there.
So as I mentioned, this is the simplest architecture. We’re going to make it a little bit more complicated now. The second one is including a secondary data center into the mix. So now we have quadrupled the number of servers that we needed for the original application, and the same thing with our disk. So a big highlight is I’ve had several different fights in my past when using Availability Groups with storage admins about the amount of space that’s being used by duplicating this data so many times. And everybody says storage is cheap, but as soon as we say storage is cheap, then people start talking about solid-state drives and fusion I/Os and all the new expensive high-performance SAN arrays, everything like that. So whether disks are cheap or not, it is a factor. It’s a capacity and it’s going to be a cost of your HADR solution, so it should be considered and acknowledged upfront what that looks like.
So in this area, we now have four copies of our database, we have four servers. And ideally, all of these servers are going to be identical hardware. It’s not a good idea to ramp it down your DR site because the workload that’s running on your primary eventually will have to run in the secondary no matter which one it is anyway. On top of that, you need to make sure that you’re keeping all of your replication in sync and you’re not inducing performance problems from that. The way you do that is by making sure that your secondaries don’t have lower performing I/O subsystem, they don’t have, you know, fewer CPUs because those are going to be needed to process the replicated transactions that are coming in and redo all those log blocks. So do not undersize any of your servers. It’s ideal that all of them are identical in almost every way that you can possibly make them.
So now there’s a bunch more icons on this diagram from the last one. I’m going to start with the subnets. So how we have two subnets here, and the reason for that is because this example has to do with two data centers. So this throws into the mix the concept that you’re going to have two IP addresses as well, and that’s where the ear comes in at the top. The ear is meant to represent the listener. I didn’t put it in any specific spot in this diagram because the listener itself flows with the primary. The listener is actually a – in this example, a Windows cluster resource that is effectively a virtual network name that has n number of IP addresses, with only one of them online at any given time.
This is important because the listener is going to follow the primary, and whenever you connect to this singular virtual network name, you’ll be able to get a different IP address if you were to failover from, let’s say, replica A to replica C. Now, we’re going to talk a little bit about the connectivity of that and what the performance looks like of the listener in a few slides. But for now, let’s switch over to how replica A can actually replicate to replica C. And we’re using async-commit in this diagram here, and what asynchronous commit is is extremely similar to what I just mentioned in the last slide about synchronous commit. The primary difference here is that once the transaction commits successfully on the primary, rather than waiting for all of the secondaries to harden their log blocks, instead, what it will do is immediately reply back to the application that the transaction has been committed, and that transaction is good to go.
This is where potential data loss can come into play if you have to do a DR failover. Now, HA is fast, automated, try to be as available as humanly possible. Disaster recovery is typically a much larger event that has a much larger RTO, a recovery time objective. And then that means that manual intervention is typically expected in those situations. So with asynchronous commit, what you’re forced into is a manual failover because of the potential for data loss, and you’re going to have to acknowledge the potential for data loss in order to do the failover. Either via T-SQL or through the wizard, through Management Studio that can handle this. You’re going to have to check a box or you’re going to have to type in a parameter at the end of the T-SQL that acknowledges the potential data loss, and it’s going to force you into the manual situation there.
But this is meant to tell you that there could be data loss, not that there actually is. So one thing that I’ve had to explain a lot in the past to people is when they’ve looked at different statuses in the availability group and they’re like, “Well I turned off the application, I waited an hour for everything to synchronize, but it still won’t let me failover with zero data loss.” And it’s like, well, technically speaking, you will have zero data loss in that situation. If there hasn’t been a transaction for over an hour and that has been successfully replicated over to the secondary, then it is a zero data loss failover. However, the feature is going to force you to acknowledge that you are identifying that for yourself. There’s nothing within the availability group that’s telling you there’s zero data loss. You can query the DMVs and understand it, but you’re kind of acknowledging that that’s on you. So it forces you to still go into that situation.
So now that we have our potential for failover and we have all of our servers, I want to highlight again that you now have four different places to synchronize all of your server objects. This is another reason why it’s really important to automate it so that this isn’t a hassle of – you don’t want to get into a situation of every time somebody puts in a ticket to have a new log in created on the database, you now have to create it in four places. Because inevitably, and probably sooner than later, you’re going to have human error. Somebody’s going to miss a server and you’re just going to failover to that box, you know, six months down the line, and wonder why your application is either only partially functioning or not functioning at all.
So now this is the architecture that I like to recommend. So we’re going to dip a little bit into ones that I find are less favorable just next, so we’re going to remember this one as kind of my ideal HADR solution with Availability Groups.
Edwin Sarmiento: Now Derik, I do have a few questions here from Brian, and he was asking, “Would you ever consider having smaller hardware in the DR location in a public cloud scenario? We are doing a manual failover and could dial up the hardware before the failover.”
Derik Hammer: So the problem that I would see with having lower hardware in the public cloud when you’re not actually running a workload over there is the synchronization. I’ve had to troubleshoot a lot of situations to where we were wondering why the secondary was always so far lagged behind the primary, and common reasons for that include network throughput, but that’s not applicable to this question here. So then another common one is the I/O subsystem. You know, if you’ve rationed it down in the public cloud and all of a sudden you have less IOPS to work with, your primary is still going to burn through transactions with its full skill of IOPS. It’s going to do whatever the primary can do. If you have enough data throughput, the secondary is not going to be able to harden those data blocks as quickly and you’re going to get a backup. In addition to that, if you scale down the CPU too much, then you’re going to have an issue with the redo, when it’s trying to redo those log blocks on the secondary side. So if you know that your throughput, your primary data throughput is small enough that by reducing the secondary you can maintain your appropriate recovery point objective, which is – recovery point objective is how much data you’re willing to lose in the potential failure. So if you know that you’re going to be always 10 minutes lagged behind because you’ve, you know, undervalued your secondary site and you’re okay with that, maybe you’re okay with an hour, and this is only 10 minutes, then that’s technically okay. But I’m still not going to advise it, and the reason for that is there’s a lot of unexpected things that can occur, that can just destroy your recovery point objectives.
For example, let’s say normally you have intelligent index maintenance that occurs on the weekends, and it only usually rebuilds one or two indexes a weekend, so you get a feel for what that data replication throughput is on a weekly basis. But then something happens, you do deployment that does massive changes to your tables, then that weekend maintenance means all of your indexes get rebuilt because they got heavily fragmented or whatever your criteria is for deciding whether those indexes get rebuilt. Well then you’re going to have a sudden surge of data throughput that you weren’t expecting, you’re going to lag behind far longer than you expected and then if there was a failure at any time within that window, now you’ve just blown away your recovery point objective. So be very careful, technically you can do it. I have been put in situations where I was forced to do it, where I just couldn’t convince the right people of the right thing to do. But my standing recommendation is to always match hardware the whole time, especially if you want to leverage some of the other features like read-only routing that we’ll talk about on the second half of the presentation.
Edwin Sarmiento: Emily is asking, “Little confused. I thought the listener would do auto failover. It is because they are in different subnets so you will need to do the manual failover?” And the follow-up question from her as well, “Can we use a replica to offset certain jobs to do backups and or read of databases?” Now, if you’re going to cover that in a future slide deck we could leave the answers towards that portion of your presentation but that’s Emily’s question.
Derik Hammer: Okay, so I’ll give you a short answer and then tell you yes, there’s more coming in the next few slides. So the short piece that I’m not going to cover in the future is that the listener itself is a virtual network name that is important during the failover, but the reason that you can’t do automatic has nothing to do with the listener. It has to do with the synchronization setting. So the fact that you’re using async or if you’re using synchronous but you’ve chosen to not allow automatic failover, that’s an option as well. In either one of those cases, that’s what’s preventing you. The listener, the mechanism isn’t really the problem. In theory, if you were to use synchronous commit between your two data centers, which I’m not advising because of performance, but you could technically do that, that would enable you to do automatic failover across data center and the listener is not going to stand in your way. The subnets are not going to stand in your way for that. The main reason I advise against it is because typically your DR center is hundreds of miles away at a minimum, and the network latency involved there can cause performance penalties to your application. Remember, your application has to wait for the secondary to harden all those log blocks before it can say it’s done. So a transaction that might be sub-second on the primary but then take another 10 seconds to get to the secondary will make the application wait 10 seconds instead. So be careful about your performance in that regard, but technically it can work the way you want. As far as the read-only routing and backup offloading, we’ll talk about that a little bit later.
Edwin Sarmiento: So I’m guessing you’re also going to cover maintenance and patching in a future slide?
Derik Hammer: So actually I’m not. Maintenance and patching is something that if we have time I’ll definitely talk about, but I feel we’re going to run out of time with the content that I have.
Edwin Sarmiento: Sure.
Derik Hammer: Okay. Alright, so the architecture that we left off on here was blending two different Always On features. Some people cringe when I say Always On, but the marketing name for failover cluster instances and Availability Groups is actually Always On, and you can layer them on top of each other. So I mentioned a moment ago that the previous architecture is my preferred, and the reason is because we’re getting into layers of complexity that I personally don’t appreciate when I’m troubleshooting an outage or a failover of any type.
So in this area, the reason that somebody would use this is if they get put into the corner where they cannot have the quadruple size of their database. So if you want to get a 100GB database or something smaller than that, having four copies of that is usually not a huge thing to sell people on. But if you’re talking about multiple terabytes, like, one system I have today, not a single database but within the availability group itself is 13TB of data. In those situations, trying to tell people that you need four copies of that becomes a much bigger discussion where some of the dollar values you’re talking about actually are a little bit more reasonable. So if you get forced into the situation to where you’re not going to be able to have four copies, then this is a good solution where instead you are relying upon your SAN, so you’re going to go back to relying upon the redundancy that the SAN admins are advocating, and you’re going to only have one copy of your data in each data center instead of two.
And what you’re relying upon here is the failover cluster instance feature, which is a service availability feature, meaning if the SQL service or the server itself has a problem, then duplicate copies of those services will come online on the other server. But they’re going to utilize the exact same data. So this is a zero data loss situation that allows or has only automatic failover because of the fact that they’re both right in the same place. So it’s impossible to have ever had a situation to where data wasn’t replicated. So on this diagram, replica A, for example, could move back and forth between the servers from its data center, replica B could move back and forth between its data center, but it wouldn’t cross data centers. And you’re using the availability group for the database level replication to be able to asynchronously commit between the two data centers. Again, I’m recommending async because of the performance that the typically latency, the network latency that you would have between two different data centers, and that will be a manual failover.
Now, once you layer Availability Groups and FCIs on top of each other, you completely lose the ability to do an automatic failover on AGs anyway. So a moment ago I mentioned you could in theory use synchronous commit and do an automatic failover across data center. You won’t be able to do that as soon as you use FCIs. And the reason for that is because FCIs are already doing automatic failovers and you can’t have two different competing features trying to handle the failover situation. So as soon as you layer them on top, the AG knows it’s not an FCI and it disables the ability to do automatic failover.
So in this situation, you’re saving some money but you’re adding a lot of complexity. As soon as you have a failover, you’ve added a second feature, a second set of logs, a second potential chain of events that could have happened in order for you to troubleshoot. So if you’re saying, “Why did I failover? What’s the problem with this?” It adds a lot of hassle and if you’re in an emergency situation, you know, like a production failover for example, you may not have the time to properly diagnose everything. So it’s one area that – I don’t enjoy this, but it has been something I’ve had to do in the past because of the limited space on disk.
Another thing that I’ll mention is the fact that the failover criteria is different between FCIs and AGs. FCIs are service level, so you could have your database completely unusable but the service is running and you’re not going to get a failover. So with different versions of SQL Server, the criteria for what can induce an availability group failover has changed, it’s improved. It’s gotten better and better and closer to the database health with every version as it’s continued. So I won’t cover the full spectrum here but what I will say is that FCIs don’t do anything for the database level. So while you would have previously been able to failover automatically within your data center because of some database level faulty events, with the previous architecture, using just AGs, in this case you won’t. It will have to be something catastrophic to the service, you know, a disk dropped off, service crashes, something like that in order for the FCI to fail. I’ve definitely had situations where I was using FCIs alone and I had to troubleshoot why I had a complete outage. And everybody said, “Why have a HA feature? Why didn’t it failover?” Because I wasn’t checking for those things. So be cautious with complexity here.
This next and final diagram that I’ll show you is about distributed Availability Groups. Now, I’m not going to get too technically deep on them, but I wanted to show a really valid use case, which is when you’re using infrastructures or service in Azure, for example. But before I get to why that matters, let’s talk about the structure here. You see, I’ve got two different Windows failover clusters, they’re on two different OS versions. 2012 R2 and 2016. I’m showing you that so you understand that this is technically possible. You can span your distributed AGs between two different Windows versions. I don’t recommend it, I’ll go back to my earlier comment of you should make everything match as best as you possibly can because that just makes everything simpler. However, this is a supported situation, and what you end up having is an Availability Group of Availability Groups. So this is kind of complicated, just like my last one.
You’re layering on top of the feature twice to where on the left side you have replica A and B, they’re their own independent Availability Group. Has its own AG name, has its own listener, has its own everything. Same thing on the right side. C and D are their own independent structure. In distributed Availability Group says I’m going to have the primary of one AG write data to the primary of the other AG, and then things will cascade down from there.
Now, this is useful potentially for the OS version thing, but more importantly in my case, it’s for reducing egress charges in Azure. So in Azure, if any time you send data into an Azure region, it’s free. Network charges doesn’t exist. But every time that you have data leave your region, that costs money, and it’s the region that matters. Not just outside of Azure. So if you’re moving data between one Azure region and a different Azure region, even though it’s still within the Microsoft ecosystem, it’s now going to cost you some money. So what distributed AGs does is it limits the number of duplicate strings that you’re going to have down to one. So previous architectures, you’d have A replicating to C, D, and B because it’s important to understand that these replicas, these secondaries are not peers. They’re all secondary, they all come from the primary, they don’t talk to each other. In fact, some of their DMVs and the information that they have about each other is limited compared to if you were to be querying the primary, for example.
So in this architecture, if we didn’t use distributed AGs, you’d have a stream going from A to C, and from A to D. And that would mean that of your transactional throughput for your primary system, you’d have to double that volume of data and pay for it because you have two strings. So one use of distributed Availability Groups is to reduce that string. You can bring it down to one, you want to pay for your data, egress once between regions as opposed to doing it twice. The failovers a bit more complicated, I’m not going to get into exactly how at the moment, but there’s additional effort and it’s definitely not a one line of T-SQL like normal Availability Groups are for the failovers. Because you have to manage the fact that you have two different listeners routing traffic to each other with their endpoints and the failover has become a much more complicated thing for you to prepare for. It can all be scripted of course, anything can be, you know, automated, scripted, and practiced, but it’s just another layer of complexity that you wouldn’t necessarily have if you just had a single Availability Group by itself.
Edwin Sarmiento: Simon has a question the Slack channel. “As it is supported, like mixing 2012 and 2016 Windows server operating systems, would it be the right way to upgrade or migrate?”
Derik Hammer: So that’s a good question. It is a way to upgrade or migrate. I definitely have nothing negative to say about that methodology, but there is other methods as well. There’s a way with just Availability Groups themselves to where you can have one node a different operating system temporarily. It’s definitely not supported to run full production, but it is supported for rolling upgrades. So I would look at it – Google them on books online, the process is very well documented, and you do not necessarily need a distributed Availability Group in order to do the rolling upgrades. There is a mechanism available for you.
Edwin Sarmiento: I do have personal opinions about Microsoft’s cluster rolling upgrades, but I’ll save my opinions for later.
Derik Hammer: Alright. Any other questions?
Edwin Sarmiento: Not at the moment, I think you’re doing really great.
Derik Hammer: Awesome. I think we’re doing good on time and pacing too. So for the next slide, we’re going to start talking about less of architecture and diagrams and more about features and lessons learned. So on the screen here, we’ve got this F18 in fine working order. It’s been polished up nice by the detailers, and what this is up here to show is a story. When I first began implementing the Availability Groups back with version 2012, I worked at a company that had a really hard time understanding the need for disaster recovery. So we would talk about it and they would say, “Yeah, DR is really important to us, high availability is really important to us,” but they always meant at the hardware level. They always meant, “Well yeah, if my server drops off or if my VM host drops off or if I have a power failure over here.” They always meant these very localized kind of individual failovers, which HA can handle. But they never actually wanted to discuss the big stuff like what if my data center was wiped off the face of the earth.
Now, this was in Virginia, we’re far away from the coast where hurricanes aren’t really a problem, there’s kind of mild weather down there so nobody seemed to resonate with any of the real world scenarios that could happen until 2014 when this jet crashed six miles from our primary data center. So we – in Virginia there’s a lot of military down there, we happened to be a number of miles away from the Oceana Naval Base, and this F18 took off and immediately had engine failure as soon as it got up in the air, and ended up crashing down into an apartment building.
And while the event itself was tragic, I ended up using this over and over again to remind people that there’s a million scenarios that can cause disastrous failure for you that you’re not thinking about. None of us were thinking about a potential plane crashing into our data center. I like that. So, in this case, it was awesome because after that my life was easy. Every single meeting was like, “Alright, now let’s talk about what happens if a jet crashes into the building.” And then disaster recovery talks became a lot more serious and it resonated with people.
I’ve only actually worked for three different IT companies so far in my career, and of them, two of them have had near misses like this. So one of them was this one, and then my current organization, we have one of our data centers in Philadelphia, and a few years ago, a crane collapsed on our building that was some number of blocks away. I think it was something like less than 10 blocks away. So these can happen. If the crane had crashed on the wrong building, then our entire primary data center would have gone out and we would have had to worry about a disaster in order to recovery from.
So don’t just think small scale. Also think of the big scale, make sure you cover yourself in multiple layers. And this feeds us into quorum. Now, quorum in this regard is a Windows clustering concept that I’m talking about here, and it’s a voting mechanism. It basically tells the cluster that I know who is the true cluster, who’s the real Slim Shady, basically. What it’s trying to say is that it never knows that other servers are down. The only thing that the cluster can ever do is know that they can’t communicate to them or hasn’t received heart beats for any period of time.
So because it doesn’t know that they’re down, it needs to prevent what is called split-brain. And split-brain is when you have, let’s say, an even number of servers, all of my architectures I showed you had even numbers of servers, by the way. So if you had four servers and something very common happens, for example, the network goes down between a primary and a secondary data center, so not even a real disaster, something that I’ve seen millions of times. If this happens where a network drops out, the servers in both the primary and the secondary data center don’t know that the other ones didn’t shut off. And both of them may want to be the cluster, and there might be resources that – like IP addresses, they’re both trying to use, there could be a whole bunch of mess. But the way Window server works, it doesn’t allow it to happen. If there’s ever a situation to where a split-brain is potential, the whole cluster services will just shut off.
So quorum is there to try and rectify this. Quorum is trying to say I know who is the primary so that all the secondary nodes can shut off in this situation. And it does this by voting. Typically you have like, what’s called node majority, which is where you’d want an odd number and you want that odd number to be in a situation to where if you lost, let’s say, two out of five of your votes, that’s no big deal. The two that can’t talk to the other three, those will shut their down cluster services off. The three that are talking to each other and know that they have the majority, those stay online. Your cluster stays online. Things will failover properly to the majority half of the cluster and then when communication comes back up on the other two, everything will just kind of normalize.
Now, voters aren’t just servers though. Voters can also be file shares, they can be remote shared disks, they can also even be cloud witnesses. I don’t have that on my slide but you can also use Azure for example as an external witness. And you always need to make sure that you weight your votes appropriately. So let’s talk through a couple scenarios of this jet. Let’s say I were to have two servers in my primary and a file share witness. And then two servers in my secondary. Well, if the jet comes in and crashes into the primary data center, we already knew that we’d have to use some form of manual failover to get the secondary, but we now have to add a second step to that process, and the reason for that is because the servers remaining in the secondary site now think that they’re the minority in the cluster because they lost communication to the other three votes, which means they shut off their cluster services. So a cluster you thought would be online for your DR is no longer online, and it requires manual intervention to do what’s called forced quorum, which is where you take the responsibility of declaring these servers are good and can bring them back online.
Now, it’s just another step, and if you’re already doing a manual failover maybe that’s not a big deal for you to add into your DR plan. But it’s better to avoid that, so let’s flip the scenario. Let’s say instead, your primary data center only has two servers, your secondary has two servers and a file share witness. Well, in that regard, if a jet hits the primary, then you don’t have to force quorum, right? But am I recommending that? No. I’m actually just preventing showing you a whole bunch of different problems, no solutions at the moment. Because let’s go back to my network drop scenario. In a situation to where your secondary site has more heavily weighted votes and the network blips, then your primary data center goes down because of a network outage to your secondary site, now you’ve just lost your primary data center because the cluster services will shut off thinking they’re the minority.
So what is the solution? Well, one solution is to use that cloud witness I just mentioned. So if you’re using Azure as your cloud witness, now you have three geographically separate voters, and if anything happens to your primary or secondary site, whichever one is still functional will be able to view the witness in the cloud, and that solves the problems that I mentioned right there.
In addition to that, we have a few kind of slick features with Windows server that I’ll talk about now. First one’s called dynamic witness, and dynamic witness is – all of these features, by the way, are only Windows server 2012 R2 and above, and you should be on 2012 R2 above anyway, but I figured I have to say it because I know there’s some 2008 servers out there somewhere. So with dynamic witness, what it is is the system is going to remove votes from voters as it feels appropriate in order to prevent the potential for a split-brain event. So in this case, we’ve got two servers and a file share witness, and we’ve got total votes of three, total owners of three. Now, there’s an intermediate state where server B, for some reason, can’t vote, whether that’s because it couldn’t send its heartbeats fast enough, whether the network actually drops, where the server is actually down, whatever it is, it doesn’t matter. All it knows is that server A and the file share don’t know it exists anymore.
So at this point, you now have an even number of votes, a file share, and a server. And because even numbers have the potential for dividing in half, the server doesn’t like that, and that transitions into the dynamic witness feature kicking in and removing the vote from the witness. So the witness will always be the default loser because it’s seen as less persistent than any other server for example. So here we now have total votes of one and online votes of one because it not only removes the vote to make it an odd number, but it also recalculates the total so that you don’t become the minority. If the total remained at three and this happened, well then server A would only be one out of three votes and the cluster server should shut off on server A, and that’s not good.So the recalculation is really important, and then the fact that votes can drop offline is equally important.
The next feature we’re going to talk about is called dynamic quorum. It’s extremely similar to the dynamic witness. The only difference is you can continue to cascade further, especially if you don’t have a witness. So in this case, you have two total votes, two online votes. Dynamic quorum is going to immediately step in and take a vote away from server A. Now, this kind of highlights the fact that there’s no problem with server A at this moment. It is fully online, it is a fully functioning part of the cluster, you can failover to it just fine, there is absolutely no problems here. But it’s removing the vote so that if something were to happen, if network between A and B were to be lost, the cluster knows in advance that B is the winner, that B will end up being the cluster server that stays online, A’s cluster service will shut off and you never have a full outage in that regard.
The last feature I’ll show you is tiebreaker. Tiebreaker is the ability to lose 50% of your cluster, not 49%. So Windows server 2008 R2 and below, if you lost more than 49% of your cluster, all cluster services would drop because you’d end up being in either a split-brain or a minority situation. Well, what this does is you have a parameter that you can only view when manipulating through PowerShell called LowerQuorumPriorityNodeID, and this declared which node is the one that dynamic quorum would act upon when the need happens. So when dynamic quorum has to pull a vote away, what node am I going to pull it away from. By default, the value is zero, and zero means the lowest ID value, which is arbitrary. I mean, the order of the ID values and your cluster’s based on when you added the nodes, not based on whether you actually wanted them to be those numbers.
So in this case, I’ve got server A as ID seven, and server B is ID two. But you’ll notice I mentioned that by default it does the lower ID, so why did they pull the vote away from A? And the reason for that is because you can explicitly set it. So after you build your cluster, you can look at your ID values and say I want my DR site to be the default loser, therefore I’m going to set a node in my DR site its LowerQuorumPriorityNodeID to whatever that ID value is. So now if you ever lose 50%, like if you had eight nodes and you lost four, it stays online. Less than a couple seconds later, it’s going to recalculate the total to four, you could lose another half, so two more, it’ll recalculate again, you could lose another one. And that’s what allows you to have last man standing, which is the ideal situation for cluster. It’s still not perfect because you can’t have a catastrophic failure, you can’t lose five of eight and have it stay online, but at least it’s a lot better than it used to be.
Edwin Sarmiento: And that’s probably the response that Gordon was asking. His question was, “I’m never entirely sure what happens to the cluster when you have two or even one server online.” That’s probably the answer he was looking for.
Derik Hammer: Yes, exactly.
Edwin Sarmiento: And another question from Sree, “If we use the files or the file share as the quorum type, it’s on a primary data center. How does it work? Will it be able to switch to secondary?”
Derik Hammer: So witnesses are very easy to move. I mean, technically they’re a drop and create situation, but it’s very easy to reconfigure. So if you were to have your file share in a primary, through the Windows failover cluster manager, it’s very easy to just remove the witness and then add in a new one. It takes, you know, seconds worth of time, and only a couple dozen clicks in order to make it happen. So it’s a pretty easy move but there’s no like, migration concept. It’s just get rid of the witness and then make a new one.
Edwin Sarmiento: Actually, the concept behind the quorum is the most complex, convoluted, complicated, confusing topic ever but a lot of people don’t realize this is the most important concept that makes your failover cluster highly available. In fact, I spent a lot of time on my training class explaining just how the quorum behaves. But…
Derik Hammer: You’re exactly right. It’s really important, and that’s why I devote a bunch of time in this session to it because most people don’t even know what it is. And I at least want to get people enough to go research it, enough to say, “I understand why this is so important, let me go figure out this extremely complicated and very large topic.” Any more questions before we move on?
Edwin Sarmiento: Not at the moment.
Derik Hammer: Okay, thank you. Alright, so now let’s dive into a couple of the more fun features of Availability Groups. First one I’ll talk about is you can achieve faster failovers. Now, technically it’s not faster in everything, so I know one person in the past has proven to me that database mirroring could actually failover faster than Availability Group in some cases. However, what I’m referring to when I talk about faster failovers with AGs is everything to do with the client connectivity and the DNSing. So this is specifically for multi-subnet failovers where you have multiple IP addresses. Now, if you failover within the data center in that IP address and the same subnet, then no big deal. Your client applications will have cached from the DNS servers the IP address, their connection will break when the failover happens, and then it will come back up x number of seconds later when the failover completes. And the next time the application tries to connect, then it’ll be successful.
However, when you cross subnets and you have a different IP address, this becomes another piece of time that you have to worry about. There’s a thing called TTL, time to live, on your DNS A records, and that’s a caching mechanism so that every connection that ever happens doesn’t, you know, slam your DNS servers like crazy. So most companies I’ve worked at either have it set to like, an hour, or 20 minutes, depending on what their policies are, and what that means is if it’s 20 minutes, for example, you would do a failover to your secondary site, the database might be online in 30 seconds, but the client application will be trying to connect and failing for 20 minutes, waiting for its cache to expire. Which means you have to put in some complicated steps into your DR plan to make sure that you flush DNS cache in all of your clients, and then you always miss some because somebody made a new app that they didn’t register in the DR plan. There’s always a lot of hassle there, and it significantly impacts your recovery time objective, how fast you can do your failovers.
Well, the way Availability Groups work by default is you’ll have a listener and you’ll have multiple IPs, and that listener will register multiple A records for the same virtual network name for each one of the IP addresses. This makes it even more complicated because if you don’t use the parameter I’m about to describe to you, what’ll happen is every time the client tries to pull down an IP from the DNS server, it’ll semi-randomly pick either one of the A records and the one it picks may or may not be the one that is actively in use at the time. Remember, I mentioned earlier the listener is handling which one is online and which IP addresses are offline. If they’re offline you can’t connect to them.
What this causes is an amazing amount of troubleshooting hassle to where everybody sees intermittent connectivity and feel like the system’s unstable. When really, the system is stable, the problem is that the client provider is caching a different IP address each time and causing you some problems. And that’s where this parameter comes in. So MultiSubnetFailover=True comes from the need for the provider to handle multiple IP addresses being registered for the same name. And what it will do is it will cache all of them. So if there’s two it’ll cache two IPs, and then what it will do is it’ll try to connect on both of them simultaneously, and the first one that connects wins and it bans the other connection because a listener makes sure that only one is ever online at a time. You never get a scenario where you connect to the wrong one. It’s always going to be to the correct one, and it’s always going to be whichever one’s online.
So what that means is when you do a cross subnet failover, the DNS cache, it can still do its job and not beat up the DNS servers with continuous attempts to get new IP addresses, but you get none of that time to live lag effecting your RTOs. Now, the caveat here is that you have to have a fairly up to date provider for this. You know, for example, .NET 4.0 and above providers, JDBC 4.0 I believe as well. There’s a whole list of supported providers and nowadays they’re pretty common, but I know for a fact, legacy systems still use some old versions of providers that don’t have this parameter. So you have to be very careful when implementing Availability Group to know that everything that connects can use this, or else you have to go back to the old way of doing things. And the old way of doing things would be to modify the cluster register all provider IPs parameter to false. What that means is that the listener will no longer register to A records. Instead, it’ll register just one and during failovers it will maintain it, meaning they will update the A record to a different IP address. But that puts you back into the exact same scenario where you have to manually figure out how to handle your DNS caches yourself, you have to flush them during DR failovers and things of that nature. So if you’re using up to date providers and you use this in your connection strengths, then you get faster failover because you take the DNS caching challenge out of the mix completely and just allow the listener to handle that for you.
The next feature here is read-only routing. Now, this is – other than the basic HADR of Availability Group, this is my absolutely feature of the entire system because it allows you to offload your reads and your backups to a secondary system. Now, we’re not going to dive too deeply into the backup concepts here, but I will mention that you can offload full backups that are copy only and you can offload transactional backups in their normal state to your secondary replicas. You cannot offload differential backups.
In my personal opinion, doing the backup offloading complicates the management of backups significantly enough that I tend not to recommend it. I tend to still prefer the primary. However, if you’re in a situation to where running backups in your primary is that much of a significant hassle to your performance of your system, you know, you could do it, or maybe look into why you’re under sizing your hardware instead. But as far as workload offloading, this makes a lot of sense partially because when you offload your reads to your secondary site, your secondary replicas, you are also implicitly forcing all of the transactions to run under snapshot isolation, which is an optimistic locking mechanism which is really great for removing blockers – I’m not going to get into a lot about the isolation levels and their caveats, but it’s definitely something to look into to understand. If you’re using Availability Groups, know that you are changing from read-committed over to snapshot whenever you read-only routes your secondaries. And then understand the benefits and potential pitfalls of that.
The way we’ll use read-only routing is you’ll pass in ApplicationIntent=ReadOnly into a connection provider that supports it. So again, you have to have some fairly recent connection providers to do this. And then the listener is going to manage this. So you’re always going to connect to the listener and this has to be through the listener. And then what’s going to happen is it’s going to reply back with the necessary information for routing you. So here’s an example of one of the DMVs I wrote a query for, showing the routing list. You have to configure this yourself. It’s not automatic. So you have a routing list and you’ll say things like, when this is primary, go to this secondary. When this one’s primary, go to the other one that’s secondary. And there’s a routing priority list with the important URLs all stored in here. This is T-SQL, it’s not that hard to implement, but it is something that you do explicitly.
On pre-2016 servers, you can only go to one secondary, meaning you just tell it to go to the next one. If that’s not available, go to the next one. However, in 2016 and above, you can loop in some round-robining. There’s a lot of criticism where this isn’t really load balancing, I don’t think anybody ever said it was, but people wanted it to be some load balancing I guess, and 2016 you got the ability to at least round-robin across your read-only replicas and you know, it still doesn’t assess load in any way, but at least it spreads the load across whatever readable secondaries that you have.
Now, you must also connect to the Availability Group database itself. Now, this is something I haven’t touched on but it’s really important. As the listener is managing this, but because you connect to the listener doesn’t imply that it’s going to be utilizing the Availability Group features. You have to actually – when you connect directly to a database within the appropriate Availability Group and then the listener will utilize read-only routing. If you don’t do that, you’ll do what I did, which is spend a couple weeks, in the beginning, trying to figure out why read-only routing always sent me to the primary, and that was because I was using the default connection to master.
So just because I went through the listener doesn’t imply I’m going to the secondary replica. It has to know which Availability Group to look up because when you’re connecting in, depending on which database you pick, you could have multiple Availability Groups with different routing tables. So be careful of that pitfall, especially if somebody’s complaining that they’re never on the secondary replicas.
So now there’s a little bit of a connection overhead that goes into the routing. What’ll happen here is the application time parameter gets passed in from the client, goes to the primary, to the listener. The listener’s going to reply back with the endpoint URL and the client is then going to move itself over to the secondary. So it’s important for you to understand your routing tables because you don’t want to do what I did, which was I had set it up – this was pre-2016, so there was no round-robining. I had set it up to where if my primary went down it would go to my secondary within my data center. If that one was down, it would go to my secondary data center. This became a problem the first time I said, “You know, I can do some online maintenance, I don’t actually have to take down and have this big outage window. I’m just going to do maintenance on my reporting replica, and when it comes down, it’s going to just have everything go back to the primary,” because by default, when it exhausts its list, it goes back to the primary.
Well, my list wasn’t exhausted. So instead what happened is it took time to connect to the primary and then it took time to try and connect to the secondary, then the secondary timed out because it was down. It took more time to go to the DR site because of a high latency network connection, and by that time the application timed out. Its connection time out was really small and it just said, nope, I’m done. So what I did by taking down what I thought was a safe replica to remove, I caused a functional widespread production outage because the entire system would no longer function. So be very careful about how long it can take for your secondary route to happen based on latency. And also work through what does it mean if it fails to connect to that. What’s the time out and how does that add in to my connection time for the application itself. So be very cautious about that.
And with this, we’re kind of wrapping up, we only got a few minutes left anyway and I want to try and take a question or two. This is a list of references that I think are really important from this point forward for anybody who, you know, was at the right pace for this session, these are the important next steps for me. The top one, synchronizing the server objects, which you should look into because you need to automate it and this gives some really great community examples of how to automate it. The second one is a PowerShell desired state driven failover test. What it will do is automatically reconfigure your asynchronous routes between your data centers and make them synchronous, wait for everything to catch up, do a failover, then reconfigure it back. And this is really useful for, like, quarterly or annual tests that you’re trying to do.
The next one is a way of kind of hacking SSMS to store the multi-subnet parameter for you and the application in temp parameter. And another one is kind of like a huge red flag. If you’re going to use filestream in Availability Groups, read this. Paul Randal helped me through a production issue a number of years ago and this article is kind of the brainchild of him and my problem. You can have some disastrous effects with the garbage collection of filestream when combining it with a lazy log truncation that Availability Groups use.
And then finally, there’s a seven-part blog series that gives you the how to that I didn’t give you today. So with that, if you have any questions I’ll take them now. If you also don’t mind, please go to the feedback link I have right there, SQLhammer.com/go/feedback, and let me know what you thought of the presentation.
Edwin Sarmiento: I do have a personal question for anybody who’s in the attendees. How many of you actually used read-only routing in your configuration? Just curious. I mean, you could type, yes I do have that, yes, we implement it, yes, I made the same mistake as Derik’s. I’m just curious. There is a question on the Slack channel. Sree, “Can we do backups in all servers? All servers priority to 50/50/50/50?” I don’t know what that means because that does not add up to 100.
Brent Ozar: I think that’s 25/25/25/25, maybe.
Derik Hammer: So if I’m understanding the question correctly, if you set everything to the even weighting, there’s still like a prefer secondary or prefer primary settings in there. So you’re going to have to do something like that. You’re either going to have to weight things appropriately to where it knows which one to pick, or you’re going to have to have it prefer, you know, which side and how else you want primary, secondary, or things of that nature. I find that the backup offloading becomes extremely complicated because all of your history is now spread across multiple MSDBs. So if you’re like me, I want to recover things very quickly, which means I’m always going to use some sort of dynamic scripting tool or something like that to drive my recovery scripts. I’m not going to, you know, write a full a diff, a diff, and then 10,000 restore commands in order to try and get the database back online. So I kind of tend to advise against it because of the disperse nature of your history and all the different MSDBs start being replicated together.
Then another one is the fact that you can really only do log backups the way that you expect to on your secondaries. You can only do copy-only files, which has some implications and diffs are just completely out of the picture. So my personal choice is I tried doing that for a couple weeks the first time and I decided it wasn’t worth it and all of my Availability Groups to date are just running backups off the primary.
Brent Ozar: Emily asked in – she said, “You mentioned that you provide info on some of the server level object items, is that in the links that are shown up on the slide?”
Derik Hammer: Yes, definitely. So if you’re talking about like the server level objects as far as replicating them between cluster nodes, then the top link is what’ll get you there. if you’re talking about DMVs and some other information, the very bottom one, the seven-part blog series with the how to has a lot of that information.
Brent Ozar: Awesome. Well, thank you very much today for hanging out with us today, Derik. Really appreciate the session and thanks Edwin for monitoring the questions there. Everybody give him a warm round of virtual applause.