Monitoring Availability Groups

Audio podcast:

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS
Leave us a review in iTunes

Target Audience:

DBAs that have implemented AGs and need to know different ways to monitor them or what to monitor.


Once you have successfully configured Availability Groups, what comes next? In this session, we will go beyond setup and look at how to monitor your Availability Groups.  We will define and cover important metrics and alerts you need to manage a database in an Availability Group.

You will walk away from this session with tools you need to monitor your environment and know how to respond to alerts.

Why I Want to Present This Session:

I really like HA/DR technologies.

Additional Resources:

Session Transcript:

Brent Ozar: In this next session at GroupBy, Tracy Boggiano is going to explain to us how to monitor availability groups. And I know, I got so many emails asking about this one, so this will be good. Take it away, Tracy.

Erik Darling: It better be…

Tracy Boggiano: I hope so. Alright, like Brent said, we’ll be talking about monitoring availability groups. I’m Tracy Boggiano, I’ve worked on SQL Server for over 15 years. Normally when I do this presentation in person, I will tell everybody, yes, I know I don’t look that old, thank you. Currently, I’m certified on the MCSE as a data platform. I have other hobbies besides SQL Server; it doesn’t seem like it these days, but I do like to run. I also volunteer with foster children; I’m their advocate in court. If you’re in North Carolina, the link’s there for Volunteer For GAL, or nationwide in the US, Casa for Children. If anybody’s interested in those programs, please hit me up after this session. I’d live to talk someone into volunteering their time for that.

Enough of the non-SQL stuff. I’m currently a DBA at ChannelAdvisor in North Carolina, and there’s my contact information if you need it; that will be at the end as well.

So, what we’re going to do is do a short introduction, then we’ll talk about what we want to monitor, then we’ll talk about all the different things we can – how we can go about monitoring availability groups, including the SSMS dashboard, our regular SQL Server agent alerts, using the DMVs, using perfmon counters, using extended events and using wait stats.

So, what are availability groups? Well, it’s the newest, greatest thing that Microsoft has given us for high availability and DR. It’s really database mirroring on steroids because you can take all your databases and group them together and fail them over. In database mirroring, the databases were one at a time failovers, and they didn’t come over as a group. So, if you had databases that depended on each other, one could failover and the other ones could stay behind. Now they can all come over together and everything’s taken consistent.

They have the same ways to synchronize, synchronous or asynchronous. Most of the time, you asynchronous is used when you’re trying to do DR and your synchronous is used more for your high availability scenarios. You can have multiple replicas; so, you can have up to eight replicas in 2016. This is a feature that is new to 2012 and up. It’s an Enterprise only edition, except when you get to 2016 when they introduce basic availability groups; which is basically database mirroring with clustering. I’m not sure it’s worth it – clustering’s not that much fun. But, coming in 2017, you can have clusterless AGs, so we’ll see how much fun that’s going to help us out with. Also, when you go to Linux, if you’re using 2017 you’ll be using PaceMaker instead of Windows Clustering, of course, because you won’t be on Windows.

So, some acronyms we need to know so we can define what we’re trying to monitor is RPO; so, your recovery point objective describes how much data you can lose. So that’s just, you know, how much GB, MB, KB of data you can lose before the business gets very unhappy; or you as a database person gets unhappy because you’re losing data. Your recovery time objective describes how much time you can lose. So that’s how much – can you lose an hour’s worth of data? 15 minutes’ worth of data? How long can you be down for? And you combine those two together to come up with your service level agreement with your business, so they have an understanding of what the expectation is, and you have that understanding so you know what to monitor.

So, how availability groups work is – this is a picture from Books Online – so you would have a primary server on one side that takes all the data from the log and flushes it to disk. And at the same time, it captures it and sends it across the network and receives it on your secondary and caches it into a log and plays a redo on a redo thread and caches as a disk. And depending on whether it’s async or sync is whether it’s knowledge that’s committed, or not, back to your primary server. So those are the things you’ll be monitoring; when things are received and how long it takes to redo and whether it’s even sending the data.

So, the things that we will want to monitor is, is synchronization taking place? There is the ability for data to be suspended and you not to have done it yourself; the system can suspend it and you not know about it unless you have alerts set up. We’re going to be monitoring your RPO, how far behind is your replica? The RTO, how long will it take to catch up if it’s behind? And whether it’s available for automatic failover or not, because if you have it in synchronous mode, you can set it up for automatic failover, and if your RPO and RTO are far out, then it’s not ready for automatic failover and you could be outside of your SLAs with the business.

So, one of the first things you’re going to be checking out is your send queue. Now this queue keeps track of all the blocks that are being sent to each secondary. Now, each replica maintains the view of the queue. So that replica has to be up and online in order for it to know that there’s blocks that need to be sent over. This data is what is at risk of being lost if your primary goes down.  So, it can grow due to any number of different things, you know; the secondary not being reachable, the data movement being suspended. You can have network latency or just a large amount of data change, such as we were discussing earlier with index rebuilds or ETL processes loading large amounts of data, things like that. they can kill your availability groups in a heartbeat.

Then your redo queue, this queue keeps up with the blocks of data that have been received but has not actually been applied yet. And each replica has its own queue of this. On failover, this is what the queue must be in order to be processed. So, if this was blown up because you did an index rebuild, you’ve got to wait for all of this to be applied before that database is available for anybody to look at it. So, it can grow due to the same things as before and also just because of resources. I’ve been in shops where the secondaries – let’s just say the shops were – they go skimpy on their secondary servers and don’t give them as many resources as they do their primary servers. Such as not giving them enough memory or CPU because hey, that server’s not really being used until I failover. We’ll give it stuff when we use it. And then that server becomes behind, so you’ve got to watch out for those things. I would always split the servers the same so they can keep up with each other, but some companies do not do it that way.

So, the impact of the send queues on your availability groups are as follows. If it exceeds a certain amount, you’re no longer able to automatically failover; you have data loss. If you’re using the features known as readable secondaries, they’re not really readable secondaries because the data hasn’t been sent over yet to read off of. If you’re offloading your backups, you’re not really in a full backup because that data’s not there yet. The same with your log backups, they’re just going to fail. And my favorite part, your transaction logs are just going to fill up on your primary, causing you to get pages about low disk space. Those are fun. And then you have to grow space here and there and then you have to figure out what’s wrong and, you know, it happens at three in the morning most of the time, instead of during broad daylight when you’d be awake.

Erik Darling: Never happens when you want it to.

Tracy Boggiano: No, never happens when you want it to. All failovers happen at 3 am or 2 am, one of the two. Most of mine have, so… Unless it’s in development, that happens during the daytime, most of the time. It’s weird how that works.

But your redo queue, you have most of the same effects. Your redo queue has the same impact on your availability. If it can’t keep up, if the secondary can’t keep up, then when you go to failover, it’s delayed because it’s got to replay that redo queue. So, if it’s got an hour’s worth of index rebuilds to replay because you didn’t give enough memory or CPU to keep up, then you’ve got to wait an hour until that database is available. Again, you can’t really read the data from there because if it was an ETL process instead of an index rebuild, then you’re just reading data that’s an hour old, and again, you’re not getting your full backups and your log backups are going to fail.

So now that we understand the different things that we’re going to look at and the things that play an important part in availability groups, we’ll look at different ways you can actually look and see what’s going on with your availability groups. So, the first, and least favorite, way is the SSMS AG dashboard. It’s my least favorite way because you’ve got to go onto each server to look at it. It’s nice and pretty though, so I give Microsoft credit for that. But it does show the status of your cluster, it shows the stats for each node, which are customizable. You can actually pick different columns for each one. It will list every database and show you different stats for each one. And by default, that view refreshes every 30 seconds, and there’s actually a way to change that. and even in that screen, you can actually initiate a failover. You don’t actually have to know the commands to do it, you can just hit a button and run through the wizard.

So, we’ll take a quick look at what the dashboard looks like and how you can add columns to it. So, I have my Hoosiers here, because I created this during March Madness time and the Hoosiers were number one in NIT and not in a tournament – but that’s okay, they’re still number one. We have AG one and AG two, and this is the default view they give you. It doesn’t really tell you much. So, if someone was – in my situation right now, everything looks good. We’ve got a healthy AG, we’ve got a normal quorum and we’ve got a couple of databases in the AG. Over here on the right-hand side, you can actually add columns, such as your estimated recover time or your synchronization performance or your estimated data loss. And you can see, hey I might be out of RPOs and RTOs.

Erik Darling: Is that ever wrong?

Tracy Boggiano: It can be. The thing that can be the most wrong is your log send rate and your redo rates. They don’t actually refresh except for when it’s sending data. Microsoft does not do that because it would just be constantly talking. So, I’ve actually got a query for that later that gives you a better view of your log send rate and your redo rate.

Erik Darling: very cool. Can I ask you a question real quick from one of the audience members? Jay Griffiths asks, he wants know, “If the add columns link is only in 2016?” Because he doesn’t think he’s seen it in 2012. You can right click and do it, but the link might be new.

Tracy Boggiano: I’m not 100% sure. It’s four years since I used 2012.

Erik Darling: I wish everyone could say that.

Tracy Boggiano: I’m pretty sure you can customize the columns, even going back to 2012 though. They may not have as many columns there, but those couple of columns that we were talking about, the log send queue and redo queue size are here. And of course, they had the rates as well, as you can tell. I’m not doing anything and it has numbers in it, and those are the numbers that Microsoft doesn’t update constantly because it would just constantly be communicating back and forth and causing extra network traffic. So, I’ve actually talked to the programmer on the team for that and they told me they’re not updating those constantly. So, there’s a better way to get those particular numbers. And you’re generally not going to monitor this way because this is just one machine, and how many of us have one AG? Hopefully not many, you know…

Erik Darling: I’m a consultant; I have exactly one.

Tracy Boggiano: Well you can stick to looking at this view then. For us people with 100 and some servers in production, with maybe 100 something AGs, you know, we might want a different way to look at stuff. So, we’ll take a look at looking at multiple servers. There’s another way you can actually customize the dashboard too as well. So, if you use SQL policies at all, you can actually set up your RPO and RTO as policies. And it will change those green arrows on the side to red arrows if you’re outside of your RPOs and your RTOs based on your customized policies. So, let’s go over how to set up policies in SQL Server.

So, the first thing you want to set up is a condition for is it the primary server or not. And you’ll get that from availability groups – and I’ve actually got all this in a script, so when you guys get the – you can download this off my site or from wherever Brent puts the scripts.

Erik Darling: Yeah, we can put it up either on the – with the GroupBy video, if you want to send us a link to wherever you host it. Or we can, I think, host it locally if you want. Whatever works best for you.

Tracy Boggiano: Either way, I have it uploaded on my site as well. I did this presentation a few times.

Erik Darling: I can tell. It’s nice – it looks nice.

Tracy Boggiano: And then you’ll set up one for RPO, and we’re going to set up a really unrealistic one so that we actually get some data back, hopefully, when we do the demo – when we do that on a database replica state. We’re going to look at estimated recovery time and we’re going to set it to less than one second.

Erik Darling: That’s only unrealistic until you see the price tag. Everyone wants it until they see how much it costs.

Tracy Boggiano: Now that our conditions are set up, we can actually set up policies. And they have to be named a certain way in order to work; so, I’ve pre-copied that over to the side so we can find them easier. I’m not seeing the is primary that I just set up… Crap.

Erik Darling: Well it looks like it’s there. You’re wowing me with the policy thing; I’ve never touched it before.

Tracy Boggiano: I’m going to cheat and use the script.

Erik Darling: Go ahead. You know what, use the scripts anyway because everyone’s going to use them.

Tracy Boggiano: Yeah, it’s easier that way anyway; this has got a bunch of XML stuff in it, it’s like…

Erik Darling: I hear you. Whenever I work with extended events I feel the same way about XML.

Tracy Boggiano: That’s the way I feel about extended events too, so… So, it’s re-set up our conditions we already set up, and we need to enable this one and I want it to run as fast as possible; every ten seconds…

Erik Darling: Oh, I wish I had that button for queries.

Tracy Boggiano: That’s my name for it, as fast as possible. I put everybody else’s on 30 minutes so this other thing doesn’t run as fast as possible. Now we have custom RTOs. So, when you go under tools and options in your studio manager, there’s an Aalways On option, and you can enable your Always On policy. And this is also where you can tell the dashboard to update more than every 30 seconds. So, we will need that for our demo, because otherwise, we’re going to not see much of anything happen. So, we set up our policies, we’re going to let the dashboard update every second. We’re going to go back and look at our dashboard and we’re going to run a script that Jonathan Kehayias wrote that adds a lot of data to Adventure Works. See, it just makes it really big. You can download it from his site.

And what I’ve done on my second AG is I only gave it 256MB of memory…

Erik Darling: That’s just cruel.

Tracy Boggiano: So that we can see numbers over here, and I’m hoping this will turn red. Sometimes it turns red, sometimes it doesn’t. But we can see our numbers turning over to the side. It’s processing data. I’m hoping it’s going to fall far enough behind to turn the numbers red. I didn’t give it enough memory to keep up – there we go, we turned red up here, we turned yellow down here. We’re all kinds of behind because our server doesn’t have enough memory.

Erik Darling: the AG bomb.

Tracy Boggiano: the AG bomb. So that’s what happens when you set up your SQL policies and you don’t give your server enough memory. And those are completely customizable. There’s a link at the end that shows you exactly how to set them up, plus you have my scripts if you don’t know how to click through and follow the demonstrations online. Some other ways you can get alerts, of course, are SQL agent alerts. Everybody, I hope, already has the normal alerts for sev 19 through 25. If not, go find Glenn Berry’s scripts that will create those for you. Otherwise, we’ve got these nice five alerts here for when the role changes, for when your data movement gets suspended or resumed. I wouldn’t page when resumed though, because that would be silly to get woke up for. If the AG goes offline or if the AG is not ready for automatic failover, which we would probably have gotten an alert for, given what we just did to our poor little AG.

So, we will go in and failover our server and see if we can see the counts change on our alerts. I think I deleted all my alerts earlier, so we can set up one new. Alerts are pretty easy to set up. So, 1480 is our role change alert, and you can tell it to notify me on the pager…

Erik Darling: The pager?

Tracy Boggiano: The pager, yeah, it fails over. And we’ll notice, on our alert, in our history section, it’s never been sent. So now we’ve got to see if we can recover our Ag, give it some memory so we can failover. It’s usually recovered by this point, even though I’ve given it so little memory.

Erik Darling: You must have done something extra special to it today.

Tracy Boggiano: I did something extra special today, yes. And I just locked out my studio manager.

Erik Darling: I like when you get that little notification. It’s like, SSMS is busy right now, please call back. What?

Tracy Boggiano: Alright, you can have 3GB of memory now.

Erik Darling: You’re so gracious. It costs what, like $4?

Tracy Boggiano: All the alerts that you’d want to set out, I have set up in a script. So, we’ve got one for data movement and if it’s offline. Then you can just go in and customize your actual response time for it. So, those will all be set up for you, and you can just customize whether you want to be paged or not or just emailed or whatever, from your history.

Erik Darling: Send it all to your junk folder.

Tracy Boggiano: Yeah, send it all to your junk folder. I have a junk folder for all the alerts. Everybody has one of those; the best place for them, the junk folder.

It’s already sent out one that’s saying it’s not available for automatic failover, saying we should have set the alerts up first. See now it’s actually processing data.

Erik Darling: It’s finally starting to sober up.

Tracy Boggiano: Well we will come back and failover in a minute – I killed it. We will definitely come back and see that in a minute. Usually, I don’t get it to turn red, so I went lower than 1GB this time and it worked, I just have to do that demo last.

Erik Darling: Unfortunately, I see customer and client setups with almost the same amount of memory and I’m like, why are you here?

Tracy Boggiano: My last company, our biggest machine had like 16GB of memory, total.

Erik Darling: What? Was it your laptop?

Tracy Boggiano: No, my laptop only had like 3GB, it was ridiculous.

Erik Darling: Sounds like they got lost in time a little bit.

Tracy Boggiano: We were, we were in the dark ages.

So, some other ways of monitoring, of course, are DMVs. And one thing you want to keep an eye on is your FileShare witness, to make sure it’s still up. Because you’re using availability groups, it has the ability to fix pages for you automatically. So, there’s a table in your MSDB for automatic page repairs that keeps up with your suspect pages for HA. So, you’ll want to query that table and keep an eye on it, just in case there’s something going on with your I/O subsystem and it’s just constantly fixing pages and you don’t know about it. And, as we discussed before, we want to keep up with our log send and redo queues. And you can look at the DMVs for that, but that’s also where the dashboard’s pulling from. So, we’ll look at a different way of looking at that from the perfmon counter ways. So, I’ll give you an actual view of that until Microsoft comes up with a non-noisy way to do that.

Let’s see if it recovered yet. It recovered, so now we can actually failover. Yes, I typed in a demo…

Erik Darling: You got it right, so I don’t think anyone will hold it against you.

Tracy Boggiano: It failed over, so we’ll notice, when we go over to our alerts, what I was going to show before is that we’ll see now that the number of occurrences has changed from zero to four. So that’s how many roles changes it actually went through just to failover. So it goes from primary to resolving, to normal, to secondary. It didn’t just do one to failover.

Erik Darling: Like the stages of grief.

Tracy Boggiano: Yeah, you can’t just do one… So, we’re going to fail back over to our primary server. We can’t failover yet, it’s not done yet. But that’s okay, we don’t need to, to do our next demo anyway. So, we’ll just open up our DMV and run it on 02. So, there is HADR cluster members that you can look at the type for FileShare witness and actually check on your FileShare witness to make sure it’s up. And mine’s on my domain controller, so it’s up. That’s an extra way to make sure your domain controller’s available.

Again, the auto page repair DMV that I couldn’t remember a minute ago, so you can have a check on this every hour to see if anything’s been automatically fixed for you. Now we need to failover because you can only run this on the secondary to see stuff. Actually, we just need to change servers. Back to one, because that’s where the secondary is – and this is the same basic query that you get out of the dashboard. I just shortened it down to the numbers I cared about, and these are the numbers where it’s not doing anything. And you could add database name and any numbers that you wanted in there…

Erik Darling: I think what a lot of people are into when they start looking at these DMVs and trying to figure out if their AG is healthy or not is what’s normal in these. So, what would you suggest for baseline-ing this kind of information, if anything?

Tracy Boggiano: I would suggest like capturing the information for like a week or so, and then graphing it out and seeing what your normal is and going from there, because everybody’s normal is going to be different. One database is even going to be different than another database in an AG.

Erik Darling: Good point.

Tracy Boggiano: And at 2 am when you’re doing index rebuilds is going to be different than what you do at 2 pm.

Erik Darling: You hope, anyway. Weird things happen.

Tracy Boggiano: Weird things happen, yes. At my company now, a lot of weird things happen sometimes. But there’s actually a better query than this one that gives you a little bit more information, that I’ll be showing you when we get to perfmon counters, that you can use that give you a little bit more information. Like I said, these two counters here, the log send rate and redo rate don’t refresh unless it’s actually sending data. So, they’re not a good metric to go by if you’re wanting to know how fast things are moving.

So now we’re back on our primary server. Those are the basics of the DMVs, and our dashboard’s saying, no you failed back over again, what are you doing? Why are you constantly failing over? So, this is my preferred method of checking things, with perfmon counters. Your log send and redo queues are available there, and your rates are actually accurate. They’re not accurate in the DMVs, as we’ve now mentioned twice. And Microsoft is aware of that and they have no plans to fix it, because like I said, it would be very noisy for them to constantly be pinging the other server to see if it’s doing something when nothing’s happening. So, they only update the DMVs when they’re actually sending data. So, when they’re not sending data, they don’t update the rates. So that’s just how it’s going to be.

So, we’re going to look at a perfmon query, and my preferred way of monitoring stuff is to be able to go across multiple servers. So, I’ve actually got a PowerShell command let that I use to go across multiple servers in our environment to see what’s happening. So, we’ll look at both of those.

Erik Darling: Now, is that PowerShell command let from DBATools, or have you considered sticking it in DBATools?

Tracy Boggiano: I haven’t. I could.

Erik Darling: No pressure, I got nothing to do with…

Tracy Boggiano: No big deal, I wouldn’t mind putting it in DBATools.

Erik Darling: I’m clearly not a PowerShell person, so anything you do is up to you.

Tracy Boggiano: I think they have one that goes against central management server, so it would just be a matter of adding one that specifically checks for always on stuff. So, this is my query that I use for perfmon data. And you have to be on the secondary to actually look at queues. So, it basically captures three different perfmon counters, and it waits a second and catches them again so that you can get a diff. and then it combines the data all together and takes and does the math for you. And I’ve left the DMV numbers in here so that you can – if they ever decide to fix it, you can take out all the temp stuff and have a real query. And then I drop everything, just so that it’s clean.

Erik Darling: Now, one of our attendees has a question. They want to know if the PowerShell in the scripts you’re showing us are all online as part of the presentation?

Tracy Boggiano: Yes.

Erik Darling: Cool, there you go.

Tracy Boggiano: I will upload and update a slide deck to my website after the presentation, but if Brent uploads them somewhere, I will give them to him to upload as well.

Erik Darling: Cool, we might just use yours if they’re already up there.

Tracy Boggiano: I will blow up Adventure Works again so we can see some data. I don’t know how big it is; I’ll probably have to…

Erik Darling: You had me at blowing up adventure works.

Tracy Boggiano: I’m on the wrong server to blow it up. I’m on two right now and I need to be on one. So, while I blow that up I can run this query, then we can see a little bit of data. Our send rate for our DMV versus what’s in perfmon is different; a lot different. But we’ve actually got latencies in here and redo rates – so we’ve got redo rates on a database we haven’t touched, because we haven’t done anything to WorldWideImporters.

Erik Darling: Yet. Just drop it…

Tracy Boggiano: This is a primary query that I use to keep an eye on things. And I have this PowerShell script here. I’m not going to go over the details of this PowerShell script, but one of my co-workers wrote it, Mark Wilkinson. Hi, Mark. I think everybody at work’s watching. And it queries your CMS server based on a pattern, or you can stick in a text file that will query whatever servers are in your text file. So, I have my two servers in a text file, and if I query it right now, it’s just going to pop back with the version numbers of my servers.

Erik Darling: Very cool, it pops up; how do you do that? That is some good old-fashioned witchcraft.

Tracy Boggiano: [crosstalk]

Erik Darling: Get out of here.

Tracy Boggiano: So, we’re going to plug in our DMV one. So, you can basically plug in any query into here and query any server you want, for anything.

Erik Darling: Anything?

Tracy Boggiano: For anything, any query.

Erik Darling: So, if I wanted to run DBCC WRITEPAGE across all my databases?

Tracy Boggiano: You could do that.

Erik Darling: A very dangerous weapon you’ve crafted there.

Tracy Boggiano: But what I like about doing this, rather than doing it using SSMS and stuff, is now you can change your WHERE criteria here, if you wanted to, and you can sort.

Erik Darling: This is the kind of stuff that makes me like PowerShell. Everything else makes me hate it. I see this and I’m like oh cool, useful.

Tracy Boggiano: You can’ do this in studio manager. DMV writes are really high and our perfmon ones aren’t as high, but that’s the view you can look at. And it tells you which ones are primary, so it gives you all the information you need to know about the AG. So, if you knew all your AGs and had a them in a text file, you can just easily query them all and know that they’re healthy or not, and you can easily find out all kinds of information about any [inaudible] that you needed to.

Erik Darling: Does it tell you if they’re lonely?

Tracy Boggiano: Well, you know, if you write the query – see, it’s still blowing up Adventure Works – but if you wrote the query to tell you if it didn’t have a secondary and there was no availability group, yeah, you could write it to say yes, I’m lonely.

Erik Darling: Can you write it to play Now That We’ve Found Love when it comes back online? I would be into that. That would make me happy.

Tracy Boggiano: You could try that. I don’t know how you get it to play music. I might have to ask Mark about that…

Erik Darling: Open up YouTube, I guess.

Tracy Boggiano: He knows PowerShell better than I do, so…

This is my favorite way of replacing the SQL agent alerts. Because we have so many servers, I don’t like to get alerts across that many servers get sent to my email, so we have a monitoring system that we get our alerts into and we check that; then important stuff pages you. But we have a central server that takes care of that, so by default, when you create an Always On Availability Group, it creates an always extended events health session by default. And it tracks all your relevant events for you, and you can create it for the failovers and several other events as well.

So, I want to show you what extended events are actually in there and show you a few queries that are already written for it.

Erik Darling: That’s awesome.

Tracy Boggiano: Yeah, I learned how to shred some XML.

Erik Darling: They always say, you know, one of the biggest things missing from extended events is any real guidance on which events matter for what you want to look at, and availability groups are one of those things where it’s like, there’s so much stuff in there that says availability, I don’t know what I want. You can’t just have a session with all of them.

Tracy Boggiano: Yeah, well they create this one for you by default. There’s one I recommend and this is the one they create by default. So, they’ve got a few events up here, like when a replica state changes, the lease expires is when it can’t connect to the other server, I think, for 30 seconds or ten seconds; one of the two. If DDL happens, so like if someone removes a database, it will record an event for that. and then they have a whole slew of error numbers, and you’ll find it kind of funny that they have these numbers in here; they have nothing to do with availability groups. Everybody knows 823, 824 and…

Erik Darling: But not 825.

Tracy Boggiano: Yeah, they don’t have that one in there. It records automatically for you. One of the things I recommend you do is actually drop it and just go ahead and get all of your severity 19 ones in there. And I didn’t add 825, but go ahead and add your 825 in there. And then query all your servers from your management system from there and get all of your alerts in one place. But that’s by default what’s in the always on health session. It’s got a number of alerts that have nothing to do with always on that you would not normally set up that it tracks behind the scenes.

So, we’re going to go query out some of the ones we did set up as alerts, because we shredded some XML. So, we’re still connected to AG one, yes. This first one is that 1480 message that just tells you the role change from primary to secondary. So, we’re going to say, every time it changed to primary, we want to know it failed over to primary. So, this is going to tell you that one time that I failed back. And all I’m doing up at the top is getting every target for always on and reading at the end, and then I’m reading the previous state, current state, replica name and the group name, and the time for it. And then this is just more fun XML stuff. And this is the particular event we’re looking at. So, I did the XML shredding for you – and I can’t click and highlight very well because it’s big on the screen so everybody can see it.

So, this is all the times I have failed it over, evidently, today. See, I did it earlier today, my time’s not current with the current time. I practiced this a few times to make sure it would work…

Erik Darling: Don’t want demos failing in front of people, that’s the worst.

Tracy Boggiano: Then, if you want to query out your error numbers and your error messages that we customized and put in there, we’ve got the same thing. We’ve got our timestamp. This time we pull out our error number and our severity and what the message actually is. And we’re looking for the object error report instead of looking for that another event. And we can see every error that’s occurred on my system today that’s important. And you can use that to just centrally collect all your error messages from your servers from a third-party app, and just hit every server and pick up everything that’s happened in the last five minutes or last hour, last ten minutes; however often you want to be alerted on the different things happening on your server.

Erik Darling: Can I ask you a quick question, Miss I haven’t touched 2012 in four years? 2016 direct seeding, do you touch any of the direct seeding error issues that go on in there, or have you not messed with that yet?

Tracy Boggiano: I’ve messed with the direct seeding, but I do not touch the errors in this presentation.

Erik Darling: That’s fair because it’s a whole…

Tracy Boggiano: Yeah, because this presentation takes an hour to do usually, with questions, so that would…

Erik Darling: Not a lot of questions so far, which means you’re doing a good job.

Tracy Boggiano: Which means we might finish early, but…

Erik Darling: That’s alright, that means you get to start drinking early.

Tracy Boggiano: That’s okay… So as far as wait stats, the two most common ones I’ve seen, and I put “I’ve seen”, because other people may have seen some other ones, is your HADR sync commit, and that’s it actually committing your transactions to the other server. And write log, because it has to write to the disk, and that’s most of what it’s doing on your secondary anyway, is writing. So, if those two are popping up really high then you may need to take a look at some performance stats on your server and see what’s going on. You may not have enough memory.

Sometimes I’ve seen instances where your secondary needs more memory and more CPU than your primary to keep up.

Erik Darling: That sounds brutal.

Tracy Boggiano: Yeah, that’s just because you do a lot of stuff, so…

Erik Darling: So, what do you do when you failover? Then your primary’s out of luck because it has less stuff than your – you just keep swapping and growing? Oversize everything, right?

Tracy Boggiano: So, based on our wait stats – we’ve got to blow up Adventure Works again… I definitely need to replace it after this demo. This will be like my sixth time blowing it up today. I got this script off of Derek Hammer’s website, that he uses to keep up with HADR syncs. It’s a nice little script, if you don’t highlight just one word.

But he actually just tells you how much the sync lag is in milliseconds based on that wait type at the moment. So, in milliseconds, that wait type is causing us 216 milliseconds of waits. But if we were to put our memory back down…

Erik Darling: Will the query even run?

Tracy Boggiano: Well it’s running on the primary so – we do need to run it on the secondary. Will it even connect? We still might get a sync lag over here – see, it’s going up.

Erik Darling: You are just cruel to this thing. I feel bad – I was like, this is going to be a cool demo, No, you just beat the crap out of it. It’s going to walk away crying.

Tracy Boggiano: It’s just a secondary server, so you know. It will come back eventually. Depending on which server you run it on, it will tell you the lag time on that particular server. It tells you the commit time when you’re on the secondary and it tells you the commit time when you’re on the primary – what it is on the primary when you’re on the primary. But that’s the only thing I have on wait stats there, but those are the two most common ones I’ve seen. And then we are ready for questions.

Erik Darling: Alright, anyone? Questions, comments? Let’s see, there’s one over from Caroline. “When you are forcing your always on to fail due to memory, was there anything in SQL that tells you why it failed?”

Tracy Boggiano: No. That’s where you have to kind of monitor your system and know your system the best. You’re going to hopefully have something in place that’s like monitoring your buffer pools and things like that so that you can see that your buffer pool sank and you no longer had memory.

Erik Darling: Someone found a Baby Ruth in the buffer pool.

Brent Ozar: So wrong, so wrong. We should have taken a poll too, to see how many folks are using AGs these days. We have over 200 people attending, that’s a heck of a sign. AGs have really caught on.

Erik Darling: Let’s see, there is a question. Jay Griffiths asks, “Redo queue is serial processing, right? How can I increase the redo rate? Will more memory help redo rate significantly?”

Tracy Boggiano: Redo rates and the memory, it’s going to depend on how much threading it does on the other side on the machine. I would check on your CPU first because – it might depend on what version of SQL you’re running, because they did a lot of improvements in 2016 on the redo rates and how it works – yeah, they did the parallel redo rates on 2016. So, if you’re running before 2016, I’d recommend upgrading to fix that.

Erik Darling: I would recommend that too, in a hurry. JD wants to know if you can do this stuff when you’re not SA on the instances. Like what sort of privileges or permissions do you need to do the stuff you’re doing?

Tracy Boggiano: You need to view server state to check the DMVs.

Erik Darling: Let’s see, there are some questions from GoToMeeting as well, which is much smaller text, so I have to lean in. Pardon any bald spot or bugs you see on my head. “How do you test that Ag and AG monitoring is actually going to failover properly? Is unplugging the primary the best way?”

Tracy Boggiano: That’s one way you can do it.

Erik Darling: Just walk into the server room with a shotgun.

Tracy Boggiano: I’ve actually tested it that way before on a system at my last company, but I don’t like the idea of unplugging. I think if you can run the command and it works, that’s a valid test.

Erik Darling: Probably a bit safer on the hardware too.

Tracy Boggiano: It’s a little bit safer on the hardware, but you’ve just got to make sure everything’s redundant; that’s the main thing.

Erik Darling: I’m with you on that. Manuel wants to know if there is still a two-node limit with the basic availability groups for 2016.

Tracy Boggiano: Yes.

Erik Darling: Two nodes, no readable replica, one database…

Tracy Boggiano: One database, yes. It’s not even worth the hassle of setting up a cluster, if you ask me.

Erik Darling: Just use mirroring.

Tracy Boggiano: Just wait until 2017 and use clusterless AGs, that’s my opinion.

Erik Darling: No clusters, that’s nice.

Tracy Boggiano: Wait a few months, one or two.

Erik Darling: J.K. asks, do you see many unexplained failovers?

Brent Ozar: He couldn’t even keep a straight face.

Erik Darling: So, do you see many unexplained failovers?

Tracy Boggiano: Yes, I might have seen some in the last week. Even Brent had to leave after that question.

Erik Darling: It’s funny because it’s like a 500 level DBA joke where you’re like, yes, most are unexplained. JD wants to know how clusterless AGs are going to work. Hopefully well.

Tracy Boggiano: Hopefully well, that’s what I’m hoping for. I haven’t set one up yet, it’s something I’ve been meaning to make time for. Hopefully before PASS because I will be giving this presentation in a longer format at PASS.

Brent Ozar: Congratulations, that’s very cool.

Tracy Boggiano: Thank you.

Brent Ozar: He’s saying, “No no, I was asking a specific question. Not how well, just how are they going to work.”

Erik Darling: That’s not very specific. Being fair here.

Brent Ozar: There’s a replica over here and there’s another replica over here and then the data goes across…

Erik Darling: It’s like a long-distance relationship.

Tracy Boggiano: I assume it’s just going to work just like database mirroring works. I mean, you set up two end points and you send it across.

Brent Ozar: Is it the Einstein joke where he was asked how do you explain radio, and he’s like, well it’s like a cat, a very long cat. You pull the tail in one city and the cat meows in another city, only there’s no cat.

Erik Darling: JD, if you want more technical information, I would just check Books Online or wherever else people put fancy diagrams and lots of technical mumbo-jumbo about that; if you’re overwhelmingly interested.

Brent Ozar: Jamie has a really good question. Jamie says his server team is doing snapshots, probably VM snapshots for VSS snapshots. It causes high availability to go offline for a few seconds, this creates issues between his nodes and quorums that cause a failover; “How can I get around this?”

Tracy Boggiano: Tell them to stop.

Erik Darling: Yeah, so what I would do is look in the error log and do a search for frozen…

Tracy Boggiano: Yeah, for the messages and stuff too…

Erik Darling: So, if you see long gaps between I/O being frozen to take the snapshot and I/O being thawed, to let the snapshot finish, then you could possibly see why and you could maybe talk to your server team about maybe doing their snapshots a little bit smarter, rather than making your server fail every time they run. That’s a thumbs down too.

Brent Ozar: Brian says, “You could turn off snapshotting on the volume level in VMware.” Yeah, but they’re probably doing it on purpose though, they’re probably trying to back up that thing. Jamie asks, “Would extended events capture this?” You don’t need extended events, no, it’s in the error log. It’s right in just the regular SQL Server error log. It will say I/O is frozen. Total piece of cake.

Wow, now we’re starting to get the fun AG troubleshooting questions.

Erik Darling: This is not free tech support.

Brent Ozar: Free tech support… Anup says, “I have a few hundred databases in an AG, and in case one node goes down and comes up in seconds, the Ag goes to sort of a brain split where a few databases are recovered on one node and others are in a recover state. It happened two or three times, Microsoft cannot confirm that they have verified an AG with 100 DBs…” So, the test matrix for 2012 was like 100 databases, one AG, one replica. That was like the test cases for 2012, and then they used bigger test cases in 2014 and 2016.

Erik Darling: And then at DBA Days, Kendra did like endless databases in an AG, and that did okay until it sort of hit a thread pool wall.

Brent Ozar: Yeah, the more activity that you have inside there, the uglier that it gets. I don’t know that I would put a max number of – like I get nervous at around 100 or 200 databases, just because I’m like, how many of these are actually active at the same time? And how many replicas you have. But you can go higher than that, but when you say a few hundred databases, I’m already getting nervous.

Erik Darling: What I would want to do is start checking my server for any poison waits, especially thread pool. Because if you’re keeping a whole bunch of databases synchronized, you know, you could be running into some awkwardness there.

Brent Ozar: Anup says thread pool. Yes, you got it. Yes, that’s the one. He said he had exhausted his worker threads; yes, you win. No availability group for you. Well, you’ll want to pay much more close attention to the metrics that Tracy is monitoring, because your AG will go south fast.

Well thanks for the great session today, Tracy. Really appreciate it. It’s certainly a hot topic these days with everybody doing always on availability groups. Nice job and good luck at PASS too.

Tracy Boggiano: Alright, thank you.

The following two tabs change content below.
I am a Database Administrator for Broadvine. She has spent over 20 years in IT and has been using SQL Server since 1999 and is currently certified as a MCSE Data Platform. I cover all aspects of administration and deal heavily with performance tuning and disaster recovery. I a co-organizer of a Special Interest Group (SIG) dedicated to advanced database administration in our local user group TriPass. Before I worked full-time as a DBA I was formally a developer and network administrator. I also tinkered with databases in middle/high school to keep her sports card collection organized. I volunteer as a Guardian ad Litem in North Carolina advocating for abused and neglected foster children in court. This is my passion outside of SQL Server. The national organization for this program if you are interested is National CASA Organization.

Latest posts by Tracy Boggiano (see all)

, ,
Previous Post
Hacking SQL Server
Next Post
Networking Internals for the SQL Server Professional

1 Comment. Leave new

Solid abstract. I might consider breaking it up into 2-3 paragraphs to make it a little bit easy to scan, but it’s not really a big deal.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.