DBAs and systems administrators – oh, and even developers who were “volun-told” – responsible for managing SQL Server FCIs and AGs.
SQL Server Failover Clustered Instances (FCI) and Availability Groups (AG) depend a lot on Windows Server Failover Clustering (WSFC). But when mission-critical databases go offline, bringing them back online could be a nerve-wracking experience.
This session will look at understanding how the underlying WSFC platform works to help us maintain availability of our databases. This is for DBAs and systems administrators who are responsible for administering SQL Server FCIs and AGs but are not so comfortable with figuring out why the databases went offline and what they can do about it. We’ll walk thru the troubleshooting process that I follow to quickly identify and resolve SQL Server FCI and AG availability issues.
Enjoy the Podcast?
Why I Want to Present This Session:
Far too often, I see DBAs and system administrators scrambling to figure out what caused a SQL Server Failover Clustered Instance (FCI) or Availability Group (AG) to go offline and how they can quickly bring it back online without violating service level agreements. This session highlights the simplified process of identifying and quickly resolving availability issues with SQL Server FCIs or AGs.
Session Transcript: Introductions
Brent Ozar: Alright, in this session at GroupBy, Edwin Sarmiento will be talking about the smart way to troubleshoot clusters and availability groups. So take it away, Edwin.
Edwin Sarmiento: Good afternoon, morning, evening, I don’t know where you guys are from. We’ve got 176 people on the session. 168 plus seven staff, good enough. Good afternoon. I’d like to say afternoon because I’m on the East coast, and it is my afternoon, after lunch. Welcome to this session, smart way of troubleshooting SQL Server outages, or – let me take that back. SQL Server clusters and availability group outages. We spent quite a while restructuring and modifying the title because we want to make it appealing and something that people would sign up for and register if they are a SQL Server DBA who have to work with failover clusters or availability groups, and they worry about outages.
My name is Edwin, I’m a Microsoft Certified Master, a Microsoft MVP. I’ve been doing failover clustering and availability groups ever since it started. Well, kind of in the failover cluster in the world, been working with Wolfpack since the good old days of NT 4.0, but I kind of hated it so I bailed out in 2000 and 2003 days, came back after 2008 was released because there were a lot of improvements in the clustering world. But I’ve been doing this for quite a while, so if you have any questions, I don’t have access to the Slack channel at the moment, I’m mainly working on my Mac here, so if you have any questions, you can either use the – of course, Anthony will be helping me out here on the questions. Use the chat, use the Slack channel, ask questions. This session is for you. We want to make this as valuable as possible, it is for you. So that’s my email, my blog, on Twitter, on LinkedIn as well, and again, feel free to provide feedback after the whole session. I’m not sure if I can take the whole two hours or an hour and a half, from my understanding, I know I’m supposed to take about an hour and a half, but I will do my best to keep it as short as I possibly can and make it as exciting as well, as I possibly can.
A Few Questions to Get Ready
Let me be perfectly honest. I am realistic. I am a realistic optimist, and I like using Murphy’s law when I deal with high availability solutions that anything can go wrong, anything that can go wrong will go wrong. And this is – I did say I’m an optimist, I’m just realistic about being an optimist in that in reality though, systems break, systems will go offline, systems will break because they’re made by humans who are not perfect. It doesn’t really matter if it’s AI or any type of system, but again, these are prone to errors and prone to breaks somewhere somehow, some time in the future. And so let me ask a few of you a series of questions as part of the preparation for this session.
Have you ever been called in to troubleshoot an online – whether it’s a failover clustered instance or availability groups, but felt clueless because you didn’t know what was going on, or you didn’t know what to do, or you didn’t know where to start? Have you ever been in this situation? Or have you ever wasted hours trying to bring a SQL Server failover clustered instance or availability groups online, only to realize that it has nothing to do with SQL Server? And when you think about it, when you – Harry was saying, unfortunately, yes, I was brought in, we’re trying to figure out what went wrong when a failover clustered instance or an availability group went offline.
Another question though. Have you ever confusingly deciphered the cluster error log to identify a one new server failover clustering issue because that’s what you were told to do? Think of these three questions before we move further. Have you ever been called in to bring a – again, whether it’s a FCI or AG, I’ll just use FCI and AG moving forward. A failover clustered instance or an availability group online. Wasted hours, trying to figure out what the problem is, or maybe have tried to even decipher the cluster error log because that’s what everybody told you to do.
Now, as I did mention earlier, like I said, I’m kind of a realistic optimist, and the way I approach this is I prepare, I overly prepare as much as I can. In fact, one of the reasons why I constantly try out new versions of whether it’s a Windows server operating system or a SQL Server version because I want to be ready in case I get called in to solve an outage or fix an availability issue for again, either a FCI or an AG, and here’s the thing though. Preparation is key, and I remembered a time – and you might be wondering what the heck this picture is all about, so [Diero] was saying yes, yes, and yes, unfortunately, yes, yes, and yes. This picture is one that I took when I went to SQLBits in the UK. Kind of crazy, it was my first time to drive on the other side of the road, and I felt like, no big deal, I’ve been driving ever since I was 16 years old, I know how to drive standard or stick shift, no big deal. But then again, you drive on the other side of the road and you feel like you’re a newbie driver. So the way I dealt with this was to really prepare mentally, physically, psychologically even, because I don’t know if the signs were the same, and whether or not how I would respond if I see a car on the other side of the road, and again, the real key here is preparation. And this is what this session is all about, is getting you ready because when that time comes, when – it’s the correct side of the road, as James was saying – when that time comes, when you are brought in to bring an FCI or a AG online, there is only one goal, right? And that goal is to bring it online as quickly as you possibly can.
Let me throw that out there as early as now. The main goal whenever you’re called in or you’re a DBA that got called in at three o clock in the morning, or you’re a consultant who got that phone call, and a potential customer or even an existing customer wants you to bring their FCI or AG online, focus on the goal, and that is to bring it online as quickly as you possibly can. And I kind of like to highlight the fact that there’s a reason why you’re running on top of a failover cluster, right? Because you want high availability, and every second that that database is offline, every second that FCI or AG is offline is technically pushing your service level agreement and recovery objectives out of the window. So again, we want to focus on the main goal. And in order for us to really understand this thing and you know, I recently just came back from a customer engagement, and we did some sort of a whiteboarding session and a design conversation around their failover cluster, and the way I approach this and the way I explain this is failover clustering is really complex. Very complex, very confusing to a point where when you know, a DBA or a systems admin would go and fix something on failover cluster, they get really confused sometimes just being you know, like a deer in a headlight where you’re frozen trying to think of what to do. And I wanted to simplify these concepts because understanding how failover clustering works, because understand that this is the foundation. Think of a building. Failover clustered instances; FCIs or AGs will run on top of a failover cluster. So we need to make sure that you understand the foundation, you need to make sure that you lay the groundwork and have a solid, reliable, highly available foundation, which is a failover cluster to Windows server failover cluster. And in order for us to do that, we need to understand how it works. So let’s understand a failover cluster in the context of a SQL Server workload. But you know, a little bit of a secret there, if you understand how failover clustering works, you could deal with Microsoft exchange, Hyper-V, a clustered file share, any workload runs on top of a Windows server failover cluster you can manage if you understand these concepts. And I like to use a simple analogy. The traffic light analogy. Now, of course, I did mention that I tried driving in the UK, on the right side, and I know somebody mentioned it is the right side of the road, and if you’ve been driving for quite a while, you know what a traffic light symbol stands for. And I am going to use these symbols, there are three symbols in a traffic light and there are three meanings, and I will use these to correlate how a Windows server failover cluster works. The red light, if you’re driving, you know for a fact that if you see a red light you got to stop. In the world of a Windows server failover cluster, this means a service is stopped. Now, keep in mind a failover cluster’s goal is to provide high availability. So when you think about a two node, a three, or four node cluster, the one node that is currently running the service suddenly becomes unavailable. Maybe it’s a power failure, a hardware failure, blue screen, whatever. The cluster will immediately stop the service from running on that node. So that’s the first part.
The second part, the yellow light there means wait. Now, of course, depending on where you come from, where you live, yellow might mean something else. From where I come from, yellow means go faster. Now, understand that in the context of a failover cluster, the cluster, because it managed to stop the service from one of the nodes that is currently running the workload has to wait, well, wait for what? Well, wait for any of the available nodes in the cluster can take over the workload. Wait until every other dependency is available so that the cluster can then proceed to start the service, and that is move the workload from the failed node to any of the standby nodes, and again, the waiting process is the one that we’re going to be talking about later in more detail. But again, understanding these three phases, the start, the stop, and the wait. I’m going to illustrate this further later on with a traditional two-node failover cluster because that’s what a lot of people understand, that’s what a lot of people are familiar with. So let’s go ahead and illustrate that with a traditional two node SQL Server failover cluster instance.
Now, the reason – like I said, the reason I’m using a traditional two node cluster is because there are a lot of two node cluster deployments out there. It’s what has been introduced even as early as Windows NT 4, so we have – like I said, I am going to use the traffic light analogy for this. I’ve got two nodes. The node on the left is currently running the workload. In our case, because we’re SQL Server professionals, we’ll use SQL Server here. But like I said, if you understand these concepts, it could be – the workload could be a clustered file share, it could be a scalar file server, it could be Hyper-V, it could be Microsoft exchange, it could be anything. The key here is understanding the concept. Like I said, in this case, we will be using SQL Server.
Now, at this point, SQL Server is happily serving client application requests, transactions are going through the database, and everyone is happy, the database is highly available. Something happens, so for instance, the server on the left becomes unavailable. You rebooted the machine because it’s part of your scheduled maintenance. You powered down the machine or maybe power failure. UPS did not get in, or maybe, I don’t know, maybe the server encountered a blue screen event. This initiate the stop service that we talked about earlier in the traffic light analogy. The service was stopped on the node on the left, and if you noticed, the application – let me track back, step back a bit, the application got disconnected as a side effect of that. Why is that? Because the service has been stopped. Well, think about that for a second. It’s just like rebooting SQL Server. Client applications will have to be disconnected until the SQL Server’s service is back online.
Now, the wait process happens, the wait process happens in between the time that the cluster has to decide, hey, which node and available cluster is capable enough to run this workload. Is it the node on the right, or if there’s more than two nodes, maybe the other nodes. And once it’s ready, the failover cluster will then move that workload off of the failed node, the one on the left, to the one of the right, and then start the service. And like I said, there goes your stop, wait and start service as we compare it with, you know, a traffic light. Now, when the service is back online on the node on the right, this is the time when the application can start reconnecting back to, in this case, the database, which kind of explains you know, a lot of questions I’ve been answering for the last couple of years about whether it’s failover cluster instances or availability groups. This explains why an application will get disconnected upon failover, and I use the work failover, but then I will again, talk about that in more detail, but that is one of the reasons – well, technically it’s the main reason why client applications will get disconnected when you perform a failover, or when the cluster performs an automatic failover. And like I said, regardless of whether you’re using the traditional SQL Server FCI with shared storage, or in this case, an availability group, the concept still applies.
But before we dive into dealing with failover cluster instances and availability group outages, we need to understand, like I said, the internals and what’s going on with the failover cluster. Well, let’s start with definitions. First is a resource. Now, a resource is any hardware service or an entity that is hosted on a cluster. I mention SQL Server. So from the point of you – of the failover cluster, SQL Server, the service, is considered to be a resource. Now, understanding this concept forms the basic building block of how a cluster works. You add a resource on top of a failover cluster, the failover cluster now starts managing that resource. We talk about shared drives for an FCI, we talk about network adaptors that you add into the failover cluster. It becomes a resource that the cluster manages. You add a network name, which is … or virtual IP address. That becomes a resource that the cluster manages.
Once you put that resource, or once you provide that resource to the failover cluster, the cluster automatically takes over the management and administration of that resource. Not you, which is also one of the reasons why in a failover cluster instance, the SQL Server’s service is configured to not automatically start on reboot, or on system startup. That’s because – and again, when you compare that with standalone instances and failover cluster instance, in a failover cluster instance, it is the failover cluster that decides whether or not to start or stop the SQL Server service. So we talk about resources and we talked about – we started talking about resource groups, and I like calling it resource groups because technically it’s a group of resources. It’s a combination of resources that are managed as a group or as a unit. Now, when you think about it, I like thinking about resource groups like services, utilities. Say, I am moving from one city to another. I need to call my cellphone service provider, I need to call my internet service provider and tell them the new address where I’ll be moving to. Why? Because I want the service to be moved as well, I want the bills to be forwarded or delivered to my new home address. Now, what happens if you don’t do that? What happens of course in a cell phone signal or cellphone service, I can go from, let’s say from Chicago to New York, of course, I’m not in the US, I can go from Chicago to New York. The service might work for a month until of course, the service will get disconnected because I’m not paying my bills. So technically, the service has to come or rather, the bills have to go with the service, and they also have to go with me. So I move, the cellphone service would move, the bills associated with those services have to move as well. And basically the concept behind how resource groups work, when you hand over a resource to the failover cluster, and you group those resources together, moving a resource that is part of a resource group requires moving all of those resources, kind of like moving the entire resource group.
Now, like I said, I still like calling it as a resource group, although Microsoft, I don’t know what they were thinking back in Windows server 2008, they started calling it services in applications. In Windows Server 2012, they started calling it roles, and that’s mainly because they focused so much on the development for Hyper-V roles running on top of a failover cluster. So they got a storage services running on a failover cluster, Hyper-V services, but instead of calling them services, Microsoft started calling them roles. I still like calling them as resource groups because technically, they’re a group of resources, and when we start looking at clustered resources in a failover cluster, let me just draw some parallels between a standalone instance and a failover cluster instance.
You see on the right, rather, on the left, is your typical standalone instance where you have a hostname, which of course, if it’s a default instance, ends up becoming the instance name for the SQL Server instance. The physical IP address, in order for me to find that instance in the network, I need to find a way to – I need to have a means to find that resource on the network, and that is done by being able to find that physical IP address. And then you have the service in itself. sqlservr.exe running as a service, and you have storage. Your database files, your log files, your system databases, your binaries, all of these sit on top of the underlying storage [inaudible].
Now, let’s draw some parallels between a standalone instance in this case, and a failover clustered instance. That logical boundary within the standalone instance is what I can consider as a group. What do you mean by that? Well, I cannot separate the hostname from the physical IP address. Well, I still don’t get it. Which means, if you assigned a hostname to a machine and you’ve assigned that physical IP address, should you decide to change the physical IP address, it will no longer be the old one, which may end up getting client applications redirected to a different machine. So say for instance, the physical IP address for this machine is let’s say, 192.168.1, change it to 192.168.0.2, if the client application did not know that there was an IP address change, then they might get redirected to some place else, or they may not be able to connect to the old IP address. Same thing. If I detach the underlying storage, my databases will not go online, my SQL Server service will not come online, and therefore, again, all of these form a part of what makes SQL Server work. In terms of the failover clustered instance, when you look at this, and again, you look at the logical boundaries for a failover cluster instance. You have your storage, you have your SQL Server service, you have your virtual IP address and your virtual server name. Now, why virtual, not physical? Well, for one, the hostnames of each individual machines are totally different. In order for the client applications to be redirected to where the service is currently running on top of, there has to be an abstraction layer, so for instance, I can call the database server, server one, but my hostname could be called server A, and the node on the right, server B. The client application doesn’t really need to know if the SQL Server service is running on server A or running on server B. All it needs to know is that it could connect to the database using the virtual server name. It’s a floating name that could move between any of the nodes within the failover cluster. Now, like I said, the logical boundary that spans all of these resources, the minute that I add a shared disk resource on top of a failover cluster, the cluster will now take ownership and management of that storage. Now, once I use that storage for a SQL Server failover cluster instance, I can then use that storage as a dependency for the SQL Server service, and we’ll talk about dependencies in a few minutes here, but understand that the concept of resource groups or roles is fundamental to how you would move resources around within the nodes of the failover cluster. So if I say for instance, I am running SQL Server on the left node, in this case, the node on the left, if I decide to move SQL Server from that node to the node on the right, I would have to move the entire resource group with it, which means the virtual server name has to come with it, the virtual IP address has to come with it, the storage ownership has to come with it, together with the SQL Server service.
I mentioned the term failover, and failover is simply moving a clustered resource from one node to another. We’ve highlighted this earlier by showing an example where the server might go offline and maybe because of power failure, and the failover is the cluster automatically moving the workload from the failed node to any of the nodes available. Now, there can be two types of failover. One is automatic, where I don’t have to do anything, the cluster will take care of moving the workload from the failed node to any of the nodes available within the cluster. Now, there can also be a time where you would initial a manual failover. Say you’re doing a maintenance task, maybe like installing security patches or updates. You could then initiate a manual failover from the node that you’re about to reboot to any of the available nodes within the failover cluster. And like I said, you could do manual, which means the administrator will be responsible for moving that workload from any of the nodes to any of the available nodes, and an automatic failover, where the cluster is the one making the decision when and how to failover the workload from the current active to any of the standby.
I mentioned the term dependency. This is when one service or let me use the term that’s appropriate for failover cluster, when one resource depends on another resource. And like I said, I’d like to use the analogy and concur a standalone instance versus a failover clustered instance, where you have, for instance, a SQL Server agent. Now, I started off with SQL Server agent because keep in mind, you cannot start the SQL Server agent until the SQL Server service is brought online. What I mean by that in terms of a standalone instance, I can’t start the SQL Server agent service before the SQL Server service is started. So there is a dependency. SQL Server agent is dependent on the SQL Server service, and this is no different from a failover cluster instance, where the SQL Server agent service has to wait until the SQL Server service is brought online before it can go online.
Similarly though, in terms of storage and virtual server name, the SQL Server service in this case has to wait for the virtual server name to come online, and – and I’m using the term and here – and, the storage, the shared disk where the master database, the user databases are stored. So both the virtual server name and the storage have to be online in order for SQL Server to come online. Now I did say, and I did mention the term and. There is an and logic dependency here, where if only the storage is online, or the virtual server name is online, SQL Server will not come online. They have to both be online, again, due to the and logic dependency, in order for the SQL Server service to come online.
The last item in the list here is the virtual IP address, which means the virtual server name will not come online until the virtual IP address first comes online. And like I said, when you look at this from the context of a resource group, you now start to look at, hey wait a second here, when I’m looking at an outage, maybe, maybe my SQL Server went offline, we can start looking at the different dependencies to see whether or not the outage was caused by something outside of SQL Server.
Now, I’m going to show you this later because looking at this diagram, I want you to really look at this diagram because this forms the foundation – again, I’m throwing in availability groups there because both failover clustered instances and availability groups rely heavily on the underlying Windows server failover cluster. Now, I’m also throwing a bit of a caveat here, it applies at up to SQL Server 2016. I’m not going to mention anything about SQL Server 2017 because the behavior is a bit different. We’re talking about outages here, I’m pretty sure that there are not a lot of SQL Server 2017 failover clusters and availability groups out there, and if there are, they would still be following the traditional way of deploying FCIs and AGs, which are highly dependent on a Windows server failover cluster.
Now, like I said, take a look at this diagram. Take five seconds, look at it carefully. And the reason I want you to have a look at this and like, picture this in your mind, is because this forms the foundation for how I do my troubleshooting. It forms the foundation of how I do my troubleshooting. Remember, this session is about being smart when you want to troubleshoot outages for either a failover clustered instance or an availability group. Understanding the dependencies, understanding resource groups is key to how you recover or bring an availability group or failover clustered instance online, which is kind of like a segway to how I do my troubleshooting process, because effective troubleshooting comes from knowing how it works.
Now, I’ve shown you how the whole failover cluster stack works. I only use SQL Server as an example because we’re SQL Server professionals, but when you think about it, I can manage a Microsoft exchange cluster, I can manage a Hyper-V cluster, knowing how the resource group, how the resource dependencies, how failover works within the context of failover cluster, I can bring a resource online and again, the goal is to bring those resources as quickly as you possibly can online.
Let me share with you my workload process. This is kind of like you know when you learn something and you learn it from experience, unfortunately you learn it the hard way, this is how I learned it because I’ve gone through several experiences dealing with outages and sometimes, I bungled on trying to figure out what the problem is, how do I start, where do I look at first, and it’s one of the reasons I asked you those three questions earlier. And this is me again, sharing my troubleshooting workflow, or my workflow process. Everything that I’m going to share with you today focuses on availability or outages. This is not about performance. If you’re thinking about, “How do I troubleshoot performance for an availability group?” That’s not part of this, although performance may impact availability. Same thing with security. We’re not trying to troubleshoot any security issue, I’m a bit paranoid as far as security is concerned, but again, security might affect availability, and so this is again, my kind of like my workflow process for troubleshooting outages.
I want to drive this as hard as I possibly can. I want to communicate it as hard as I – I want to convince people as hard as I can to focus on the goal. And again, the reason I share this is because I remember an experience, this was almost about 12 years ago, my immediate supervisor and I were called in as standby engineers. We were supposed to be just standby engineers for a storage migration. This is a large government finance institution in Singapore when I was living there, and the storage engineers were supposed to be taking care of everything. Like, everything. Eight hours on a Sunday morning, we started off – well, not we, but them, we were just standing by in case – and I like to emphasize the word in case something happens. The upgrade went well up to the point where they were trying to bring the SQL Server service online. A bit of context, this is Windows server 2000, SQL Server 2000, failover cluster. They’ve decided that they couldn’t fix the problem and it’s time to rollback. We’re running out of time and the maintenance window is about to close. They couldn’t bring the SQL Server failover cluster online and so they asked us for help, which is kind of ironic, the moment they handed over to us, they went away and start taking coffee breaks, which is kind of interesting because they were supposed to be the ones taking care of this. Turns out that when they restored the backup, there were invalid file paths, you know when you restore from backup and snapshots in the backup, you would see the squiggly line instead of C program files, Microsoft SQL Server, you see C program and then the squiggly and then the number one. We ended up hacking the registry to point the SQL Server service to the correct path of the SQL Server service .exc. The system databases, the log files, the tempdb files, but the reason I’m telling this story is because when we got called in to fix the issue, of course, in that case, it wasn’t really our problem. We were just there on standby, it was the responsibility of the storage spender. Our goal is to ignore everything else and to bring the system as quickly – online as quickly as I possibly can. And I kind of joke around at the fact that one of the storage engineers commented – made a comment about us just Googling for some answers, and I’m kind of like, “Okay, we kind of brought the service online in less than 20 minutes” using Google by the way, because we know exactly what to do, but these guys spent like three or four hours trying to fix the problem and they couldn’t fix it, which is why again, I wanted to hammer this and I wanted to communicate this as much as I possibly can to focus on the goal. Focus on the goal and the goal here is to bring the cluster, whether it’s a failover cluster instance or an availability group, as quickly as you possibly can. Forget about trying to find the root cause. RCAs should be done after everything is back to normal. And Justin was saying in the comments here, it says, “Just recently we had a power outage. After power outage was over, two nodes were up but not the cluster. Unable to remote desktop this up to either node. It’s a testing environment so it’s not a problem rebuilding it. How should I start troubleshooting if it’s production?” That is why you’re in this session, and I am going to walk you through that process now. Like I said, before I move further, I want to emphasize, do not do a root cause analysis. Don’t. Only do a RCA once everything is back in business. If you’re back to normal, business as usual, that’s the time you do a root cause analysis.
Now, when you encounter a problem or an issue, ask yourself this question. Is the failover cluster accessible? That’s the first step. Ask yourself this question, is it accessible. What I mean by that is if you do a ping or a telnet on the virtual network name for the failover cluster, for the hostname, or you know, connect using remote desktop, that’s the first thing that you need to do. And again, going back to – and I’m opening up the chat window … and in the question, John was saying, “How do we start troubleshooting if it’s a production issue?” This is the first question that you need to ask. Is it available?
Now, again, there’s a big difference between the hostnames or the nodes in the failover cluster, like in your case, all of the nodes are available, right? All of the nodes are available, but the cluster is not online. Well, we’ll answer that question after this, but the first question you need to ask is is the cluster available. So basic network connectivity doesn’t, Anthony, I’m pretty sure he can agree with me on this, basic ping and telnet test, ping the IP address. Of course, in a more restrictive environment where ICNP responses are not allowed, then telnet is the way to go. But start with this first. Always start with this first. Do a ping, do a telnet test, and then try logging into the failover cluster. Remote desktop, can you remote desktop into the server? Can you run PowerShell remotely? If you can’t do remote desktop. Can you log in using your KDM? Basic tests, don’t do anything else. Start with this.
Once you can guarantee that the cluster is accessible, then we start looking at what I call my five-step process in the workflow of how I do my troubleshooting. The first one is reviewing the dependency report, and this is huge. The reason I say this is huge is because it has saved me a ton of headaches, a ton of wasted hours trying to figure out what went wrong and how I can bring it online, just by looking at the dependency report. Again, there’s a reason why I went through the dependency hierarchy, the resource groups, and the dependencies. Look at which resource is offline and check if there are dependencies. For instance, if you’re dealing with a failover cluster instance, ask yourself this question when you look at the dependency report. Is the failover cluster instance network name online? If it is online, you’re good. But if it is not online, let’s say it’s offline, don’t spend your time trying to fix issues in terms of SQL Server. There’s nothing wrong with SQL Server. Start with the resource that everything else depends on. So for instance, if my disk is online, remember that dependency hierarchy? Remember that resource group that we talked about earlier? If both my network name and my disks are offline, there is no way that SQL Server will come online. That is the first step. In an availability group, of course, this only applies if you have a listener name. Keep in mind you don’t really need a listener name for an availability group. You can survive, it’s not a requirement. But in case you have a listener name for your availability group, look at the dependency report and see, is the virtual network name, in this case, is the listener name online. If it is not online, don’t expect your availability group to come online. Plain and simple. In fact, just by asking those questions, I don’t even have to dig into you know, the cluster error log or the SQL Server error log, and the way I deal with this is if it’s a virtual network name problem, I pick up the phone. I call the network guys, I call the systems guys, I call the DNS guys, the active directory guys. Why is that? In large organizations where teams are siloed and I can’t do much, I don’t want to spend time fixing a problem that has nothing to do with me. Again, my goal is to keep this troubleshooting process as short as I possibly can so we can bring the resources as quickly as we possibly can.
So again, start with the dependency report, and when you look at this, and again, this is just a review of what we’ve already talked about earlier. Have a look at this. If my virtual IP address becomes unavailable, immediately the virtual server name will come offline, will go offline regardless of what the issue is. Same thing, and this is a cascading effect. The SQL Server service will go offline and the SQL Server agent will also go offline. I call this the domino effect. You start from the top, and the top here refers to the virtual IP address. If that goes offline, good luck. Everything else will be taken offline. And this is something that you really need to pay attention to because again, even if there’s nothing wrong with SQL Server, even if the database engine is healthy, no CPU, no memory, no network, nothing, no problems at all, it is humming fine, it’s working as expected, but because of the concept behind resource groups, the virtual IP address becomes unavailable, it will take everything down with it.
And like I said, this forms the foundation of my entire troubleshooting methodology. I always start off with the dependency report, and I’m going to show you that in a few minutes. This is an example dependency report for a failover clustered instance. Graphically, visually, you would see that I need to get my virtual IP address online first before I can bring my network name online. Now, looking at that and icon there, it only means that in order for me to bring the database engine online, I need to have my network name and my shared storage or shared disk online. And we have another question, the listener Rahul is saying, “The listener in clustered resources can be pinged from the primary node but not from the secondary nodes, is that okay?” I’ll answer that later when we move into the availability groups. Here I am talking about failover cluster instances. Now, again, the clustered resources without any dependencies will be the first ones to come online. In this case, the only one that does not have any dependency is the virtual IP address. So imagine this. The cluster went offline, taking the failover cluster instance offline, even if you force start the SQL Server agent service, it has to wait for the virtual IP address to come online first. Then the virtual network name, then the SQL Server service, before the SQL Server agent can come online. Again, it’s like the domino effect. The SQL Server agent has to wait for all of those resources in the hierarchy before it can come online.
Now, I will be answering Rahul’s question in this case. In an availability group, like I said, you don’t really need a listener name for an availability group. You could survive, you could have an availability group without a listener name. It’s not mandatory. This is not like a failover cluster instance where you definitely need a virtual network name. Here, I can have an availability group without the listener name. Now, again, suppose you decide to have a listener name, you got to be careful because the minute that the listener name goes offline, even if the database’s and instance in the availability group are healthy, it will be taken offline. Now, side comment here about some time about 2013, I was brought into deploy an availability group for a SharePoint farm. Now, I worked with SharePoint a lot in the past, I don’t tell people about it, not that I’m ashamed of it, but I’ve worked with SharePoint in the past and this was the SharePoint 2013 time frame, where Microsoft has been pushing availability groups for SharePoint, which is kind of ironic because you would see more documentation about availability groups from the SharePoint product team, but not from the SQL Server product team, which is kind of like saying SharePoint actually financed a partial development of an availability group. But moving forward, they deployed – well, in my case, I deployed a two replica availability group, and because it’s a two replica availability group, it doesn’t make sense to use a shared disk as a witness, so we deployed a file share witness. Now, this is Windows server 2012, not R2. Not 2008, but 2012. While I was building the availability group, I say in the Windows event log that the file share witness kept disappearing. You see, one of those events in the Windows event log where a load of file share got disconnected, the file share became available, and I told them about it. I said, “Are you aware that your file share witness is unstable?” Like, “Yes, it’s fine, it’s hosted in a clustered file share, we’re taking care of it, we’re monitoring it.” Fine, it’s probably their way of getting me to shut up. And so a couple weeks after I go line, one of their engineers applied service packs on the standby node. This is the secondary replica. Now, keep in mind this is an availability group, which means there are two instances that are running the SQL Server service independent of each other. Only the availability group is hosted on a failover cluster.
Now, because they had to reboot the machine as part of the patch management process – I’m trying to slow down here a bit, I want to build the intensity.
Anthony Nocentino: Dramatic pause.
Edwin Sarmiento: Dramatic pause. They reboot the machine, rebooted the machine, all of a sudden the SharePoint farm went offline. The entire SharePoint farm went offline. Keep in mind, this is a high available SharePoint farm with a couple of servers for web front and a couple of servers for the middle tier and two replica availability group for their databases. Called me up, and started complaining. They blamed me for what happened to their availability group, and I told them, “I’m sorry to disappoint you, this is by design”, and they hated me for that. And I explained the reason why I said it is by design. I already told them about the file share. They lost the file share and I know that they’re going to be losing the file share if they don’t do something about it. I have two nodes in the failover cluster, plus the file share, I have three votes, and the concept of the quorum is – I still believe it is the most complex, most convoluted, most confusing, yet the most important concept in a failover cluster. They lose one voting member in the form of the file share, well, not really lost, but you know, kept disappearing and it kept coming back online, so technically you lose one vote. So you’re not down to two out of three votes. They rebooted the machine so you’re down to one out of three votes in quorum, regardless of whether the SQL Server databases are healthy, the cluster will take itself offline. And I explained it exactly like that, and ended the statement with, “It is by design”, and they hated me for it. And like I said – yes, go ahead.
Anthony Nocentino: This is in the 2012 days, right? So you’re on Windows 2012, right?
Edwin Sarmiento: Yes, it is in the days of Windows 2012, and it still happens up until Windows server 2016 because even with dynamic quorum and dynamic witness, you need to understand how it behaves because it doesn’t work like we would expect it to work in a traditional two-node failover cluster, and like I said, they hated me for it. And they tried to reason out that, “Hey, the SQL Server service was all running on both replicas.” Of course, you can do a select star from sys databases, you could do a select star from msdb.suspectpages, you will get responses and you will get results because only the availability group is running on top of the failover cluster, and because you no longer have majority of votes, the cluster has no other option but to take itself offline, and that’s one of the reasons why you really have to pay attention to this. And I am reading J.D.’s comment here, “SharePoint wants availability groups for redundancy of context.” Yes, it does, and there are some caveats on putting availability group – rather SharePoint databases on the availability group because there are things that are supported and there are things that are not supported, but moving on to…
Anthony Nocentino: I got a question at you from Slack. So there’s some commentary about using C names for listeners, and also C names for multi-subnet configurations. Do you want to – just basically, can you use it and is it available to use in multi-subnet configurations?
Edwin Sarmiento: Great, great question because you know, what’s really interesting, when I was working as a data center engineer, of course I worked for a large Japanese company that’s very well known in the IT space, we have standardized on naming conventions that we use DNS aliases, and the reason why we use DNS aliases is because we’re still humans. We make mistakes. So for instance, and I’ll use the SharePoint example here, we name our environments with certain formatting so that a user logging in, or using a client application to connect to it will know exactly what he is doing. So for instance, if it’s a SharePoint database in the dev environment, our DNS alias would be SharePoint-DB.dev.domain.com. What that means is that if I need to manage my databases, I have no excuse to tell people who are responsible for it, “Hey, I accidentally truncated the entire table in production”, because by doing so, you’re explicitly specifying in your connection string, SharePoint-DB.production.domain.com. So we’ve segmented our network to a point where traffic is different for management, for monitoring, for DR, for production, and again, it’s not just from a technology standpoint but from a human psychology standpoint. There will be no excuse for somebody to say, “I’ve deployed the wrong – the code in the wrong environment.” No. You explicitly type production in the connection string or in the remote desktop.
So going back to the question about C names, when you look at a listener name, and this is in conjunction with active directory integrated DNS zones, a listener name is nothing more than a DNS entry. A virtual server name is nothing more than a DNS entry. Before today’s availability groups, in database mirroring, what you ended up doing was using a failover partner connection string parameter in order for the failover to be automatic. Of course, your client application has to support that connection string parameter. What we did, we just used C names, or in our case we used DNS aliases. If a failover happens to DR, whether it’s within a single subnet or a multi-subnet cases, we just change the virtual IP address of the DNS alias so that the client application can connect to where that service is running on top of. So in a production setting, we’ll probably use .01 as the virtual IP address, or in a DR setting, we’ll just change it from 01 to 10. But the point is again, listener names are no different from C names. Of course, from a networking standpoint, I don’t want to dive deeper into this, Anthony had a really good presentation for GroupBy back in September where he talks about the different stack, the networking stack, but when you think of it from an abstraction point of view, as long as the client application can connect to the service, in this case, the listener name, whether you’re using a C name, a DNS alias, or just a listener name, you’ll be good. The way I look at it, I would rather have a standard naming convention and use that as a DNS alias to get redirected to the listener name. Why? Again, we’re still humans, we need some sort of structure so that when you know, troubleshooting, administration, auditing, or trying to do an upgrade or migration, it becomes easier for us to identify what this machine is for, what environment this machine sits on, and how we can access it. In fact, we never access our machines using hostnames. We have a policy in place that if somebody starts using the hostname, they get audited, and we track that – well, I actually wrote some scripts in order to track that information, but again, it’s all about us standardizing that information.
So going back to – let me open up the questions, J.D. was saying, “I can see the slides but not see you. Is it me?” [crosstalk]. We’ll check it out. And Rahul was saying, “So the listener and clustered resources can be pinged from the primary node but not from the secondary node?” So, this is a question that I want to answer back. Don’t try to ping the listener name from the hosts. Even if you can, technically your client applications will not be sitting on those nodes. It will be sitting outside. Of course, I’ve seen cases where customers will have the application within the same node, or within the same failover cluster, so you got like, four nodes in a failover cluster that client applications are also running on those four nodes. Not something that I would recommend, but it happens. Try to ping and telnet the listener and the cluster network name from outside of the failover cluster. Now, I’m guessing, I’m guessing this could be running either on a Microsoft Azure data center, an AWS data center, or a Google Cloud data center, which is why you could ping the listener from the primary, but not from the secondary.
Now, like I said, if you’re on the Cloud provider, the networking stack that I’m talking about does not necessarily work like they were on the Cloud. You need to understand how the Cloud deals with that. In fact, one of the reasons why on Microsoft Azure data centers, or even on Amazon, you cannot just simply use a virtual network name without using, for instance, the Azure internal load balancer, is because the network does not support gratuitous ARP requests, which is something that you know, you could do within your network, but not on the Cloud, and it’s not a limitation, it’s just how the Cloud is designed and built. So how you would deal with issues within your internal or your on-premises data center when you move to the Cloud may not exactly work as expected. Like I said, on-premises, you could ping the virtual network name or the listener name from any of the nodes and that’s fine, and the only reason, again, like I said, I assumed that you’re either on Microsoft Azure, Google Cloud, even Rackspace, or Amazon’s data center is because that is how it works. You can ping the virtual network name or in your case, Rahul, the listener name, from the primary node or the node currently running the availability group, but not on the other node, that’s how it works.
Anthony Nocentino: Yes, they confirmed that, it was coming from Azure.
Edwin Sarmiento: Yes, and like I said, that’s one of the reasons why I kind of waited until this portion before I answered that question because it just works differently. In fact, my biggest challenge when I first started working with Azure back in about three or four years ago was networking. I cannot unplug network adaptors, I cannot connect to network switches, I cannot define routes the way I would want to within on-premise data center, and again, totally different behavior.
So moving forward, in terms of the clustered dependency report, which is something that I will show you later, is that in an availability group, if you have a listener name and that listener name is unavailable, it doesn’t matter if your SQL Server service is online, if the databases are healthy, it doesn’t matter. Every single database in the availability group will be taken offline, hence why the SharePoint farm went offline because well, the SharePoint farm can log into the SQL Server instance, they just couldn’t connect to the databases. And again, looking at the dependency hierarchy, you have to wait until the virtual IP address and the virtual server name in this case, because you know, one is dependent on the other, before the availability group can come online.
Now, let me show you real quick how the dependency report looks like. Let’s first look at the dependency report for a failover clustered instance. So I have a SQL Server 2016 failover clustered instance and I’ve got like three shared drives, and I’ve got the virtual IP address, the virtual network name. So let’s look at the dependency report for the SQL Server failover cluster. Right-click, more actions, and show dependency report. I’m not going to do that, I’ve already opened up the dependency report, so here it is. Look at the dependency report real quick, and you would see that my SQL Server service which includes my databases, my system databases, everything, not just the service, depends so much on one, two, three, four, and what I mean by that is I won’t be able to start my SQL Server service if my physical drives G, U and F are offline, as well as my network name, my TDPRDSQLLS01 network name, which also means the amount of time it takes for me to bring my SQL Server service online is dependent on all of these four components, unless of course the virtual IP address, which is the fifth one, and you would see that, and I want you to pay attention to this. I am going to move and I wish I could slow down the video, but this is not a recording. I would move this and currently, it’s running on TDPRD022. Let me try failing that over, this is an example of a manual failover. I will move that to 23, and I want you to pay attention to how the SQL Server agent in this case, I want you to look at how the SQL Server agent behaves. I’m not going to speak faster, I’m just going to see how this thing would work and how the failover would work between 22 and 23. So I’m going to move that over to 23, click okay, the first thing and watch closely, the first thing that will go offline is the SQL Server agent. Why? Because it is on the lowest level of the dependency hierarchy. Well, not SQL Server agent, but rather the virtual IP address, so it is a sequence of going offline and going back online. So I’ll click on okay, did not complete. Interesting. Let’s see if I can move this to 25.
First, again, this is where I said I couldn’t really speak that fast. SQL Server agent first has to go offline and then the SQL Server service, then once everything is brought online, like the network name, the IP address is brought online, is only the time when the SQL Server service can be brought online, and the last one to be brought online is the SQL Server agent service. Of course, the last one, the SQL Server CIP role is the last one because it doesn’t really depend on anything but it’s just part of the resource group. And right now, it’s running on 025, and when you look at the dependency report again, when I moved – remember that traffic light analogy, you know, the stop, wait, and start. I stopped the service by moving it off of 22 and moving it to 25. So the stop process kicks in, took the SQL Server agent offline first, and then the SQL Server service, again, the agent first and then the service, and then any of the drives, F, U and G, and then the network name. And once all of these were brought offline, the virtual IP address was brought offline, and then when it had to bring the SQL Server service online on 25, the IP address first came online, before the drives, or in this case, before the network name goes online, and then the drives have to go online before the SQL Server service went online.
It’s interesting watching how this thing moves between the nodes, and let me do that one more time, let’s see if we can move it back to 22. Again, SQL Server agent went offline first, ignore the CIP here. See how that thing works, all of the drives are now online. The IP is the first thing that has to come online, then the network name is now coming online. Only when those four are online can SQL Server go back online. And again, this is how you would deal with looking at the dependency report because understanding that, you know that if I have a missing drive or if I have a virtual network name that’s not going online, I’m not going to spend time looking at a SQL Server issue, because it’s not going to be a SQL Server issue.
Anthony Nocentino: What’s something that can keep your network name from coming online? Like disks and IPs and things like that make sense. What would keep your computer name from coming online?
Edwin Sarmiento: Great question, and I have a story to tell about that. It’s interesting because this is actually part of the three-day training class that I’m doing with Brent in a couple weeks. Learned that the hard way when I figured, “Oh, it’s the network name.” We were in charge of a two node Windows server 2004 failover cluster for a sytrics VDI environment, this is Windows 2003, SQL 2000. Patch Tuesday – I don’t know what it is with patch Tuesdays. We had to patch all of our servers, this is a weekend, we asked for a maintenance window. And again, best practice, I start off with a standby node. Patch the standby node and wait until I can proceed with installing the patches on the current active node and then reboot. It’s a failover cluster, right? You reboot the current active, you see exactly what happened when I moved the workload, in this case the failover clustered instance from 22 to 25 and back. So I am just waiting for a maintenance window for approval that I can go ahead and reboot my current active node. I got the go signal, rebooted at node, why? Expecting everything to work. Five minutes later, I get an email alert from our monitoring system, the cluster is down. The cluster is down, isn’t that supposed to come online? Isn’t that how it’s supposed to work? Upon further investigation, one of our operations guys accidentally rebooted all of the domain controllers at the same time. We got three in production, one in DR, they all got rebooted at the same time. Now, this goes back to 2005 days, IBM machines where you got like, you know, multi – I forgot how the architecture was like for IBM service back in the days but I timed it, it usually takes an average of about 25 to 35 minutes to reboot those massively large machines. And I couldn’t get my failover cluster online. I don’t have any problems with quorum, all my nodes are back up and running. It’s just that the failover didn’t happen.
And the reason why it didn’t happen is because the virtual computer object – I talk about virtual server name, whether it’s the listener name or the virtual server name, it appears as a computer object in active directory. Now, this is no different from you trying to log into your computer and there are no available domain controllers to authenticate your account. You know, you see this, there’s no available domain controller or even look at it this way, if somebody accidentally deleted your account off the back of directory and you’re trying to log in, you won’t be able to log in. That computer account is an account. It has to be authenticated. Unfortunately, all of the domain controllers that need to authenticate that computer account were not available until after 45 minutes. It couldn’t failover.
Anthony Nocentino: Why have three domain controllers? Right?
Edwin Sarmiento: Exactly, and that’s also one of the reasons why you have multiple domain controllers and you have like, primary and secondary servers, is because for standby, but nobody in the operations team was – nobody was informed that you’re not supposed to be doing that. Again, lessons learned the hard way.
Anthony Nocentino: So we do have a question or comment in Slack, so a few seconds ago you said that you patched the active node first and that Windows server failed a cluster, is their argument that there’s more or less risk in doing the passive node first.
Edwin Sarmiento: I did say I started off with the standby or the secondary.
Anthony Nocentino: Okay.
Edwin Sarmiento: So best practice, you always start with the standby, or the secondary. It does become a bit more of a challenge if you have multiple actives at any given point in time, or multiple primary. So for instance, you have a multi-instance failover cluster, three nodes, three failover cluster instance, you have to decide how to move those resources around so that you can start – you can have a standby. So again, pardon me if I came off wrongly there, but the best practice is to start off with a standby always. So if you have a DR node in your failover cluster, start with that first. Of course, there is no sense for you to failover to the DR if it’s just for patching. You certainly don’t want your client applications to be in production and your databases to be in DR. Again, but part of the patching process, always start with the standby.
And same thing, let’s look at an availability group. In this case, I’m only showing you failover clustered instance with stick, let’s look at an availability group. Let’s look at an availability group where I have a – I’ll come back to that later, I can’t remember the exact server name, I’ve got tons of them running on my environment. So that is the first thing that I look at. Remember, you might be wondering, “How come this guy spent so much time on the dependency report?” Because again, I don’t want to be fixing an issue that has nothing to do with SQL Server, it is something else. I don’t want to fall into the trap of trying to fix a network problem when it’s solely out of my scope. Now, just by looking at the dependency report and let me just show you that real quick, if I take this thing – take this guy offline, take offline, notice what happens. I simply took the network name offline, but if you look closely, everything else that depends on it becomes or has automatically been taken offline.
Now, in this case, it doesn’t make sense for me to start looking at the SQL Server error logs, start connecting – I can’t do anything. The database engine is not available. So this is again, the reason why my first stop is the clustered dependency report. I know that the network name is unavailable, and there is not way for me to bring the SQL Server service online until this guy comes online. If that’s the case and I know it’s not a SQL Server problem, I will stop. Took me less than a minute to figure out that it isn’t a SQL Server problem, I’ll call the network guys, I’ll call the systems guys, I’ll call the DNS guys and have them figure it out. That’s why I always…
Anthony Nocentino: They’re just going to tell you they can ping it.
Edwin Sarmiento: Of course. It’s not a network problem, it’s a database problem. Again, that’s one of the reasons why my first stop has always been the clustered dependency report. Now, in the past, this wasn’t possible because it was only made available in Windows server 2008, where I could generate the clustered dependency report and look at the dependencies. Windows server 2003 and earlier did not have the clustered dependency report. At least now, I can look at the clustered dependency report and say, “Okay, there are the dependencies and then I will look at the failover cluster manager and just look at which one of those resources is offline”, and if it’s not the SQL Server engine, I will pick up the phone. Again, took me less than five minutes to figure that out. Log in, check, and if I’ve got my documentation in place, probably less than five minutes to figure that out. And now the system is back online, on 22.
So the next thing that I will do I if I know that it could be a SQL Server problem is I look at the Windows event log first. Now, there’s a question here, Breck was saying, “Do you ever set up a drive letter as a dependency?” Not a drive letter, but the drive. When you configure SQL Server, when you install the SQL Server binaries, in this case, for a failover clustered instance, and we’re talking about failover clustered instances, not availability group. In a failover clustered instance, when you go through the setup, it will mark the drives, the shared drives that you add to the failover cluster as dependencies, and let’s quickly go back to that environment. If you look at the SQL Server service and you look at the properties here, the dependencies, and this is exactly what you show visually on the dependency report. It’s dependent, the SQL Server service is dependent on the U, the F, and the G drives, plus the SQL Server virtual network name. That’s why you see this and logic dependency here. So I don’t define dependencies on a drive letter because I can have shared drives that are mounted as mount points, so I technically don’t have drive letters for those, but I do assign – or not me initially, but the SQL Server installation process will assign dependencies on the shared drives, not the drive letters though.
So for step two, this is where I look at the Windows event log. I look at in this case, where source equal to failover clustering and level equal to error. And the reason I look at this next after the dependency report is if I know it’s a storage issue, if I know I can fix it, because I’m also responsible for the underlying storage, I need to know what caused the problem. So I start looking at failover clustering, the source equal to failover clustering, and level equal to error. Now, me being the lazy guy I am, I use Windows PowerShell to parse through the Windows event log, and even the cluster error log. If you see an error pertaining to storage, check for any storage-related errors. Or if you see error pertaining to network, look for anything related to network.
Now, if I’ve already figured out that it’s not a network problem, it’s not a storage problem, it’s not an active directory or DNS problem, if I know it’s a SQL Server problem and I can’t bring it online – keep in mind, in a failover cluster instance, the cluster will not come online, unless all of the dependencies have been brought online. In an availability group, I can access the instances because the instances are totally independent of a failover cluster. In a standalone instance, only the databases that participate in the availability group will be in the failover cluster. In that case, I will open up the SQL Server error log, Textpad, Notepad, whatever text editor you want to use, specifically useful in a failover cluster instance where you can’t bring the SQL Server resource online. Key words, for failover cluster instances, I kind of treat it like a standalone instance. What do I mean by that? Remember that story that I mentioned earlier about trying to upgrade the SAN environment and then the restore process didn’t work. Because the SQL Server service could not find the correct path to the system databases, the user databases, the SQL Server error log, I have to, or we – I don’t want to take credit for everything. We had to manually hack the registry, not something that I would usually recommend, we needed to hack the registry to point the SQL Server service to the correct path, where the files are, where the database files are, the SQL Server error logs are, and so on and so forth. And again, that’s how I treat failover cluster instances, and because I can’t bring it online, I could not use SQL Server Management Studio to read the SQL Server error log, and that’s why I mentioned earlier, you look at reading the SQL Server error log using you know, your favorite text editor. Notepad, Textpad, command prompt PowerShell, name it. You look for again, missing file path, incorrect path definitions for the service, and so on and so on.
Now, if it is a corruption on the system databases, well, again, this is no different from treating this like a standalone instance. You restore from the master database a backup, you force start the SQL Server instance in single user mode, again, no different from a standalone instance. Now, if it is an availability group, I look for these two key words. Resolving, and failover. Both of these would probably look familiar to you. Error 983 for instance, unable to access the database because it’s [replica] role is resolving. The way I deal with this is again, because they are – SQL Server instance is independent of the failover cluster. I can easily fire up SQL Server Management Studio, do sp read error log, with a specific key word and look for resolving. Look for the time stamp and then either read up or down to figure out what caused this issue. Same thing with error 19406. I see a failover, like the replica state changed because of either a startup, a failover or a communication issue, or a cluster error. Again, I want to zoom in to the real problem while not getting dragged into the rabbit hole and trying to figure out what the root of the problem is. The goal while I’m doing this is to bring the cluster resources both the FCI and the availability group as quickly as I possibly can.
Now, didn’t I mention earlier that most people would start looking for the cluster error log? I don’t. Don’t get me wrong, I love reading the cluster error logs, especially if I’m in a meeting and the meeting is too boring, or if I’m in a presentation and I’m not getting out of it, I would crack open the cluster error log from one of my customers and start reading it. I love – I don’t want to say I love it. This is not my first option. In fact, it is certainly have to be the last one. The only time that I would crack open the cluster error log is when I’ve already looked at the dependency report, if I’ve already looked at the Windows event log, and I’ve looked at the SQL Server error log. This is my last option. Not for the first one. In fact, I tell my customers, the minute you see me open the cluster error log, you’re in big trouble because, and again, when you look at the failover, when you look at a resource group going offline, you would immediately see that it is either one of those resources not coming online or a much deeper problem. And that’s why I tell my customers, “Look, the minute that you see me open the cluster error log, there’s a big problem that we need to address here” and like I said, because it is not the first thing that I would look into, I don’t recommend you reading the cluster error log unless of course, you’re a geek like me or you can’t sleep at night or you know, you got nothing more important in your life to do so you have to read the cluster error log.
And so look for errors before and after events that lead to it. So when you would – there’s a couple ways to read cluster error log, of course, the very popular one is by using the Windows event log. Look for the diagnostics branch under failover clustering, or like I said, I’m a bit lazy, I use the cluster – get cluster log PowerShell command. Now, all of the events that are recorded in the log are in GMT. It becomes a bit more challenging when you’re dealing with a multi-subnet cluster where some of your nodes are let’s say, in the West coast, and you’ve got like, DR nodes in let’s say, South America, or even in the East coast. So you have to use either system time, or in this case local time, to display the output of the cluster error log in the local server time. So this example for instance, I use the cluster error log to display events that happened 40 minutes from the time I generate the log, and usually the way it works is that you know, my customer would encounter a problem, they call me up two days after, and I tell them, “Look, if you’re calling me two days after the cluster went down, you don’t need to call me. It means you don’t really care about high availability.”
But the point is if I want to narrow down what I want to look at in a cluster error log, I specifically generate only a fraction of what’s in there, to again, zoom into what the problem is. And again, I use the local time parameter so that I could use the information basing it off of the local server time. Now, since we’re already diving deep into the cluster error log, let me show you how you read the cluster error log. Again, be careful when you read the cluster error log because it might cause you to become so nauseous when you’re reading this.
The first few characters there pertains to the process ID. Now, if I’m – like I said, if this is a troubleshooting exercise and I need to fix the problem real quick, I fire up process monitor to find exactly what this process is. Now, of course, in our case, we have SQL Server, I’m pretty sure that this is the SQL Server database engine, with a corresponding process ID. And then I look for the thread ID, and the reason I look for the thread ID is because I want to see if this is a chain of events that lead to this. And if it’s a chain of events, the thread ID value would be the same across multiple events. The next one would be the date and time, again, depending on whether it’s the default UTC or it’s the system time. I usually just output the cluster error log using the local system time, and then the third – the fourth one will tell you whether or not it’s an info, a warning, an error, or a DBG. A DBG is a debug event output, which is not available by default unless you turn on verbose logging. The last, the next one will be the resource type. Is it a physical disk? Is it the virtual network name? Is it the SQL Server engine? The next one is the resource name; how you name that resource in a failover cluster. The last one, of course, is the description.
And let me show you real quick how I would – I do have a copy of the SQL Server error log, and again, not something that I would recommend to be the first stop when you are trying to troubleshoot issues in a failover cluster. Sample cluster log. The names of the innocent, the names have been changed to protect the innocent, is what I wanted to show you here. Process ID, thread ID, date and time, whether it’s an info, a resource, an error, or whatever, the network name, the name of the resource, and so on, and so forth. Now, the way I deal with this is I open this up in my favorite text editor, hit find, ERR is what I look for. Find next. See, I have an error here, it’s an error caused by the resource monitor in a failover cluster. Not something that I would pay attention to at this moment because I’m more concerned about SQL Server.
Now, you might be wondering, “Hey, this looks like a SQL Server availability group for SharePoint.” So I see SharePoint here, so I would look for ERR – I would just navigate through all of the error messages there until I find something that points me to a SQL Server resource offline and is reported as an error. And if you notice, I’m just going through every single one of these and looking for ERR. I don’t see anything in here that pertains to SQL Server resources, and that’s why I can go ahead and say, “There’s nothing wrong here, I could skip this out and move on and look for something else.” Like I said, this should be your last resort. Not the first one, but the last resort.
When you encounter a problem or an issue, remember the first question that we had to ask ourselves was, “Was the cluster available? Is it accessible? Can I ping? Can I telnet the cluster? Can I add a login to the cluster?” If the cluster is not available, your only option, and you don’t even have to think about this, which is one of the reasons why I said you need to start with your checklist, your processes. If the cluster is available, I can ping, I can telnet, then you can follow the four-step process that I just described. Now, if the cluster isn’t available, which means I can’t ping the virtual server name of the failover cluster, again, not the nodes, but the failover cluster, because the nodes could be online, but the cluster can be offline. Your only option is to force start the cluster without a quorum, and this happened exactly – the thing that I had to do for that customer, where I had to force start their cluster first before I can bring the availability group online. Keep in mind, the availability group depends so much on the failover cluster. Now, if I can’t bring the cluster online, there is nothing that I could do that would force me to bring the availability group online. So the first step that I had to do apart from checking if the virtual server name is responding, or I can log in; if the cluster is down and you could figure that out in less than five minutes, the first thing that I do after figuring out that the cluster is offline is to force start the cluster without quorum. In that case, because I lost two out of three votes in the quorum, the cluster went offline. Even though the nodes are available, I don’t have any choice but to force the cluster to start online. And I’ve written an article in SQL Tips, I think this is back in 2011 or 2012, where you could force start a failover cluster without quorum. You can do it using PowerShell, you can do it – I believe the failover cluster manager is now capable of doing this. I still don’t rely so much on the GUI, but this is how you would go about starting the cluster because you can’t bring the failover cluster instance or the availability group online until the cluster is first brought online. That’s the whole dependency hierarchy there.
Anthony Nocentino: So forcing the cluster to come online and forcing the quorum sounds pretty nasty. Are there any side effects to doing that? And on our production systems is there anything to worry about?
Edwin Sarmiento: Great question. This is a very easy thing to do. Unfortunately, if you are not familiar with how to do it, it’s just like, you know, trying to fix a car on your own when you barely driven a car. In fact, if you go to the link on this side, it’s a single liner. It’s a single line of cluster, .exe and PowerShell code. Now, there are some side effects though. There are some side effects because keep in mind, if the cluster is offline, there may be transaction log records on this that have not been rolled forward or have not been rolled back, depending on the state of the database when the cluster took it offline. So you have to be aware of the side effects, and the side effect of this is number one, potential data loss. I’m saying potential data loss because even if it’s a synchronous database availability group configuration, you would still be in a situation where there is potential data loss like this. Transactions are going on, primary setting transaction log records to secondary, but then all of a sudden even before the secondary could commit those records on this, the cluster went offline. When you bring it back online, the availability group’s replicas do not know the state of either the primary or the secondary. So you won’t know if the database has been synchronized, and that is, again, a side effects of this. My goal when I get called in to fix an availability group or failover cluster instance outage is to bring it as quickly as I possibly can and then if I can now take the availability group online, the first thing I do is take a backup. Take a backup real quick, and synchronize the database. Now, because I already have a backup, the cluster’s online, everyone will be happy because everything’s back to normal. But I have yet to do stuff like how much data is in the backup that we’re not committed to the database, so this is where disaster recovery or data recovery process that I have to work with the business users, identifying the records that were in the secondary but not in primary, or in primary but not in secondary, or maybe I can just resume the data synchronization process and let the secondary replica get caught up. But again, the side effect of this is number one, potential data loss. The second side effect of this is called partition in time. I wish I can dive deep into this, but I’ll give you some details on how you can register for my training class with Brent in a couple of weeks, but the idea behind partition in time is you made some change on one node and I’ll give you an example. Re-IP. Your network guys decided to restructure your whole network infrastructure and decided to – I know Anthony’s like, “Why did they do that?” Improper planning. You’re changing your virtual IP address for your listener name. It happens, and this is what I call the partition in time situation, where I am on the primary, I am modifying the listener name, and as I hit okay, boom, the cluster went off on it. Maybe for whatever reason, maybe I lost quorum, maybe the witness and the other node went offline. Now, the change is persisted on the primary, and we’re talking about the cluster here, not availability groups, where the change happened on the primary but not secondary. Now, when you force the cluster to go online, the question is, what’s the new value? Is it the old one or is it the new one? The cluster will then dictate what the value would be. If you bring the cluster online and your force start the cluster using the node that was running the primary, then you’re good. But that’s not always the case. In some cases you may have to force start the node or the cluster using the node that was acting as a secondary. Now, if you do that, the change that you just did will disappear. That is the second side effect of it, and again, I could dive deep into this and I could explain this in more detail, but we’re moving – we’re getting sidetracked from the troubleshooting aspect of this, but that is a really good question by the way. So that is when you don’t have any choice, the cluster is offline, and the only option you have is to bring the cluster and force start it without a quorum.
People laugh at me when I give them the fifth item in my five-step workflow process. If you can’t fix the problem, call Microsoft. Call Microsoft, and the way I walk my customers through this is that if you’re giving me four hours to fix this, and on the second hour I couldn’t, I’ll pick up the phone and call Microsoft because the last thing I want is to be the cause of blowing up the recovery objecting and service level agreement. Remember what I mentioned earlier. When there is an outage, if your FCI or your AG becomes unavailable, forget about your ego, forget about becoming a rock star, forget about being the superhero. My goal is to bring this online while meeting my recovery objectives and my service level agreements. If I can’t, I will call Microsoft. I’m passing the ball to them.
Now, the clock is still ticking at that point. Now, I’ve seen cases where customers are offline for hours and they could have called Microsoft earlier in the troubleshooting process and have still met the recovery objectives and service level agreements. I will walk you through the process, I don’t know how much time I still have, but it’s 4:11 eastern, do I have enough – do I continue, do I…
Anthony Nocentino: I wouldn’t mind if you kept going, I mean, everybody else is still here. But yes, Brent says it’s totally up to you, sir.
Edwin Sarmiento: Okay, so that’s about it so far, and I know I was assuming an hour and a half would be the time that I have, but here’s the thing though. Here’s – since everyone is here and I am going to open up my Slack channel.
Anthony Nocentino: Yes, you’re the last session today so you have – you can keep us for as long as you want to.
Edwin Sarmiento: So I will keep you as long as I want to, but here is what I am going to do. I will ask you – and by the way, before I go any further, like I said, in a couple of weeks from now, I will be doing my three-day training class with Brent, where I will dive deep into the conversation of the partition in time that I mentioned, the complexity, any of the confusing items in the quorum, how the failover clustering works, how all of these tie into managing, designing and implementing an availability group typology with a local high availability stretch clusters, Windows server 2016, all added new features, name it. There’s – in fact, this is actually the, I think Brent’s most popular training class so far. It sold out in the first week. In fact, the 21 in stock that you see there is for the second batch. It’s still on the 20th to the 22nd, but I think now we’re about 30 or 40 – reaching 40 for this particular training class. There is still some slots so if you want to really see this in action when I show you demos, go sign up, register. It’s on the 20th to the 22nd of December, so it’s a couple of weeks from now.
Anthony Nocentino: You still have time.
Edwin Sarmiento: Yes, you have time to convince your boss, you still have time to think about this, in fact, I would just tell people, if you’re deploying an availability group next month or next week, don’t even think about it, sign up for it because I kind of use the analogy with what happened to my leg. I certainly would not be comfortable being on the operating table with a surgeon that happens to be an intern. I certainly don’t want to be being cut open by somebody who just fresh out of medical school. I want somebody who had experience working with orthopedic surgery, and the irony behind that is we are letting – again, this is not to belittle DBAs, I know how hard you guys work, but this is mainly for management, where we are letting highly skilled database administrators to work with mission-critical databases but are not familiar with how the technology works.
So again, I strongly encourage you to sign up for this training class, we still have a couple more days to go before the class starts. But what you saw in this presentation is a compilation of bits and pieces of information that I share in the three-day boot camp, where like I said, I walk you through my troubleshooting process, but before I do that I walk you through the whole stack, how Windows server failover clustering works, how availability groups work, and eventually, when you go into troubleshooting outages you know exactly what you’re looking for. You’re not wasting time trying to figure out how do I bring this online, or what seems to be the problem. You know exactly what you need to look for. So again, sign up for that, and while I am – well, I can take about the next ten minutes, here is a challenge for everybody, and I am going to – I would open the floor for anybody to participate in this because I am going to show you an availability group. So I have a three replica availability group here, currently, the availability group is running off of TTPRD021. Everything is working fine, everything is – Josh is saying keep going, thanks Josh. Keep on preaching. And I will respond to some of these questions later, but let me show you this real quick, and I want everybody to participate, alright?
I have a working, healthy – as far as the dashboard is concerned, healthy availability group. I’ve got two applications going on, I will start my applications. Both of these applications are, you can call them read or reporting applications. They’re not in starting records, they’re merely reading records off of my database. And what I’ll do is I will run a batch file that would insert records into the database, and if you look closely at the client applications, notice that my transactions are now coming into the database, you would see changes happening, these three batch files are actually inserting records into the Northwind database. I like using the Northwind database because it’s old, and I’m old school. I’m kidding. It’s available anywhere. So notice that I am now inserting records in my database, and if you look at the ordered date column here, they’re all getting inserted today, 1:17pm Pacific, and so on and so on. So they’re all – everything is working fine. Everything is working fine. My reporting workload is working fine, my batch files are working fine, the ones that are inserting records in my databases, in my Northwind database are working fine. And here’s what I want you to do. I want you to participate in how I would troubleshoot potential issues in this availability group. And what I will do is – I fully support pubs, yes Erik. As the transactions are going, as queries are running on my availability group, I wanted to look at what could potentially be the problem. Let’s open up my dashboard, all of my batch files are still working fine, all working fine, orders are getting inserted, my reports are working fine, but I wanted to look at how – let me just close this real quick and reopen the dashboard, and see – wow, everything is still working fine, that wasn’t supposed to happen. Let me introduce so issues. Cause issues in there and see how this thing would behave, and as everyone is on either Slack or on the chat on GoToWebinar, watch how the availability group would respond as the issues are happening. Of course, you won’t see this immediately, it will take some time for the issues to happen, but let me see if I can do that one more time and reopen the dashboard and see if – nothing. There we go, now it’s working. Now, I want everyone to participate on this. Let me just restore – let me resize this a bit so that I can see. There we are. I want everyone to participate on this, everything is working fine. My reporting workloads are working fine, which means my primary replica is working fine. However, if I look at the dashboard, I see that PRD022 and 023 are unavailable. So 21 is my primary, 22 and 23 are my secondaries, failover mode are all synchronous commit, but only the primary is getting synchronized. Of course, it’s the only one available. If you look at my batch files are still working fine, I still am inserting records on the database using the batch files, otherwise these would get kicked out and this would get disconnected. My reporting workloads are working fine, and keep in mind, I’m using the listener name here, which means I am getting redirected to primary, which is generally TDPRD021. Same thing as this, everything is working as expected as far as the primary is concerned. Help me troubleshoot this issue.
And of course, two third nodes out, network maybe, reboot, I like the fact that Brent said reboot, because that’s the first thing that you do. You reboot. So tell me, what – upgrade to Linux. I already have Linux running. So Anthony was saying, can I ping it? Let’s try. Let’s try if we can ping this. I did share the five-step process to highlight what I would do in this case, so let’s try and ping those machines. By the way, I am on a client machine here, .200 is my client machine so let’s try this. Of course, I’m not going to ping TDPRD021 because it’s working fine. This is where my primary is sitting right now. So let’s do 21, let’s do 22, and I get a response, 23, and I also get a response. So it’s working fine. What else do I need to do? Net stat AN, well played, supporting pubs – so I can ping, I am getting responses from my ping and telnet. Again, watch this, I can do a ping on 22, I get a response. I get a ping on 23, I get a response on 23, so I know both nodes are up and running. Besides, as somebody would probably tell me, look the cluster’s down. Well, it’s not down. Why? Because my availability group is up and running, it’s working. So I am – break to Oracle, come on Erik, that is not something that I would do even though I used to be a Oracle DBA. So let’s try ping the listener. Now, keep in mind, Joe, so Joe suggested that I ping the listener name. My client applications won’t be able to connect if the listener name is offline, but I will take your advice. I will do a ping. Glad to be the last person in today’s GroupBy session. Ping TDPRDSQLCLS01. Not CLS, SQLAG01. TDPRDSQLAG01. And of course, like I said, this would get some responses because I can see my client applications can’t connect to the database, they’re inserting records, and because the listener name is – will always run on the primary, we’ve already figured out that the primary is working fine, it’s you know, getting responses. My primary is TDPRD021. So it’s not a primary problem, it’s not a primary replica problem because I am getting a response from the primary. In fact, because I’m using the listener name, I know my primary is working fine, I know my listener is working fine.
At this point, I’d see if I can connect to the servers that are down. I would then check cluster service on them. I would take that comment about the cluster service because number one, I’ve got three nodes. One, two, three. If the cluster service is down for 22 and 23, I am pretty sure that my availability group will not go online, but I would try that out. But remember when I mentioned earlier about the problem is that you’re down to a single point of failure and not – if you’re not copying, yes. So remember when I mentioned earlier about the first thing that I would usually do, and the first thing that I would usually do is network connectivity and Anthony already gave me some hints on what could possibly be wrong. Ping works fine. Let’s try 22 and this time, let’s just simply try telnet on – well, 1433 is the default port number that my client application’s used to connect to the database engine, but it’s port 5022 that my availability group uses to connect to the replica. So let’s use 5022. So that is the problem. And just by looking at this, I know it’s a networking problem. Well, not necessarily a networking problem, but it has something to do with the network because I am not getting a response from TDPRD022 on port 5022. Now, you might be wondering, why did I use 5022? Well, if I look at this, 022 and 023 are not synchronizing, so if they’re not synchronizing, there’s probably some issues with the network. Cannot connect. Let’s try 23. It’s also not connecting. So here’s the gist of what I did for this.
What I did for this was simply disabled or rather enabled the Windows firewall on these two machines so that 21 could not communicate to 22 on port 5022. And that’s one of the reasons why I started off by saying, is the cluster available? If the cluster is available, or of course, in that case, I would start of with ping and telnet, and see if I can log into those machines using either remote desktop or SQL Server Management Studio, or anything that I would use to verify if the cluster is online. Now, if I would go back and take a look at this, let me close the dashboard and reopen it back. Hopefully, the replicas have already caught up at this point, and it’s not responding. But the point that now I’ve tested port 5022 on both 22 and 23 are not available, let’s try that again. Let’s try with 22, and it’s still not working, so I’m just waiting until the firewall is disabled on both 22 and 23. Hopefully, that will not take longer than usual. Or let me just rerun the script, and the script is simply, again, enabling or in this case disabling, to disable in this case, disable the Windows firewall on 22 and 23.
So let’s try that one more time and see now I can connect to PRD22 and PRD 23 on port 5022 and if I still get my Management Studio working – why don’t I open a new instance of SQL Server Management Studio, and if I look at SQL Server Management Studio and look at the dashboard one more time, I should be able to see 22 and 23 synchronized after I have disabled the Windows firewall, and hopefully again my Management Studio would not screw me up on this. This one’s not responding. Disconnect to 21, and show the dashboard. By now, 22 and 23 have finally caught up, my transactions are still running, they’re still going on, my reports are still working fine, as expected, and my availability group is back in business. Give me a ping. Questions so far, and I do see a couple questions on the GoToWebinar panel, let me quickly flip back to this and just a quick promotion there.
A couple of questions on GoToWebinar chat. Do I ever – no I’ve already answered that question, Okay, if you force a cluster, and this is Jason’s question, if you force the cluster online, does that mess up the metadata information, which I’ve already explained, that you can end up in a partition in time situation where you may, you know, for an instance change the virtual IP address on the current primary and then all of a sudden the cluster went offline. When you bring it back online, or force it to go back online, you may end up with what is known as a partition in time where the changes only happen on one but not on the other. Now, if you force a cluster to go online on the secondary, because it is now the new authoritative source for the cluster configuration, which is the change in the virtual IP address for the listener name, you might lose that change. You kind of think of it as a rollback in your database. Couple more questions here, great presentation, thanks. What’s the connection error? I’m not quite sure what that means. I’m pretty sure this has something to do with the troubleshooting that we did earlier with the connection and the disk, is that the only nick on those servers to check the connection error from the dashboard services. It’s interesting that you’ve – it’s interesting to show how people would normally respond to issues like this, because it took me quite a while to come up with a very simple scenario that would introduce the concepts of how to properly do the troubleshooting and get the database. At that point, the databases are online, so it’s not an outage. Technically, it’s not an outage. The SQL Server resources are online, the availability groups is online, the databases are online because I can run my client applications, my transactions are working fine, so it is not an outage. It only becomes an issue when for instance, because the replicas are not synchronizing with the primary, my primary databases, my LDF files will start growing because the VLFs could not be marked inactive and therefore could not be truncated if a transaction log backup occurs. Couple more questions here, that’s about the questions so far on the GoToWebinar chat. Do we have any questions on the Slack channel?
Anthony Nocentino: I think we’re pretty good there. Mostly just folks joking around about your testing and stuff like that, your troubleshoot methodology.
Edwin Sarmiento: And again, I would strongly recommend if you would be deploying an availability group, if you have a project within the next couple of weeks or even next month, or even early next year, sign up for this. It will be in two or three weeks from now, three weeks I guess, and we still have slots for you to join in, and the last one that we did back in September was amazing. In fact, this is the reason why we’re doing it again. It’s highly in demand, very, very useful, you would see me exactly what you saw me do earlier, plus going deep into the concept of quorum. Erik was there last September on the session that we did and he could speak for himself on the quality and the value.
Erik Darling: I can. Buy his stuff. It’s a great course. All this stuff that you saw today and more is incredible. Edwin is a great speaker, even when he has a broken leg and he has to like lay down and take pain medicine. So next time around will be even better because he won’t be on like Xanax and Percocet.
Edwin Sarmiento: Sounds like one of those TV infomercials.
Erik Darling: I promise I’ll be sober.
Anthony Nocentino: Thank you, Edwin, this has been a fantastic presentation, and looking forward to your course in December, so it’s right around the corner, right? Not too far away.
Edwin Sarmiento: Yes, coming up. Yes, Absolutely.
Edwin M Sarmiento
Latest posts by Edwin M Sarmiento (see all)
- Leveraging Microsoft PowerShell for Managing SQL Server VMs on Amazon AWS - June 9, 2018
- Deploying SQL Server 2017 Always On Availability Groups on Linux - June 9, 2018
- Getting Started with Linux for the SQL Server DBA - June 8, 2018