DBAs who care about the quality of their work, with the world on their shoulders and no budget, who want to keep their jobs but also keep their jobs interesting.
You should know extremely basic PowerShell such as what a variable and a loop is. But we’re also going to boost everything to the next level with templates that even world-class experts will appreciate.
Are you accountable for databases or database servers? Do you fix problems only to see them pop up again months or years later in slightly different scenarios? Have you been called at 2am and asked to “health check” hundreds of servers for an issue that nobody in the world has ever documented before and wished there was an easy mode to do that or even avoid the problem in the first place?
If so then Operational Validation can help you. In this session you’re going to get:
- A mindset on how to handle your day to day duties in a way that keeps you efficient, honest to yourself and to others, while covering your posterior.
- Free tools to fix things and make sure they stay fixed over time. Really free. As in you couldn’t pay for them even if you wanted to.
If you work with one instance you’re going to learn how to manage five hundred instances. And if you work with five hundred instances you’re going to learn how to have fun doing it.
We’re going to learn:
- Jenkins, a cool web application you can install on your workstation to provide orchestration and reports.
- How to integrate PowerShell with Jenkins and multi-thread efficiently in the background with PoshRSJob using ready-to-go templates so you don’t need to worry about how it works.
- Then we’re going to write some tests. They’re not going to be beautiful tests. We’re going to be testing some of the scummiest, most crapulent edge cases SQL Server has to offer. Please bring tissues because there will be tears. If you’ve ever said, “I don’t know what to test”, I’ll make you regret it.
- I’ll also discuss a few other PowerShell modules like CimSession and DbData and why you might want to use them for reliable and scalable tests.
By the end of the session you should be bouncing off the wall and reconsidering your life. I want you to start screaming about Operational Validation from the rooftops, and have the tools and know-how to quickly start on your own.
You’ll also get pretty cool web-based graphs like this:
Why I Want to Present This Session:
I’ve seen things no DBA should see, watched C-beams glitter in the dark near the Tannhauser gate, and hundreds of SQL Server patches die simultaneously. I work at scale. Microsoft tools work great one server at a time in a clean little demo, but I’m going to show you real world guerrilla warfare tools for use in the jungles of unreliable and unpredictable environments.
But don’t get me wrong – I’ve never felt the warm embrace of an MVP award. I don’t work for Microsoft or a well-known vendor. I’ve written hundreds of in-depth technical articles and have a good half decade of community contribution. Maybe getting my name out there will change that?
Cody Konior: Well, thanks everyone for coming, especially if it’s early in the morning for you, and that’s probably good because you should go to work and try some of this stuff after you’ve seen it, and get paid for it. If you’ve never heard of me before, that’s because I’m not very important. So let’s get on with the show.
What prompted this whole session was a post on Stack Exchange from last year where someone asked, hey, I’m not a DBA but we have 350 servers, how can we monitor all of those? And that kind of caught my eye because I look after 500 database servers and I have some pretty strong opinions on how you should monitor them all. To start off with, just hire 500 DBAs and you’ll be done for the day, but what’s more likely if you work at scale is you’re on a very small team and your boss will say just work 500 times harder, and if that happens you should probably kiss your family goodbye because you won’t see them for a little while.
But what’s it like to work at scale? Well, we do pretty much the same things that you do, but we try really hard to standardize it. So we use Hallengren scripts for example, for backups and maintenance, and we look after agent jobs and we look after security like dropping other sys admins, and we apply a lot of patches. If you look after 500 modern SQL Servers and it’s about one CU a month, that’s about 4000 patches in the past year.
So we have to standardize these that we can automate them more easily, and we automate so we can work with less people because we have guns to our heads and there’s mortgages to pay. But also, most of the stuff we automate is hell boring anyway. But you know, standardizing is kind of good even if you’re not working at scale. Even if you had one server, you should probably standardize it according to whatever best practices you’re comfortable with and change the server and stick to those standards. That’s how you become familiar with what’s good, what’s bad, what’s safe and unsafe, by doing it.
Standardizing helps avoid stupid problems in the first place. When incidents occur, you have fewer and fewer possible causes to worry about, which means wasting less troubleshooting time, getting things back online faster and as my friend, [inaudible], would say, it leads you towards known problems. Known problems are good because they often have known solutions.
Now, this all sounds fun, hey, I’m going to go standardize one server or ten servers, and you’re thinking of building a golden image. But the truth of being a DBA at scale, or not at scale, is that most of us don’t build a whole lot of servers for golden images. More likely, we’re inheriting legacy servers, almost always built by vendors or non-DBAs and full of random garbage settings. Choosing to standardize your environment isn’t about golden images or writing documents or build scripts. It means going out there onto servers which already work and changing things so that they’re the same as your other servers and meet your standards, which probably scares the hell out of you, or it might even sound a little mental.
Well, we all need to blame somebody so I’m going to blame this guy, stock photo consultant, who says every setting that you have depends on something else, usually, something that only they know and will sell to you, so you better be careful with it. Also, you should test everything, and if you don’t know everything a setting depends on and if you don’t know how to test everything about it, then you should do nothing because that’s being a safe and professional DBA, and DBAs are always safe.
Brent Ozar: I do just want to point out that he has a bald shaved head just like you do and a beard just like you, it’s terrifying, he’s like a younger version maybe or older version.
Cody Konior: No relation as far as I know. But personally I think DBAs have developed a bit of a fear culture, and it’s always an easy sell because something is always going wrong and so being cautious seems justified. Also, who wants to speak out against perfection culture? Because we all want to be seen as perfect DBAs, with perfect 59 uptimes. But at scale and not at scale, changing nothing, standardizing nothing, it’s a choice, which prevents you from improving anything. What exactly do I mean by that? Well, let’s imagine you start tomorrow at a job where you’re looking after 500 typical legacy inherited servers. First off, welcome to 500 low memory alerts, because nobody sets max server memory. I mean, I do, you do, everyone else on the call does, but some other people don’t. It’s one of those best practices that depends on a whole bunch of fiddly things, and so the people who set those servers up doesn’t bother to do it and here you are with your alerts.
Some of those alerts will be important, but it’s impossible to work out which ones are. So you add it to your to-do list, 500 max server memory settings to investigate and fix. Probably something that requires you to think of a standard and then apply it, even though your servers are technically working completely fine. Next, you might go checking for backups because we all care about backups. Maybe you find 200 of them out of the 500 servers you have, and those 200 are going to local disk, which isn’t very good if an emergency happens. Fixing this kind of thing probably requires some standards and changes, and of course, they will all be encoded in our lovely maintenance plans, doing whatever random tick boxes someone felt like filling out that day in a UI, which is different on every server and probably not meaningful.
Now you know I’m going to say you should probably pick some standards and apply them out, but what if you didn’t? What if you looked at those maintenance plans and said hey, this is a little dangerous, I don’t want to standardize anything, I’m just going to fix them one by one. You are in for so much pain. Do you know there are bugs in SQL Server Management Studio, where if you edit a maintenance plan owned by someone who no longer exists on the server, it will disable all of your maintenance jobs without telling you? And that takes a long time to find out. Also, all the settings for the maintenance plans are buried in XML and how good are you at XML? Probably best to replace them with a script.
Finally, if you go looking for problems, you are going to find a wasteland. Failed clusters, failed database mirrors, failed log shipping, failed replication, failed jobs, failed CDC, every possible thing you can think of that is high availability, disaster recovery, will all be failed because people set them up and they never look at them again. It will make you cry every single time.
Now the reason I mention all of that, if you take notice on your first day at scale, you know have 2000 things on your list to standardize and nothing has even gone wrong yet. These are all extremely common issues, well known, well understood, with best practices anyone could fix. But things also tend to kind of tick along and work even when they’re broken, and you can break them worse by standardizing, but on the other hand, they don’t really have complex things to think about. I think you should probably pick a standard, deploy it in small segments, have a rollback plan and standardize them.
Now, hopefully you’re all rational and thinking these are all serious things, of course, I’d go change them eventually. You know, I’d go through change and release and everything else, but maybe there are some other settings and standards you probably don’t feel so comfortable with changing, and you know what? Those other settings you feel are so precious and complex and important are probably set incorrectly and changing all the time and you’ve never noticed because you’ve never really looked at your servers. What does that mean?
Let’s pick a simple best practice. Do you believe in MAXDOP? I mean, important, understood, documented, and if I were to ask you do you set MAXDOP properly on your servers, you’re probably going to say sure you do. Now, I want you to think about this. When did you last check MAXDOP on your servers? Because if you checked two years ago, that’s a long time for things to change. We’ve seen new servers built since then, did you use build scripts to build them? Because your build scripts could have broken and now your MAXDOP isn’t set correctly anymore on new servers. Or someone else built a server and they didn’t run your scripts to do it. How many legacy servers did you acquire? Because those almost certainly won’t have MAXDOP set properly. Could any person or software have changed MAXDOP without telling you? Because that’s never happened to a DBA before. Could someone have changed how many CPUs are defined in the VM config? Could they have even rolled back the VM to an earlier snapshot before you set MAXDOP? Those have never happened either. And which formula were you using, you know, when you started your job two years ago and the one now, was it a step by step calculation or was it something you were doing on the back of a napkin? Because it could have been different. Was MAXDOP reverted by a SQL patch or bug since you checked it?
Now, you would probably say a SQL patch will never touch MAXDOP, and I haven’t seen it, but I will tell you that an add remove repair on SQL Server will set some of your SP configure settings back to the default. Also, things like Management Studio have bugs. There’s a bug in the previous version of Management Studio, where if you edit an availability group, it drops your read-only routing list. So if we were talking about read-only routing and you were sure it was set correctly on your servers yesterday, if someone’s edited it, that will be gone. Now, let’s say you say yes to all of these questions. Your MAXDOP is really secure.
What if there’s a problem at 2 am and you need to tell someone else how to check MAXDOP is set correctly on your servers because something has happened, something that’s not your fault. Can you validate MAXDOP across your entire server fleet right now? Now, if you didn’t check these things today, how can you be so sure MAXDOP is set correctly and what’s your excuse if you turned out to be wrong? Because when you do check this kind of thing, it’s an oops situation, and I’ve gone through this many, many, many times. So many people, so many servers, so many settings. Every time something happens that nobody in the world could ever predict and it turns out that they’re wrong.
So let’s talk about this. You should be setting standards, but most people don’t. You should be implementing those standards on your legacy servers but most people won’t. They’re too afraid to, and you should be verifying those standards are true at all times and almost nobody in the entirety of IT, your stack from SQL Server to Windows to infrastructure networks to SAN, none of them are actually validating anything. It’s all just scout’s honor. So if you don’t set standards, if you don’t implement standards, you don’t validate your standards and no one else is doing it either, do you have any idea what’s going on on your servers?
Now, I’m not judging you. When I started with my first 100 database servers, I had a very clear plan about how I was going to manage it. I was going to consolidate all of my information, analyze patterns, do really cool data science and start standardizing settings because I really believed in standardization. Now, it went well for a year, but the curse of success is that if you work for a company with 100 database servers, soon enough you’ll find yourself looking after 200 database servers. Again, not golden images, but random, undocumented trash that’s now your responsibility. Now, I had to go back to work and think everything that I had standardized over the last 12 months, what was it and how do I do it again? And before I could even finish, the curse hits again. If you look after 200 database servers, soon enough you’re looking after 300 database servers because you’re doing such a good job, right? And then 400 database servers, 500 database servers, and by now, no standard in the world matters, there’s no time left for standards, you’re not creating standards, you’re not implementing standards, the sheer volume of incidence will literally murder you and dance on your grave.
And that’s when this kind of workflow sets in, and this is probably something that’s familiar to everyone watching, where you spend 100% of your time keeping crap going and to some people, especially managers, this looks productive. But to me, it’s soul crushing because there’s nothing here which improves anything. Nothing that adds to your career value, nothing that makes you happy, nothing that reduces the rate of incidents so that you can finally have that bathroom break you’ve been promised. And this is the kind of cycle I was in when I read that Stack Exchange post last year, and I thought hey, what the hell am I doing, maybe there’s a better way of doing things.
And then blessing in disguise, an entire data center went down at 2 am on a weekend while I was on call and the monitoring system also died with it. I had 500 database servers to check and all I was doing was looking for anything out of the ordinary, and I say it like that because the fault was that on any of these 500 servers, C drive disappeared for a few minutes. What is the impact of that? No one knows, and there’s no way to tell whether that was enough to break SQL Server on them or not, but you couldn’t just reboot everything randomly because some of it is critical. So, I worked until early in the morning, writing out and testing ideas to determine whether each server was working properly or not, I did it in PowerShell and I survived, and I kept most of my sanity as well. But after the madness was over, it kind of tweaked to me that this kind of testing is useful, that maybe I could keep those tests and run them every day, just in case this ever happens again. And it did happen again.
I also thought maybe I should test other things, maybe every time an incident occurs and I know what caused it, I should write a test to pick that up and then add that to the list to run across all of my servers so that I’m the first person who knows about it and can fix it quickly. And that maybe I shouldn’t just be checking for incidence, I have all of those standards that I want to implement as well, and I should put those into test as well. So on the weeks that followed, I had a fresh plan because I had my build standards, I had tons of incidence knowledge and I had some basic tests that were working in PowerShell. But what I really wanted was a framework so that I could write all of that really quickly and to a standard so that I could write it, other people could write it, read it, edit it and understand it. And also, I wanted this stuff to be fast, because when you’re checking 500 servers and you have a hundred different things you want to check, that takes a little while, and I want this to be quickly. No one’s going to be happy sitting there at 2 am in the morning waiting for my response to say whether the servers are working or not. But I also wanted a web UI to run it all.
I mean, this is getting a really big wish list right? I want a web based UI, so that I don’t have to type hundreds of PowerShell commands in the morning because that’s a risky business to be in. Also, maybe someone else could use the web UI to run this stuff, that would be great. That way, I don’t even need to get out of bed. And then I want to turn this into a day to day workflow where I’m not just going through that cycle of keeping crap going, but that I’m actually identifying incidents and picking them up and protecting against them in the future, as well as rolling in my standardization over time.
That’s when I came up with my version of operational validation, which looks a lot like what developers do with unit testing and continuous integration. The funny thing about best practices, build standards, and incident knowledge is that all three of them end in the kind of expectation that things look or act in a certain way, and so that’s what we’re going to use for the core of our workflow. When you come across an expectation you have about one of your servers, where you are absolutely 100% sure it’s true and that you don’t even need to check it, that is the time to write up a test for it. Not in a document, not in a database, but a test. Something that will reach out, touch that server and validate it as true or false. For example, when that one encrypted worm was spreading a couple weeks ago and you expected for sure all of your servers were patched, that was the time to write a test to actually confirm it.
So you get your expectation, hey my servers are patched, you code up a test that will confirm the patch that you’re expecting is on each server in your environment, then you run it across all of your servers, validating that they’re all correct. However, surprise, every time you do this, you will discover you’re wrong. You are always wrong. The test will fail in places you never expected, and that’s a good thing because now you know your test is working. Now you can go out and standardize the servers. In this case, perhaps by applying the patches it needs, and the test will start to pass. Here is where most people would just forget about it forever, but what you do, you add it into Jenkins and it continues to test every server in your environment for the rest of your life. If any new servers are added, those also gets tested, and that way we get coverage of our entire environment.
And it sounds like almost childish and simple, but this – it turns out to be like having a super power, because before when I asked you hey is your MAXDOP set properly, and you casually say yes without really thinking about it, instead this is the answer I want as a fellow DBA that yes, I know it’s correct because it was checked every way today, because I have a test for it. The cool thing about this is you prove it. It’s not up for debate, you can press a button and have your proof immediately. If it changes, you press a button again and you can reprove it on the spot. Also, it’s something that you can really safely feel accountable for. You don’t need to worry that hey, maybe my MAXDOP isn’t set properly, because you have tests running all the time to confirm it for you, and whether you’re sick or you’re just feeling lazy that day and want to watch cat videos, the tests have your back. They’re running all the time.
So that’s almost enough talk, now we need to start putting it into action. Here’s the demo environment I’ve set up, it’s very simple. I’ve got a domain controller, I’ve got my workstation and a three node availability group. What you need to do is pick a workstation for yourself. Download PowerShell 5.1, which is the current version supported on most operating systems. Even though this will run on say, 2008 R2, you can still reach out and touch servers that are much older, for example, Windows Server 2003, if you have any. That’s why you should just set it up on one workstation. Don’t go installing PowerShell on those. You should install the latest Management Studio, which is cool because that one is going to get updates through W Sus and it installs a lot of useful DLLs that you can use in your test.
The cool thing about PowerShell 5 compared to previous PowerShells is that this one lets you download modules directly from Microsoft’s PowerShell gallery, and don’t worry if you’ve got firewalls and such, you can also save them offline and move them to your servers. So here I am, installing a SQL Server module which we need, plus three modules I’ve created, and another common module that contains a lot of functions that are useful for DBAs using PowerShell. There are also PowerShell modules that are installed as Windows features like almost through add remove programs. So if you want to look at active directory or do failover clash testing, you should run these commands too and install those features, which makes the modules available. So now we can finally get to the cool stuff, which is talking about Jenkins. Jenkins is what we’re going to use to orchestrate and execute all of our PowerShell testing, but the first thing you need to know about Jenkins is it’s like any other software where it has security issues. So you should go to the mailing list, sign up to the security mailing list and what happens is they give you about a weeks’ notice before any security issue hits so that you can plan your change and release process for updating Jenkins. Now, that does happen from time to time, but because you’re running it on your workstation and hopefully your workstation’s firewalled off, or you pick a server somewhere else that’s out of the way and you firewall that off, you should be pretty safe. Also, Jenkins is very easy to do the security updates. It’s through web UI, so it shows you a button when you log in, you press it and it updates everything.
This is how you download it. You go to the website, and there’s a little dropdown button that can be hard to see, that will give you a Windows installer. There are two versions here, you can probably safely pick the LTS release, the weekly release is for bleeding edge people. To install it it’s next, next, next, next, next, next, and now Jenkins is running on port 8080 of your server. Now, Jenkins is composed of the core web interface and also installs a bunch of plugins by default that are useful for developers. Jenkins is very widely used by developers, but we need to add some extra plugins as well to get our job done.
So this is the list of six plugins I want you to install, there’s one to run PowerShell, there’s one to let you log in with your active directory credentials, and then some other bits and pieces that are useful down the line. Now, I’m going to show you how to do this. So here is what you will see when you first install Jenkins, which is a kind of blank screen. You go to manage Jenkins and down here to manage plugins. This is where you get to see all of your plugins that need to be updated, although you can do that later. When you install this for the first time, you go to the available tab and you start typing in the name of the plugin you’re looking for, which we’ll filter down the list, you select it and you install it. It’s that simple. One of those plugins that we installed is an active directory plugin, so the first time you use Jenkins, it’s going to log you in with its own job system – sorry, user system.
So if you go into manage Jenkins and configure global security, after you’ve installed the active directory plugin, you get some options here to use active directory. By default, it’s this – the Jenkins own user database. Just select active directory, you add in your active directory account and this bit is case sensitive so be careful about that. You click add, you select all of the options, and you save it. Then it forces you to log back in using your active directory credentials and you’re all ready to go. This is just securing Jenkins so that it’s safe for us to operate on it and other people aren’t going to jump in and start doing things.
So back to the slides.We’ve got our additional plugins installed already. Now, we’re going to want to do some actual work in Jenkins. Let’s create a job. Jobs are the basic unit of work for Jenkins, and we’re going to run some PowerShell in it as well. So here it says create new jobs. You could also use this new item button in the top left corner. We’re going to call this one sample job – I want it to look pretty. Okay, and freestyle project is the one you would always pick. These other ones are used for development. And that’s it, we’re going to save the job and run it. Pretty simple. I mean, the job’s not doing anything, but it will give us a green dot down here. So going back to my main screen, we’ll get a list of jobs here that we’ve set up, you can build them, which runs the job, and it populates a kind of history down here. The green dot shows you the output of whatever your job did. The hash number tells you some details about how long the job was waiting, how long it took to run, and things like that. Now admittedly, we’re not in useful territory yet, so let’s add some PowerShell to it.
So I’ve gone back to configure the job, and down here in build, I’m going to add a PowerShell step, and here I can fill in any PowerShell I want. So I have some stuff here that I’m going to paste in. You can ignore this top stuff because that’s just best practices, but here’s I’m just going to write some output, save it, build it, and now when I build the job it starts that PowerShell task, which then runs. And here I’ll be able to see the output of what happened.
Now, here’s the output that I ran from PowerShell, but I’ve got a red dot here because the job failed. It failed because I didn’t leave a variable behind to say whether the job was successful or not. So let’s go fill that out. Back to configure, and here I have my last exit code. If it’s zero, everything worked properly, if it’s something else then it failed. So when I build this now, now that I’ve populated that variable, it should give me a green dot, which it does. And here you can see I’ve just got my output messages. Now, what happens if we want an error to occur? Well, we can write an error like we normally would in PowerShell, and because of the way I have set this up with the headers, this error is like a terminating error, and when the terminating error happens, Jenkins knows that the job has failed. So I’m going to build it again, and this time I’ve got a proper failure.
Back to my slides. So when you see me fill this out later, I’m going to use this script. All of my scripts begin with set strict mode. What this does is tell PowerShell if I mistype a variable name or I reference a variable that doesn’t exist, that’s an error and I should be notified. The second step is setting error action preference to stop. That means when an error happens, stop processing, don’t go any further. Then I import the module I’ve created to hold my PowerShell functions. Now, I’m going to include a script in the script pack that will show you how to create a module if you’ve never done one before, but it’s very simple. It’s just a couple lines, and it just creates a little unit to store my things in.
Then I run everything in this try-catch block. What is this doing? Well, between the try and the catch, I’m going to do whatever testing I’m going to do. If an error happens, because of the way I’ve set up strict mode, because of the way I’ve set up error action preference, the error gets caught. It sets that last exit code to one so that Jenkins know that the job has failed, and then I’m going to dump out the details of the error. You’ve probably never seen resolve error before because it’s a custom function that’s in one of the modules I installed earlier and that just creates kind of stack trace, a very detailed voluminous stack trace that helps you find problems in your code.
Now, we need to talk about what Jenkins did when we installed it, and I know, I know, we all want to get to the good part, but when Jenkins installed, it installed a 32-bit version of the Java run time in its directory. The problem with this is that a 32-bit Java will call 32-bit PowerShell and 32-bit PowerShell doesn’t support many of the modules that are important these days. To fix it, you download Java 1.8 64-bit, you install it on a server, you edit this XML file that I’ve indicated here, and you restart your servers. It’s really simple to do and you only have to do it once. The second thing is when Jenkins installed, it’s installed us a local system account, which isn’t great for security, and it’s really easy to fix this too. Create a domain user, grant them control of the Jenkins directory, add these servers permissions with this crazy string I’ve got here below here, but just says whoever owns the servers can restart the servers, which is important for patching, and then you modify the servers to use your active directory account and you restart it. Easy, again, you only have to do it once. Also, now that we have Jenkins running under our servers account that we control, that’s the one that you can start granting access to your servers for.
To start with for your testing purposes, you’ll probably grant local admin or a sys admin. Of course, if the tests you have are simpler, then you can have much lower permissions granted to that user, and there are ways in Jenkins to segregate permissions as well, but we’re going to keep things simple for this time. Before we go back into Jenkins though, we need to think about what we’re doing. Because we have 500 servers, we can’t just run 500 tests one after the other in Jenkins to get a response this year. The way you would deal with this in normal PowerShell is through the jobs system. If you type start job, it will spin up another PowerShell process, run it and then return the results to you. The problem is if you were to start 20 jobs, it literally starts 20 PowerShell processes at the same time and does them all at the same time, which is a massive resource hog.
There’s an alternative, which is a very common module called Posh RS Job, which runs almost exactly the same way, you just add two little letters, and it will run things more efficiently. What does this mean in real life though? Well, here I’ve got a command that will do 20 things one after the other, each taking a second, and it takes 20 seconds to run. The second one is where I do a different job to sleep for a second. So I’ve got 20 jobs, each sleeping for one second and somehow it takes 48 seconds to run. That’s because PowerShell, through start job is starting up another PowerShell process, loading everything, running it, reading the results and waiting for it to close, so it’s super inefficient. If we do start RS job, we can tell it how many things to run at once, in this case I’ve throttled it to four jobs at a time, because that’s how many cores I’ve got on this server, and it takes six seconds to run basically, and those numbers get much better, the much bigger the numbers are. So you can see we don’t want to run things sequentially, we don’t want to use start job, we want to use start RS job.
Now, when I started doing this, I looked at these command lines and I thought hey, these all follow the same kind of format, and that maybe instead of having to type this out and putting in my tests here, instead of the sleep, maybe I should make it built into the tests somehow, and so I wrote a blog post about it. My idea was to create a function template where you would just copy and paste this every time you create a PowerShell function and just fill out whatever you want to do here. If you passed in an array of 20 different computers, all of the other complex code handles all of the job scheduling and running and waiting for results and collating it together. All you need to worry about is what you’re going to do on each individual one.
Now, this was a very simplistic way of looking at things, but it will give you some idea about where we’re going, which is some place bad, because as soon as you start doing this in the real world, not just a demo, things become complicated and I really want to show you why.
The perils of multithreading. In my first example here, what I’m doing is I’m generating ten GUIDs. I’m going to start a job for each one that returns whatever GUID it is, but one of them or more will occasionally fail. Let’s see what happens. Imagine these GUIDs were ten different server names. What I’ve got back are the nine server names where everything was A-okay, but the server which had a problem is just an oops. Which server was it? That is something that I would really want to know, but it’s not available to me by doing this kind of thing. You can hack around it, you can tell the job system, hey, name each job after whatever server or GUID that I’m running on and then when I get the results I’m going to try to mung it together with the output and return it in the kind of format, which leads to this kind of disaster where yes, I know all of the servers and I know this one has some kind of issue, but the exceptions and the other weird things are returned out of order, and now this has screwed up my entire format. This is a real problem.
What we want to do is give Jenkins some really good information that it can pass, and we don’t want it littered with junk like this. What Jenkins wants is some XML and that’s what we’re going to give it. J unit XML is a way of describing tests – well, tests as in literal tests, functional tests, and what their testing and what kind of suite of test they’re in. So here in the first example, we have a suite called OPS. OPS is the name of the PowerShell module that I created. There’s a class here called test OPS computer ping, which is going to be the name of the function I’m going to write next, and here I’ve got a name, which is the server that I ran the test against, which is the first node of my availability group. This piece of J unit XML here is something that Jenkins can interpret to say when I run a job, I want you to give me some output that tells me this server passed this test. What if something went wrong though?
J unit wants something different, it wants a bit of more XML inside, saying what the – that a failure happened, what the message is, and it gives you a spot to store some stack trace data, or other complex information. You could also skip a test. Why would you skip a test? Let’s say you were going to test the availability groups, but you run it against a server that doesn’t have any availability group. It’s not a pass and it’s not a fail. You would probably want it to say skipped. If you were doing this by hand, which you won’t, this is what it would kind of look like in PowerShell, where we create a hash table, we have some flags to say this is the module we’re running, this is the function we’re running, this is the server I’m testing, this is the result and this is the messages and stuff I want to return. I’ve got a function here that can convert the hash table into the XML format, the J unit XML format, and it will output stuff at the bottom here.
This is how some of the internals work, but again, you’re never going to write this. This is too complex. So I’ve bundled it all up for you. I’ve created a module called Jojoba, named after that nice stuff they put in shampoo that smells good. It’s based off a couple of ideas. So we’re going to have a whole bunch of functions, each function is a test. All the tests should automatically run across all the servers we pass it and it should run it in parallel so it’s efficient. Every test function will return that standard format that we saw, that J unit format, and that has to encapsulate all of the error handing that’s going on so that we don’t get stuff leaking out of the sides. That test format it returns, it should be readable on screen because the power of PowerShell is that you’re running it interactively on a command line. But also, it needs to translated to XML so that Jenkins can read it and populate all of the cool complex details for the GUI.
How does it work? This is the entirety of what you will see as a user of Jojoba. You cut and paste this template, you name it whatever test that you want, and you write your code here in the bottom section that’s highlighted. If you do something that you want your test to fail, you say hey, write jojoba fail and some detail. If you want to skip the test, write jojoba skip, and that’s basically it. All of the other complexity for parallelization and formatting the output, getting it to test output, translating it to XML is all handled behind the scenes.
So let’s pick an actual test. One of the expectations we might have is hey, can I ping all of my computers? If I run SQL Server, I surely should be able to ping the computers underneath. So I’ve called this test OPS computer ping. I said hey, there’s a default variable here, input object, but I’m going to give it cool name, computer name, so I can kind of remember what kind of thing I meant to pass in. Every time I pass in a computer name, it’s going to split out to another thread in the background, so I don’t have to worry about it. Here is my actual testing code, and this is a basic PowerShell built-in function that tests a connection to a server, and tells you whether the ping worked or not. If the ping didn’t work, I tell Jojoba hey, ping failed, and then magic happens within Jenkins.
So let’s see how this actually looks when we use it. Here’s my test ping, I’m copying it – this is my module directory, and if I open this, you’ll see it’s exactly what I showed you beforehand. Now I’m going to force load my module now that I’ve changed it, and run it against something. This is the first node of my availability group, and you can see because I run it through the command line, it’s given me a kind of format I can really understand. It says when I ran it, who ran it, run and what server I was testing and the results. Of course, I can pass in more here, and we’ll be running them in parallel in the background. Now it doesn’t seem very fast because it’s just a virtual machine here, but in the real world, you can test 500 servers in about 30 seconds.
So here are the two results, one for each server. Of course we don’t want to be running this in PowerShell, well, forever; we want to put it into Jenkins. So let’s see how we do that. Back to Jenkins, I am going to create a new job. And what I like to do is name the jobs after whatever the test is. So I’m going to test OPS computer ping, and it’s going to be a freestyle project, and okay. Now I add my PowerShell build step down here, and I’ve got the template that I’ve shown you before, cut and paste, which does my best practice PowerShell stuff, imports my module, and this is the only important bit where I’m running my test. Now, what is this? This is a function that’s built into one of the DBA tools modules that I installed earlier on. If I was to run this interactively, what it does is connect to the server I tell it and pull back a list of all my other servers stored in CMS, the central management system, if you remember that from 2008.
So I’ve entered all of my servers into CMS, and now I can pull it back with a PowerShell function. The cool thing about this is when I get a new server, I just add it to my registered server list and now automatically it’s going to be picked up every time this function is run from inside Jenkins, and my test will run against it. So I’ve got this filled out, now I need to add something else to tell Jenkins hey, I need you to interpret some of that J unit XML. You publish X unit test result, you add a J unit thing and it will be called Jojoba.xml, and you put a zero here. What does this do? Well, what this does is give us three different results from Jenkins. There’s a green, which says I’ve run the test, the test worked perfectly, all of your servers are A-okay. There’s a red, which means there’s something wrong with your script, your script failed, it threw a horrible error and I don’t know what’s going on, you need to investigate. This tells it hey, if I get some results, and one of the servers fails the test, the test still worked. It functioned, but just turn the indicator yellow so I know there’s something I need to investigate. Now I save it and build it.
So what is this job doing? It’s asking the first node of my cluster, give me a list of all the servers you know about, it’s passing it into the function we’ve created that is running in parallel across all the servers to confirm that ping is working, and then it’s giving us results into Jenkins that it can understand. Now, in the console app, what we still see what we would run, if we ran this interactively. But what’s powerful about this is we now get this extra option here of showing us test results, which you can drill down into, which collects software suites, your classes, and tells you hey look, these are the servers I ran against and they all passed. If I build this again, I’m going to get something even cooler because Jenkins starts correlating all of that information and showing you over time which servers passed and failed and how many there were, and you can generate really – well, it generates really complex graphs and charts for you.
So here it is, I’ve run it twice so it has some history, and it’s showing us here I’m running against three servers and all three were green. One of the plugins I installed earlier also shows you the test results in a different format, where you can kind of drill down and see greens and reds and oranges based on what’s happening and some cool pie charts and stuff, which is always fun for management.
So you can see this is the very beginning of how we would use Jenkins for testing, but we’ve got a long way to go because the things you do in PowerShell now, assuming you do use PowerShell, they’re not really fit for scale. You need to change things a little. The problem with this kind of testing is that if you were to run this across 500 servers in parallel but one of the servers chooses to never respond for some reason and whatever PowerShell you’re running doesn’t have a built-in wave timing out, then you will never get any results, and that’s a bad thing.
One of the things that are notorious for this in PowerShell is WMI. You might use WMI to query volume information, processor information, even SQL information, but the problem with WMI is it just does not time out if there is a severe error on the server. Also, it’s very chatty across the network, this is a simple query of what disks are on a server and it sent 72 packets across the network. There’s an alternative, and it’s almost exactly the same. Instead of typing get WMI object, you run this command, which builds a session to whatever computer you’re going to query, and that only takes 12 packets. This will work on Windows server 2003 and above, so it’s really backwards compatible. Then instead of using get WMI object, you use get ciminstance, along with the session that we just created.
The cool thing about this is it takes a time out. Also, it only takes 9 packets to send the query, so it’s really important if you’re going to do any WMI queries in your PowerShell tests that you switch to a system like this. This new CIM session down is part of one of the modules that I asked you to install earlier called CIM, which is one that I wrote, but CIM has another cool feature because when you’re doing testing against SQL Servers, there’s usually three ways you’ll be doing it. You’ll either be touching WMI or CIM to query the underlying computer, you’ll be touching SQL itself, but the third thing is you often want to look at things in the registry.
So I wrote some things that can query the registry over those SYM connections. It’s very efficient, easy to use, and it’s something that native PowerShell doesn’t do. Native PowerShell only lets you look at the registry for your local computer, and that’s going to become useful later.
Now, there’s one other thing that people use that they shouldn’t use at scale, and that’s invoke SQL CMD, or if you use the command line SQL CMD. Invoke SQL CMD has some severe issues that make it not fit for running at scale, and I’m going to go over some of them now, because otherwise you won’t believe me. Here’s me running invoke SQL CMD a hundred times against my server to get the version. Everything is A-okay. However, if I was to run these at the same time, a hundred different jobs at the same time, I will get a whole bunch of errors. That’s because SQL CMD is not designed to run concurrently. It’s not designed to run at scale, and so we can’t use it because we don’t want our test randomly failing for no reason.
SQL CMD has also got another horrible thing where you can pass it variables, but the variables it uses are kind of string substitution, so in this case, I’m asking to select something … databases where it matches a variable I pass in. In this case, master, and I get the results I want. But it’s so easy for someone to commit some SQL injection attack where instead of getting for master, it’s also retrieving all of your server principles as well. These are problems built into SQL CMD, and that’s why you shouldn’t use it.
But what’s the alternative? Well, of course, there’s an alternative because I wrote one. One of the modules that you installed in the beginning is called DB data. It doesn’t have any of these issues. It’s safe to run in run spaces, it can’t be injected, it doesn’t hang, which is another issue that invoke SQL CMD has when it touches a faulty server. You can capture a detail from print statements, you can do retries on time outs and deadlocks. It’s even got a really simple way of doing CRUDs and upstrokes and bulk inserts on any data that you retrieve. You can also wrap it in SQL transactions but within PowerShell and it’s got some other cool features that I won’t go into. I do want to show you how it works though, because it’s kind of important that you use it or something like it.
Here I’m going to run the same test as before where I’m doing it a hundred times, asking my server for its version information, and it just works. How does this work? Well, previously you would say invoke SQL CMD, the server name and what query you want to run. In this case, we build a connection to the server, we tell it what command we’re going to run and then we’re going to say hey, get me the data, and that’s it. It’s pretty simple once you’ve used it a few times. Also, it doesn’t have the problems with injection attacks. Here it was querying stuff for master, but if I try to pass it something awful, it simply doesn’t work. Why? Because it’s using proper SQL variables here, and we’re passing parameters in a hash table. It’s not just a straight string substitute that invoke SQL CMD does. But then there’s the other features I talked about, where if you had a print statement, it’s very easy to capture the details of that. This could also be an error, and you want to be able to extract out easily the state class and things like that, or you could just wrap your stuff in a simple retry.
Here I’m waiting for ten seconds, but I’m only going to wait one second for the data, and so I’m going to get a time out. What if I want to retry the time out? Well, now, it’s just easiest wrapping it in userDB retry. Once you start building things at scale, these kinds of things become very important. But even if you don’t run it at scale, if you run it across more than one server at a time, you really need to kind of use it.
And that’s all of the beginners stuff. Now we need to get into what’s actually important, which is using it all. Finally. Now, we’re going to assume that you’ve started work at a company that has 500 servers, and you’re going to experience your first day there and all of the horrible things that will happen. When each of these things happen, you have to ask yourself, what’s the expectation? If this is running or it’s not running, what would I expect the server to look like? And that’s going to be the basis of what you write your test around. Then we’re going to add the test to Jenkins, and show that the test picks up servers that fail. Then we’re going to fix the problem and confirm the test passes and this is going to be our day-to-day job. So it’s 9am and there’s an application outage, what could possibly cause that? It’s sitting on a highly available availability group, well, it turns out that someone added it to the first node but then never added it to the availability group itself. So when the listener failed over, the application went down.
There’s an expectation here that if you have your database on a highly available availability group that it should probably be in the availability group. So that’s what we’re going to test with. Now, many use something special for this, which is called SMO. SMO is a framework that Management Studio uses to talk to SQL Server, and it’s very easy to use in PowerShell to get a lot of work done very easily, and if you’ve never touched SMO before, there’s a really cool walk through by [inaudible] Junior, that shows you how to use it. But I’m also going to show you how I did it.
So let’s get started with the first test that will pick up this potential application outage. Here is my test function, going to put it into my function directory. I’ve called this one test OPS availability group database, and here is the stuff that I’ve inserted into the template. What I’m seeing here is to get an SMO object for whatever server I’ve passed in. I’m going to test whether that HADR enabled checkbox is ticked in that configuration manager application when you set up availability groups. If it is, do we have any availability groups set up? If so, then check each database, except for the system databases, because those can’t be part of a group, and if it doesn’t have an availability group name attached to the database, that’s a fail. The reason I’m checking for these two things here is because if you start trying to query availability group information on a 2008 server, this will all throw an error. We don’t want to throw an error for that, but if we check things in this order, it would go through to the skip section and say hey, there’s no availability groups here, carry on.
So let’s see what happens if I run this manually. I’m running it against one of my servers and you can see immediately that it’s failed because I have these three databases that are on an AG instance but not on an AG. Let’s put this into Jenkins, where it’s going to become useful. Going to go new item, name it after my function, this time instead of typing everything out, I’m just going to copy it from one of my previous jobs. That way everything is already set up. I’m going to replace this here, now, remember this is getting a list of all of my servers from my central management system, and then running it in parallel to do the testing, and I’ve already filled out all the information here so all I need to do is run it, and we will see what the state of my small demo server environment is.
Now you see this time we have a yellow. The yellow means the script ran perfectly, but some of the servers aren’t so healthy. This time I’m going to go into my test results and see hey, the first two nodes of my availability group, they have an issue, which is funny, which means that some databases must be missing off of the third node as well, otherwise, it would have errors. You can see when I expand it, this is Jenkins translation of the XML that you would see if you look in the console. What’s cool about this is you can also output some stuff in your PowerShell job that says hey, when this problem happens, this is how to fix it. I’m going to build this again. Not that I’ve changed anything, I should get exactly the same results.
What’s also cool is you can look at the console output over the web while the tests are running, and here you can see what we would see if we ran it manually in PowerShell, where we have our third server which passed, and the other servers which are wrong. Now, I’m not going to actually go and fix these this time because some of these databases are important for the other problems that are going to happen today, because this was just the beginning of our day, the first hour. It’s 10am, hey data corruption, that’s great. Well, what are our expectations surrounding data corruption? Probably that there shouldn’t be any I/O errors, and how do we pick those up? Well, your databases should have a page verify flag of CHECKSUM and there should be nothing in your MSDB suspect pages table. Easy enough. This is a very simple test to write. I’ve got a function here, all we’re doing is saying hey, this incident has bitten us in the ass before, now we want to pick it up next time it happens so we can pick it up on the day it happens and not six months later when something really bad happens. Here I’m creating a connection to the server, this is my alternative to invoke SQL CMD. I’m querying, is there anything in the suspect pages table?
If there is anything, then I’m going to output it and fail the job. Say hey, suspect pages detected. The other thing I’m going to do, I’m going to get one of those SMO objects again, I’m going to go through each database where the page verify isn’t set to CHECKSUM, I’m going to check what version of SQL it is because I think this is 2008 and it’s checking for tempDB because that can only do torn page detection, something like that, and it will fail it or fail it for some other reason. So I’m going to create a Jenkins job straight off the bat for this one, new item, named after my thing and I’m going to copy it from one of the existing jobs and paste this in.
And let’s see how healthy my server are, because of course, they’re all healthy, I’m sure after I fix one all the others magically fix. Nope, one of them is wrong. It is the first node of my availability group. I’ve got suspect pages, and also I’ve got a database called NOCHECKSUM, which isn’t set to CHECKSUM for page verify. I’m going to actually go and fix this one though, and I should also show you what – I showed you the test already. Here’s the fix. In this case, I’m just deleting all those records out of the suspect pages table, which of course is not a real fix, but after you’ve investigated the cause this is the fix. Set my page verify to CHECKSUM, when you set it to CHECKSUM, of course, that doesn’t automatically add a CHECKSUM to everything on the database. It’s only when pages are rewritten in the database, but still setting it is better than never having it.
And now we see we’ve got a green check mark, because everything worked properly. Nothing failed my test, I’ve got some good stats here, and everything is A-okay. Now, the cool thing is say tomorrow that I/O error happens again, there may be nothing else in your environment which is going to tell you hey, this error has recurred, but when this job runs everyday, that’s going to tell you hey, you need to look at this. How would we run it everyday? Well, if we go back into configure, there is a trigger build section, saying build periodically, and you can give it something as complex as you want because it’s a kind of Cron format saying hey, on these minutes, hours, days, run the job. And you’ve also got plugins and stuff that will send you emails when this happens, although what you find is your day to day process will be opening the Jenkins UI and seeing what is yellow and what is red. Hopefully nothing will be red.
But our day is still only getting started, and these things are going to get worse because now an application was put onto an availability group but all the reports are running slowly. In fact, maybe they’re even causing the application to time out and fail. Why is this? Well, when people add things to an availability group, they don’t always think about those important read-only routing lists, which are important to do reporting properly on the secondaries.
So let’s make a new standard, to say if we have an availability group, it should have read-only routing enabled, and it should work. We can actually test that I works. This is where PowerShell starts to get exciting. So I’ve got another function here, how would we test whether it’s working or not? Well, the way I decided to do it is I’m checking whether the availability groups are enabled, whether they’re configured, and then for each availability group, for each database, I work out what the listener is. And then I create a connection to it and I say hey, I’m a read-write connection, I create a second connection to say hey, I’m a read-only connection, and then I can pair the server names that get returned from both, and if they’re different then I know that the read-only routing is set properly, or at least it’s set in some way.
So let’s put this into Jenkins and see whether my environment is set up properly, and this is something that people will always forget to do. Or you have the problem with that Management Studio where it’s deleted them for you. How would you ever pick that up unless you went looking for it? So I’m copying it from one of my previous tests, I’m just going to fill it out here, and you see as time goes on, as we build these, I get faster and faster at it, and I never have to touch these jobs again. They’re just going to be running over time, doing my job for me. So we’ve got three failures, of course, because there’s one availability group shared between the three nodes, and the read only redirects, I never set them up, or actually I dropped them for the demo. I’ve got some code here that will fix them.
You know what’s cool, you can actually add a parameter to this test that says repair, and then when it finds the problem it could generate this script and run it for you so that you don’t actually have to do it manually. Wouldn’t that be cool? Okay, now I need to find which one is the primary. I fixed it. Now, it’s always important when you fix one of these problems you find in your environment, you test it again to make sure it really is fixed and that your test is able to work out that it’s fixed, and this time I’ve got a green light because the read only redirects are configured and working properly. It’s going to get worse because it’s only one o clock now.
Did you notice that there was something that no one else has documented? There’s something wrong with my servers and it’s very visible in Management Studio once you know to look for it. Here are my three nodes, version 13, 1601. Version 14, 4001, 4001. What is going on? Well it turns out I patched one of my nodes but I didn’t patch the others. Well, actually I patched two but I missed one out. That could never happen in real life, but we should probably test for it anyway, and this is what the test would look like, where we test if availability groups are enabled, that we have them set up, then for each availability group we go to each replica and we connect to that server and then we start comparing versions between them. And if there’s any mismatch, we’ll flag it. But that’s not enough. When this kind of problem happens, you need to think where else could versions not match, and the answer is not everyone is using availability groups. Some of us are still running failover clusters. How would you tell whether the version is a mismatch between nodes of your failover clusters, considering only one’s actually going to be running at any one time? This is where it gets really complicated, but it’s achievable. And this might drive you crazy but you only have to write it once, and once you write this kind of thing once, it’s easy to reproduce it and use it as a kind of template.
So what I’ve done is I connect to the first server, well, the virtual network name of my server, I work out what the service is called, and then using the built-in failover clusters module, I get a list of the computer nodes and then I get each service that is running on them that matches the name of the SQL Servers I’m checking, and then I reach out to get the version of the executable that the service is pointing to, and then I start comparing that between the nodes, and this way I can tell hey, my availability groups are all patched at the same level and my failover clusters are all patched at the same level as well. Now, these two I’m not going to put into Jenkins because I think you’ve got the picture for this already, and we have much more important problems to deal with because it’s still the middle of the day.
This one’s also not documented anywhere, and you may never notice it, but have you ever looked at running a patch and seen this, where it is incompletely installed when you run it again? It turns out this can happen if you’re installing a SQL patch and happened to I don’t know, reboot the server partway through, which I’m sure no one has ever done before, this will happen. And what will happen is SQL will start up after and everything seems okay, but some of the components are running at different version levels, and I don’t really know what that does, and I don’t want to know what it does, I would rather just identify the problem and fix it by rerouting the patches again. First, we need a test to identify when this goes on, and thankfully I have a way of doing that. Here I have test patching. This is where registry keys become important. It turns out SQL Server stores in its registry these two keys called configuration state and upgrade and complete state, all entirely undocumented. However, through analyzing failures across 500 servers, I could work out some of these should be zero, some of these should be four, and if they’re not, then your patching is broken.
This one’s kind of interesting. So I definitely want to show you what the output is. I’m going to add it into Jenkins, luckily we’re approaching the end of our day, though there is a great big disaster I have lined up for the end.
Brent Ozar: You’re teaching people that no one wants your job, is what you’re basically teaching.
Cody Konior: It’s one of the problems at scale. The problems that irritate you irritate me 500 times a day, and on top of that, all of the rare problems no one has ever heard of, I also deal with every day. It’s very lonely. So here is the output of this test. What I’ve done is when something fails, I flag out these are the registry keys I found that are kind of funny, and it seems to be a reliable way of finding when these patches fail. I also wrote out the details of which ones look okay. So here you can see that one of the servers, it has the incomplete patch, and the way I reproduce this was my restarting the VM during patching. Now, I’m not going to fix this by rerunning the patch for you, because you get the idea.
It’s 5 pm, time to go home right? No, it’s never ended because now you’re going to deal with a problem that I had to deal with, where a CD drive disappears from your server, and what the hell does that do to SQL Server? Well, let’s start with our expectations. Every instance should be accessible, like you should be able to connect to SQL Server because if you can’t, that’s a real big problem. But also we want to be able to connect to every database because every database could have different file setting on different disks. So let’s start testing this and see where we go, and this is also something that you will never find anywhere else.
I created some databases earlier, ignore all of the fluff, however, there are two databases, one’s got its data files on E and its log on C. The other one’s got its log on C – data on C and its log on E. So they’re kind of reversed, and that’s important for our testing because these two databases are going to run differently when I disconnect a disk. So I’m going to add this to my test, I’m going to go through how this works in a minute, let’s create this test item first.
Daniel Hutmacher: Just the idea of disconnecting a disk, my brain is screaming at me, everyone brace for impact. Don’t do this.
Cody Konior: It kills monitoring servers.
Brent Ozar: People in chat are going you’re really chipper for someone with a really bad job.
Cody Konior: Yes, okay, currently it’s green. All of my servers are working perfectly fine, that’s when I do something awful. Here’s my E drive, and I’m just going to set it offline. Now, does that affect SQL Server? That was no too. So if I was just to do this manually, it’s not going to affect this. I could even probably connect to these databases and I wouldn’t get any problem at all. In fact, you can even query some things in these databases and nothing bad will happen, and yet the underlying data and log files for them are gone. So this is something I had to think about when I was doing this test, and this is why it’s important to write the test while the incident is happening, because afterwards, it’s extremely hard to reproduce this kind of thing.
So how do we discover it? Well first, I’m going to get a list of what the database names are and just make sure that they’re online. If they’re not online then that’s probably a problem. Now, you can add some extra statuses in here, like if you have database mirroring or something funny like that. Now I’m going to connect to the server once and then try switching to each database, and as we just saw, that’s probably not going to cause a problem. I mean, I can access the databases fine. However, when you start accessing some table metadata, that is when SQL will just boot our your connection, and what we do is test whether our connection dropped, and if so, this is a really good indicator that something horrible has happened. Also, it will throw this error, which will then tell us hey, your data file is inaccessible. This seems to be a fairly reliable way of picking up the problem, but it’s only one of the problems. This is if your data file goes missing.
There’s a second problem if your log file goes missing, and you can just have a look at FNDB log. This is the safe documented one, not the unsafe undocumented one, and it will do the same thing basically, just booting out your connection, we capture it and write a failure. So now that we’re disconnected the disk, I should be able to run the test and prove that those databases although they look perfectly fine in Management Studio, are actually broken.
Daniel Hutmacher: Your databases are crashing and you’re saying like yes, I’ll get them up in a second, I’ll just write this Power script to test it first, right?
Cody Konior: You say I’m investigating.
Daniel Hutmacher: No pressure.
Cody Konior: You don’t say what you’re doing. In this case, I haven’t got quite the results that I was expecting for the demo – here we go. No, database disaster log log file is not accessible. Well, that’s one of the two. I do have the test for the data file, for some reason it’s not triggering this time, maybe this test isn’t at 100%, but it’s probably way better than anything that you have. So that is officially the end of our day. Now, what other things could you test? Well, there’s a massive list. Two that I blogged about recently was they found a bug in a certain version of the VMware tools driver, where if you have received side scaling turned on, it will really slow down SQL Server. So I documented on my blog how to pick this up, through a lot of registry key queries. Another thing I found was full text indexing is a bit of a strange beast, because if someone disables the full text indexing service, when SQL starts for the first time, it says okay, full text indexing is broken for the rest of my lifetime and just leaves things that way, and that’s bad. And there’s a couple things you can look at to determine is my full text indexing working or not, and also some steps you can use to fix it without restarting your instance. So I have a test for that.
Other things that I haven’t put onto my blog, well, remember when – I think it was 2012 or 2014, there was a bug in the first service pack where it kept writing registry entries until it ballooned and killed the server, you could write a test for that. Just check whether that registry hive is above 1GB or so. You test your – if your one encrypted patches are there, [curb – inaudible] authentication is something that’s very rarely set on servers, and is something that you could make part of your standard. Just that the SPNs are correct, all you have to do is do a connection to the server and check what kind of authentication your session has. You should also check any listeners on your availability groups because that’s someplace that people often mess up.
If you have failover clusters, those failover cluster functions let you iterate through the resources on your cluster and tell you whether any are failed. Because they’re buried deep in the UI and you can have complex end and or conditions, you often find clusters have failed resources that no one knows about until it’s a bit too late, and when you have a problem in your cluster, you want to know that it was all green this morning, not that that something is broken and you don’t know whether it’s impacting or not. Pending reboots are also important. There’s a test computer pending reboot function that someone wrote on the internet somewhere, which will pick this condition up, and there’s like five or six different things built in that it will test. The reason for this being important is because if your net framework got patched and you haven’t restarted your server, it will almost always break CLR within SQL Server. It can also break things like DQS that rely on CLR. Check whether your servers are set to a balanced power mode instead of high performance, and check whether they go to sleep or not. You would be surprised how many production servers are set to sleep after a couple hours, and no one – it never happens because the network is busy until one day something happens and the server goes to sleep and no one works out why, except the application is dead.
Time is another big problem. I’m talking about your time zone and your UTC settings, because if these drift by more than five minutes, then you’ll fail to connect to servers or some applications will fail to connect to SQL, and that can be difficult to find. So I write tests that ask what time do you have and works out the time zone settings, which are super complicated to determine hey, we are within five minutes of sync with each other. What you end up with once you start doing this for a while, this was after about two months, I think there’s about 50 tests here. These are the kind of tests that I do, I have a test for everything, and they’ll often fall into some categories. Things like my underlying infrastructure might have an issue, SQL Server might have an issue, there might be a standard that I’m not meeting, and you often find that some of the tests will fail everywhere. Like you have a standard in mind but you haven’t implemented it yet, and having a test there reminds you hey, all of my servers don’t meet this standard and I’m going to have to go fix it one day. On the other hand, you’ll get somewhere, maybe 399 of your servers are green but one fails the test, and that’s the stuff you’re going to focus on day to day when you come into the office, say hey, why is this server failing this test. And my list of tests has gotten much longer after taking this screenshot.
The downside of all [inaudible] little bit of effort to work out how the functions work and get PowerShell running, and you get Jenkins, you put in all the effort, you find out how often you were wrong. It’s not just that you’re wrong about your SQL Server stuff, but also everyone else you have to deal with is wrong, because there’ll be Windows settings that are wrong and there will be VM settings that are wrong, and if you can test all of these and then you have to explain to people yes, I know that you think that your VMware power setting are correct but they’re not, I mean that makes for some pretty tough conversations, and not everyone has the ability to spend time doing this kind of programming, but I think if you’re a SQL Server DBA and especially if you work at a large environment, it’s kind of stuff you have to do to keep your job these days.
And that’s it, so I have a blog, I stole the domain name, OperationalValidation.com, which just redirects to my blog, and I blog lots of stuff. I’ve done about 150 articles over the past couple of years, and all of them are really technical. I mean, they’re really technical issues that I come across, not beginner 101 stuff. So if you’ve never seen it then maybe take a look. And that’s it, I am pretty much ready for questions.
Brent Ozar: Good. So really nice job, lots of compliments between Twitter and Slack here. Shane asks, “Your PowerShell IDE looks like it’s on steroids. What are you using?”
Cody Konior: Yes, I am using PowerShell steroids. It’s literally a plugin called ISE steroids, and I think it’s got a 30 day trial, but anyway, I bought it because I just loved having things in colors. It has other cool functions and stuff as well too, but I like the colors.
Daniel Hutmacher: I would agree with Eugene, who said, “Expected colon, blah blah blah blah, stability, blah blah, unit testing, received colon, 80s horror movie, don’t go through that door, no” which kind of sums up your session. I think it was a great interesting session, except you have a shitty job. But you appear happy anyway.
Brent Ozar: How often do you run these tests? Like how often do they – do you fire them off every day? Every hour?
Cody Konior: Yes, I pick once a day. so I have – I think it’s 73 tests now, 500 servers, so not every test runs against every server and [inaudible] in an hour.
Brent Ozar: Okay, so is there anything that stops you from doing it more often? Is there anything that there are reasons that you would do it more often or less often?
Cody Konior: Well you know, professional monitoring things, they care a lot about impacts to applications and things, and the impact of this stuff, it could be a bit hard to determine and I just don’t need to run it constantly, because it’s not a monitoring tool. It’s more of a workflow tool, a work organisation tool, an incidence avoidance tool, bathroom break tool. So I run it once every morning and that generates the kind of list of things that I need to be aware of for that day, and things that I need to focus on, and once I fix those problems, if I do get a chance to fix any of them that day, I just can click the button next to the test to rerun it and verify that everything’s green.
Brent Ozar: Slick. Have you – are you going to open source like your entire suite of tests or you thinking about not – it’s too private?
Cody Konior: No, I couldn’t. So I own some parts like the Jojoba framework and that’s completely open source. The tests I write, I write them on work time, so I can’t really give them away. Some of them I can remember how I did them and so I can share kind of a rough guideline of how I would do it, and I often talk about how I would do it, but I can’t actually show the code.
Brent Ozar: Cool, very cool. Em McDonald said, “What was the name of your site again? OperationalValidation.com?”
Cody Konior: Yes. I’ll show the slide again. I just stole it. Someone else coined the phrase end of last year and they have their own operational validation framework which is based on PowerShell pester, but I needed to develop my own because I really wanted the parallelization, and I really wanted that standard output into that XML that Jenkins could understand.
Brent Ozar: Perfect. Alright, well thanks a lot for presenting this morning, very awesome. I think this one’s going to be huge on the recordings as well, so that’s a lot sir and have a great morning.
Cody Konior: Thanks, you too.