SAN Primer for the DBA


Audio podcast:

Enjoy the Podcast?

Don’t miss an episode, subscribe via iTunes, Stitcher or RSS
Leave us a review in iTunes

Target Audience:

DBAs, sysadmins, managers, or anyone who faces challenges with their database infrastructure on SAN.

Abstract:

When storage doesn’t perform and scale it causes issues for SQL Server and can make DBA life extremely challenging.  In many cases SQL Server storage is provisioned on a SAN.  However, very few DBAs even get to see a SAN or understand how it works.  Even though the SAN is someone else’s responsibility, as a DBA you feel the pain and you’re typically a first responder when there is a problem.  Understanding the “magic” behind it will make you a better DBA and help you have productive conversations with your storage administrator.  After all, pointing fingers has a negative impact on your superhero status!

In this session you will learn what a SAN is , how it works, and why it’s typically used vs. direct attached storage.  These ‘black boxes” have a lot of capabilities, so you will also learn about some of the common features today that they offer and how they can help you.  You will also learn about what can go wrong, some basic troubleshooting when you’re not getting what you expected, and how to communicate with your SAN administrator if you’re not getting what you need.

Why I Want to Present This Session:

Spending almost 20 years as a DBA, I went through several performance challenges and many conversations with storage and infrastructure people.  Countless times I got the blank stare when I told them my storage performance wasn’t as I expected and the conversations were less than productive.  Over the past 5 years I’ve gotten an in-depth look at this technology as well as the challenges that both DBAs and SAN administrators face. I would like to pass that information on to you.

Additional Resources:

Session Transcript:

Brent Ozar: So today, this morning, the next session here at GroupBy is going to be Doug Bernhardt talking about the SAN Primer for the DBA. So take it away Doug.

Doug Bernhardt: Alright, thank you Brent. So let’s get this party started shall we? Welcome to my talk on SAN Primer for the DBA. My name is Doug Bernhardt, I’m a storage engineer and product technologist at Dell and I’m focused on SQL Server storage integration. So if you have any questions after the presentation today, feel free to reach out to me at my email address, doug.bernhardt@dell.com.

So a little bit more about me, I have been working with SQL Server for 25 years in various capacities. Started out doing development, went in to do some DBA work, done some architecture over the years, done some consulting for a few years, so dabbled in various facets. And I actually started working with SQL Server back when it was running on OS2, so for those of you who are familiar with that time frame, that’s kind of a point of reference.

But for the past five years, I’ve been focused on SQL Server storage and integration with arrays and that type of thing. And full disclosure, I do work for Dell, but this work is largely based on my experience as a DBA as well as working with storage and storage customers and the problems that they face. So while it’s a vendor agnostic talk, the examples will be based on Dell EMC hardware because that’s what I have hands on experience with. So the talk is not sponsored by Dell, so the opinions in the talk are solely by own. Really what this means is I don’t have to comply with corporate media standards and I got to pick my own slide theme and graphics, which I’m really excited about.

Brent Ozar: Which if you’ve never worked for a big company, is a really big deal.

Doug Bernhardt: Yes, our graphics library is like three things that we can show, so legal doesn’t let us show anything else. So – and just so you know, everyone in attendance today will get their very own SAN, so it doesn’t get any better than that, right?

So I ran across someone who was blogging about effective meetings and one of the key points that I loved was this concept of wait, why am I talking? So one of the reasons that I’m talking today is that storage is key to SQL Server. Anyone who has been working with SQL for any period of time, this is painfully obvious. So for those of you who are brand new, you will one day discover that if you don’t have performance reliable storage, you’re going to have issues. So to complicate matters, you typically have to rely on the SAN admin person for your storage needs, who speaks a completely different language.

The next reason is that SAN is a black box to most people, so before I worked at Dell, other than for a tour, I was never in the data center. I didn’t have access, I had no idea what the gear was that my SQL Servers were running on, what that even looked like. I couldn’t find my servers in the data center, in a rack anywhere if my life depended on it. So first of all, it was cool to understand what all this stuff was and get some hands on experience with it, but more importantly, when I started working with it, I discovered all these things that would have a huge impact of my life as a DBA that I never knew existed.

Now, for DBAs typically, developers are public enemy number one, and SAN admins are number two. However, whether you like it or not, you depend on these people for success. So having a basic understanding of their world and how to communicate effectively with them is key to your success. So my goal in this presentation is that you learn enough to have an intelligent conversation with your SAN admin and he doesn’t look at you like this dog in the picture here.

So as DBAs, we like to think that we are the alpha and the omega, the beginning and the end, the center of the universe, which everything revolves around. But in reality, there are likely many other servers in your organization and in the data center besides yours. So there’s lots of other storage needs and it’s all managed by a team who has a charter to do more with less, much like you probably do.

So now I kind of get into the sales pitch, which likely lead your organization to purchase one of these things in the first place. So by centralizing, they’re able to create a number of efficiencies. The first one of these is administration. It’s a lot easier to have all the storage centralized in the data center so admins don’t have to run around looking for failed drives, updating firmware, multiple servers, upgrading capacity on individual servers, worrying about different drive types when they have to replace or expand your storage, if you have the available drives based in your server and that type of thing. So also the pooling and virtualization provides flexibility when it comes to capacity and performance. So if you think about how your cellular voice and data service work, you don’t have dedicated infrastructure or dedicated time on a network. You’re sharing a pool of resources and the majority of the time it works out quite nicely, and we’ll talk about what happens when it doesn’t later on.

In terms of data protection and recovery, if everything is centralized, then they don’t have to worry about protecting every server. The storage is centralized, you can protect the SAN, where all the data is located and it makes it much simpler. And then from a performance standpoint, if I have a handful of SSDs, I don’t have to put SSDs in every server, I don’t have to choose which ones go where, I can put a handful of SSD in SAN and boost performance to dozens of servers instead of just a couple. So – and by the way, I’m not endorsing this book here, this is just a bad attempt at humor.

Okay, so now before we get into this too deep, let’s talk about the top five things which cause a SAN administrator to flip the bozo bit on you. So when someone hits the bozo bit on you, it basically means you’ve been dismissed for something stupid that you said. Now, the funny thing is about this list is that you can find guidance regarding SQL Server storage that will advocate for these things. The problem is that most of it is based on older technology, and before SSDs were a twinkle in your father’s eye. So the top five things that will cause the admin to flip the bozo bit on you are ask for direct attached storage, tell them you want your own SAN, tell them you need dedicated disk, demand raid ten, and or explain that your disk queue lengths are too high.

So the problem with this list is you’re basically telling them how to do their job, and nobody really likes that. So unless you want them telling you what indexes to create, let them focus on – let’s focus on what you should do.

Okay, so you said all these things and – that I just told you not – I just told you all these things of what not to ask for, right? But you have this consultant telling you that you need these things. Now, just to clarify, I’m not talking about quality consultants, like Brent and his colleagues, because surely they would never say any of these things.

Brent Ozar: Never, absolutely never.

Doug Bernhardt: Yes, hopefully, you’re not editing materials at this point.

Brent Ozar: Note to self, bozo bit.

Doug Bernhardt: Okay, so this consultant says if I can’t get 500 SSD drives direct attached and raid ten, my disk queue lengths will be greater than zero and this thing will never work. So in addition, the storage admin says that I have to use this SAN thing, so before we get all worked up, let’s just keep calm and find out what this SAN thing is, because after all, education is the key to understanding. And by the way, I have no idea who these people are, I just Googled consultants and these guys showed up, so there you go.

Brent Ozar: So then I’m a little disappointed that you Googled consultants – no go ahead.

Doug Bernhardt: So what is a SAN at a high level? A SAN allows storage pools to be created from multiple physical storage devices or simply disks and your host of servers can act over a network, typically fiber channel iSCSI protocol. And then the SAN admin creates your volume from a storage pool that he’s created – he or she has created, I should say, and maps it to a host on a server. And so then a volume shows up in your operating system ready to be formatted and assigned a dry letter or mal point or whatever you want to do with it.

So if we take a look at it from a hardware perspective, traditionally it’s one or more hardware nodes operating in a cluster, and the hardware is similar to server hardware but it’s purpose built for storage. So when I say purpose built, it may include things like cache cards, a communication network between the nodes, specifically selected CPU depending on the core combination and speed that works best for their software and their features. Maybe they have additional ports and slots to provide redundancy and hot slot capabilities. So now let’s talk about a couple of SAN guy terms here. So the first one is what they refer to as the front end of the unit.

So each unit or node has front end network ports and these network ports are typically attached to a switch. There are various protocols and speeds that are supported depending on the vendor, and then the hosts or your servers attach to that same network through HPA or network cards and they attach to the same switch or networks structure. These networks can be physically separated so a lot of organizations will have their own storage unit. They can be physically the same but logically separated into V-LANs and such, or they can by physically and logically the same.

Now, on the other side of the unit is what they refer to as the back end, and when I say side, it’s kind of a virtual term, it’s not referring to the physical side of the box per say. But what they call the back end is the side of the SAN where it connects to the back end disks or disk enclosures. So it used to be that these disks were in separate enclosures, but now in the newer units, they can be in the unit – the disks can be in the SAN unit themselves, and these are typically connected via SAS connections.

So hardware is built to be highly redundant, as you’ll see here in a minute, so there’s a lot of duo cabling, duo ports, multiple – basically, everything that they can.

So this is a diagram of what front end components look like and how they’re connected. Now, don’t panic, we’re not going to have a test on how to cable one of these, but the takeaway here is that there are multiple redundant connections here for both performance and availability. So walking through the diagram, we have our server at the top that has in this case, fiber channel HPAs in it and it has redundant connections to a fiber channel switch, which is then connected to a storage controller at the bottom. So this diagram was actually taken from an 80 terabyte data warehouse fast track reference architecture that we published and if you want to look at that paper you can go out to dell.com/storageresources and find it there.

So the back end diagram, we have in this case – this is SAS connected on the back end, so this is the same reference architecture and you can see that we’ve got a couple things going on here, so we have our SAS cables but we also have a network connection there in the blue line, that’s used for the heartbeat between the controllers, so this is much like if you had a Windows cluster and you had a heartbeat running on a private network, and then we also have a couple of interesting purple lines here. So in this case, there’s cache being nearer between these two controllers over their own separate network connections.

Okay, so in addition to the virtualized disk we just saw, now the virtualized storage is going virtual as well. So things are really starting to get crazy now. Some of these SANs can run a software only, so there is no purpose built hardware. One example of that is the Dell EMC UnityVSA, so this is a virtual appliance that runs as a VM ware OVA. You present virtual disks to the VM, so you create virtual disks in VM ware, you present them to your VM, those of you who are familiar with VM ware, you’ll know exactly what I’m talking about here. These, up here, there’s drives in UnityVSA so every virtual disk that you create in VM ware appears as a drive and then you can create your disk pool and set up all your SAN features from there.

It’s fairly full featured, and it’s actually available as a free community download. So if you just search for Dell EMC UnityVSA, you’ll find a community edition that you can download and run for free. Another example of this is using Windows server 2012 and storage spaces feature. So this allows you to basically roll your own SAN, you can present both file and block storage to servers in your network and has some of the traditional SAN features built in. So there you go, there’s your free SAN as promised, just some assembly required.

Brent Ozar: Are there any requirements around like how beefy of a VM it needs or any particular versions of VM ware that are supported?

Doug Bernhardt: I believe most versions of VM are supported. In terms of how beefy the VM is, actually the – so the memory is fixed in that OVA so you can’t change the amount of memory that it uses, and the community edition, I believe it’s pretty restricted because I think the maximum storage you can provision is like 2 terabytes or something like that. So it’s not – I forget exactly what the exact amount is, but it’s not a tremendous amount.

Brent Ozar: Got you.

Doug Bernhardt: Okay, so here’s just a list of some common SAN features that you’ll find among the various vendors and some of them have all of these, some of them just have a subset of this. But if you start to look down the list, we’ve got – let’s see, we’ve got like compression here, we’ve got some encryption, replication, data protection features, quality of service, some caching, snapshots, wait a minute, some of these look a little bit familiar, right? So I think I’ve seen some of these somewhere before.

So what is this feature list mean to me as a DBA? Well, let’s ask ourselves for a minute, what happens if both the storage feature and the SQL Server feature are turned on? What happens if you both think you’re in charge of data protection and then for the features in this list that you haven’t seen before, how can we take advantage of some of these? So generally, SAN features are enabled and disabled at the volume level and you can meet with your storage admin to discuss these features.

So this is the point in the presentation where you need to stop surfing the web and start paying attention because we’re going to hold a DBA storage admin meeting. So this is where the two great titans collide, right? Mano a mano. So you both know what you can each make life difficult for each other, so this is really your opportunity to create a good working relationship and the purpose for this meeting is we’re going to discuss common areas and features that coordinate the best solution for the business by optimizing storage and database operations. Now how is that for some management speak? That’s pretty good. You can feel free to rip that off and use that as your – for you real meeting.

So the agenda for our meeting is we’re going to review the availability and response – recovery responsibilities and then we’re going to discuss some product features in terms of what features can you leverage that you don’t have today, where we have similar features or the same feature what makes sense, and the affordability of these features. So when I talk about affordability, what I’m referring to here are things like the licensing cost. So some of these features, the licensing may be included with SQL Server or it may not be included with the version that you’re running. Same goes with the SANs. So depending on the vendor, depending on what they purchase, some of these features, they may already have licensed or included in their package and some of them they may have to pay additional for. The other thing in terms of affordability is it’s good to discuss who can afford to run these features. So there’s a CPU, an I/O cost to turning these things on. There’s no free lunch, and so somebody’s going to have to pay to run these features. So depending on who has the most headroom in their environment and what the benefit is, that’s something that you’re going to want to discuss.

So some discussion points that you want to have around data protection, first of all, make sure you’re coordinating your efforts. So have a discussion about who’s doing the backups, are you both doing the backups? Is your SAN administrator just backing everything up because they think that’s their job and what’s the recovery strategy? So just because they’re backing them up doesn’t mean that they necessarily know how to recover the system or how to recover databases. You might have your own recovery strategy, be doing your own backups, and so what they’re doing may not be beneficial, or it may not even be needed. So that might be a process or responsibility that if you’re doing that and that’s something that’s on your plate, if you let them know, that can free up some of their resources and take some responsibility off of their plate.

So the other thing to talk about is some efficiency thing, so can you leverage storage-based snapshots to speed up your recovery or use it to mount up databases in different environments? If your storage admin is taking storage based snapshots, do they understand that both data and log files need to be protected and at the same time? They may or may not know when they provision storage and they present it to you, what you’re using this storage for. Unless they explicitly tell you, usually they cannot see the files on the volumes. They just see volumes that are being used and they see I/O activity, so they don’t really know what you have on there or necessarily what you’re using it for. Also do they know that they don’t need to protect tempdb?

So while it needs to be available, it doesn’t need to be protected in terms of being able to recover from a backup. So they might be able to save some resources if they’re doing replication or they’re backing up things on their side. So basically, see if you can create some credibility by being able to help them out and save them some resources. So as my kids would say, this is doing someone a solid.

Okay, data replication or availability groups may or may not be part of your data protection strategy but it’s good to talk about from a data replication standpoint if you are doing it, who’s doing it and once again, who’s responsible for this? And which method works best? So the implementation is very different depending on if data is being replicated in SQL Server versus it being replicated via SAN. Now, depending on what you’re doing, what your application, how your application is designed, there may be something that you can leverage here or vice versa. So if storage replication is being used, some things that you’ll want to keep in mind – typically a replication here is at the volume level. So as we mentioned earlier, things like keeping data and log files together, it’s probably not going to be real useful if they’re just replicating one of those volumes and not the other if you have them separated. Also, let them understand that you don’t need to replicate tempdb and potentially other system databases. So depending on what your recovery strategy is for recovering your servers, they may not need to do that or they may not need to do it at as high of a frequency as they do. And then another important thing here that a lot of people overlook is if you’re using storage replication, make sure your database file layout is optimized for replication.

So what I mean by this is that if you have databases that you want replicated and you also have databases that you don’t want replicated, you’re going to want to make sure that those are put on separate volumes because if you have databases that you don’t necessarily care about or you don’t want them replicated, and those are residing on SAN volumes that are involved in replication, you’re basically creating a bunch of noise or unnecessary replication traffic because it’s replicating everything that’s on that volume, regardless if you intent it to be replicated or not.

So it’s also good opportunity here to take advantage of some functionality that you don’t have in SQL Server. So one of the top ones that was my favorite I guess, when I first found out about it were the storage-based snapshots. So what they are is they are fast, efficient copies of storage and they’re pointer based. So at a high level, it’s similar to how SQL Server database snapshots work and that you use these and it uses – we’ll call it copy on write for a lack of a better term. It’s not exactly done the same way but it’s similar, so that when you’re accessing these volumes after you’ve taken a snapshot on them, then it’s pointer based operation and there’s no copy of the data involved.

So it’s basically a single copy of the data, you can mount them up really quickly, use them for various purposes, you can use them for recovery, a lot of vendors also allow you to integrate log backups that you may be doing in your environment today, to be able actually do point in time recovery off these, but they’re also really great for creating these dev and test copies of the databases, copies for research, that type of thing. So instead of having to go beg for SAN administrator for yet more space so you can back up a copy of these databases and then go ask him again for more space when you need to restore them somewhere else, if you can simply take advantage of a storage based snapshot, then you can do all of this and not use any additional space until you actually start writing to that database.

Almost all the vendors also support some form of VSS integration, so when you take the snapshot, you can get a truly database consistent copy. Another thing that’s kind of cool, and we’ll take a look at this a little bit in a minute, is QoS, or quality of service. So you can set various limits in terms of bandwidth and IOPS and some arrays also support bursting as well, for dev test and mixed workload environments. So what this allows you to do is from your standpoint, you can discover I/O intensive code before it hits production. So this is a good opportunity to – if you work with your SAN administrator to get this enabled and get this turned on, you can help point out some performance issues before they actually hit production.

And then finally, one of the things that was completely new to me is how this networking and how the volume mapping actually works. So I had a previous employer, when we did releases, we would also do hardware refresh of our servers. So when our application new version went into production, we would actually put it on a totally different physical server. So what we did and we thought we were being smart about it, was we would set up and we would log ship these databases over and we would get them all in sync ahead of time because they were several terabytes and this process would take a while, so we get it all synced up and then we take a maintenance window and we would do the final cut over.

So once I got to Dell and I started working with SAN technology and I started understanding this, I found out about this concept and it’s basic to how SANs work at volume mapping. So you can take the volumes that are presented to your server and through a simple mapping operation, you can just detach those similar to how you detach a database to SQL Server and you can attach them to another server. So when I found out about this, this would have saved us a tremendous amount of time and space and effort trying to get all these databases moved over when we need to do a hardware refresh or get them moved over to another server. Alright, so let’s…

Brent Ozar: Kyle asks – Kyle says,  “I’ve been concerned with doing snapshots or removing drives between one server to another because I’m worried about like a hard detach or it seems like it’s going to cause a blue – the storage that I’m going to attach is going to look like a blue screen of death attachment. It seems like backups using SQL are more reliable. Is that still true or snapshots can be as reliable as SQL Server plain backups?”

Doug Bernhardt: So they can be as reliable, you just want to take care in how you’re taking those snapshots. So with SQL Server, preferably, you’re going to want to use a product that supports the VSS integration. That will give you your best chance I guess – or not best chance, but that will give you a reliable recovery. Now, you can also use just snapshots from the SAN provider and here again, if your SAN admin knows exactly what you’re doing and what – if you update a log file separated for example, if they know what volumes those are on and those need to be treated as a set, they have a way to create a grouping of those. Here at Dell EMC, it’s called a consistency group is what it’s called, but you group these volumes and then they’re treated as set. So when you go to snapshot them, they’ll both snapshot at the exact same point in time. And then when you do that, typically when you attach these snapshots, then SQL Server is smart enough to figure out the crash recovery – okay, I can roll forward, roll back when I need to and recover the database that way.

So if you have – I don’t know with the question if it was – they were literally giving some type of server crash when they were doing this or…

Brent Ozar: No, he was worried that they weren’t as reliable as native backups and he says thank you. That answers his question.

Doug Bernhardt: Okay, perfect. Okay, so now let’s take a look at how we can use this quality of service feature to find out who’s writing some bad queries.

Brent Ozar: Generally me.

Doug Bernhardt: Yes, before they get thrown over the wall. This is a novel concept, right? Catch things before they hit production, so once again, crazy talk here, right? Alright, so I have a simple script here that first of all what I do is I clear the buffer cache and then I’m going to capture the number reads I have from sys.dm_io_virtualfilestats. Obviously, this is only going to work if you’re the only one on the database and your system is very quiet. And then turn the statistics time on so I can get some time stats on this and I have a couple of queries here.

So first I’m going to do just a simple seek and this database is one of the TPCE benchmark databases and then I’m going to record the number reads from that query and then I’m going to run a scan query and then I’m going to record the number reads. Okay, so when I run this query, you can see that the first query returned in 31 milliseconds, it did three reads, and then the second query returned in 1700 milliseconds, it did 1726 reads. Alright, so I’m looking at this and I’m thinking to myself, you know, that’s a little bit excessive and partially the reason why I think that’s excessive is because I’ve seen this query and I know that it’s doing a scan. So I want to make mister developer a little bit more aware of the impact that he’s having by doing all these reads.

So I talk to my SAN admin man and I say, hey, can you help me out and can we set up – once again, help me help you. So if we turn on QoS, not only will it save some resources on your SAN, but it will help me figure out who’s writing these I/O intensive queries.

So here I have – this is a picture of the SAN that I’m using, and I’ve got a list of my volumes here that are presented to my server, there’s DB SQL 01, SQL data is the volume where my data files reside. So if I edit the settings on here, there’s a bunch of stuff in here but the one that we’re going to focus on for right now is there’s a quality of service profile that I can apply to this volume. And you can predefine these to be whatever you want but I’ve created one where instead of giving this guy access – this person rather, I don’t want to imply that guys are the only ones that write bad queries – so I’m going to put the screws to this person and I’m going to give them the bandwidth of one spinning disk drive. A whole, whopping 120 IOPS. Now, you may not want to be this cruel but – okay, so I set that on at volume so now just for reference here, so we’ve got 1726 reads on the second one, the first one is doing three reads. First one took 31 milliseconds and the second one took 1700 milliseconds.

Now, let’s run this thing again, and we’ll wait here for a few seconds and a few more seconds. Okay, we’re finally done. So if we got up to the top here, we’ll see that this one did the same reads, the execution time is almost completely the same. So this is efficient query, they’re not pounding the storage, so therefore query time is just fine. But if we go down and take a look at this one, now we’re up to 13 seconds, same number of reads as before. So now we’re getting to a point where somebody was running this, they might stop and think, hey this query I wrote is taking a while, what can I do to optimize this? Okay, so let’s move on and talk about compression and encryption.

Brent Ozar: Also by the way too, you’ve got a bunch of you are asking questions that are unrelated, which is phenomenal. All I’m doing is I’m typing them up into a whole list that I’ll hit Doug with at the end. Everybody wants to ask things like should I put my data and logs on separate volumes, how do I tell my SAN admin which volumes are relates, all kinds of stuff, so just go ahead. You definitely you keep asking those questions, you can ask them in Q&A or in Slack, but I’m just piling those up for Doug at the end.

Doug Bernhardt: Yes, no, please continue to ask those questions. We should have plenty of time at the end to cover some questions and we can talk about – we’ll cover those questions. Also if there’s anything that anyone would like to see in the SAN tools or whatever that they’re curious about, I can give a tour of that too if anyone’s interested.

Brent Ozar: Perfect.

Doug Bernhardt: Okay, so getting back to compression and encryption. Now, there’s a CPU cost to encryption or compression, so where does it make sense to turn this on? In addition, the compression is implemented differently, so on SQL Server we know we have a couple of options in terms of row in page, we have backup compression. But then also on the SAN level, depending on the vendor and how it’s implemented, some of them do in-line compression, some of them do post-process compression, and the algorithms are different and the efficiency ratios are different. So it’s good to talk about this with your SAN admin and see what makes the most sense.

The other thing to point out here is that if you can afford it, you can actually reduce your IO workload by turning on compression in SQL Server. So because all the pages are compressed and decompressed in memory, the data that’s read and written from disk is in a compressed format. So therefore, it reduced the actual I/O that the SAN sees. So it’s important to understand here again, who’s responsible for this. You can turn it on in both places in some cases if you’re really, really concerned about the size of your databases. You might see additional benefit by turning it on, on the storage, as well as turning it on in SQL Server, in some cases not, so it’s really going to depend on what your data looks like as well as how the vendor has implemented it on their side.

So storage is rapidly evolving in recent years, mainly due to the flash technology in various forms. SANs are going through a major revolution actually right now, and therefore a lot of the assumptions that were previously made based on number spindles and that type of things are no longer true. Since a lot of devices now have flash technology built into them, and so they’re no longer bound by the mechanical limits of spinning drives. Storage admins, as we talked about before, so they can configure these drives into pools, and these pools can potentially contain multiple drive types. So you could have both flash storage and spinning drive storage in the same pool. The grade levels could also be mixed in a pool, they have the ability to also assign data to a specific tier. So when they create – when they take SSD drives, they take spinning drives, they can break them out into separate tiers, they can create one tier out of them and how this tiering works is really the special sauce that the vendors provide.

So they all do it differently because it’s all proprietary code that they’ve developed to do this. A lot of it is based on usage stats, but there are also other things that they can tweak to move this stuff around and that type of thing. Now, the point I want to make here is that because all this is in play, so you want to pay attention to your database maintenance and you don’t necessarily want to be sloppy with your maintenance.

So what I mean by doing sloppy database maintenance is just simply running a re-index wizard that sets up a re-indexing job and blindly re-indexes all your indexes in your database every night, regardless of whether they’re ever used, accessed or fragmented because when you do this, you rebuild all your indexes in your database, you’re rewriting all the data in the database. Hopefully, if you have – if you’re following best practices and you’re using clustered index like you should and that type of thing, so if you’re not following best practices and you don’t have any clustered indexes, that’s a completely different talk.

But another possibility here is that the performance boost that you get from your re-indexing, it may be due to efficiencies within the database or it could also be due to the fact that you’ve now moved that data to a different tier because you’ve rewritten data that otherwise doesn’t get written to. So this performance boost you thought you were getting because you freshened up your indexes is quite possibly could be due to the fact that because you’ve now written that data, that signaled the SAN that hey, this data is moving from a cold state to a hot state and so I need to move it up to some faster storage. Okay, so how do you – hold on…

Brent Ozar: How ironic is that? How do you communicate?

Doug Bernhardt: Yes, that’s kind of ironic isn’t it? Alright, let’s try this again. How to communicate performance requirements.Storage admins’ world is really centered around a few key metrics.

Now, if you really want to talk like SAN guy, they call this speeds and feeds, but really what it comes down to is IOPS, so the number of trips your data takes to the storage. The storage is limited in terms of the number of IOPS it can do but that limit is based on the block size. So your block size is how large each I/O is and as you do more work, then the max number of IOPS is going to decrease because the payload side is larger, and you’ll see drives and that type of thing published and with performance stats and some crazy numbers. The problem with some of those numbers is a lot of them are published with say, a 4K block size, which is smaller than what you’ll ever be able to utilize in SQL Server.

So the next one they talk about is bandwidth. So bandwidth is essentially how much data can you fit in the pipe, and this is typically measured in megabytes per second and it’s a simple calculation of your IOPS times your block size, so that gives you your bandwidth and then finally, one of the ones that’s critical for SQL Server is latency, so basically how long the trip takes. So discuss what volumes need to be high performance versus low performance. So do your backups really need to have high-performance storage? Do they really need to live on this SSD storage or can you get away with a lower performing storage?

Now, the other – one interesting thing to note here is that the spinning drives, a lot of them are actually still pretty good at doing sequential light workloads, so if you are doing things like backups, you can use a pool of spinning drives for those and a lot of times that works just fine, whereas the newer technology, SSD technology is actually tuned, or geared, more towards random I/O activity because that’s typically been the weak spot or the restriction with spinning drives. So also take a look and see, does your application hammer tempdb or is it largely unused? Tempdb can kind of be a wild card in a lot of situations, for some applications it’s critical and for some applications, it’s hardly ever used, so based on that, you can make the determination of where it should live. And then once again, don’t try to dictate what grade level or drive type that you actually need. So they probably don’t tell you what indexes that you need to create your database, or maybe they do, I don’t know.

Okay, so I would suggest that you speak their language with perfmon. The reason why I say perfmon is because it’s common so it’s on every Windows server. It’s a credible tool to use, it’s one of the main tools that’s suggested by Microsoft. If you work with them or you get into troubleshooting scenarios with them or read their papers, the main performance tool that they’re going to talk about and they’re going to reference is perfmon.

So everybody that knows anything about Windows is familiar with perfmon, and then it’s easy to use. It’s intuitive, it has a lot of options, logging support, et cetera, etcetera. Most of you are probably very familiar with perfmon, but even if you have your own monitoring tools, your own DMVs, scripts you’ve written et cetera, I would encourage you to verify your results with perfmon and then communicate in those terms, because that kind of takes – sometimes if you talk about these other tools then the data can come into question because they’ve never heard of it or they don’t understand what you’re looking at and those types of things.

So when you capture your data, also make sure you’re collecting stats from both your peak and average times. They’re going to want to know what times of day these things are, when you’re recording these metrics, because you are using a shared environment and so timing is somewhat important. So the perfmon counters that you’re going to want to capture and these are the key ones that pretty much directly relate to the data that they’re watching and the data that they’re monitoring, but there are certainly other ones that are of interest. So from – I have the ones listed here for both block and file if anybody actually uses that. I don’t know if anybody joining SQL on SMV shared, but I’m guessing it’s a very small minority if you are, but I’ve included them there for you free of charge.

So on the block side, for IOPS, this is just simply the number of disk reads and writes per second under the physical disk counter. For block size, it’s your average disk bytes bandwidth as your read and write bytes and then latency is your average disk seconds per read and write. Okay, so you have a new BFF, you’re taking long walks on the beach, you guys are hanging out after hours, you had this wonderful meeting, life is good. So your database operations are now running smoothly, you have a bunch of new toys to play with, your SAN admin is happy because you’re not crushing his system anymore and life is pretty good for now. Until the next major event in your organization, or maybe it’s not so good and you still have performance issues.

So how do we deal with performance issues? So I can’t emphasize this enough, and people talk about it all the time, there’s always talks, presentations done on it but database tuning is by far the biggest bang for the buck in terms of performance and scale. I was at a previous employer, when I was hired on there, our main database server was almost on fire. So this thing was running at about 90% average CPU, all day long, and they were convinced that SQL Server was to blame for running this toy database engine and moving to Oracle is the only answer.

Stop me if you’ve heard this one before, okay? So what we did was we started a database tuning effort and there’s various ways to approach this. Our way was just start producing a top ten offenders list in terms of queries every day and prioritize those and start to tackle those and tune those. So within a period of about six months, we were able to whittle this down from an average of 90% CPU to 30% CPU and sustain that for two to three years after that even though the transaction rate was almost double and the database site we were managing was almost double. So paying some careful attention to tuning can really pay some huge dividends.

You can throw more hardware at the issue but that tends to be expensive and it also doesn’t scale nearly as well as tuning your workload in the first place.

Next, once you’ve done that, so if you’re still having issues, then do your homework. So, first of all, verify that you actually have a storage issue. There are various ways to do this, various statistics to look at, DMVs. Once again, there’s talks and information published on these various tools and methods of how to do this, but make sure you verify that it’s a storage issue before you actually go to the storage admin complaining about the SAN. The other SAN check you can do is to just see if you’re out of bandwidth. So if you don’t know how to look and see what types of HPAs or what type of bandwidth you have in your server, you can just simply ask them, they can tell you how many storage ports you’re using and what the speed of those ports is and in perfmon just look and see if the – when you’re having issues, that if the megabytes per second that are being transferred, if you’re capping out on your bandwidth. So, if your pipe is full, then that’s an issue that needs to be addressed, but it’s something that is very simple to check.

And then next, if you’re going to go talk to your SAN admin about your storage performance problem, make sure you can explain why you’re having a problem. So don’t just go to them and say 20 milliseconds late is too high for SQL Server. Some people can get away with that just fine and for other applications, three milliseconds is way too high. So you’re going to have to explain to them why SQL Server is different than other applications.

Now, just like you probably don’t understand the difference between a fiber channel and SAS cable, they don’t understand exactly how SQL Server works, right? That’s your job. Applications like exchange BDI, these types of things have more specific workloads, so they have more predictable I/O patterns that they’re used to dealing with. They’re not used to dealing with the SQL Server thing that can be quiet one day and then the next day be totally lighting up their world. So you’re going to have to walk through your issue, basically walk through it from the standpoint of okay, when we’re updating this customer and when we update this customer it runs this set of queries and these queries do this amount of I/O. So have the statistics to back it up, walk them through this operation and illustrate your math, and like your math teachers said, make sure you show your work.

The good news is that most SANs today are capable of massive performance and scale. So chances are that it could be just a configuration thing on their side, what tier or pool that they have allocated and put your volumes on, or that they can make a call to a vendor and say hey, send me another six SSDs to throw in my SAN and you can get a huge boost on performance that way.

Okay, create a troubleshooting plan. So work with your SAN admins to agree on a toolset and a process. The tool really doesn’t matter, there’s nothing magic about the I/O that SQL produces versus anything else. Just make sure that the I/O pattern that you’re working with matches your workload. Another big thing here that a lot of people don’t realize or they overlook, is make sure your test tool doesn’t write pages of zeros. So some test tools out there, when you use them to create a test file that it’s going to use for the test, it just simply writes pages of zeros to that test file. The problem is that SAN vendors have gotten smart and some of them when they see a page of zeros come in to disk, that page of zeros doesn’t get saved to disk. They make a note that you wrote a page of zeros to disk, and so then when you request that page back from the storage, they just rehydrate it with zeros again and hand it back to you. So it allows them to save bandwidth on the disks and it also allows them to save storage space this way.

But the problem is that when you run a test with your test tool, the test never hits the disks because all it’s doing is hitting the SAN controller and the SAN controller is either throwing it way or just creating a page of zeros and handing it back to you. And then also be aware of the cache in the system and the role that it plays. So if you’re trying to actually test the storage itself, meaning you’re trying to test how the disks are actually performing, you’re going to have to use a large test file that’s larger than the size of the cache in the SAN.

Test the storage components and then try to reproduce the issue, so I will emphasize that this will have a performance impact on the SAN while you’re doing this, so make sure you coordinate with your SAN administrator and don’t just start kicking off all your workload tests on their SAN. Maybe I need to go back and update top six ways to get bozo bit, so note to self, I’ll update that for the next talk.

Brent Ozar: That’s one of the questions is people are like, wait a minute, can you go back to the bozo – don’t do this now, but can you go back to the bozo bit slide because I still believe some of those things and I need you to explain them to me.

Doug Bernhardt: Okay. Yes, that’s – well and you can find – unfortunately, you can find links on the internet to justify those things as well, so not sure what that means, you can find links on the internet to justify pretty much anything I guess.

Okay, so you talk to your SAN administrator and he says what problem, I don’t see anything, so you’ve made your case to your SAN admin and your SAN admin, he sees nothing. So there’s some possible reasons for this. So it could just be a simple matter of averages in sample rate. The tools that they run and the metrics that they collect typically, the sample rate is not very granular. So the sample rate may be only every five minutes or something like that, because they’re no trying – they want to preserve performance on the storage, and they don’t want to put a performance burden on storage by collecting too many performance statistics or collecting them too often.

The other thing is that they may be just simply looking at graphs and looking at the SAN from a high level. So there are several different levels that they can look at it from, most vendors allow you to look at it at entire system view and then drill down into individual servers and individual volumes but if you’re looking at – if they’re not looking at it in the right context, then your averages are not going to match up. So the blips that you’re seeing on your SQL Server are going to get averaged out by the other 250 servers that they have in their environment. Ask them to make sure there’s no throttling or QoS restrictions in place like we saw earlier. I’ve actually worked with customers before where there was a network admin in the company who’s no longer there and unbeknownst to anyone. They set up QoS restrictions on one of the switches, so they were expecting 10 megabits or 100 megabytes of bandwidth and they were getting one. So one of those kinds of is it plugged in type of things to check, but it certainly doesn’t hurt to check.

So the other thing to realize is that your scope is different, so your scope when you’re looking at it from a SQL Server perspective includes many more components in the I/O path. The scope that they see when they look at it from the SAN is from the time that the I/O reaches the SAN. So even when they’re looking at front end metrics, most SANs don’t know if it’s a Windows host or if it’s a Linux host or what kind of host it is. They do not measure it at that level. They’re just measuring the response time from the time it hits the front door of the SAN until it does whatever it needs to do. So it could be that the SAN administrator is not lying to you, like you maybe think he is, but the problem is actually somewhere in between.

So we looked at those diagrams before, those are greatly simplified, there’s a lot of different components in between your SQL Server and the SAN. So it could be in one of your HPAs, in one of your storage network interfaces, it could be in one of the many cables that go between your servers and the switch and the SAN, or there could be an issue inside the switch itself.

Okay, so narrow it down. Remove SQL Server from the mix and use an I/O tool to reproduce the issue. When you do this, you can test a couple different ways. You can test the I/O path itself and simply see if you can fill the pipe and you can get the bandwidth that you’re looking for. So to do this, you would run a big block sequential read test, say 5 to 12K or greater, and use a small test file because here we don’t care about what the disks are doing, we just want to see if we can fill the pipe to the SAN or from the SAN in this matter. And then – so that test file can be fairly small, say one megabyte or something like that, so once you’ve verified that and you’re getting all the bandwidth that you should be getting, now you can do it in the end test and you can test the disks.

So here you’re going to want to use a test file that’s larger than the cache and run the appropriate workload in terms of the block size, the number of IOPS and the read-write ratio that you’re running in your actual SQL Server environment. So the second thing you can do and unless you dig too deep, a lot of people don’t know about this, so there’s a thing you can run called the Storport trace, and this is actually done through the perfmon tool and there’s articles online if you look this us, that tell you how to configure it and you got to watch carefully because it’s going to talk about little endian and big endian numbers and that kind of thing, but don’t be scared, the documentation is good. And so you can set off the Storport trace and what it allows you to do is it allows you to look below a lot of the components in the Windows stack. So it gets you much closer to the actual layer that is communicating with the SAN.

There’s also tools available to monitor your HPAs and take a closer look at what’s going on with your storage drivers. The vendors have those and you can acquire those and take a look and see is my HPA having some type of errors. You may not know what those mean but once you’ve identified that you’re seeing some errors there than you can either work with the vendor or work with your SAN admin to help figure that out. And then finally, your SAN admin can be a huge help here. So they can work with support, and most SANs come with support, or it’s required or it’s very highly encouraged that they have it. They have made a big investment in these things and so the vendors a lot of times are there with support contracts to help them out.

These SAN vendors have remote monitoring capability in a lot of cases, so typically they are getting on a regular basis, they’re getting statistics on these things in terms of performance, any errors that they may have. They also have a command line tools and a command line interface so they have a lot – the SAN operating system actually has its own proprietary language, it has its own command line interface and there’s a lot of information that’s available through there that even your SAN admin probably doesn’t have access to. So he’ll have to request some help to troubleshoot these things, but the good news is that there’s a wealth of information in there that the vendor can pull. So when the vendor takes a look at what’s going on with your SAN, a lot of times they can tell based on what they’re seeing that you may have a bad cable in your environment, you may have a bad interface in one of your switchboards, and switchboard is going up and down and is causing you performance issues, so they can really be helpful here to understand what’s going on in that whole path between your SAN and your server, so make sure you encourage them to leverage that resource.

Okay, so what did we learn here today? Well, some basic knowledge of this thing we call the SAN helps you understand your SAN administrator’s perspective. This relationship is a partnership, so it’s a help me help you type of thing, so if you can build some credibility with them, if you can do them a solid so to speak, by giving some resources back to them then I’m sure they’ll be a lot more willing to work with you and tend to all of your SQL databases to SSD, just like you want. And also speak their language, so the metrics and things that they’re looking at, anything you can do to communicate your needs in the lingo that they speak and the things that they understand will certainly help.

So when issues occur, try to partner with them, don’t point fingers, your boss doesn’t like it when you point fingers, that’s not cool. And then finally, encourage them to leverage their vendor’s support. So once again, they’ve made a large investment, chances are they have it and it’s available to them. So now we’re at the point to talk about questions.

Brent Ozar: There are so many, it’s going to be fantastic. So Kevin says, “Our SAN admin standard is a 32K block size and he refuses to change it. Is there a difference? Like should I go to the mat fighting with him trying to get 64K?”

Doug Bernhardt: So this may be a lingo thing, so what he may be talking about is how that data is actually written on the SAN. I’m assuming in this case that he’s not – that they’re not dictating to you that you need to format your volumes with the 32K allocation unit, so I think you’re talking about two different things. The block size that’s used on the SAN is used – typically vendors have a certain block size that they use and it’s based on how their architecture is built and how they do raid striping and that type of thing. It really doesn’t have a performance impact or performance implications like the allocation unit size does when you’re formatting at volume.

Brent Ozar: Yes, totally different. Now let’s see here. Jamie says, “I just buy my SAN administrator some chocolate to keep her happy.” Yes, that will work. SQL Sorcerer says, “Should I put my data in log files on separate volumes?”

Doug Bernhardt: Well, so this is going to depend on what you’re designing for to some respect. Most – so if you’re trying to get max performance and I don’t know about all the vendors but a lot of the vendors when they have a dual controller SAN, one of the controllers owns the volume and so all the I/O for that specific volume goes to either one controller or the other. So if you want to get the full bandwidth of your SAN, then you want to have at least two volumes and then distribute your I/O as evenly as you can across those two volumes.

Now, this is – so for example, when we run benchmark tests and that type of thing, publish performance numbers, that’s what we do but keep in mind that when we do that we pretty much have the entire SAN at our disposal and that’s the only workload running on it. So if you’re doing it for max performance, that’s how you would want to design it.

Now, if you’re using other features, say like if you’re going to use storage based replication and performance is not as critical, you may want to put those files on the same volume to make sure that they don’t get separated because they’re going to replicate at the volume level. So if you keep them together and you’re getting the performance that you need, then that also will ensure that if they were to replicate those volumes or back it up or whatever they’re doing from the storage side, that that set is going to be kept together.

Brent Ozar: Let’s see here, Brian and Packfan asked – two different folks asked, “So I’m dealing with VMs, should I use VMDK or raw device mapping?”

Doug Bernhardt: So I know as a SQL person you’ve never heard this one before but it depends. So the thing is that if you’re going to use and here again, this depends on your vendor, you’re going to want to check – if you’re going to use, for us one of the big caveats here is that if you’re going to use the storage based snapshots and you’re going to use VSS integration, so you’re going to take those storage based snapshots using one of their tools and you’re going to take it actually from the host, then that communicates a lot of time back through iSCSI. So what – and some of our products, we require that you have the raw device mappings so there’s that communication channel and again “talk to” the storage.

When I send VMDKs, you can still do it but a lot of times the snapshot isn’t taken at the host level on the entire host, of all the VMDKs that are in the host. So it kind of depends on how you want to work with it, if you want to work with it at the granular database level, at the volume level or if you want to work within the host. Now, I don’t know about all vendors, if that’s true with all of them, but that would be one thing to look into.

Brent Ozar: Oh, while I’m at it too, jump back to the first slide, whichever one has your contact info just because I know I’m going to end up pointing to people and say go email Doug because he knows everything. Ronnie asks, “When I turn on perfmon, my average disk seconds per write and read are super fast. I mean they’re all within say, 30 milliseconds, but my disk queue length is up in the thousands” and lots of people were like, hey how come Doug didn’t mention to monitor disk queue length, and while you ask that one I’m going to go get a fresh cup of coffee because you got lucky on that, and this will only take you a minute so go ahead.

Doug Bernhardt: No worries, okay, so the problem with disk queue length is it depends on the number of resources that are actually behind it. So you know, if you look at the analogy – the queue length is you know, similar to if you go to the store or you go to the airport or whatever to check in and you got to go up to the counter and there’s a lot of people up there at the same time, okay, so that’s the queue length. But if you get through there, if you still get through there in a rather fast fashion, then it really doesn’t matter. If there’s a lot of say, you know, you go to check in at the airport and there’s 50 people in the queue, but there’s 50 agents willing to help you, then you’re going to get through rather quickly. So it doesn’t really matter – what I’m trying to say is how many people are in line if it there is enough resources to help them.

The other thing is that just through natural activity, SQL Server and any other application for that matter, but it will offer requests in a burst. So you will get a huge burst and you’ll see these queue lengths go up and down and sometimes you’ll see a dramatic number out of them, but really what you’re concerned about and this also kind of ties into why one of the suggested tuning methods is to tune based on weight stats is that that’s really what matters. So how many resources are available to service your request or how many requests are in the queue really isn’t relevant so long as your response times are what you expect it to be.

Brent Ozar: Let’s see here, Kenneth says, “Is there a way to see where my data has moved on the SAN? Like if we have automated tiering from slow to fast storage or fast to slow storage” because he says, “I’m thinking about end of year time is coming, let’s say for reports and I want to pre-warm up my data. Is there a way for me to like do some kind of activity on the SQL Server side and then see that my data moved up to the fast tier?” I like that, it’s creative.

Doug Bernhardt: It is creative. So yes, that information is visible, probably just not to you because you don’t have the SAN tools. But I can – here let me do this, let me switch over here real quick. Okay, so this volume here that I was working with setting on – setting QoS statistics on, so if I look at the statistics for this volume and this is Dell storage manager. So this is the management tool that your SAN admin would use if they’re working with Dell storage. Here you can see in the statistics tab that I have my tier one storage up here and there’s hardly anything in here, and so you’re thinking to yourself, okay, wait a minute there Mr. Bernhardt, you just showed us this operation on this database that you said was pretty big and there’s no storage. Well, if we scroll down here, you will see down here at the bottom, tier three, so my data on this volume has aged and it’s moved off of tier one, in this case I have faster disks in tier one, and it’s down here in tier three.

Now, I can also see – as you can see here on the screen, so I also have statistics on what the estimated compression ratios would be and that type of thing, so these statistics are available in the SAN, they’re available at the volume level and they’re also available at system-wide level.The other thing that you might have some luck trying and this is actually a good question because it gets into something that we didn’t talk about, is that – so Dell storage, we have full PowerShell integration, so you could actually write a PowerShell script. Now here again, if you have access to the tool or if you could buy your SAN administrator chocolate or beer or whatever it takes to get the job done, but you can basically do anything with the storage, set up, tear down, view, administer all through PowerShell. Other vendors have PowerShell APIs, some of them have APIs through rest and that type of thing to view these statistics, but that’s another option if there are certain things that say, maybe you want to keep an eye on but your SAN admin isn’t game for just throwing you the keys to the SAN.

Brent Ozar: Let’s see here, several people asked, “Is there a performance difference between VMDKs and raw LUNs?”

Doug Bernhardt: Not really, is the short answer. The virtualization technologies and layers have gotten so good now where whether it’s VM or whether it’s Hyper-V, there’s not a lot of overhead in there, assuming that it’s properly configured. So if you try hard enough, you could screw up just about anything, right. So you know, what is possible, I know we have VM ware experts here, I’m certainly not one of them, but there are different things that you can do within VM ware and the settings and things like that. A lot of those are documented by VM ware themselves in terms of best practices for setting these things up. But I think, by and large, you’re not going to see much of a performance difference. The overhead that is there, you’re talking like a, you know, percentage difference of single digits for that layer.

Brent Ozar: And then so the last one we’ll do is Lee asked, he said, “Can you go back to the things not to say to your SAN admin slide and elaborate just a few sentences on each of those?” There we go.

Doug Bernhardt: The bozo bit items. Okay, so asking for direct attach storage, so if you ask for direct attach storage, if you look at it from a SAN admin’s perspective, so they just advocated for an probably spent let’s just say a large sum of money on the SAN device. So that’s their strategy going forward, that’s their architecture. If you ask them that you want direct attach storage, you better have a pretty good reason for explaining why you’re so special.

The other thing is that then you know, it’s kind of like as DBAs, it’s much easier from an administration standpoint of you can think keep things common and have standards, right? So if you could have standard processes for managing your databases, if you could have standard processes for setting things up, one-offs create issues. So if you’re asking them for essentially a one off in their environment, that’s going to create challenges for them, it’s going to introduce gear in their environment that they’re not familiar with, that type of thing. So asking for direct attach storage is one of those.

Telling them you want your own SAN, so here again, I think in most cases and certainly you know, there may be some of you out there that have just a very large, very high performing SQL Server environment that would mandate just a crazy amount of storage and storage performance. But typically speaking, telling them you want your own SAN is just being kind of naïve about what the performance capabilities and the capacity capabilities of these things are. So chances are if you just communicate your requirements, they can satisfy your requirements somehow by either expanding the current capabilities that they have or promoting you from the crappy you know, tier 57 storage or whatever that they put you on by default to something a little better, but chances are they can accommodate with what they already have.

The dedicated disk thing, that goes along kind of with the direct attach storage item as well. So that presents issues for them because it’s different, the environment, when you start doing dedicated disk and direct attach and that type of thing. Like I said earlier, that presents issues like okay, if I set this server up and now I need to go change it later or the databases grow, can I match the drive types that are in there, is there actually enough physical bays in that server to support more drive types and that type of thing.

Demand rate ten, so that item is – and this goes back to kind of the days of spinning disks and when you only had a certain number of spindles, the advice was to use rate ten because it was the fastest, so it basically gave you the maximum number of spindles, say versus rate five or that type of thing, and therefore would give you the most performance. But now that we have SSD drives and we’ve got other flash memory technology, typically as a vendor, we don’t – when we size systems, we don’t have to size them for performance with crazy number of disks because the flash technology and the SSD drives are capable of such a crazy number of IOPS, in and of themselves, a lot of it now has come down to capacity because the drives are that powerful. So demanding rate ten, that just kind of displays your ignorance, if you will, of what’s going on with the technology today and how the systems are designed and actually work. So the disk queue length one we already kind of talked about that in another question.

Brent Ozar: Yes, and it’s amazing how I see so many people still using that today as a vital performance metric, saying my disk queue lengths are high, what should I do about that? I really tell folks, think of it in terms of if you had a huge to-do list on your desk and you were executing every to do within an hour or so, your boss would be ecstatic, he wouldn’t care how many things were in your to-do list as long as you can knock them all off in an hour. Things are great.

Doug Bernhardt: Yes, I kind of like to think of the analogy of you know, when you go to the airport and you have to go to the counter to check in for a flight, right, if there’s one agent, you’re in trouble, right? There’s a lot of people in line. But if you got 20 people up there, you can have 40, 50 people in line and it really doesn’t matter because you’re going to move through fairly quickly.

Brent Ozar: And that tells us that obviously, you’ve been taking a lot of flights lately as well.

Doug Bernhardt: I’m just damaged by the few that I have taken.

Brent Ozar: It’s bad these days. Alright, well thanks a lot for volunteering to speak to the community this morning. Hopefully everybody gives Doug a warm round of virtual applause, thank you so much for volunteering to spend your morning with us. Great presentation especially, I can tell by the huge number of questions and answers that came out of that, so thank you very much sir.

Doug Bernhardt: Absolutely, thank you, everyone.

The following two tabs change content below.

Doug Bernhardt

I currently work for Dell EMC as a Storage Engineer focused on SQL Server integration and performance. I have over 24 years of experience working with SQL Server in various roles.

Latest posts by Doug Bernhardt (see all)

, ,
Previous Post
SQL Server 2016 Features for Performance Tuning Lovers
Next Post
Bringing DevOps to the Database

4 Comments. Leave new

Doug, great presentation idea. I would love to see this talk since SAN’s are a mystery to me. As someone who isn’t a hardware guy, understanding these aspects are always a struggle.

Some constructive criticism about the abstract:

You start with “Storage is arguably the most important component in your SQL Server infrastructure”. You start with a big statement that is true, but many people may not know why it’s true. The thing you should be answering from the very first sentence is “why do I care?”.

As a DBA, I don’t worry about the electricity, I don’t worry about heating and cooling, why should I care about storage? Isn’t that someone else’s job? And the answer is I have to care, because if the storage fails, I don’t have a job. If the storage is slow, my databases are crippled. It’s important to sell why this talk is important.

Another thing to consider is clarifying is how the listener will grow by watching this. What will they be able to do moving forward? How will this improve their lives? Again the key question is “Why do I care?”. Optional, but definitely helpful.

Finally, I would consider breaking this up into two paragraphs. The first paragraph would be the why and the second would be the what. Sell the talk, then tell them what they are actually getting.

I’m excited to see where this goes!

Reply

    Thanks for taking my feedback into consideration, I love the changes. I think this does a much better job of explaining why this presentation is valuable and what watchers will get out of it.

    Reply

This would definitely be worth watching!

Reply
Michellea David
February 1, 2017 8:25 pm

I love this topic. Recently I have work with several SQL Servers that were affected by the SAN slowness. We had issues because the SAN was running out of storage space and that slowed down the tier movement.

I would love to hear why this occurred. Why the different tiers and how do I prove the SAN is causing my DB slowness.

Maybe a next step. Explain how to talk to the SAN admin to get faster io.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu