Enjoy the Podcast?
Intermediate to advanced DBAs/Developers interested in how a “top 30” web property uses SQL Server
Stack Overflow is the largest online community of developers, and the performance and reliability of our SQL Server databases are crucial to serving over 1.3 Billion page views each month with each rendered in ~10-20 milliseconds. The Site Reliability Engineering (SRE) team is responsible for designing and maintaining our infrastructure (SQL Server, Redis, IIS, Haproxy, ElasticSearch, Fastly CDN) and finding ways of getting the most performance from a minimal amount of physical hardware. We try and design all of our solutions to be as simple as possible, and we believe that troubleshooting should be a first class feature of any critical system.
This high level session will show how we use Availability Groups (AGs) to scale out the SQL workload and meet our current HA/DR needs, as well as future plans to use Distributed AGs to solve a pesky issue called “the speed of light”. It also will show the monitoring systems we use, namely OpServer and Bosun (both Open Source!), to quickly identify production SQL issues and ensure that performance is a top-priority feature.
I will also outline some of our key development principles, such as collecting exceptions in a central location, the benefits of using a Micro-ORM like Dapper, and the pros and cons of using the new JSON T-SQL functions. I will describe our experience finding the best tool for the job (like database migrations), where single purpose tools often beat all-encompassing frameworks. Keep in mind this information is heavily weighted toward the needs of a large .NET based web property, so YMMV. But my hope is you will leave with a few new tricks and a better understanding of how far you can go with a simple SQL Server architecture.
Why I Want to Present This Session:
At Stack Overflow we are very active in the developer community and as employees we are encouraged to talk about our work publicly. This past year I gave an Ignite talk at DevOpsDays in SLC and co-presented with SQLCAT at SQLPASS, but I have always wanted to develop this talk into more of a overview of the SQL strategy and development practice at Stack Overflow. People are often surprised to learn that we use SQL and .NET and are able to run on such little hardware (basically 4 SQL servers and 9 Web servers). Group By feels like a great place to share this kind of information and I am excited at the opportunity to be a part of it.
The 2-minute infrastructure description of Stack Overflow shown at Microsoft Connect:
Brent Ozar: In this session at GroupBy, Greg is going to be talking about how Stack Overflow uses SQL Server. So Greg, take it away.
Greg Bray: Cool. My name’s Greg. I’m joining from Salt Lake City, Utah today. I do want to say thanks to Brent for organizing this, as well as the community for up voting this session. I’m excited to be able to give this talk. We’re going to talk a little bit about how SQL Server is used to keep Stack Overflow and the rest of our network online. You can see the agenda here.
Do need to give a disclaimer: this is very heavily biased towards the needs of a very large .NET-based website, so please don’t take anything I say as hard or fast truths. They definitely are not. But I hope you’ll find some interesting tidbits in there.
Also, working for Stack Overflow is like drinking from a firehose every day so there’s a lot of information packed into these slides. I do not expect to cover everything but happy to follow up offline. If you see something you’re interested in, ping me, whatever, we’ll start a good conversation hopefully.
Hopefully a lot of you are familiar with Stack Overflow. I definitely know I use it every day and we believe most of the world’s developers do as well, at least on a monthly basis. We actually run a network of about 162 different Q&A communities, from very wide-ranging topics, from bicycles to DBA, there’s a DBA-specific, dba.stackexchange.com, kind of a really cool network of sites to be able to be a part of.
Last summer, we released a new content type we’re calling Documentation which is focused on code examples. Still pretty early in the beta process there but there’s already 100 topics on the SQL Server tag. Highly recommend you go and check it out if you haven’t looked at that yet. We do of course also have a job board, it’s one of the ways we pay for servers and salaries. Then recently, released what we’re calling Developer Stories, it’s kind of an alternative to a resume. Great place to go and show off some of maybe the cool things you’ve been able to do and accomplish.
A few numbers: site rankings of course vary quite drastically based off of who’s measuring and what they measure, but safe to say we’re in the top 50 sites for the U.S., probably even closer towards the top 25. A fairly decent amount of traffic nowadays. 108 million global uniques a month, 1.3 million page views. Those are probably already even out of date from last time I updated the slides. The sites were launched in 2008, a good history link there from I believe a Channel 9 video that was recently done by Joel Spolsky.
We were always built using SQL Server IIS and ASP.NET which is a technology stack you don’t often see at that size and scale. We do also have a fair amount of other technologies in the mix and then we use Fastly CDN as our edge cache provider. A few more numbers: 17 million questions, 27 million answers, over about 44000 tags or topics. One of the cool ones is about 20 billion SQL Server queries each month. So we definitely get a lot of value out of those licenses we pay for. Some more interesting stuff: There’s about 10 top level domains and a lot of subdomains.
We’re currently in the process of trying to move everything over to [inaudible] and finding all the quirks and concerns around doing that correctly. We do have 360 databases that all use the exact same schema and data tables and we have to keep those in sync as they change frequently. It’s actually two per each site, one for the primary Q&A and one for the meta for that community. So kind of an interesting problem. We’ll show you a bit about how we do data migrations.
Myself, I am on the SRE team. Basically our job to keep the sites online, make sure they’re running fast, secure. We have an on-call rotation, rotates through about 10 different people now I think. We maintain all of our internal hardware and services which is a mix of Windows and Linux systems, almost half and half. Pretty much all of our production traffic is served off of physical hardware. We’re not a very big virtualized or—don’t have a lot of cloud services, pretty much all we use is Glacier storage for offsite encrypted backups, then Route 53, GCE, for DNS, those are kind of the two main cloud things we use. Everything else handled through a data center in New York for the most part.
We very much enjoy talking about this stuff publicly. It’s actually part of our annual review process, we kind of get graded on it. We have a blog off of the Server Fault domain. You can see here this is from about two years ago when we did an upgrade of a live data center, lots of hardware porn if you’re interested in seeing those kind of pictures or just seeing some of the behind-the-scenes things, how those work. Also contribute to the company’s engineering section of the Stack Overflow blog.
We maintain the StackStatus account for Twitter and off of Tumblr, if we ever have outages or do postmortems. There’s a cool one that got recognized on Hacker News about a regex that took down our sites for a little while. It’s kind of fun to see some of those horror stories sometimes. Lots of people on our teams and development team have personal blogs, Twitter accounts. We’ve done a few AMAs. Definitely like to be involved as much as we can in the community.
The SRE team works very very closely with the developers. So we’re not in charge of the core code. We do do a fair amount of development for like automating tasks that we have to work on, but we make sure that development teams they can do streamline deployments, any sort of troubleshooting security performance, that kind of stuff. We definitely work with them very very closely.
So to get into our SQL Server strategy, software engineering products can sometimes devolve into Rube Goldberg style machines. We try to avoid those as much as possible. We try usually the simplest solution we can find and eliminate any excess complexity to hopefully end up with something that will scale. That’s a problem that we’ve had in our history as we’ve seen our traffic grow significantly since that initial launch. We found simple systems just generally scale much, much easier and of course are much easier to troubleshoot when things go wrong, and they always go wrong.
We all try to make things as fast as possible. We try to render pages in 10 to 20 milliseconds, which is sometimes difficult but it’s usually achievable especially if you have access to optimize the full stack all the way down to the hardware. At our scale, fast rendering can actually increase productivity of the world’s developers. So when you accumulate it, it can save countless days or weeks of developers’ lives. It’s something we enjoy, whenever we’re looking for a problem, getting that solved as fast as we can.
We do of course need to make sure we’re highly available and have disaster recovery strategy in mind. We do live maintenance in the primary data center if we ever have to go down it does disrupt global developers. You can see it on Twitter, it’s one of our best monitoring systems. If we have an outage, my TweetDeck column will just blow up usually 10 to 20 seconds before any alerts. It’s just crazy to watch. We try to avoid that if at all possible.
We of course have DR plans for everything. We monitor everything we can, often from multiple locations. We’ll show you some of the systems we use. Then when we’re designing say our data access layer, we try to centralize things into ORM features if it’s something we think the public might be able to use. I’ll talk about Dapper a little bit which is our Micro-ORM. Then definitely lots of caching, and making troubleshooting a first class features is pretty important for us.
Primary workloads: We do have two main clusters and five availability groups across those, about 369 databases total. Like I said, two for each site and then various databases for other services. Those two clusters are composed of two Windows Server Failover Clusters with three nodes each. There’s a primary in New York for writes, a replica in New York for reads, and then a replica in Colorado for disaster recovery. You can see here we’re actually currently using Windows Server 2012, not even 2012 R2. We’re kind of stuck on that until we can figure out the most efficient way of moving to a new cluster without tearing down the whole thing. We’ll talk about that a little bit.
Not crazy hardware, definitely some big boxes but just a Dell R720, a Dell R730. We try to operate on about a four-year hardware lifespan, so the Stack Overflow cluster is actually up for renewal. We’re spec’ing that out right now. Hopefully we’ll have it replaced in the next few months here with the latest and greatest. In total, Stack Overflow cluster has about three terabytes of data and the Stack Exchange cluster has about four terabytes of data all running off of NVMe SSDs for the most part, with some additional either SSDs or hard drives backing it for some maybe slower less IOPS needed.
We do have some additional SQL workloads, not going to dive into these too much but we have a decent data warehouse now where we send all of our traffic logs. There’s actually two separate standalone instances of that, we’ll replicate the logs to both, we’ll send copies to both New York and Colorado. 80 terabytes uncompressed, 20 gigs raw daily that it’s growing by, great for troubleshooting, machine learning, various things. Our ad server and data science teams are doing great work with columnstores in 2016. There’s been some major performance gains in storage, won’t be talking about it now but hopefully there’ll be a session on columnstore in the future for GroupBy because there are some amazing things going on there.
Brent Ozar: Raul has already—and a couple of people—they said, “Did I read RAID 0 on that last slide?”
Greg Bray: Well, we’ll get into that. I’s one of the tradeoffs we’ll do. We’ll maybe explain why and, like you said, not best practice but it is a RAID 0. We have a lot of things that will lean toward performance and we’ll show some of those reasons why.
Brent Ozar: Erik wanted to throw in and ask, “What do you do with your old hardware? And is there any way I can come pick it up in New York City for free?”
Greg Bray: We have, the original data center was in Oregon, so we actually have three data centers. We’ll send a lot of the old hardware to Oregon and make a great test lab if we need to spin up some new systems, play with some decent gear we’ll do it there. Sometimes occasionally it will get sent out to employees. I don’t think we’ve ever done it to the public. There’s a long line waiting for some of those I think.
Another place that you can access our data is through data.stackexchange.com. It’s actually a public-facing site built off of the Stack Exchange.DataExplorer open source project. Great place to go and explore SQL without having to use SSMS. All the site databases are there, with just some columns and tables removed so it’s not displaying private information. You can see on the right dashboard from Opserver, which I’ll show a bunch of screenshots of today and various miscellaneous servers. We do try to keep CPU low because that’s almost always the first resource we’ll run out of.
Then for data.stackexchange, an example of our kind of simple process, instead of using SSIS for ETL, we actually just use stored procedures and SQL jobs to kind of shuffle data around. We found it to be a simpler process, we didn’t need the full stack. That toolset was kind of too big for our needs. We just went with a more simple approach.
Some HADR concerns, why we’re using RAID 0? Well, we do have redundancy at a hardware level. Back in 2012, which is way before my time at Stack, the team actually worked with Brent to move over from 2008 to 2012, so we could get access to AGs. They’ve been a key part of our strategy since. There’s a great blog post if you want to see some of that history of how we moved and what we’re using those for. We’re not only RAID 0, we’re also asynchronous-commit mode. So with our data, we value the lower transaction latency over the slightly larger window of potential data loss. Then of course we can switch between those modes if we need to. So when we’re doing a manual failover, we’ll go from asynchronous commit to synchronous and be able to do that without any data loss.
Then it is, like I mentioned, split between the two clusters. AGs are great because they support things like rolling upgrades with minimal downtime. Most of that time is actually spent spinning up the listener IP and waiting for our .NET connection pools to stabilize. When we’re doing a failover we’ll actually shunt the traffic at the load balancer, do the flip, let things kind of stabilize, come back up, then re-allow traffic to go through. That’s currently the best solution we found. May be some ways to improve that, but works for our needs and is a fairly straightforward simple approach.
Again, the two different clusters, the primary one only has one AG, but the Stack Exchange network cluster has four because that is the failover unit. Then you can do different characteristics on them. So we’ll have one that’s a higher SLA for say our ad server or machine learning kind of money making operations. Maybe one that doesn’t go out to Colorado, so we have New York only databases that don’t need to be replicated.
Chat is an important one because that is a very critical system during an outage. We might want to fail that over independently to the DR site so we can do that by giving it its own AG. Then SE Network just holds all the rest. We can see here the Opserver view of all of the replication traffic, how big the logs are backing up on servers, if there were any issues, waits, and whatnot. We’ll be going through a few more of those screens turn during the monitoring section.
A few more of our experiences with AGs. We do like them because they let us use simple and fast local storage. We’re using Intel NVMe [PD3608s], I believe and they are in RAID 0. We have data redundancy at the server level. We’ve yet to have one of these go out on us—knock on wood—and we like that we can get the better performance, we can tradeoff that performance versus slightly higher risk and just hasn’t been an issue for us. We actually end up replacing them every 18 to 24 months anyways, as we need to grow into additional storage. So definitely don’t recommend RAID 0 unless you’ve thought through all the different concerns but we have a lot of other ways of making sure that we have redundancy at either a server level or through backups and whatnot.
It’s kind of interesting, those drives are actually four terabyte drives but they’re presented as two by two terabyte. In the current system we had, we couldn’t even really do various RAID systems we couldn’t do a RAID 10 because it wasn’t available through the hardware controller. We only really have what Windows software RAID could provide for us and that was going to be a RAID 0 or RAID 1. Now on the newer server, we have been able to get Intel’s latest RSTe drivers to do basically hardware RAID. So on our new dev servers we’re spec’ing out some different things, maybe won’t be RAID 0 but it’s worked fairly well for us so far.
We can also use the replicas to scale out the read traffic by sending ApplicationIntent read only in our connection string. That will allow us to do read operations on that replica. It’s great at scaling out our load across the two servers. If we ever needed to, we could split that out into additional servers through the AG if needed. If you’ve never worked with AGs or failover clustering, if you are using multiple sites where you have to split, span across multiple subnets those clusters can be kind of rough to deal with.
We actually had fairly frequent issues when we were using VPN tunnels for replication, adding a dedicated MPLS connection helped significantly, so you do want to make sure you have as much stable connectivity between the two as possible. But it’s still somewhat fragile, especially with our 2012 server OS. Been a lot of improvements, so if you’re able to, definitely recommend using 2016. They’ve done vast improvements in the clustering but our long term plan would be to move to a new feature called distributed AGs in 2016 because that will mitigate the multiple subnet failover issues. You’ll actually have an AG of AGs and don’t have to do these kind of cross site boundaries that can sometimes be difficult.
Currently our number one performance blocker is the speed of light. Rendering pages in less than 10 milliseconds doesn’t really help if somebody is in Australia 200 millisecond roundtrip time. The Fastly CDN can help somewhat with caching static content or dynamic requests, but our question pages have a fairly long tail, like 80 percent of them are viewed every month.
One way we could get around that, we’ve been planning and would love to try to implement if we had the time or money, would be creating a global Point of Presence where we can put Stack Overflow in a 2U server and put one on each continent. Then use distributed AGs to keep the local replica in sync with the primary data center. That’ll handle all of the read workload and any write requests we could just forward to the main data center. Great technology and we love that again things that can scale all the way down to a simple local VM setup. We’ll scale all the way up to attacking the speed of light. So it’s a great technology for us to kind of base our business needs on.
Distributed AGs are also a great way to do data consolidation. We have those two clusters and we’ve always wanted to create recording instances but because they’re on separate clusters, we would usually have to actually create two reporting instances because we couldn’t find easy ways of getting the data to go into one instance, but we could do that with AGs. We could have distributed AGs that kind of collapse that information down into one database, offloading a lot of the batch requests, preventing a rogue query from being a harm production.
This is my favorite topic, is monitoring. We’re somewhat fanatical about our monitoring systems. We actually usually end up building our own so we get exactly what we want. Then we also do run multiple instance of it, so we’ll have a copy of Opserver and a copy of Bosun in each data center so they’re accessible even in the case of major failures.
This is the main view of Opserver, which is our single plane of glass dashboard, is the status of all of our various different systems. It’s an open source project written by Nick Craver. Kind of has a very heavy SQL focus so you’ll see a lot of these examples of screenshots here in a bit. It was originally backed by the Orion monitoring system and they’re monitoring agents but the latest version now uses our Bosun monitoring system instead. It will let us do automation of various things like Redis failovers or putting a server in maintenance mode for HAProxy overwriting on call. If I’m doing some risky operation and my colleague in New York or Alaska—I don’t want to wake them up—I can just take on call to make sure it doesn’t call them. It will call me instead if I screw something up.
I get all the high level characteristics, you can see our SQL Servers usually run fairly low. In this case, we did have one going up to 21 percent CPU which for us is kind of high—it’s a warning level of CPU usage. We try to keep those low so that if there is some sort of influx, if we get attacked by a DOS attack, or various things, if we screw up, hopefully we have enough overhead that we can survive that without having to—we can’t spin up new instances like some cloud providers can. We have to make sure we bake in with a reasonable amount of overhead if possible.
A similar view for Redis, hopefully any information I need to know about how to manage or monitor Redis is going to be available directly here through this view. The AGs that you saw screenshots of before, if there were things that were going wrong, of course they’re going to try to red. It will show us which databases aren’t in sync and can we can kind of watch those catch up as we hopefully address or resolve whatever issue was causing the problem there.
This is the main SQL instance view. You can see things like trace flags and performance metrics and various other operational things you might want to look into. If there is a problem, again, we’ll try to turn it red. In fact, we try to turn the entire tab red if there’s a problem with an underlying component so you can even just see in the top-right corner what’s going on at a high level. So if I were to click on that job that is red, it will show us an outline of all the different jobs we have on the system.
We have various jobs for replicating SQL logins across the cluster, T log, full backups, optimizations, and checks. On another server, actually on our utility SQL instance, we’ll restore those nightly backups. Because you know backups don’t fail, it’s the restore that fails. So make sure you’re testing those. After it’s done restoring that, it’ll even go through and do what we call a sanitized minimized process. There’s a stored procedure written by the dev teams, it will run that process after the full backup. It creates a smaller version of the database that we then save off to a share. They can use that for development and testing. We do that every night. We’re testing not only our backup process but also getting sanitized minimized versions for developers to use.
This is a nice wall of numbers that you get if you ever go and look at the top tab. It shows all the top queries that are running and lets you sort by various things. By default, we’re sorting here on the average execution of CPU time per minute, but you can then also change it to say total. So you want to see the cumulated CPU used by each query, this is how we can track down if there’s a problem with one of our SQL queries very quickly. When you hover over a query, you’re going to get the text for that query. You’ll see there’s a little special comment that we’ve added in and we’ll talk about in the developer section a bit.
You could also change it to do total executions or executions per minute filtered down by database, or some other execution counts. Then when you select it, you’re going to get the full query plan so that we don’t even have to go into SSMS to find that if we’re deep in checking out what’s causing our SQL Server to become slow. So this is a great tab. It can be a little intimidating when you open it up and just hit with a wall of numbers, but once you get used to it, it’s a great tool to have access to readily right from a web browser, be able to share this with other developers. A lot of our teams are remote so you can’t have somebody looking over your shoulder. You’ve got to be able to share that with them in a quick and easy fashion.
We also do include Adam Machanic’s sp_WhoIsActive. That’s a great way of seeing if there’s something that is having lock issues or causing other resource waits. We can hopefully find that quickly and easily directly from this tab. Then a view of the open connections, so the logins, what host it’s from, also what isolation mode they’re running in. If a developer is connected and trying to run a query in the wrong isolation mode, we can track them down and prevent them from causing any locks through their ad hoc queries.
Brent Ozar: Lots of people are saying that they’re trying to open Opserver now, just keep in mind that GitHub is experiencing a major system outage. This is not anything with like a bad link. GitHub is having a huge outage right now so it has nothing to do with your link.
Greg Bray: It’ll still be there and it might take some time getting it set up. So I’ll show you all the screenshots and then you can hopefully have some fun trying to get it set up later today once GitHub is working again. Also a reason why we don’t use GitHub for our product. We use an internal GitLab instance because if this kind of thing were happening and we couldn’t do deploys—that’s a scary thing to be in as a sysadmin or SRE.
Here is the database tab. So this shows foreign instance, the databases that are running on that server, the state that they’re in, if it’s synchronized or having some other sort of issue, sizes, various other details. When you select it, you’ll be able to dig down into tables, columns, various information about all those. Some features we haven’t added in yet towards the bottom of the left-hand side there. But definitely a great way to be able to dive in deeply to these kinds of information. Again, through a browser, not having to open up SSMS, being able to deep link colleagues and whatnot to information when we need to share it across team members.
Bosun is a monitoring system, we’re kind of crazy enough to write our own. Don’t recommend it, but try ours, it’s open source. So it has an Scollector agent that uses WMI to collect information from IS and SQL, Windows, various other systems. We don’t currently access any DMV values directly from Scollector but in the future we’ll probably scrape that information out of Opserver. It already has all of it. We don’t need to double dip, we can just kind of grab the information Opserver has and then store that into a time series database. Makes it great being able to look back historically.
Right now we’re using OpenTSDB, but it also supports Grafite and InfluxDB, is kind of the major time series databases. Being able to retain the history of operational metrics has been invaluable in our experience, especially when things go wrong and we might not even notice it for six months or nine months and to be able to go back and identify when that occurred.
You can see here at the top there’s an active alert that hasn’t even been acknowledged yet, so it’s kind of hopefully something new that has occurred. When you expand these out, they’d actually even show the template of the notification that would have been sent. There’s things that have been silenced, so if we’re doing maintenance on a host or it’s an understood issue we can silence it to prevent it from bubbling up an alert. Then once it goes inactive, it would still retain on our dashboard because we kind of want to do root cause analysis, figure out why it happened. Maybe we need to tweak an alert, if it was kind of a false positive. So this is one of the views that we use to track down that information.
The graph tab on Bosun is what we use to explore the time series metrics. It will generate expressions as you’re going through and filling in the various fields. Then you can take those expressions and turn them into alerts combined with the notification templates so that you kind of progressively go through and go from ad hoc query searching through to generating hopefully finely-tuned alerts that will give you all the information you need when that error occurs.
The boxes at the top are a feature that was added last year. It is system-generated or human-generated annotations. If we have a server reboot, we have it send itself an annotation so we can see when that occurs. If we have an outage, we can mark that with a box so that we can identify even across various different metrics. Outages, maintenance windows, those kind of things, we found it very very helpful, we actually just store them off in Elastic. It’s a fairly simple thing to set up. The graphs will support scrubbing so during postmortems we can get detailed time windows of how long something occurred. It’s just been a great tool. I’ve spent a lot of time in this kind of thing just researching, trying to find high single metrics and turning them into alerts.
Brent Ozar: Nice.
Greg Bray: Yeah. So, Redis. This is actually even about a year old now. Last February, we posted this on Twitter from Nick and we did about 160 billion Redis operations. In all the time, the top of that graph is 2.2 percent CPU usage. So this thing is really good, it’s a key value store, any time we do an expensive operation we’ll usually cache it in Redis. Our internal data access library uses the public Stackexchange.redis NuGet package to create what we refer to as an L2 global cache.
So Redis is accessible from all of our servers. We can cache anything that we want at kind of a global level there. But then also cache it in a local ASP.NET cache called like an L1 local cache. So when we go out to SQL, we will take that information, put it in the L1 cache so that server can reuse it if it needs it again in a short window. Then also into the L2 cache so that if any other servers in the web farm need it, they should also have access to it pretty readily.
Here on the left, we can see a template from a Bosun notification. So if I was to get an alert at 3:00 in the morning, I want as much information in that so I can hopefully analyze it with as few brain cells as maybe I have at that point in time. In this case, it’s saying there’s high CPU usage on a system and that the process is called Crushinator. Well, I happen to know that is our ad server simulator, so I’m not too concerned about it. It’s on a machine learning box, not going to be affecting production. I can ignore it and deal with it at a later time if need be.
Then on the right-hand side, you can see some of the various different kind of chat ops-y things we’re able to do. If we have a spike in exceptions, we have a fairly complex alert that will actually post that into various chat rooms. It will tell us not only which data center is having the spike—if it’s New York or Colorado—but then it will also break down the application counts. Here it’s saying stackexchange.com, which is actually a different code base than stackoverflow.com or API.
It’ll break it down by host and give us as much information as we can cram into a little short text message. For us, it’s actually the developers that will go through and look at that. We’ll see that in a bit when we talk about some of the developer principles. Then others will have different flavors. So here’s Malfunctioning Eddie in our ad server team’s room, they have their own quirks on their chatbots sometimes.
Then the ability to create those annotations, so those blocks on then graph. If I’m doing an upgrade on a SQL Server, I can just type that out into a chat message and the Bosun bot will come back say, “I created this annotation for you.” I can go in and edit it more if I need to.
The next one I want to talk about briefly is Grafana. The main thing I can say: If you’re not already using it, you probably should be. It’s by far the best thing we’ve seen to be able to create dashboards, pull data from a number of different sources, including Bosun, which we use primarily, but also Elastic and directly OpenTSDB is where most of our time series data lives right now.
This is what a SQL Server AG dashboard looks like showing an issue with our windows failover clustering, kind of just failing and not coming back. Our dev teams also use this, so they’ll go in and create their own dashboards for their application. It’s been a very valuable tool. If you’re looking to do any sort of dashboarding, we didn’t even do any real dashboards in Bosun because we didn’t feel the need. We could just rely on Grafana to create those for us. We don’t have to duplicate their efforts. It’s a great open source project to start.
Brent Ozar: Matan Yungman asks a question. He says, “Do you have any suggestions on the best or quickest way to learn and start using the Grafana stack?”
Greg Bray: For us, it’s mainly just going in and see what works for you. They did recently add the ability to share dashboards. So there should now be, I think they originally called them data sources, which was like the ability to just get the data out. Now they’ve come up with what they call applications that’s kind of a higher level construct. It might include a fully fleshed out dashboard for you.
So if you look into, instead of just the data source plugins, see what applications are out there, they’re starting to show up more and more. I believe they even have kind of like a gallery view where if you want to see what somebody else did for SQL Server or some other product, you can definitely find what they did and hopefully just copy it. I’ve seen it float around on a few various places too, some of the subreddits that I follow somebody will come up and say, “Here’s the Grafana dashboard I created.” So it really does tie in to where your data lives, sometimes that means you might need to start with more of like the Influx TICK stack to get up and running and get data into a system.
If you already have an existing monitoring system, hopefully Grafana supports it. If it doesn’t, then go from there. For us, it’s just been an evolution as we’ve continued working on this over the last three or four years. But it’s pretty easy to get started with your first dashboard. Hopefully, you’ll find at least one place you can get information. They even have, I believe they call it Raintank is the company, and worldPing is like their external ping-all-of-your-stuff service. They will integrate that directly into Grafana. So if you’re just looking for something to ping your website and make sure it’s working, that’s a fairly easy way to get started.
Here’s another view of that same dashboard. This is what one bad single query making its way into production looks like at Stack Overflow. Things kind of just explode. When we have an issue like this, what we’ll usually do is we’ll start on Opserver and try to figure out or isolate which component is having the issue. Then dive down into a specific dashboard for that system. During a post-mortem, we’ll look back through the series of metrics and see what we were overlooking. We’ll add that to an existing dashboard or create a new one if we found that we don’t really have one that’s sufficient. Then go through and find if there’s a high signal metric that we could create an alert out of or something that predicted a warning or failure characteristics.
A lot of times, you won’t know exactly what you need until you have some sort of anomalous event. But once you do, by having that time series data, you can go back, look through it, and hopefully make it so that next time that won’t happen or you get much faster resolution when it does.
The last section that we’re going to talk about is the development principles. It’s kind of a lot of information so we’ll see how much of it we can get through. Looks like we’re doing pretty well on time. So I’ll go through as much as I can. A lot of these are, I just kind of wanted to throw it in because it has to be a full spectrum. You can’t just work on SQL problems. You’ve got to make sure it’s also the development teams and that everybody is on the same boat. We’ll show you some of the ways that we’ve been able to achieve that at Stack Overflow.
Definitely one of the first ones is just communication. You can’t have silos of ops versus dev. You’ve got to work as teams. In our structure, we actually have two people that are on the SRE team that are also actually fulltime developers as well, so they are kind of bridges between various different product teams. It’s worked out really well but we’ll also do things like when we’re going through and planning a SQL upgrade, we’ll make sure that they’re aware of how that upgrade is going to impact them.
When we do the upgrade on our production systems, we do need to bring the developers and the build pipeline with us because they’re using backups, as we talked those minimal backups, as soon as we flip the switch to 2016, now their entire local setup also has to be 2016. So I know Bob talked this morning about some of the cool things in 2016. Developers, we wanted to make sure they were aware of what they were getting. There are some cool improvements in SSMS that’s not based on an archaic shell of 2010 anymore. It’s at least upgraded to the Visual Studio 2015, which does add a lot of improvements.
The live query statistics are an amazing feature, if you haven’t seen them yet. They’re even available in downstream versions like 2014 SP1. So it’s a great way of being able, if you’re troubleshooting a query you don’t want to wait for the whole thing to finish, you can kind of watch it fill in the statistics as you’re going.
They do get some new features from the language perspective. STRING_SPLIT was an interesting one. We actually found that it was the fastest way of getting a large list of information embedded into a query. Between JSON and string splits and various other methods of splitting an integer into an in array, so where, exist, in or whatnot, our data access layer needs to do that fairly frequently. If we can find ways of improving that in Dapper, everybody is going to get the benefit for it. So there’s some cool information there.
There was a great blog post on the SQL Performance blog about, “STRING_SPLIT Follow-Up #2” I think was the name of it and just some amazing stuff being able to go through and benchmark it. Definitely not going to be a win in all scenarios. You need to see what’s working best for your specific situation. But kind of a cool thing to be able to pass on to developers and show how they’re getting improvements just by when we upgrade to 2016. Then simple things like IF EXISTS and DROP and ALTER, developers love being able to have shorthand on those kinds of things.
JSON is a big feature in T-SQL. I know a lot of people are very excited about it. We do definitely make use of it in very particular places. We use for FOR JSON when we’re doing ad hoc data migration. So we have a bunch of columns that we kind of are wanting to just turn into a blob because maybe they’re changing or it’s not really a static schema and we just want to make it a more fungible JSON object, we can do that through the FOR JSON T-SQL statements now.
We also do use OPENJSON and JSON QUERY to then extract information out of those columns in some points but we’re not going to do it in our hot path. SQL Server CPU is definitely the most expensive thing in our data center. We have two or four SQL Servers. We have nine web servers and they’re much more capable of doing that kind of work.
So in our hot path, we’re probably not going to use SQL to manipulate JSON. But if we need to, if we’re doing data migrations or various other things, previously we would have had to do like maybe a developer route in our application that only they can access or use a third-party tool like [Link Path]. It’s a great tool, we’ve used for abusing in many different ways. But now, we can do some of those things in SSMS. So very good to have it available but it shouldn’t be something that you’re always looking at just using directly. In our case, we definitely have more computation power on our Web tier than we would in SQL and definitely also costs less to do it there.
Here’s the exception log that I mentioned earlier. It’s been very beneficial when we have all of our applications, make sure that they collect all of their exceptions from all different parts, be it a SQL exception that has occurred or various other things and centralize that into one location. Because when we have something break, it might be that it manifests itself through various different systems and we need to track down what the actual root cause was. It’s also just one of the highest signals that we’ve ever had, been able to grasp. This is the primary thing that developers will look at when they’re doing new deploys. We do not have a dedicated DBA team. All of the developers are responsible for writing their own SQL and then monitoring it during those deployments.
The deployments are staged so they would start by of course working on their local system. Then once that’s ready, they’ll move to a private shared developer environment and do some additional testing there. There’s a dev exception log that they can look at to try to see if they’re creating any problems there. We would then move to what we refer to as the public meta sites. So it’s a portion of our public-facing properties that we can kind of abuse. They’re actually our most engaged users so they’ll always find the problems before anybody else for us. We would much prefer to have something break there before it rolls out to Stack Overflow and the rest of the network. So finding a way of doing those staged deployments is another key component of our deployment process.
We’ll find countless things on meta that never cause issues on the wider network and hopefully we’re finding them on local or dev the majority of the time. Sometimes they’ll slip through and we’ll still be able to find them there. By having those applications use Stack Exchange Exceptional and putting everything into a centralized place, we can then manage them in bulk. We’ll do duplicate detection so that you can just see a count over there on the right-hand side. That number two usually is more like 2,000 if we have a real issue where it’s going to be the same thing occurring. We can quickly identify what the problem is and go from there.
Since there are multiple teams, this is a fairly recent addition to Opserver where you can then filter down by not only team, but then also product within the team and kind of see that break down. Then on the left-hand side you can see the Xs and lock icons, which actually show if you want to clear out information. You can do that by selecting similar, or all of this category, or just clear the whole thing, because sometimes you just got to commit exception bankruptcy and start over.
If you’re going to do that but there’s a few you want to save, you can just lock them. That way you can say I really need such-and-such to look at this but they’re not awake yet, so I’ll save this for later and then be able to share it to them with a URL. Again, a web-based product, being able to do link sharing is amazing versus SSMS or event logs, trying to get that information into somebody else’s hands can sometimes be pretty difficult.
We mentioned database migration so there are a lot of products that do this. I’ve used DB Ghost and various Redgate products before to handle database change management. Definitely want to find the right one for your specific scenario. For our scenario, it was hundreds of sites that run the same schema and need to be updated. We found using multiple databases is more advantageous than trying to consolidate it all into one because we’ll have Stack Overflow that’s a three terabyte or two terabyte database. Then we’ll have bicycle’s or whatever the newest beta site is that’s a few megs.
We don’t know what’s going to happen to those sites. One of them might explode. If it gets incredibly popular and we need to give it additional resources using a database to isolate that means we can move it to a different server, kind of manage it as needed. Not all of the sites survive. Sometimes they get closed. So we need to be able to clear those out and being able to export the data to a data dump and then just delete the database we found to be a very reasonable process. It’s not uncommon for things like a WordPress hosting site, they’re going to have a very similar situation and they might just have a bunch of copies of databases. SQL Server can handle that just fine.
So needing to solve the migration problem, we found just including SQL’s migration scripts in a folder in the git repo has been the simplest approach. Then a very small tool that will run that across all sites during one of the stages of the build process.
You can see at the top there that TeamCity is what we use for managing various different builds in our systems. It found 342 sites that it was going to go through and update. We do design those scripts to be idempotent. So if it’s an especially large one and we can’t run it within the time limits of our build process, we might actually run that on Stack Overflow beforehand. We could run that through SSMS, when it then goes through to check the script again it’s is going to just go through and see it doesn’t need to do. Worked out pretty well.
In fact, one of our developers, Bret Copeland, from our ad server team liked it so much he wanted to be able to use it on public projects. So he created an open source version, it’s called Mayflower.NET. It’s not the same one we use in our production but it’s kind of a simplified version of it. Definitely something you can look at, see if it works for what you’re doing. If it doesn’t, modify it to your own needs. It’s another simple approach to a problem and has worked out reasonably well for us.
This is actually an interesting snippet of SQL here. It’s for what we’re calling the Census badge. It’s awarded, it’s another one of the virtual badges I talked about earlier to people that take the developer survey, which I believe launched yesterday. So if you want some virtual swag, you can take the 2017 Developer Survey that was just recently announced on stackoverflow.blog and you will get what we’re calling the Census badge if you’re willing to add in your Stack Overflow account number into the survey results that they’ll go through and use.
Dapper is a micro-ORM. Many of you may be more familiar with LINQ to SQL or Entity Framework. They all have pluses and minuses. You should never go out and just try to use something because somebody says it’s the best. We’ve found Dapper is great at being a minimal and thin layer that does very simple things but can be built up to be as complex as you need it to become. Its main goal is to be a wrapper around the IDbConnection object and then take string texts, that is just embedded into your code base, and turn that into a query that gets executed and then returns just Plain Old CLR Objects. Very simple approach. It lets us do things, optimize at kind of a lower level, so like that parameterized list thing we talked about using string splits.
In our code base, our SQLMapperExtensions will do sufficiently more complex things like building up the SQL query. When we’re building up the query, we know the line number that was executing that. We can embed inside of the query something like Controller\Method@LineNumber. That is by far the most useful thing I’ve ever seen in any database is when you look at the query it tells you where it was being executed from. It means when it has a problem I can go and find it immediately. I don’t even have to really understand the code base. I can just kind of see where it’s coming from. So that’s been the approach that we’ve seen.
Definitely the benefits of it is having that full control over the query text. If you’ve ever had to look at LINQ to SQL or Entity Framework queries, they can be quite cryptic. They’re often poor performing because you don’t always have full control over what the framework is trying to do for you. We’re taking a simplified approach. You do end up having to write more code, but you get more control over it as well. Being simple, it should be fairly easy to get started with it.
If you’re just writing SQL queries right now, you should be able to take your existing queries and just wrap them into this thin layer of ORM to get a few niceties out of it that you can then keep building on top of. But because it is just strings, they’re not strongly typed so if you’re needing to have some more of the compile time checks in place, then Dapper might not be the best place to do it. Or, you might need to combine it with—maybe you’d only use stored procedures to pull information into Dapper and you can still get those compile time check through other SSDT tools maybe.
Brent Ozar: Riley asks a question. He says he knows from looking at the pages that it says the number of views that each question has, things like that. He said, “Well since you log recounts, is anything really read only or do you separate the logging of views from the alteration of content?”
Greg Bray: We would definitely separate that out so those paths are highly optimized and can be done separately. We don’t need to increment that view count immediately. We could definitely batch that up if need be. We could, in fact, you know if we were a cloud provider we might just send that off into a queue, let it accumulate for 5 or 10 minutes, and then update it at another point in time. By being able to separate that out, we can render even if you’re logged in, we could render still the bulk of that page in a fairly cacheable format and just send some one-off operations.
One place that we definitely have difficulty using Dapper is in those complex relationships. If you do have a lot of child objects or dependencies based off of things, multiple tables, you’re going to have to do a lot of work to get that to operate in Dapper. In fact, we’ve from a legacy standpoint had a lot of LINQ to SQL. We’re looking at Entity Framework, especially the vnext that has a lot more capable development models as ways of managing that complexity for our limited, lower volume write operations.
We get the high performance of Dapper for the read-only page views, then a simpler approach to updating data through Entity Framework or LINQ to SQL. If we can keep the maintenance costs down of having those two systems, that is probably our preferred approach. We don’t want to turn Dapper into the full framework ORM. We want to let Entity Framework to do what it does best, which is managing those kinds of relationships and some of the more complex operations. Then let Dapper handle things that are easier to manage.
This is one of the last slides that we have for our developer process, which is, again, the full pipeline, being able to make sure we’re measuring not only things that are occurring within our data center, but how is it actually affecting our end users. One of the main ways that we do that is through something called MiniProfiler. When one of our developers visits the website, which we do on a monthly basis, just like everyone else, trying to solve our own problems. They’re actually going to get a drop-down on the top right corner. That’s going to say how long that page took to render and breaks down the entire rendering pipeline from the various MVC stages to third-party things like calls out to SQL or Redis or ElasticSearch.
It’s going to break that all down into hopefully easy to understand and look at information. Like here, it’s saying a certain section took 149 seconds. You can see it even turned red because the total was 192 whereas we’re expecting that to be much closer towards 20 milliseconds. It’s a great tool. It does require you to instrumentalize your views. You’re not going to get any of this really for free, except the lower level ASP.NET stuff. Sometimes we can kind of tap in and get for free, but then you’re going to have to go and say, okay, I’m going to do this section of works. Let me wrap it in something that that will time it and send that information out to it.
We can then again pass this information around. There’s a sharable link that we can use if we see a problem. We can say, “Hey, looks like this is taking longer than others.” It’s one of the main ways we really see performance degradation on our sites is when we’re using it if we see it’s being slow, we just can in one click check to see why it’s running slow and then start tracking down that problem. That’s great for us. But again, it’s only available for developers on the site. That’s a much smaller sample than the world at large.
We also do something which is the client-side timings of the browsers that are visiting the page. They’ve been in browsers for quite a while now, you can just ask it for information about how long a DNS look up took, how long the HTTP negotiation, or DOM manipulation took. There’s a vast amount of information that’s available.
If you go to test.stackoverflow.com, it will actually break that down and show you the information that we’re collecting. We’ll just collect that for a small sample of users. We only need 1 percent or 2 percent of our page views to show that information to us. That will give us information from across the world so we’ll understand, is there some sort of regional issue maybe? Which kind of indicates maybe your problem with our CDN. That’s another piece of information that we’ll throw into our data warehouse so that we can go through and crunch the numbers.
In fact, when we were changing CDN providers from Cloudflare to Fastly we made extensive use of that. Also doing things like putting a hit in iFrame so we can do testing of DNS providers on an alternate domain and just get access to huge volumes of very valuable information. I believe the latest blog post on Server Fault goes through talks about that, how we measured different DNS providers, chose the best ones, and then also showed how we could combine two of them so that we don’t have an issue if Route 53 goes down like DynDNS had recently. That was a major critical failure in some infrastructure that we hopefully think we can work around by just using two of them. But are we going to have any performance degradation? Well, we can measure that beforehand using some of these tools.
So definitely if you’re doing web development, these are the kinds of things you need to try to work on to get access to that data. You’re always going to hopefully have more data than you can consume, but just having access to it will help you make much better, more informed decisions.
That’s kind of wrapping up for the material that we had. There are a lot of open source projects that we contribute to. We do have a list of them there. The two big ones that I was talking about today was Opserver, single pane of glass dashboard, and then Bosun, our monitoring system. If you’re interested in some more of maybe the development aspects than just the SQL, one of our developers on our core team, Marco, did recently do a presentation JSConf in Asia. That’s something I would definitely recommend looking at if you’re interested.
Nick Craver has an amazing series of blog posts if you want to know about our infrastructure. His blog is one of the great resources to see about the things that we do. Then he also had a recent podcast on MS Dev Show. Just today, I saw my boss Kyle had a presentation on Bosun from GrafanaCon last December. So if you’re also interested in learning more about Bosun, there was a [inaudible] video from a few years ago and it sounds like just another brand new one recently for GrafanaCon. That’s all I have. Happy to take any additional questions and go from there.
Brent Ozar: First, before we go into the questions, I want to thank you. Thank Greg for volunteering to speak here at GroupBy and sharing what Stack Overflow does and what lessons they’ve learned. I know in the Slack channel we’ve been already talking about thanks for being so open with everything. Slava asks, “How many questions have you posted on Stack Overflow along the way?” I’m guessing that’s more of a rhetorical question.
Greg Bray: I’ve posted quite a lot. The one I would actually be interested in: How many have I read? There’s an interesting feature—so our machine learning is called Providence. If you look through the blog posts about when we released Providence, if you are a signed-in user or even if you’re not, it does track those and will at least tell you the tags that you visited and it will try to predict what kind of a person you are.
For myself, it’s seen that I’m visiting a lot of say SQL tags and maybe some things that go towards dev ops. So it might label me as a sysadmin or a dev ops user. It’s going to use that information to maybe change what kind of ad that it shows me. But you can go and download that information. If you go into your profile and say, “Send me my Providence profile,” it’ll tell you all that information. You can go see exactly how many questions you’ve looked at and what tags that were on it, sometimes kind of interesting.
Even more so, you can see how you’ve changed over time and if you’re at a different job using different technologies. It’s really cool to be able to see how that goes. So there is a place on our site I could go look up that information and I’ve looked through it a few times. Recently it’s been the DBA site. If I find a complex problem that I can’t solve, I’ll usually post it there. Then if I later do solve it, I’ll fill in the information because half the time, four years later, I’m going to forget what I did and I want to have that public artifact to go back and find it.
Brent Ozar: It’s the same reason I got started blogging too. I’m like, I got to write this stuff down because it’s like my documentation. Andrea says, “With your availability groups, how do you guys keep logins, jobs, and linked servers synced between all of the nodes?”
Greg Bray: That is one of the SQL jobs that we have, does login replication. It’s kind of unfortunate it’s not provided in a more standard format by Microsoft. But I can kind of see that maybe there’s some differentiation there that’s hard to make generic. So for us, we just have a SQL job. We’ll be able to go through and get the information of what logins should be on that server and it will create those.
As soon as we create it on the primary, within usually 10 or 15 minutes that job run will. It will run on all the servers and let it replicate out. Linked servers, we don’t make extensive use of. Actually, I guess we do have it in a fair number of places, but they’re fairly static. We haven’t had them change a lot. We just have a big spreadsheet of how do we set up a new SQL Server and just gets added in there. Things like installing various stored procedures too for sp_Blitz and all of the goodies that you can get onto a server nowadays.
Brent Ozar: Jay West asks a question. He came late, he saw the slides on AGs. “Is this multiple instances on the same server with AGs on each instance? Is that possible?” No, SQL Server won’t let you do AGs between the same physical box.
Mark V. asks, he says, “You mentioned you’re spec’ing out new hardware. What are some items on your wish list? Are you going to post the specs and when is it going to be finalized?”
Greg Bray: We’re mainly looking at the latest version of the R730 chassis again. Primarily—we actually don’t need to do the upgrade. It’s more we either need to upgrade the warranty or the server. Sometimes there’s not a lot of difference, so we might as well get a faster server. We’re fairly particular on the processors that we use including—this is indicating the cache size. Depending on what the server is used for, we might make a differentiation of a faster processor or more cache or whatnot.
We just barely did quotes I believe this week. I’m sure once we have things finalized we will be posting on it, but it’s more or less going to be fairly similar to this. This was 2015. There have been some changes. Processer, we’re probably going to get a few more cores. I know we were looking at doubling the memory. In terms of the storage, we’re reusing the same storage or waiting for Intel to come out with a six or eight terabyte drive instead of a four. Other than that, it’s going to be fairly similar specs just with a slight bump in some of the components.
Brent Ozar: Tim asks, “What are you guys using for load balancing?”
Greg Bray: Load balancing, we have used HAProxy I believe through the entirety of the company. No, that’s true. They used to use NginX for SSL termination. But once that was rolled into HAProxy, we currently use that for everything. Waiting to see how HDB2 progresses in HAProxy. It is not currently available except in I think some developer builds, but it’s been an incredibly stable system for us. We actually run—on two servers we’ll run the same HAProxy with the same configuration and then a keepalived for a virtual IP address that can flip between the two. So those two servers do a VIP flip. When you flip over to the other server you can do maintenance on one. If there was one that crashed, it will flip to the other.
We also have some unique parts where we’ll actually split our infrastructure. We have two different frontend load balancers, each with two instances. Whereas one is dedicated for our CDN path and the other is dedicated for our public path. So we have in total four load balancers, split between four ISPs, two different functions, and some other tricks such that if somebody was to attack our public-facing network we can have ways of hiding behind the CDN and not allowing them to reach us directly. That’s kind of the biggest concern we have is by running our own hardware, we are very much limited to what we can do on our uplinks. We only run one gig uplinks. One gig is not a very large pipe nowadays. Most denial of services are approaching terabyte size. We could never really do that internally so we have to rely on a third-party CDN to absorb some of that influx if we are ever under attack. Which, unfortunately, does happen sometimes. Often, it’s just somebody in college having a bad script crawling our site, that’s more or less the main denial of service that we get, but occasionally we do get quite large ones as well.
Brent Ozar: I’m glad Mark asked this because if Mark hadn’t asked this I would have asked it. He says, “Wait, so you guys aren’t upgrading to SQL Server on Linux?”
Greg Bray: We have evaluated it. In fact, when we got access to initial private builds of it, I believe it only took four hours for us to get Stack Overflow working—not the public side but just a copy of the database on SQL on Linux. It’s amazing what they were able to do. It is just SQL. It’s running on Linux but it is just SQL. We’ve evaluated it for some different places.
There’s an interesting place that it may be used which is on the Stack Overflow enterprise instances, which is basically a private copy of the software that we will let people run on premise, if they’re a very large organization, usually about 500 developers. We’ll allow them to license the software from us and use it there. We have talked to people that would be hesitant to purchasing it because they don’t want to run Windows servers.
So SQL on Linux is a great path that we could support that kind of a scenario if we wanted to. But in terms of our actual infrastructure, the high availability solution there is very much in its infancy. We’re excited to see where it goes. It’s more useful for things like our open source projects. We would love to have a time series database backed by columnstores. They have been an amazing feature that we’ve been able to do some cool things with and those being available in a Linux environment through SQL on Linux, kind of opens the door to those open source projects that previously if we were to tie it to a SQL Server instance, it does limit the scope of that project versus it’s awesome to see it going into so many new places.
Brent Ozar: Perfect. Well Carnegie says something. He says, “Great presentation. Thank you for sharing your mast for knowledge and experience. Best regards and keep up the great work. Stack Overflow is a great resource.” I would agree. I would have everybody give you a warm virtual round of applause. They can follow Greg @GBrayUT on Twitter. Thanks for hanging out with us today and sharing over at GroupBy. Thanks, man.
Latest posts by Greg Bray (see all)
- Keep It Simple, Make It Fast: The SQL Server Strategy at Stack Overflow - January 13, 2017