Today’s applications are not using monolithic approach anymore and evolving into micro-services architecture. Monitoring tools are also evolving, micro-services approach have stepped into this area as well.
What you should do if you have several technologies under your responsibility? SQL Server and MySQL? Maybe Hadoop or PostgreSql? Should you use separated tool for each product? Should you use the same tool for Monitorig and Alerts, incidents management and notifications?
There are many cloud SAAS products in addition to traditional on-premise monitoring products and in this session, we will talk about their advantages and disadvantages.
Enjoy the Podcast?
Application and Infrastructure DBAs that are challenged by supporting more than one databases technology or those simply interested in cloud monitoring tools and their advantages and disadvantages.
Brent Ozar: Alright, so next up at GroupBy is Maria’s going to talk about how to monitor everything. So take it away, Maria.
Maria Zakourdaev: So hello everyone, and welcome to my session, how to monitor everything.
My name is Maria Zakourdaev, and I’m from Israel. And me and my team are managing all data infrastructure in one awesome company called Perion, which provides data driven ad tech solutions for brands and publishers. I’m a SQL Saturday organizer in Israel, and I frequently talk here in Israel in various data events, and I also blog at sqlblog.com.
If you have any questions for me after this talk, you can reach me on Twitter as Maria_SQL or you can contact me on LinkedIn. What I have today on my table is a – in this age of cloud and big data, when infrastructures are very complex and application response times are more and more critical, and all the application performance depends on data layer performance and I believe that workable monitoring strategy is a very important factor that impacts profitability, customer loyalty and business service health.
So if your team – today’s session was inspired by three different aspects, and this is what I have on my table in addition to some medicine for my flu. So sorry for my voice. Of course, I had to get it two days before the session. But what I’m going to talk today is the first aspect is how we make the services strength and polyglot persistent. Persistence have impacted my team and have triggered a lot of various data tools under our responsibility. We have SQL Server of course and we have MySQL and Postgres and Elasticsearch and Couchbase and Redshift and the question whether we could use SQL Server for the same purposes is out of the question for this talk but I can only say that due to part of them we’ve inherited and some we’ve tried and liked them, some of them were cheap solutions. So as a proper SQL Server DBA, I only knew the monitoring tools that can monitor SQL Server. And I went and I searched for some tools that will give me the view, the dashboard over all systems that we are supporting, at least for high level.
But if you don’t have all those technologies in your responsibility, please don’t leave this session because I have a lot of more going up because you know, I’m going to talk also about my top list of monitoring strategies, which is, you know, can be applied to any data monitoring and I will show you some demos of the tools that I’m using.
And another aspect of my talk is SaaS tools. When I went and I searched for monitoring tools, I have realized that there’s a huge amount of software as a service monitoring tools, I really like them, and I’m going to talk why I like them. And another aspect is that when I started to learn all those tools, I have come into understanding that there is a microservices trend also have impacted the monitoring area, and instead of having one monolith database monitoring tool, we have different categories, and I am going to talk about those categories in a minute.
So if you and your teams are responsible for company infrastructure, you need to make sure that – I believe, that everything is under control and any multifunction will be taken care as fast as possible and as efficient as possible. And do you know how fast is fast for web application? Under 100 milliseconds is perceived as instantaneous, for 100 milliseconds to 300 milliseconds, delay is perceptible, user expects site to load in only two seconds, and 40% of visitors will abandon your company’s site after three seconds. And 10 seconds is about limit for keeping user attention. So this is why I think – why our data layers play an important part in your web applications and the mobile application, any type of application. This is why I think this topic is very important.
Polyglot persistence is a term that is talking about storing your data in different data stores based on the way how data is going to be used. So today we have tons of database. We have relational databases, and key value databases and document databases, [craft] databases. And each one of them knows how to deal with only one data model well, and the other data models could also be modeled as well, but not regularly you will get the best possible performance. As a [cause] of this trend, you know, you get fast response times, and of course perfect scale out if you’re using another technology for different data models. But on the other hand, you need to learn and figure out how to use all those databases like this is what we had to go through.
And microservices is another technological trend of the recent years that correlates to polyglot persistence. Many companies begin with one huge database or one huge cloud base, and when I have joined Perion, we had one database and we had very – that everything was there. Every component that – any business unit in the company was developing, everything was the same database. And then it was very painful process of splitting the procedures, stored procedures into different databases and in some point of time, to different service because application owners didn’t want other applications to impact theirs.
One database design offers advantages like less operational overhead for small startups with a few people. These advantages start when the startup grows and a whole cloud base or many database components being changed all the time and you know, whole thing is revealed or changes. And once the idea of having a database server for every application was difficult and expensive task. And only big organizations can do that. But today we have services like Azure and Amazon web services and we have many options for deploying tiny databases without too much overhead.
So again, if you need to change some stored procedure and our database at least looked like this plate of pasta where everything was mixed together. And when the microservices trend and you know, where the all components decoupled and during down monolithic database server into microservices architecture brings a lot of benefits. Changes to components can be applied without being afraid that you might impact other services, and the smaller databases, you see on this side, they provide you with ability to be scaled separately or you can change technology if needed. And one service performance does not impact the other.
But you know, as someone who is managing all those tools and monitoring it’s incredibly frustrating for more database types you get under your responsibility. This ability is critical and you really want it to have one dashboard showing all our infrastructure and not jump from one product to another, at least for a high level of course. There are tools that know exactly how to monitor each database technology, but for high level you really want one tool.
And I like SaaS tools. You see, I’m even using Prezi for presentation instead of – it’s not PowerPoint. It’s SaaS presentation tool and I think it’s really cool. Why SaaS? Because you know, it’s very easy to use. When you try database as a service, you understand that that’s really, really use to use stuff. The monitoring tools as a service, you know, you usually can install agent within one comment and an agent automatically though it’s very minimal configuration, detects all the components that you have on your server, and you can see everything. Users and applications and any database technology, any operational system. And all in one place. No IT headache and accessible from any device, and instantly on, automatic upgrades. And many SaaS tools have free plans and in many cases, there is a low investment cost. Of course, if you are a huge organization, if you have thousands of servers, it might get expensive, but you know, those tools have packages for big companies with their discounts, but it’s still so easy to use and very comfortable.
Also what we can do sometimes, instead of using agent, we can write a small code and usually those SaaS tools are very customizable. And you can write a small code, you know, Python or some – any languages you’re used to, and connect and send metrics to the monitoring tools, and then from one – Datadog for instance, costs $15 a month and if you have one code that’s monitoring all your infrastructure, the main counters, you just pay $15 per month, and you know, you can have all your infrastructure monitored in one place. I will talk about it in a second.
So here are the monitoring tool types or as I see it, microservices. There are probably more types, but those are really important for me. I identify them as really important and we use them here. And as I said, you know, several other monitoring can cover everything, but there are areas that you just need to leave them to the specific – like graph databases, you know, they know to serve some specific data model. Like here, there are monitoring tools that can do something very specific and you don’t have to mix everything in one place.
So first we will talk about server level monitoring tools. This is the main category of course, and those tools monitoring your servers, any operational level, any database technology, including networks and everything. We’re using Datadog for this purpose and you can see here how many systems it can monitor. You can see SQL Server here and all services on your cloud somewhere over here and Elasticsearch and Couchbase and MySQL somewhere over here and Postgres and Redshift, everything that you just want to think about you have – you can monitor from this tool.
In addition, a lot of other applications that you might use, and what Datadog does for each component that you see here, it has its specific counters that it knows that are important to be monitored. In addition, as I said, the Datadog and a lot of other tools from the same category are really customizable. So what we do for instance – just a second, I will show you. This is the dashboard from the Datadog, and you – it is customizable. They put here any counter from any database that you have. It can be number, it can be graph, it could be list like this and here you don’t see but here in my office, I have a huge monitor and I have a place for Couchbase and for Elasticsearch and SQL Servers and everything is in one place.
And in addition to servers, I have – I can monitor my processes. For instance, I have this Python, which is taking data from SQL Server to Elasticsearch. And what I can do, I can import a few libraries from Datadog. Again, I don’t need any agent, to install agent for this. I include libraries, I defined keys, which is sort of using password, and then I have only one line here, API metric send, and what I’m sending the date time and the amount of rows I have transferred at this exact moment.
So you can understand that this job is running every now and then, and every time it’s running it just reports to Datadog amount of data it sent. What it does, it does sort of visualization about how your job is doing and if the job suddenly – maybe it continues working but not moving any data for some reason. Maybe there’s something’s wrong, some bug. The Datadog is going to notify you, no data for this process. I will show it to you here. Brent, do you see my screen?
Brent Ozar: Yes, absolutely.
Maria Zakourdaev: Alright, so this is this counter that I sent, and I can, you know, visualize it for a few days or for the past week. And you know, you can just understand what’s going on, if amounts of data growing or you know, less and less data is being transferred. This is amazing way – you know, another part of monitoring that you can do in addition to monitor CPU and memory of course and the disk space and everything.
What else we do, for instance, we’re using Redshift for one of the data models in our company, and Redshift – Amazon UI is really limited for its ability to show us what’s going on and the queries and what we do is another Python that I’m streaming the tables and the query IDs and the performance of all queries. Redshift is based on Postgres, so inside you can find any query. And here I can – in this dashboard, I can pick up any table and see all the queries that took a lot of time and the visualization when it happened and I can just, you know, usually the performance is very good, but I can go to those points of time, find those queries inside Redshift and make sure I tune them.
So what you can do, you can stream in here, there’s such counters so many, many database technologies and you will have it all in one place. So instead of having one filter, it can have filter here of database sort of database, database technology, and you know, you can put dashboard or anything that you want.
Another thing that we monitor in Hadoop, we have a lot of tables that we’re updating there and processing, and there is another code that is going and checks when the last timetable was updated and we have dashboard of data delays in Hadoop on the same dashboard. Alright…
Brent Ozar: And all of that is just $15 a month?
Maria Zakourdaev: Yeah, this, yes. We also have agents on our service and we’re paying $15 per each server, but those custom monitors, if you’re running them from the same machine, so different Python processes in our company, but you’re paying only for – because Datadog see this as only one machine that is sending the metrics and you know, sending metrics of different types of databases and different types of activities.
Alright, so let me continue. Another thing that I wanted to show you – there are many fun projects that people do with Datadog. For instance, in the Datadog headquarters, they have dashboard of restrooms availability, which is apparently a big building and they have a problem. And then engineering have a keg of beer and they’re getting alert when the beer temperature is getting warm and then giving recovery when the beer gets cooled in the kitchen.
Brent Ozar: Nice.
Maria Zakourdaev: Yes, and then there is another person who took the monitoring everything, he took the whole monitoring to a completely different level using their smart home devices and he has a dashboard of the lights on and electricity that he uses and lights in every room. This is just an example of, you know, that you can monitor a lot of things with Datadog.
Brent Ozar: My grandpa would have loved that because he was always telling us to close the doors, turn the lights off, you know, yeah, he would have been all over that.
Maria Zakourdaev: Right? So in addition to several level monitoring, there are more groups of monitoring. For instance, application performance monitoring. That’s the set of tools that learn your code, they usually know all industry programming languages, they go and learn everything that is happening on your web application or mobile application and correlate the usage of classes and objects, and which objects and database they trigger and what was the query performance and the wait stats sometimes. This is just a whole pipeline of performance, which is sometimes very interesting to see.
This is one of the – this screenshot was taken from Redshift – from New Relic, sorry, which is one of the leaders in this category. It supports .NET and Python and Ruby and you can know what’s happening in the application environment and it combines metric from a mobile browser applications. It can even benchmark every deployment against the previous ones and code changes, dependencies, configurations, everything.
AppDynamics is a – this is a screenshot from AppDynamics, and is another example of application performance SaaS tool. In a similar way it helps you to monitor every single line of code and every user or every database transaction that comes from the code across all this tech in real time. If you go to SQL Server level applications, the tools like IDERA and Redgate, and the only one tool that I have found which is completely SaaS tool is Israeli startup AimBetter, and they applicate the SQL Server – application level of SQL Server, and they provide capabilities to identify bottlenecks in your database environment and also they provide database support, no additional cost if anyone’s interested. But they do a very good job and it’s completely SaaS product.
Here’s Dynatrace, another giant in the application monitoring world. This is powered by artificial intelligence and people say it has very great way to analyze a million dependencies, identify problems, and has robust integrations and API. SolarWinds we all know, leading provider of powerful IT management software, and this is a screenshot of the database performance analyzer. It supports Oracle and SQL Server, MySQL, but unfortunately it’s an in-house product. It can monitor in-house, virtual machines, everything. But a few weeks ago they have acquired Pingdom and launched it as a SolarwindsPingdom server monitor, and this is completely SaaS offering.
Brent Ozar: I didn’t know that.
Maria Zakourdaev: So everyone who likes Solarwinds, just go and check it. Pingdom can perform different deep server monitoring but you see, this is the amount of database that it can monitor, and SQL Server is not one of them, but I hope it’s getting there.
So another tool that I really like is IDERA diagnostic manager. I have used it for several years, it’s an awesome in-house product and it appears that they also have uptime cloud monitoring SaaS tool, which can monitor several level and also collects data on databases like SQL Server, MongoDB, MySQL, Postgres, …, Radius and more.
Redgate has a great set of tools but I haven’t found any SaaS offering in each product. This is solely for specific database. SQL Server, MySQL and so on. Are there any questions?
Brent Ozar: No, you’re doing really good. Everybody was really glowing about the cupcakes. Everybody really wanted – making everybody hungry, especially this time in the morning. Yeah, absolutely.
Maria Zakourdaev: Alright, so we continue. SentryOne have a hybrid cloud platform that allows you to access your performance data from browser from everywhere, and you can control which information gets into the cloud and which is not. You also have an option to alias the names of computers, servers and databases before they’re synced and they have interfaces for mobile and phones and tablets. This is very cool offering.
Another group of tools that I’m going to talk about that’s an end user experience tools or sometimes they call it synthetic transaction tools. They provide you the ability to simulate user transactions and test their performance. By using synthetic transactions, we are able to monitor multistep web-based transactions, measure performance and availability before they impact end user and predict performance from any location of the world. When I worked in [unintelligible] years ago, I think I have built the first set of tools to script web transactions and simulate them. It’s called load runner, and you know, when you’re trying to test some website, the first site that comes in mind is Google.com and we have got a lot of calls from Google that have asked our QA to stop stressing their servers because you know, they can put 50 transactions at once or 500 transactions doing the same search and it’s just been – the infrastructure wasn’t built for it.
Exoprise is one of the SaaS offering that can do synthetic transactions for you. They can find problems before they impact real users and can do stress test – real stress test over your databases.
Brent Ozar: And they do this live, I’m guessing? They’re actually injecting live.
Maria Zakourdaev: Yeah. Okay, anomaly detection is another set of tools that I’m going to talk about and why do we need anomaly detection software? Because those tools use complicated machine learning algorithms, they learn your data, normal behavior. You see here this is Anodot, we’re using Anodot for this purpose. You just stream your data, for instance, the job that I showed you earlier, the job that takes data from server to Elasticsearch, it could stream it – instead of Datadog, we could stream it here. And then I would avoid alerts on some days when there’s not a lot of data, and then during the rest of the days there are a lot of data being streamed. And you just can’t put a regular threshold on such thing. And for those tools, they learn your data normal behavior, they identify abnormal behaviors. Here these grey areas, that’s what they learned about your data, your metrics. They score each incident, correlate. You see here different metrics and they can alert you. It has many applications in business like identifying strange network patterns and traffic or that could signal a hack or drops in clickstream that can identify problems with site accessibility or bugs or after application deployment.
I will show you the Anodot. Okay, I have here some metric that I’m streaming – this is a test environment that I’m streaming into Datadog and it’s really normal. It’s clicks, and I got to anomalies, and one I can see here that Anodot have identified here a few anomalies. And it shows me here in Anomap which KPIs were impacted, and I can see that three countries were impacted by the same anomaly. And at different ad size. This is example from ad tech work. Only for double click exchange, ad type display, but all of them were noticed on some specific version of the application and what you can do, and I said microservices talk to each other, and you can stream into the Anodot not only your metrics, but you can stream also for instance, if you have Jenkins that you use for every time the application code gets changed, it can stream here events that you have new versions. And then it can correlate some anomalies with the version. That’s Anodot. This is an awesome tool.
Alright, another group is log analysis tools. And these days, data is recorded and logged just about for everything. It can be web server uptime, database uptime, database queries, database events audit, or user clickstream, .NET errors, everything. You know, I have also huge application logs. You can log just about everything, and it’s very important to have a log management service in your monitoring tool belt. Those tools have a very efficient log analysis capabilities and for instance – that’s what they do, they collect log data and they’re cleaning the data and invert it and you can – I will show you the screenshot. You can do analysis, visual analysis of your logs. For instance, this is a logz.io, which is cloud offering, it’s a SaaS tool, which is Elasticsearch and Kibana, ELK. We’re using ELK in-house but I think we’re going to check this one and move over there instead of managing the elastic cluster just for ELK.
But you can see here like, some web server access log from which country is the people came in and which browser and which web and you also can have here which – if you log which stored procedure was accessed or query performance, you can see everything here. Everything that you have on the log you can find here and you have – usually you have free text search over the log data, which is much more useful than just going through the huge text files.
And Datadog have acquired Logmatic.io and now on the dashboard, in addition to CPU and system load and memory, you can also put queries logs if you keep some – some stored procedures even doing, you know, sometimes you’re logging into the table, sometimes you can log into files. So that’s the Datadog. And LogRhythm is another log analysis tool which has awesome dashboard, and Splunk is very well-known monitoring tool. They have now cloud offering and the customer said that the real-time dashboard and visualizations are awesome. And it, you know, can monitor everything that you log.
Okay, but the main highlight of all those tool monitoring discovery was incident management tools discovery. This was awesome. I think as soon as we have streamed all of our alerts into such tool, have changed completely our perception of incident management. And that’s what those tools are for, to translate alerts into action, to – you know, to – especially if you’re leveraging multiple tools. Each tool for maybe different database. And aggregating alerts into more incident management tool can pay huge benefits like do you – we used to send emails from SQL Server alerts and I don’t know, do you have filter on email that takes all SQL Server alerts and puts it into some separate subfolder? Because I did.
Brent Ozar: Everybody does this.
Maria Zakourdaev: It makes sense. I didn’t want to miss business emails and email looks nice, you know, it’s a way of sorting the email but that’s an easy way to miss an alert. And now what we do, we’re sending all those emails into the incident management tool. And what they can do also, they can prioritize your incidents, they can assign incidents to specific person, and trigger escalation and they allow collaboration. I will show you in a second the demo. They can manage on duty schedule, wow, that was – we had on duty managing outbox I had a calendar for each person who is on duty and it always gets mixed and just very not comfortable and the tool solved this problem. It can combine multiple alerts. You know, if you get 100 SMS from different systems, it can tell you one problem, not 100. It can use different notification methods, for instance it can call you at night. And alert analysis, wow, I will show you. This is amazing. As a manager, it’s really important for me to take a look of the period and understand if we got more alerts like last month or less and if my team is – what is my team’s SLA? And usually, the incident management regular flow is like when the incident is open, when some system sends an alert, only on duty gets the alert. We used some mail to SMS service before, and all the team was waking up at night by this SMS. All kids, all families, all people in the team woke up at night. You know, they were on duty who was supposed to take care of it, but still everyone got SMS. So this tool is taking care of it, and if no one acknowledges the event, if on duty not acknowledges the event, incident gets escalated after some period of time. Tool can notify the customers is something is wrong with their infrastructure, and when the problem is fixed, it can document root cause and you know, put rectifications that you did. Incident is closed and the tool notifies everyone, the customers, or the team that everything is good.
And this is PagerDuty. We’re using this tool in Perion and let me show you – instead of this, I will show you the real tool. Okay, that’s PagerDuty, and this is all events that were triggered and if we go for some specific event, okay, I can see which tool have triggered it and I can see timeline when the event was triggered and who was notified and if anyone acknowledge the alert and when the event was closed. Sometimes in the morning I just go through closed alerts to understand how long – which alerts – just sometimes monitoring tool opens an alert and it closes it. So we don’t even get notification. But from here you can see at least how long was the outage, if there was outage, or maybe I need to tune the threshold if the alert was opened for a few minutes. And I can search here for similar incidents, maybe to talk with the people who already – who dealt with the same problems, or maybe there is some – you know, maybe something in the alert collaboration part in the notes, something that can help me to solve the problem.
This is what – they have schedules, as I said, on duty, this is totally have changed our work because you put all the people here and you know and the system knows to whom send that SMS or who gets phone call at the middle of the night. And you define escalation policies, meaning after incident is triggered, you notify this person because he’s on call and after 30 minutes someone else notified, or you can send email – I send email to the whole team, hoping that – after on duty, doesn’t take care like, for an hour, we just get everyone alert. I don’t care about sending the mail to the manager, I just want someone to take care of alert. But it’s still – at the beginning, it gives the chance for other team members to sleep.
Brent Ozar: How long did it take you to set up all the – between the services and the teams and the rules, like, how long does it take to set up this kind of thing?
Maria Zakourdaev: It doesn’t take long. You know, you do it gradually, but this is a very easy system. You know, just takes you two days to learn, this is very easy system and you know, it’s not so long. Each tool, you get used to learning it.
Brent Ozar: Cool.
Maria Zakourdaev: Analytics thing is very interesting. This is – you know, especially for team manager, but even for – you know, just for team member, just you can see how you’re doing. I wanted to see this. Something nice here and I will do like, three months. I don’t have data for three months. Let me see one month. This is a testing environment. Takes it a lot of time so I won’t spend time on it. It will show you this screenshot. If we’re talking about SaaS tools pros and cons, you know, sometimes there are not only positives, there are cons as well. Internet performance, but you can see every day the amount of alerts that you’ve got, you can compare if you have the same amount or something weird is getting on. You know, for the last days, and you can see if you need to measure SLA, you can see in the meantime to acknowledge how long it takes for people to acknowledge the event to get the alert and how much it takes them to solve an alert. And you know, comparison to the last period of time. This is very nice.
And another tool that is great is OpsGenie. OpsGenie is another player in this category, they have awesome dashboard and it’s a similar solution, and they have playground. If you want, you can just go to the site, enter the playground, just over here, and just see some environment and play with it to understand if it’s suitable for your company or not. You see alerts and you know, which tool have sent the alert into OpsGenie. And you can see who’s on call from the different teams. And what else? You can see teams here. It is very similar, but all of them – you’ll just – I think when we started to stream data into the incident management tool, everything started to be manageable and understandable. And I was sure that I’m not going to miss any alert.
Alright, now we also have – I also have this category of notification/collaboration. Not sure it is – I think it is part of monitoring strategy. Many companies these days use Slack for collaboration between teams. It’s very nice messaging service and many tools from those categories have collaboration features already embedded. But I believe there is a place for another category because Slack’s main idea is to send the alert notifications to the channel, which can contain many members. It can create channels for different servers, for applications, for projects, and you can have conversations over there among who received alerts. What we usually do, we’re also streaming our alert to other team – other application on the channels to notify them instead of sending email. And they can see who’s taking care of alert and they can collaborate about the alert and also PagerDuty can notify then when the alert is resolved, gets resolved.
After having all those tools, the most important thing is to have integration between them of course. And you know, like if you’re having different database solutions, at some point you will have the same data replicated from – on different solutions. Same here, but I think it’s still good and each one of them – again, like my job example, job which moves data from technology to technology, I can send it to Datadog and I can send it to Anodot and just get different points of view on this job performance.
So alright, we’re getting to the last part. We have time. But in order to make your monitoring strategy work, I’ve wrote here a few insights, what I think is very important. It’s my to-do list, which can make your monitoring strategy work. First of all, all alerts should be actionable. Meaning this causes the major stress on on-call false positives. Nothing is worse than having your phone waking you up in the middle of the night several times to tell you something that either doesn’t matter or can be fixed until the morning anyway. And before creating alert, you should ask yourself, does it matter? Can you do anything about this? Or is it informational only? Is it something you just want to know? Just send the report, and do not call this alert. Some informational things don’t go on the same – into the same systems.
And you need to prioritize alerts. Can this wait until later? Then this alert priority is low. Don’t use the same notification rules. You know, for low alert, you’re not getting called in the middle of the night. High priority alerts might be taken care as fast as possible. Low priority alerts usually can wait until you get time to fix the issue. And alerts tuning, this is a constant work on tuning thresholds. This work never ends. We always review our threshold and alerts that we have received for some period of time to make sure that the thresholds are correct and I always find something that is missing or requires changes. This is very important to decrease noise and give important alerts high visibility.
Incident analytics we already talked, this is very – I think this is a very important part of – to understand if you’re having exactly correct number of alerts or you have a lot of things that are not really important. And you should document everything because you know, types of alerts and thresholds and when you set them and why you set them to that value. We have found one alert a few weeks ago in the CPU which was only – was supposed to alert after four times in a row, CPU was above 99%. After four hours. So we removed this four hours to one hour and then we started to get tons of alerts because it appears that the application team, the data warehouse team had some processes that were utilizing the Redshift during huge amounts of time and it was almost 100% and they didn’t want it because it always happens, they didn’t want to get that alert. And these kinds of things, of course it should be taken care of, that job should have been taken care of but on the other side, you know, if someone puts something weird in the threshold, it should be documented somewhere.
Another thing is who is monitoring the monitoring system. You know, do you know that anyone have this – maybe someone disabled that important alert or some server is muted or put on maintenance mode. Some systems like Datadog, sometimes it can configure that if someone changes or modifies the alert, you’re getting alert on that, but we had the database – another system – we had a list in the database, a list of alerts, and the systems, and there was once in a while some process that went and made sure every alert is in there and not muted and is running. And if not, guess what? Another alert. Manually or automatic, you just need to keep that in mind.
And another thing if you’re finding yourself doing the same things over and over, maybe you should automate your response, and the incident management tools, they can run your responses even from your telephone if you are again, wake up at night and you know what do you need to do, record it, and you can – and they can just run the automatic response. And monitoring tools are only as good as you make them, and I will have emphasize this again. If you and your teams are responsible for company database infrastructure, you need to make sure that everything is under control, and again, any multifunction should be taken care as fast as possible and as efficient as possible. Remember, application response times, very important because their company is going to lose clients, and sometimes this means just widening your areas of responsibility and reaching out to the application and say hi, you need anomaly detection tool for application data-flow or by the way, synthetic transactions are missing from your QA plan, quality assurance plan. You just sometimes you need to take ownership on that as well, and thank for you listening.
Brent Ozar: Nice. Now, if you want – if people want to go learn more about these, the different categories, the tools, how to learn more about monitoring, where would you suggest that they go?
Maria Zakourdaev: I have put – this presentation has a lot of names and I probably will write a blog post summarizing all this, but you just – just go to Google and write this type of tool, the type of tool and you can learn.
Brent Ozar: Are there any blogs or anything that you subscribe to? Any newsletters that talk more focused about monitoring?
Maria Zakourdaev: Not something specific.
Brent Ozar: Cool. Well, thank you very much, nice talk. I was like – the whole time I’m watching I’m waiting for Richie to go, “We need this tool. We need that tool.”
Maria Zakourdaev: Thank you.
Brent Ozar: The angle of how to monitor everything, there really are so many categories of tools out there. it’s pretty impressive.
Maria Zakourdaev: Yeah.
Brent Ozar: Alright, well thanks for talking to us today Maria.
Latest posts by Maria Zakourdaev (see all)
- How to Monitor Everything - June 19, 2018
- The extraordinary power of Python support in SQL Server - January 11, 2018