Enjoy the Podcast?
Anyone looking to understand basic statistics or get a refresher
We can be better at our jobs if we have a good grasp of basic statistics.
It doesn’t matter if you’re a DBA looking to understand query plan performance, a data warehouse person needing to come up with ETL load time estimates, or an analyst needing to report figures to managers. Statistics can help you all.
If only maths classes hadn’t been so darn boring!
Instead of going all mathsy, we’ll be doing some real-time data capture and taking an intuitive and visual approach through summary statistics right up to understanding how to produce simple predictive models.
By the end of the session, you’ll understand concepts like sampling, error, regression, and outliers – important day-to-day stuff and a great base upon which to build. By the end of the session, you’ll wonder how people could have it made seem so hard for so many years.
Why I Want to Present This Session:
Thinking critically about numbers is important.
Brent Ozar: In the next session at GroupBy, Steph Locke will be talking about Stats 101. Take it away, Steph.
Steph Locke: Hi everybody, thanks for joining me today. If you haven’t already, whilst I’m going through this introduction, please take the time to fill out our live survey. So bit.ly/stats101survey and the data is currently real-time, populating this. So, every 30 seconds, this is going to update. And we’ll be using it to explore some statistical concepts, and I’ll make the data and the dashboard available for you to use later so that you can play with the concepts.
What we’re going to do today is look at some of the fundamentals; that when you’re doing statistics, machine learning, data science, data mining, it’s all very much the same thing. So, these fundamental principles and concepts that we’re going to start talking about, they’re a gentle introduction to all of those things. Data science is just the cool new word. Data mining was the cool new word previously, but it all boils down to a lot of the same things. So, we’re going to go through those basics.
Before I carry on, whilst we’re waiting for you all to finish filling in the survey, a little bit about me; I’m Steph, I run my own data science consultancy in the UK. I focus on helping organizations and people get started. So, I’m not going to take over everybody’s jobs with a crazy AI, Skynet box. I’m going to help you implement something that saves your customers money, ups their sales, helps you reduce energy in your data centers. Things like that, that are first projects. Help you get the skill and do things like this for you.
I’m a Data Platform MVP. I’m still, two years in, quite surprised at that, but I guess I get it for a lot of my community stuff. So, I write a blog, ItsALocke.com/blog. I Tweet, I present all over the place. Coming up I’ve got London, Cambridge, Katowice, Utrecht and Copenhagen; and remote to the US/. So, there is a lot of nice jet setting about and chatting to people around the world. So, if you want to follow up with me, you can check out my blog and my website, ItsALocke.com. Puns are mandatory in my world, so lots of Locke puns. And Tweet me – so my personal account, StephLocke, my company account LockeData and I’m doing a giveaway every month now. So, every month on LockeData, there is a book giveaway to one follower. So, if you want a data science book, it’s a good way, potentially, of getting one for free.
Okay, let’s see how many people have filled this in… Woohoo, 88. So we have some decent amounts of data. What our data looks like – we have, for people who haven’t filled in this thing, we have a timestamp as our unique identifier. I then want to capture any personal data of yours. We have your focus; so, we have BI, a bunch of others – so those were – people supplied different values, but as you can see, there aren’t a lot of them. And we’ll have a look at why I’ve grouped those into other. We have ages, so we have everybody from 22 to 65. We’ve got people on a variety of continents. We have one person though, who is not from North America, Europe or Asia. So, thank you very much for you showing up and help me keep another category going. We asked if people were happy with their jobs. And this will be used to have a look and see if there’s any difference in whether younger people are unhappier. And social, so what kind of things are people using? You know, people, email is not social. Email is anti-social.
Okay, so that is our overall data. We’ve got 89 records. 89 records is a bit small, for doing statistics in a robust way, but it is a good enough data set to be showing you some of the principles. So, once you’ve captured your data, the first job is to have a look at how the data is distributed; see what sort of summary statistics you have. So, there’s going to be a lot of working with age here, so we’re going to look at all our different factors in relation to how age is impacted.
Our first bit of statistics – so, hopefully, you all know what the mean is. The mean is where you take the sum of all values and divide them by the number of values. So, this creates a single value that represents the average of a set of values. Now, standard deviation you might not know. The standard deviation is, on either side of the mean, how far out do you have to go to get the majority of the data? So where do about 68% of all people show up? So, let’s see if this works. I might have to change this around a bit. I’m just going to do some sketching… Okay, so imagine you had everybody’s ages. We’ve got – can everybody see that? We’ve got everybody from age 22 to 65. Now, if we had lots and lots of data, we might assume that most people are going to be around a certain age. Based on when I originally looked at our data, I think that’s probably going to be around 40, suggesting that DBAs are a fairly old breed; because we have lots of DBAs today.
So, 40 will be our mean. And then, if our data follows this kind of bell curve approach, we’re going to find that probably most of our people are going to be between 30 and 50. So this will be what will happen if our data is perfectly normally distributed. So, what that means, is that we have as many people on the young end, as we do on the old end. Now, that’s quite unlikely for us because how old do you think people could possibly start being a DBA from, or a BI person, or a dev.
Brent Ozar: Okay, we’ve got a bunch us wondering in Slack, we just couldn’t hold back any longer. How are you doing that ruler on the screen? That’s amazing.
Steph Locke: So, this is Windows Creator update; so, if you go to your taskbar around the system tray, you’ll probably see a pen with a squiggle icon.
Brent Ozar: Keep talking, I’m on a Mac, I won’t be able to say yes. I’ll smile and nod because they want to know.
Steph Locke: Hit the squiggle icon and then it’ll open up a notification page, and you have a sketch option, which is fantastic, and this screenshot capability, which is also pretty nifty. And then it has a ruler and a protractor and it’s really great for making annotations and playing around with things.
Brent Ozar: That’s amazing. Yeah, everybody’s impressed with that. Alright.
Steph Locke: So long as people take something away from my session.
Brent Ozar: Yeah, right, data science is incredible. It puts rulers on the screens.
Steph Locke: Okay, so let’s say that we think most people are probably going to be 16, at the youngest. So, in the – oh god, I’m now so old in the UK that it’s not even accurate anymore that 16 is when you leave high school. Now you leave high school when you’re 18 in the UK. So, we’re not going to get any data going along here. It’s going to stop by here. But, potentially, people could carry on working past retirement.
So, our data could carry on because people are living longer, they’re working longer. So, we might actually expect, instead of it to look like this really nice curve, it might look something a bit more weighted towards the older side. And that older – that’s called skew. That’s what happens when data isn’t distributed symmetrically around the mean. And we’ll see if that’s the case with our data going forward. So, I’m just going to save this… Okay, so that was normal data, normal distributions. Any questions on this?
Brent Ozar: No, nothing so far. Everybody’s still just dumbstruck by the ruler thing. They’re all going all over…
Steph Locke: Well, keep you dumbstruck and not asking me hard questions, this is something, I think, Chrissy LeMaire showed me. I do a lot of stuff with GetSourceControl and things, and I’m constantly bouncing around lots of nested structures, and I often need to get into the command line in one of these directories. Changing directories from, like, System32 or User Documents is really hard. If you go here and select that and just type CMD and hit enter, you get a command prompt at that directory. There we go, another pro tip to keep you shocked.
Brent Ozar: And also, JD Walker says he’s, “Not dumbstruck by the ruler but he is dumbstruck by 40 being old,” By which we can read into JD Walker is at least 40.
Steph Locke: It’s old in the context of IT. In terms of average age in IT departments, or at least in my experience, it’s usually been between 30 and 35. So when it’s older than that, that suggests probably that you don’t have many developers in your group. DBAs tend to be older, I think because as you get older you get wiser; so, you get more care and consideration for saving your data and keeping it safe. That’s my hypothesis anyway.
Okay, so let’s have a look at where our data sits in terms of averages and standard deviations for some of our variables. Okay, so we had 20 people saying they’re not happy with their jobs, and you can see that the average here is around 41 and 43. So what that is saying is that this is 43 and it’s in the no category to capture the majority of our people. So, there are only 20 people. So about 70% of them is about 14 people. We’ve got ages ranging from 28 to 58. So, we have a fairly broad range there to capture our data. But 20 people is not a big sample. So, it’s really – it has to be quite inclusive. We would expect, hopefuls, as the values – as we get more people submitting data, that this range will decrease; hopefully, that would get smaller. It would tell us a bit more – hopefully, we’d have more people clustered around that mean.
Pinal Dave: So just a quick question, that this is a graph of what?
Steph Locke: So, if we’ve got all the age values and we put them along the X axis – so we had 20 down here going across to 65, then it would be one height for every person; so, the higher the peak, the more people in that age. So, this is a smoothed version of a histogram. So, a histogram is where you take lots of slices; so, you might do every five years or every two years, and you would do a count. So, this is actually where, in SQL Server statistics, if people are thinking about histograms, they’re thinking of a single column, where it gets distributed it’s going to produce these red square lines, basically; where SQL Server slices up the data based on some values and counts the frequency, the number of things and then…
Pinal Dave: When you said it this way, it’s so familiar now, I suddenly understood everything, and I’m sure a lot of database professionals, somehow, immediately related to how important statistics are for SQL Server and querying performance, so thank you; please continue.
Steph Locke: Not a problem. And I really hope I explained how SQL Server statistics actually works reasonably okay, because it’s not something I ever touch aside of, like, update statistics. So hopefully that was okay.
So, coming back to our data, we can see we have a lot more people who are happy with their jobs and they are a lot closer together. But now you can see why we might have to have gone a bit further out in our data, because somebody has very kindly been age 99. So that person, and it’s always good that somebody does this, because otherwise I would never be able to raise the concept. It’s the concept of outliers. So, an outlier is a value that doesn’t look the part. It might be too high or too small. So, when I was working for a loans company, most people’s salary was – in the UK, so times by two for – used to be times by two, pre-Brexit, used to be 15,000 pounds to 75,000 pounds annual salaries would be our most common. We had data quality tests to check for these things, and then one day we got a footballer, or soccer player. If I do fail to translate some of my Briticisms, please call me on it. So, we had a soccer player who was taking a loan with us but had an annual income of a few million pounds. So, a small loan, very large income. And he became an outlier and an exception to our data. So, most people would be on this from a small value to 75,000, and we just basically had to extend our curve to be able to allow them to be inside our system.
When we’re doing statistics, we need to look for those people and decide how we want to treat them. So, the person could be genuinely 99 years old, but when we start looking at this, they’re going to be making it more difficult to predict inside our normal range of values, because whenever we’re doing the math behind the scenes, it’s going to go yes, but you have to take into account this person all the way over here. So that dilutes our ability to predict people in the main set of data.
So common ways of dealing with such people are excluding them or capping values. So, you might say, since the retirement age is 65, used to be, any value over 65, we’re going to replace with that high-end value, which has its own implications when we do things. But the plus side, our min and max are kind of in the same ballpark for yes and no. Looking at our locations – now when we’re building models, we’ve got most of our data in North America, reasonable proportion in Europe and not a lot in Asia and other. So, we have very small samples for those cases, and we wouldn’t be able to predict for future, people from Asia, their age very effectively. Because we’d be going, okay we’ve got Pinal over there at 38, this guy over there for 31, this guy over there for 35 and they could be what’s called unrepresentative.
They’re just three people; that’s not a lot of information to go from data to … We’d probably say, in those cases, I don’t want to predict somebody’s age, because you might just as easily get somebody over 40 coming along and predicting inside the 30s only.
Pinal Dave: I have one question. When I look at the screen, so will you please scroll down all the way, so the first line too. It’s so interesting when I see this thing, people say statistics lie, and I think … I have a question for you. So, people who are happy are between 22 and 65, correct?
Steph Locke: Yes.
Pinal Dave: And people who are unhappy are between 26 and 99. Does it mean if one-person crosses 65, does it mean – like if I’m 66, does it make me unhappy?
Brent Ozar: Oh yeah, old people are grumpy.
Steph Locke: Well, me posit would be, older people are possible happier with their jobs because they’ve had more time to find a role they’re comfortable with.
So, what we’re doing here is the people who’ve answered the survey are a sample. So, we have the population of DBAs, BI people, devs, project managers and anybody else in IT who might conceivably watch GroupBy. And of all those people, a very small fraction have submitted this survey. So, what we are seeing here is a subset of people and they don’t necessarily reflect the whole population. The bigger the sample, the more likely they are to be representative. So, think about this, we would expect, if our population of DBAs had an average age of 40, that we would see that average age of 40 and the same sort of spread in our sample, as is the case in the real world. But we’re not going to get that, we’re going to get an imperfect understanding, because the perfect sample would be the whole population, and as soon as you get fewer people, you get less information. It gets less accurate.
Pinal Dave: that is very fantastic. So that is why sometimes – this is very similar to what becomes parameter sniffing. We think that if there is a value which is 66, it’s going to be a sad person, but that’s not true. Maybe at the age of 66 he’s still happy, but we incorrectly assume he’s the sad guy. And that’s why we try to make him happy, he comes back to us and says, “I’m already happy, don’t make me any more happy.” It’s a classic case of parameter sniffing, so this is a very good example which you have demonstrated to us. So, if anybody ever asks about this question, as a DBA, now I can go back and explain them, so thank you very much. How I came to let the statistics right into my life as well as a DBA how I can use this information, so thank you very much.
Steph Locke: I’m really glad to increase my amount of DBA anecdotes on this. It’s pretty good. So, then we have our focus, so – yes, we have a lot of DBAs and everybody else aside from the others look to have the similar mean. Now what’s interesting with this is you can see our max age of 99 is in our BI people. So, what that means is that if we took that person out, that average could drop very significantly and the BI people could look much younger than our DBAs and our devs. And there would be a question then, based on our imperfect samples, is that something we feel comfortable saying? Because we’ve only got 14 BI people, and we’d have only 13 BI people if we removed that outlier, versus 70 DBAs.
So, we have two samples of groups that we’ve sampled in different levels, and it would make it potentially very difficult to say with certainty that BI people are younger than DBAs. So, we’ll have a look at how that works in the question of, are happy people older? Once you start looking at the summary data, it’s also pretty good to start looking at the data visually. So here, I’ve done what are called bee swarm plots with a box plot behind it. So, bee swarm is like a scatter plot where you put all the values – so you make an X axis and a Y axis and you put points where they intersect, but it helps you see the distribution a bit better. So, with our – my computer is lagging quite badly…
And whenever I’m clicking faster, it reminds me of Adam Machanic’s SQLBits promo video. I don’t know if you’ve ever seen that – you should totally dig it out and I’ll try and find it and share the link on Twitter afterwards. It’s click, click, click makes it go faster on SQL performance. it’s a brilliant video. There we go, okay.
So, we can see our outlier. This person is really changing things in terms of our mean, because we don’t have any BI people in our sample older than 60. The oldest person looks to be about 59. And we can see, in the box plot, it shows where the median is. So here we have, in the background, this line. So that is our summary statistic that we were looking at earlier. But then what it does, where our standard deviation would give us an actual value and say 15 years below that and 15 years above that is where the majority of our data is – well that’s kind of accurate, but this box says, okay most of our values are clustered around here, and then above it you have to go much further out to capture that data. So, our BI people are very, very close to the average age in most of these cases, except this outlier and some older people. So, we have a high degree of centrality. So, all the data is close to the center in there. Whereas, sorry I’ve been saying BI, our unhappy people are quite close to a certain age; whereas our happy people are much more evenly spread across the age range.
Our mean is a little bit higher than our unhappy people, but based on the box plot, which tells us how many – to capture most of the data, how far up and below the mean you have to get, that it’s a bit more disbursed, it’s a bit more symmetrical in it. There’s less clustering around the mean. So, you could do this with a histogram. So, you could have, instead of two charts like this, you could have made a chart with age going across the bottom and made your, sort of, histograms. But the problem with the histogram approach is that it loses some data. Because you’re providing ranges, you’re not getting the full truth. Because if I’d have done this every five years, lots of my people could – all of this section here could have come from 31 and I might not have any data for 32, 33, 34.
A histogram doesn’t tell us the full distribution. We might end up, if we were looking at this data, we wouldn’t necessarily see the gaps in our data. Having a view that enables – a scatter plot view is very handy. It doesn’t necessarily work when you have lots and lots of data though.
Okay, looking at our location data; so again, or samples for Asia and other are too low. I would really hate to draw conclusions about DBAs and BI people from Asia when I only have three points. And for the rest of the world, I’m also going to have a lot of trouble. Now, looking at the other two, we have – I should maybe lower the refresh rate – again, we have a lower average in Europe than North America, but it’s pretty close. And because we have less data in Europe, given that we expect everybody to be along this sort of distribution of ages, the fewer values you get, the less likely you are to see a clear spread of data. So, we have to be careful drawing conclusions when we have only a small amount of information. For instance, it’s quite interesting that we have some quite young people in America, relative to the rest of the world. That could be an apprenticeships scheme, getting them whilst they’re young, or it could just be a sample error in the part of Europe and we just happen to not have anybody that young in the people who’ve filled out the survey.
And then similarly with the focus, we’ve got DBAs spread across a range. Our sample for devs is kind of showing my dev as a young man’s game sort of hypothesis, but we still see a lot more spread than out BI people,
We can see that, by comparing, all of these have pretty similar averages, right. So, our BI people are there, our devs appear to be the oldest on average, and they’re all the way across here. So, if we were looking at their averages at a high-level view of the data, we’d be saying that there isn’t much difference. But we start looking at the sample sizes and we start looking at the spread of information, and we start seeing a bit of a different picture. So, it’s really handy to visualize your data so that you can have a look at how groups actually behave.
So, on that note, if we wanted to ask the question of does being happy with your job, does that indicate a difference in age? We can investigate that. so, we have these two samples, our no I’m not happy with my job, you’d be happy elsewhere, and yes, I’m happy with my job. And we can see if their ages – if the averages are – whether we can be comfortable saying they’re different. So, the mean age of happy people is 41.6 and the mean age of unhappy people is 42.9. Now that’s pretty close. And we can do something called a T-test. A T-test helps us compare two samples and their average values and gives us a score of basically how probably it is that they’re the same average in the real world’ not in our sample, but in the real world. So, the lower this score, the better. If it was less than 0.5, we would expect that only in less than one in 20 cases, when we took our samples, that we’d think they were the same people.
So right now, we can’t say that happy people are any different age-wise than unhappy people. I didn’t bother with the formula because showing a little bit of R – and we can dig into this a bit more later. So, the thing to do that test was this line of code. So, it is just a function called T-test. So, you don’t necessarily need to know how to calculate it by hand, just that whilst you’re looking at two averages, you can run this T-test function to see if it’s different. And I’m wracking my brain, I think you can do a T-test in Excel. There’s a bunch of online calculators and you can do it in R, Python, Netlab, Stater, SAS. And the reason why you would do this is, for instance, if you implemented an A/B test. So, an A/B test is where you try out two variants on your website, or when you’re sending out newsletters or testing how people react to reports. You would get a figure like a conversion rate, and a conversion rate is an average, it’s a mean; because it’s how many clicks did people make divided by the amount of people.
So, you get this value, and you get it for, say, your original version of your site and your new version of your site, and you run this T-test to say whether the new system, the new conversion rate, is actually better than the old one. And things that impact this behind the scenes are how many people you use, so what your sample size is; because it’s a lot harder with small data to be certain about things.
Brent Ozar: you’ll like that the Slack channel is now asking all kinds of interesting questions like maybe the old people don’t want to work and maybe they’re unhappy about that. And other people are going, well what is happy? What does happy really mean?
Steph Locke: So that is an aspect of survey design. And the last time I did this talk, I introduced another ambiguity on purpose which was years of experience in the field. So, people go through a lot of career changes. You might have gone from dev to BI to DBA or any way through those. I’ve known people who were hairdressers before becoming infrastructure people. There’s a lot of career change and it was great to see people thinking about that and answering that question differently based on whether they counted how long they’d been a BI person, or how long have they been an IT person, or how long have they been in insurance. So, when we ask these questions, we actually have to be really careful in how we ask them to be able to get an answer that conforms to what we expect. So, the question of are we happy, as people are showing, they have different interpretations and questions about it. So, the art of a good survey is to make it unambiguous what you mean, without being so long winded that people just go, “I’m out of here.”
So, a T-test is a simple way of testing whether two averages are significantly different. And the lower the score, the more likely they are to be different because this score is basically how likely is it that they’re the same group?
Okay, so this is actually – this is a great question. I love it when questions are answered by the next slide. So, if we were trying to predict people’s age based on the information provided, we would have to go through, basically, kind of an experiment process. Or if you’ve seen the CRISP-DM model, which is the something, something, something data mining. I have a page on this… It’s always good when you have a blog post on it. Cross-Industry Standard Process for Data Mining; that just trips off the tongue.
So, when we’re doing any of this, we should be, first of all, getting a business understanding. What is the thing that you are trying to solve in the real world? So usually for me, this boils down to making customers spend more, getting new customers or reducing costs; so, profits from customers goes up. And these are like the three key business challenges most people face. So, when we do data science or statistics, it’s going from whatever this company goal is to something that you can stick an analytical lever in and push down on and change a behavior. So, going from a business understanding to quantifying your challenge is finding the behavior that you can change. So, in our case, if we were able to predict a person’s age based on this information and it was my company, I could be doing this to give them targeted adverts, say. So, from LinkedIn, I know this information. I can give them training specific adverts, but I could also start, for instance, if I know they’re a DBA and I know they’re in North America and I know they’re unhappy with their job, I could predict that I need to start selling them an annuity financial services product, or start recommending a move to a more safe pension plan. Or if they’re a dev in Asia and they’re happy with their jobs, I could be, because I think they’re under 25, I could be selling them hoodies.
So, we can use this – there’s no point in predicting things for the purpose of just going, oh yeah that’s interesting. It should always have some sort of practical lever that you’re aiming to use this information with. So, when we’re doing that, we go through this process. So, we get that lever, and once we get that lever, we get data. A common challenge is predicting which customers are going to stop buying your stuff. So, the kind of data that you might need to help you work out what customers are going to stop buying your stuff so that you can treat them nicer, give them discounts or get rid of them early, because they might be unprofitable, you need to look at who’s left previously. And then look at information about them. So, you might look at how much they used your service before they left. You might look at geography, their internet usage. You can get all sorts of information and you would try and get that data in a form where you can get that information at the customer level. And that would be your data understanding.
Once you believe you have enough data that you might be able to build a model, you go into your data preparation. So, actually with this Stats 101 thing, where I have grouped data into other here, and other here, if we look at the actual data, before I did some data cleansing we can actually see some of the varied responses. So, for instance, we have the ERP admin, a data architect, a consultant, and all these other values make it very difficult to say something concrete about these people. So, whilst I was doing my data preparation step, I changed all of those values to other, and I did that in R with a line of code.
Brent Ozar: A couple have asked; will it be possible to get the R script for this presentation?
Steph Locke: Yes. So here is the bit where I change people’s – any value that wasn’t the top three into another category. So, we do that, and once we have some data, we can then start thinking about this step. And this step is modeling, which is how can we make a production that can be used to make some change. So, some of the ways that we can do that is we could just use the averages. We can say the average DBA is 40, so when we produce some targeted ads, we’re going to target at that 35 to 45 market; so red Porsches, Brent?
Brent Ozar: Yes, both Steve Jones and I say yes, red Porsches make perfect sense. Hall and Oates’ music, that’s also good.
Steph Locke: Yeah, I saw that earlier and I didn’t get it…
Brent Ozar: It’s a very American thing, very American, yeah.
Steph Locke: So that would be another reason why you would use the geographical location.
So, you could use a mean or a median. A median – so if we have outliers, if we have stuff that is quite extreme, then it has a really big impact on – so here is a histogram of values. This is people’s age going across the bottom and the count of values in that age bracket. And you can see that we have lots of people on this younger side, below 40, and then it slopes off and we have an extreme…
Brent Ozar: A couple of people have asked, Chuck asks, “Before the end of the talk, can you ask Steph what she’s using to turn that R into a live website is; is it Shiny?”
Steph Locke: This is Shiny plus a package called Flexdashboard, which just makes it a lot easier to get these things going. But yes, it’s all R, it’s all open source and the code that I’m writing to do these things is just some Markdown and then some R code. It’s super simple. It does all the web magic for me; it’s phenomenal. I’m very happy with it.
So, because we have this outlier of 99, what that’s doing is one where summing up all the ages and then dividing them by how many people there are, it’s having a really big impact on that top bit; which is, if you want to be technical, the numerator. So, the bit at the top is the numerator and the bit at the bottom is the denominator. So, when we’re doing this, that 99 is pushing this red line quite far up. Because you would probably say, based on all this data, the average is going to be a bit lower, except for that 99. The median is, if I got everybody who answered my survey and I lined them up from youngest to oldest and I split them 50/50, so whoever was in the middle, they’re the median person, or the middle person. And for us, our median person is 39 years old. Median is less sensitive to outliers, because it doesn’t matter what age people are on the far ends, because we’re just going for the middle.
So, you might see average salary measures; and when people are doing average salary right, they should usually be doing the median. And they do the median because you’ve got lots and lots of people earning nothing, who probably aren’t in your sample of earners as DBAs, unless there are people managing to do fulltime DBA work pro bono. But you might have a few people working for a ton of money, and those people who earn a lot will make the average amount look higher than it is. So then, when you say to somebody, “Oh yeah, the average DBA salary is $90,000.” They go, “Damn, I’m only on $75,000. I’m being underpaid. I’m not even average pay.” And when that’s happening, that’s because there’s like 30 or 40 guys in Google being paid a million dollars a year to be DBAs; and they’re artificially increasing it. Whereas if you were saying the median income is $80,000, well then, I’m not being underpaid quite so much. So, these two different measures are useful in different circumstances, but they’re never very useful for predicting values. So, this is one of the areas where BI people, they used to make a lot of summary charts with mean, median, min, max; those can’t readily be used for forecasts for predicting what things are going to be like. Because it’s only good for a very small amount of values, but it’s not accurate for the rest of the range.
If we predicted 39 as the – if I said a DBA is 39, I’m going to be wrong in nine in ten cases. Ideally, I want to be able to go, “What’s your job? Are you happy? Where are you working…” Ask a lot of questions, get a lot of data points and say, “Aha, based on that, I reckon you’re 65.” Or, “Oh, you’re working in those startup types, so you’re 25…” Kind of thing.
So, the more information we can get, the more we can go away from just a simple high-level mean or median. One of the ways we can do that is using a line of best fit. So, in line of best fit, we can – That was me sketching a concept for something in log data, and it was terrible.
Let’s say we know your income; which I didn’t collect. I would probably have to use – Brent, you’ve got a survey with income on, right?
Brent Ozar: We do, yeah. I can’t remember if we have age. I know we have years of experience, but yeah.
Steph Locke: So, I could use income to try and predict age, because probably the older you are, the more money you earn. You might get a few people up here, you know, hotshots who are just managing to earn lots. And you’ve got some people who might be transferring – I’ve got this wrong. So, you might have some people over here with income high and age low, and you might have some people who have a low income and a high age. But then most people are probably going to be kind of along this sort of track. So, if we had lots of our data with income, what we’d probably do then is say that – if we were trying to put a line through it, how would we put that line through it so that as many of our points were close to that line as possible?
So, we’re always going to get some error, because if all the values were on an exactly straight line, we wouldn’t need to do anything. So, our predictions are always going to be a bit wrong; and the question is how do we make this line minimize those errors? And those errors – so if I place this line here because these values were really impacting it…
Brent Ozar: I’m totally going to get a machine with Windows 10; just to sit around and draw all day.
Steph Locke: So, if I put a line by here and if you were, like me, you had to do a load of this in high school or school by hand, and it completely convinced you not to do statistics – I did pure mathematics because if you’re going write stuff on paper, you may as well just work with formulas. So, the error for a given value is how far away from the predicted value it was. So, we do – when we build a line of best fit, we basically get all of these red lines, which is how far away they are, and we actually square them. So, if that was X, we would take the square of it. So really far out values have more impact than really close values because these ones here are brilliant, right; they’re pretty close. Most people aren’t going to tell you off if you predict them just a few years younger or older than they are.
So, you would take a line – and this is how all the systems work behind the scenes; they generate a line, calculate all these errors and get a value for it. Then they’ll go okay, I’m done with that one, let me try another line. And then they go through the whole process again; so, there are lots and lots of iterations behind the scenes working out all of these errors. And once it’s done that, it will say, okay, so I’ve found this line only has the sum of the red lines of, say, 500. That is the best of all the thousands of lines I considered. That is my line of best fit. And what this line of best fit has is some properties. So, in your – you might remember Y equals MX plus C. This is C, so this is the intercept.
So, this is – if I then know any other information, this is where I think all the points would start at. So, we’re going to say that nobody should be less than, say, 15. As soon as they start earning more money than that, we’re going to start saying that their age goes up. And we say that their age goes up by a certain amount per unit. So, our income, let’s say that goes up 1000, we might say that for every 1000 in income, their age goes up five years. Obviously, five years is too big, but we would have an easy model that says – okay, so we start off saying you’re 15 or you’re 20, and then okay, you’re earning $100,000. That says to me that you’re 45. If you’re earning $35,000, I’d say you were 25. And then you add on more and more features and trying to make this line work in multiple dimensions to be able to add more pieces of information in.
We could say that as your income goes up, your age goes up. But then we could, if we had the extra bit of information about location, we could say that when we’re working in – when we know somebody’s from Asia, for instance, they could have – to earn the same amount, they’re able to be younger. So, when we have these factors such as location, those are called a categorical variable. So, income and age are variables which have lots and lots of values; and they can have decimal points. Our categorical values hold only a few values. So even zip codes, postcodes, those are categorical values that can even have a hierarchy of locations. So, when we have a categorical variable, what we do is we don’t adjust the slope. So, this bit here, we adjust where the slope starts. Does that roughly make sense?
So that is a line of best fit. And if we were making predictions, we can then go and visualize them. These all look pretty close because our ages are quite packed. So, if we put all of our people on a map to look at their fit, then if we’re only using one factor – so we’re doing our line of best fit, just with where should they start and… What should we do if somebody says they’re not happy, what value should we predict? And if we say they’re yes, what value should we predict?
Now basically, when we do it with just one single value that is continuous, we’re basically using our mean, because we’re just trying to work out where we’ve minimized the error. And the error becomes the sum of all the bits up and down from the value. So, we need to make it more complicated to be able to get this right.
Pinal Dave: I was just going on Twitter and reading some random questions about statistics. One question which caught my eye, and I thought I’d just throw in right now is, at what time-, or how do we decide, that this statistic is not important or it just absolutely does not have good information? Like when you just say, you know what, let’s start again, this is not what I want; what kind of test would you run? Or what kind of two or three questions would make you feel, you know what, I’m going to stop instead of investing more?
Steph Locke: So, error is a really good thing to include. So, if we were making predictions based on focus, for instance, we could see that we don’t get it too wrong for our others, but in our DBAs, our errors our fifth of our model is just terrible. It doesn’t work for that category of people. And if it doesn’t work for your biggest category of people, you need to go back and change your model.
So, one key metric that you look at when evaluating a model is error. And you look at that error overall and within key groups. So, it’s a really good idea if you’re using a model to try and predict customer churn, that you might have a segmentation model already where you say these are my high-value customers, these are my low-value customers and these are my middle ones. And if your model doesn’t work very well for your high-value customers, then you really want to go back to the drawing board, because those high-value customers are the ones you want to retain the most.
So, error is a key factor as to whether you go forward with a model. The other aspect is, do you get it more right than just randomly picking? So, if we said one in ten people are going to leave and we just, for every one person … nine times out if ten we said they’re going to stay, then we’d probably get quite a few of those one in ten people just by doing that.
So that is called a lift chart. I don’t have one on here unfortunately, but it’s basically how much better do you do at predicting versus a random assignment? And usually, you want to be much better than random because if you’re worse than random you’re actually reducing performance, and you need to be able to make a profit. So, when we’re doing it, and especially in relation to those customer segment type things, you look at how much better you’re performing; how accurate you are at doing it.
Oh, that’s mean; somebody just put an age of 1000. I think that’s somebody just trying to see if the model actually is updating real time or if I’ve just plugged in a random grey screen process. Yeah, so you look at.
Pinal Dave: Excuse me, if you see this kind of thing when it is happening, how do you take care of identifying outliers like this? So, the question is, if there is a lot of people, we know that somebody cannot be the age of 1000, but let’s assume there is 20 people who will go and put age of 1000 and suddenly that number becomes realistic – in the case of age, we know, but there are certain things we would not know, like salary or experience or things. Is there any way to figure it out, the data which looks right but is not right?
Steph Locke: So, what you can do, obviously, as you say, with age we know, so we could add a raw in here. So, I want to add a change and what we’re going to do in here is say age equals, then use the function pmin, which is the smallest of two values. So, I would say age and 70. This is one of the easiest known data quality things that you can do. So, I’m now rebuilding my dashboard and it should completely remove our 99 and 1000 from this.
Pinal Dave: You are scaring a lot of people. I’m going to keep quiet actually with your age limit.
Steph Locke: Of 75? So, it pulls the data real-time from – so now our data only goes to 75. We can see that in our tables…
Brent Ozar: We’re talking in Slack too. Chuck says, “Pinal raises a good point; you don’t always know what data is valid.” We totally had that problem on the salary survey. You just have no idea what to take for people, and what a top range should be.
Steph Locke: Yeah, so when we’re doing things like that, there are various things that you can do. So, what we could do is instead of making this a fixed range, we could simply say any of the top 1% values, get rid of it, do something with. And in R we actually have a load of different capabilities for dealing with this. So, it’s quite nifty for data quality because we have these various data assertion packages, and we can write a check here that says they have to meet these error rules, and we can say if they don’t meet them, stop. So, this is quite a handy one that we’d be looking at. If you’ve heard of Six Sigma project management, what that is is a quality control original system that said if you take all the final measures of a washer that gets produced by your system, most of them should be around the mean, and only very few errors should happen far out six standard deviations away from that mean. So, you want very little appearing in those areas. And those are our tails of our distribution; our far end. So, with salary, it’s always bounded by zero. Hopefully, people don’t pay to work places, but you could have a very high salary.
So, what we could do is say that we want to insist that salary must be within four standard deviation or five standard deviations and set that boundary. And you could use that to provide a filter. So, you would say calculate the salary range but secluding stuff at the very far ends of the tail.
Pinal Dave: One question, just a follow-up. I will not take time – but salary is a great example which varies from country to country. So, I’m sure there is another also maybe a grouping by or windowing function, I’m sure available in, we can say, hey if it’s in Asia then this range is good for five standard deviations, but if it is United States of America, then they have a different range, so we can just create further windows into the ranges also, right?
Steph Locke: Yeah, so the nifty thing is – let me just show you this – with assert, what we’re doing here in our workflow is we’re taking this data set we’re doing this filter that says it’s got to be within three standard deviations within the whole data set, then we’re doing a GroupBy and then we’re doing the average. You can do a GroupBy first to say do this within standard deviations within the group before you do the mean. So, it’s literally as trivial as changing the order of the lines. And that is – so you would just do that.
Brent Ozar: Alright, well let’s see the graphs again now that we threw out all those people. Oh, there’s still that 99 guy…
Steph Locke: So that probably means I coded it wrong.
Brent Ozar: Because the 1000-year-old guy’s gone.
Steph Locke: Yeah, the 1000-year-old guy’s gone but it should be better.
Pinal Dave: While you search for that I will just say thank you very much. You answered my couple of questions, I’m sure some other people are also wondering the same, so thank you; that was amazing help. I’m more confident about R than yesterday.
Steph Locke: Brilliant. So, I need to change this because I’m not sure I’ve done it right. Oh, I know what the problem is. This opened up on a new port. Sorry, it randomly generates port numbers. So, this is on – did I press play? I did. So why is this… Now you’ve got me wondering about my code. There we go, I just hadn’t updated. Go on, Brent…
Brent Ozar: No, there’s a bigger difference in between happy and unhappy people. Still not a lot, still not a lot but…
Steph Locke: Because we removed those values that were – they made the mean for no artificially higher than it was, because they were outliers which distorted that central view. And we could even add in the median for that very quickly and see what that brings us.
Brent Ozar: And then to see if there’s a difference then on the guessing of can we determine is there a difference between them happy or not.
Steph Locke: So, the median is a fair bit younger – oh, and of course, now we have somebody who’s added a zero. This is why I love doing these live surveys because you get to see what people really do to surveys. It’s really useful because outliers do happen and handling them is very necessary. So, whilst we’re – I forgot one step.
One thing that we should do when we’re building our models is you don’t use your entire sample. If you do your entire sample, firstly you have no way of knowing whether you’ve done it right or effectively. Secondly, you do this thing called overfitting, you draw the perfect line for all of your data, and that perfect line for your data is not necessarily the perfect line for future information. So, you leave some data out so that you can ensure that you don’t build something too perfect.
So, when we build a model, we take some training data and some testing data. And the most common thing to do is to take a 70 or 80 and 30 and 20 splits. So, you can build test samples really easily in R, and perform a split. So, you would do the split, then to the model; take the model code so the – take your line of best fit, apply it to new data and check out how it performs on that new data.
So that is the very beginnings of data science and statistics and data mining. It’s how do you get your data? Is it sensible? Have you sorted out any outliers? Do you know what you’re using it for so that you can impact the right thing? And then take some of the data, build some sort of model like a line of best fit or an average or something more complicated like neural net. And then take that model, check how it performs – so whether that’s its overall ability to predict things or within a certain category that is most important to you. Check how well it performs against other types of models and then you go through the wonderful process of putting it into production and checking it’s doing the right thing in live.
Brent Ozar: When you do a project like this, you know, obviously this is pretty quick in here, but when you go through and do stuff like this [inaudible] on a project like this? Say that you had 100,000 responses out of here and you wanted to get a better idea of building that model, what kind of work is involved in it?
Steph Locke: So, there wouldn’t be that much difference in terms of going through what we’ve done here and doing it for extra respondents. The [inaudible], because we would just sort the data quality, or those 100,000, and just run it through the same process. So, it would become a bit more complicated when we have lots of data. So, if we asked 30 questions and we wanted to build – so if we had the income as the thing we were trying to predict, we had asked somebody 30 questions and we had their email address, so we got their Microsoft qualifications. We were able to scrape LinkedIn and their Twitter and all sorts, we’d have tons and tons of information. And you wouldn’t want to put all that into one big model. So, you would have an intermediate step which is called feature reduction, which takes lots and lots of data and works out what is the most salient aspects, and then you use those for your model.
Brent Ozar: and when people now want to take this and go, “Alright, I found this part interesting.” Where do you suggest the folks go next?
Steph Locke: I’ve got a page called Getting Started with Data Science. And this has some job recommendations and some online courses. So, I just gave away a copy of this book. This one, Data Science for Business, is not about the code but about the process and what types of things you need to think about when you’re building a model. That’s a very handy non-technical but non-math-y way of starting to learn these things.
Brent Ozar: Does somebody have to have a degree in computer science or mathematics or anything like that in order to get started? Or what do you think are the most useful assets, personally, to get started down a career like this?
Steph Locke: I think you don’t have to have any qualifications around it. You do have to be, because the literature is often not very accessible, you have to at least be willing to Google and learn. Fundamentally though, I think a lot of the things that most people do day to day in data science is problem-solving your way through how people will screw things up for you. It’s like all of IT. If you get garbage in, you get garbage out. So, you spend a lot of time worrying about data quality, understanding how information was collected, checking for issues. Then it’s a lot of stakeholder management – like how do you get from a business goal – first of all, trying to get somebody to give you a business goal is hard, and going from that business goal to something you can actually impact with data that is decent – so that’s just pure process and people management skills there.
Then there are a lot of – a big checklist, really, of things that you should do before you start building a model. Like what types of models do I want to use? Do I have enough data? Am I building samples? Have I done things like scaling variables where required? Do I have too many fields? Do I have not enough fields? And then going through, building a model is actually relatively simple. Then it’s working out whether that model is doing the right thing for you and if it’s good enough. And that has – it’s kind of, again, fairly checklist-y. It certainly helps to have some programming experience or SQL Server 2016, to be able to Implement a model in production. That’s where it starts becoming, again, more general and technical than data focused.
Brent Ozar: Eugene says, “There’s a great deal on data science books right now. The Humble Bundle at HumbleBundle.com.”
Steph Locke: Yeah, so that one – I think the Humble Bundle’s pretty good for people who are general techies already. It focuses a fair amount on the tech. so there’s a Hadoop and a Python book and an R coding book and things. For $15, get it.
Brent Ozar: And then Alvaro asks, he says, “Where does Hadoop and Spark fit in the world of data science?”
Steph Locke: Hadoop, nowhere. So, Hadoop is just a data storage and query system for me. Spark is a much – so R has the limitation of working in memory. Spark allows you to take some R code and Python code and some Scala and do some of that feature engineering. So, cleaning up the age and removing outliers, working out what to do with missings, producing lagged data, like lagged measures, like how many times have they visited my site in the past 12 months. Then condense it all back down again and build a model. And Spark allows you to do that on much bigger data sets than is often easy to do on your desktop.
Brent Ozar: Cool. Well thanks a lot for talking to us today, Steph. That was Awesome; fantastic.