For developers, DBAs, BI/BA professionals interested in integrating statistical techniques into SQL.
One of the most exciting new features of SQL Server 2016 is the integration of the R statistical programming language into the platform with the brand-new SQL Server R Services. In this introductory session we’ll briefly introduce the R language and then look at how to install and administer SQL Server R Services. We’ll start to demonstrate the power of R by showing you how to create and run simple R scripts from within T-SQL. We’ll then discuss some of the powerful open source packages that are available to aid your data analyses. Finally, we’ll go through some examples of how to use R to integrate data analytical methods in real-world situations, and wow your customers with the power of statistics!
Enjoy the Podcast?
Why I Want to Present This Session:
Lots of people are talking about how to use R and analytics along with SQL. I’ll show you the why as well, using some use cases from some of my past clients.
Session Transcript: Introductions
Brent Ozar: So in this session at GroupBy, Chris Hyde is going to be talking about SQL Server R services in SQL Server 2016; take it away, Chris.
Chris Hyde: Alright, thank you very much for having me, Brent. Thank you, everyone, who’s joined. I’ll start with a little bit about me – I’m Chris Hyde from Albuquerque, New Mexico. I run the local user group there. I’ve been dealing with SQL stuff for about 16 years or so. If you like this presentation, please mention me on Twitter. Let everyone know they should come see it, particularly people in DC who should come see this presentation next Saturday. If you don’t like the presentation, if there are things I could better, well that’s what the email is for. So shoot me a message there, shoot me a message on the GroupBy channel on Slack; let me know what I could do better.
And I did just want to start off, before I get into things, with a little note about PASS. So I ran for the PASS board of directors this year, and I wasn’t successful, but that’s okay. But what really disappointed me was that we only had about 800 people vote out of 250,000 people who are official members of PASS. So I find that pretty disappointing – so what I’d really like to know is if there are people on this webinar who didn’t vote, I’d love to hear why you didn’t vote, if you couldn’t vote, if you had technical issues I’d love to hear about that. Please send me that info by email so I can collect it and I can help them improve the process for next year’s elections. So that’s my piece of the soapbox done.
Erik Darling: Democracy is hard.
Chris Hyde. It is. I won’t go off on a tangent there, as much as I’d like to.
Warning: There Will Be Math
But I do want to start off with a warning of a different kind – I assume everyone is expecting this. You’re coming to an R session, but there is going to be some math involved. For me it’s a little bit early; it’s 8:30 in the morning. So if I can handle the math at 8:30 in the morning, hopefully, you people on the East Coast, Central time or in other parts of the world… I might be the first GroupBy casualty here; I apologize.
Erik Darling: If you drop dead, I can’t …, I’m sorry.
Chris Hyde: Yeah, I should have sent this out in advance so you guys would be ready in case that happens. We’re going to be talking about math because R is good for, you know, statistical computing. That’s what we’re going to be talking about today. So in order to show you some examples of where we can use it, they’re going to involve math. And in order that you understand it and you follow along with me, I’m going to actually explain the math… Boo, hiss, right? So let’s go ahead and begin.
R is a language and environment for statistical computing. It’s got really good graphical capabilities; there are some good machine learning capabilities. We’re not going to go into those too much today. And it’s an open source environment that was added into the SQL Server ecosystem in 2016. Now, open source R does a couple of things not so well. It doesn’t handle large data sets that exceed the RAM space very well, and it doesn’t handle multi-threading very well.
Introducing R in SQL Server 2016 & 2017
There was a company called Revolution Analytics – they did a distribution of R with some extra packages that does handle multi-threading. It does handle spilling large data sets to disk. So when Microsoft started to look into incorporating some of the data science stuff into the ecosystem, revolution was kind of the natural company for them to acquire and to bake their version of R into the platform.
So it’s been available since 2016 in CTP3, it’s still there in the 2017 version and we’ll see, when you guys do the install, how that looks a little bit different today. So I’m expecting, because this is GroupBy, that a large number, possibly the majority of people attending the webinar are DBAs. So you might be thinking, what can I do for me? I’m not in that predictive analytics space, I’m not in the BI space – well I want to show you guys a little bit what you guys can do with R; so I’m going to pander to the DBAs a little bit. Hopefully the BI people, the analytics people – you’re already on board and I’m not going to lose you by pandering to the DBAs. So let’s go ahead and kind of do an intro to R.
So I’m going to jump over here into R studio, and this is an open source IDE for using R. You have other options, such as RTVS, which is R for visual studio. But I’m going to start just in this R studio, and we’ve got a simple script here.
Our Analytics Problem: When Will We Run Out of Drive Space?
So the problem we’re trying to solve with this script is when are we going to run out of space on our data drive?
It’s not usually a big issue, right, because you need more space…
Erik Darling: It’s going to be 3 am on Saturday, when you always run out of drive space.
Chris Hyde: Yeah, and then you call up your SAN admin, right, and you ask her for a new [LAN], and she gives it to you right away, right?
Erik Darling: It works perfectly; it’s seamless.
Chris Hyde: Yeah, or you know, you call up your manager and ask her to start ordering some new Spinning Metal for you, if you’re still using Spinning Metal. By the time you go through that process and all the requisitions and stuff, it takes weeks. So I know, when I was a DBA, I’d kind of ballpark it, but let’s actually use some statistical techniques here. Let’s go to our manager, go to our SAN admin with some kind of ammo; some kind of real data.
Our Data: Capture Disk Metrics to a Table
So I’ve been capturing the size of all my databases on disk every night, you know, schedule jobs, saving that to a table for example. So I’m going to use R, use the statistical capabilities there to predict when exactly are we going to run out of space. So here’s my script; let me blow that up a little bit so it’s more readable. We’re going to use a library called RODBC, that just allows us to go from R to a SQL Server engine using an ODBC connection. We’ve got a standard ODBC connection there, and then a query, just to sum up the total size of all our data files every day. And we’ll bring them into what we call a data frame, that’s DF total size in line 20.
So I’m going to prefix every data frame with DF. And when I say data frame in R, that’s kind of the poor man’s implementation of a table. There are other table-like objects you can use in R. This is kind of the bare bones base one that you’ll probably start with if you’re new to R. Then, you know, we’re good data professionals, so 21, we’re going to close our connection. So let me go ahead and run up to that point and pull or data into the R environment.
Now the data is there. One little wrinkle, everything came in as a string, so we’re going to have to do some coercion here. We’re going to turn, in line 25, our date from a string into an actual date. And we’ll just name the column Measured Date. And the size of our databases we’ll call Total Size. And you can see the dollar sign in there in those two lines – that’s kind of like the dot syntax in SQL. So instead of table.column, it’ll be dataframe$columnname. Hopefully, that makes sense to everyone.
Plotting Disk Space Over Time
We’ll plot that. I mentioned R’s graphical capabilities earlier; it’s very easy. Line 30 and 31, we’re just setting the size of our plot, the range for our data values. 33 we create a blank plot. Line 34, we just graph the line of our database growth on there. So let me go ahead and run down to that point. And if I can figure out how to – there we go… I almost couldn’t pull the graph up because of the webinar.
So you can see how we’re trending over time. Let’s say we’ve got a terabyte of space on disk, our databases are running around 800GB right now. So we know that sometime late next year probably we’ll be out of space. But we can do better than that. So what we’re going to do is we’re going to kind of make a trend line through that graph using a linear aggression model. It’s very simple in R – it’s one line, line 38 here, to create our model, which is the size over the date. And line 39, that’s one line to fill that onto the existing graph.
Using Linear Regression to Predict Outage Times
Erik Darling: Can I ask the never made it past algebra one question? What’s a linear regression?
Chris Hyde: It’s basically a way of kind of figuring out the line that best fits that jagged curve. I don’t remember the exact – it’s something to do with squaring the variants and finding the line that takes the least total [crosstalk] where those variances – I’m sure we’ll have somebody who will chime in and tell everybody exactly what that is. But for now, let’s think of it as this best fit line here, this red line. So, that line shows us the best trend of where we’re going. So I said there’d be math – hopefully everyone remembers the equation for the line is Y=MX+B. This is the part where when I present this live, I ask that question and everyone looks around to see who actually remembers it.
So we have our Y intercept over here, we know the slope of the line. We could actually work out exactly when we’re going to hit the 1TB mark, but there’s an easier way for us to do that. What we’re going to do in invert that model. Right now, this graph has the time along the X axis, and the size along the Y axis. So as we go out to the right, think of looking at the time, we can see where our database grows size-wise. Well if we invert that, put the size along the X axis and think of as we grow out in size, on the Y axis, what time will it be where we hit that?
So I’m going to go and invert the model, that is line 44. So you can see in line 38 we had size over date. Like 44, instead, we’re going to have data oversize, so we have an inverse of that model. And then we’re going to use the predict function to find out exactly when it’s predicted that we’ll hit that 1TB mark. Now that’s going to give us a date – it’s actually going to give us a number since January the 1st 1970. So we just have to turn that into a date. And if we run that we can actually see here on the bottom left that it predicts we’re going to run out of space on March 28th next year.
Now, obviously, there’s some variance in that. It’s not an exact science. It’s not saying exactly on March 28th we’re going to run out, but we have a statistical analysis that we can take to our manager or SAN admin and say, “This is when we’re going to run out of data.” So it’s a lot better than just kind of ballpark-ing it. So hopefully you DBAs on the call can use something like this already, very simply. And hopefully the BI folks, I haven’t lost you in speaking to the DBAs.
Erik Darling: I appreciate that you used 1024. A lot of people try to square off megs and gigs by a thousand, and I’m shaking my head at them.
Question: So, Why Did Microsoft Use R?
Chris Hyde: (Reading an attendee question): So, why R? Why did Microsoft start with that? We’ve seen this trend over the last couple of years. Microsoft wants to go beyond just the traditional BMS, to expand their platform into the wider data world, including data science. So business analytics is huge, and there’s several parts to business analytics. Those of us who are the more traditional BI professionals, we’ve been doing what’s called descriptive analytics for years – that’s the who, the what, the why, the when – maybe not the why, but the when, the where et cetera.
But we really want to go toward predictive analytics, what’s going to happen in the future based on what’s happened in the past; statistically, what can we expect? And this is – it’s been called data mining before, it’s kind of the same stuff; nothing new under the sun. and then the next data science buzzword will probably be prescriptive analytics, making changes to the models, hypotheticals to suggest what might happen if you make some changes in your business processes.
As an analytics tool, R is right up there. One good reason to use R, if we look at this chart from O’Reilly a couple of years ago, the number one analytics tool is just good old plain SQL. The number two analytics tool is everybody’s favorite – user’s favorite tool, Excel. And then we have Python and R neck and neck for the next spot.
Erik Darling: What do they call R in Boston?
Chris Hyde: That’s a good question.
Erik Darling: I don’t think it exists. It just floats out in the air. That’s why everyone uses Python.
Chris Hyde: Nice. I’m going to steal that one, I think.
Erik Darling: All yours, my friend.
Chris Hyde: Now, if we look at people who call themselves analytics professionals, not just database professionals; our usage has been up in the 75%-80% range. And for people who are using it as predictive analytics tools, it far outshines – at least it did a couple of years ago, python is catching up pretty quickly – it far outshines everything else. it’s very easy to use. It’s open source, so it’s free. Look at those numbers compared to SQL Server data mining, you know, down there at 20% Well, anybody who’s done that in SSAS, you know, kind of knows why it’s that far down.
Erik Darling: I’ve heard nightmares…
Chris Hyde: I’ve had nightmares.
Erik Darling: Sorry about that.
Chris Hyde: Of course, because R is open source, a lot of people are writing packages, contributing things. I said earlier there’s nothing new under the sun; if you’re trying to do something, the chances are somebody’s tried to do it before. Somebody’s implemented that particular heuristic model. Somebody’s probably done it. We’ve done exponential growth of packages sitting in what’s known as the CRAN repository; which is the central location for our packages. That growth is just increasing.
I do want to touch very briefly for installation, before we dive into the next demo. So one thing you’ll see, if you’re using 2016 or 2017 is an option that may be new to you. It started out in the CTP being called Advanced Analytics Extentions, and then in 2016 RTM it was R services database. And now in 2017, now that we’ve added Python as an option, it’s Machine Learning Services in database. That’s the piece we’re going to use here to communicate from SQL Server to our back and forth.
Question: How is R Installed in 2017 & Linux?
Erik Darling: There’s a question in Slack. “Will you explain how to install R packages and dependencies in SQL Server 2017? This has changed a lot since 2016.”
Chris Hyde: I can talk a little bit about that. I haven’t seen any changes in 2017 for R. I can still do the install.packages, and then the package name and that will go out to the CRAN repository just fine. So the person who asked that question, if you could maybe give more details about what’s changed for you. But I updated a couple of packages this morning in my 2017 version and it was exactly the same.
Erik Darling: And is the process different on Linux versus Windows? That’s a full metal pocket protector question.
Chris Hyde: I’m not actually sure if this integration exists in the Linux version. I don’t believe it does yet. I know definitely that the Python version isn’t, and they’re working on that. I’m thinking that that means that the whole machine learning services piece is not in the Linux version, but I’m not a Linux expert.
Erik Darling: Me either…
Chris Hyde: Luckily we have people for that. But in theory, if you’re doing that, it should be pretty much the same. As long as you have a connection to the internet, you can do the install packages and connect to that CRAN repository and download new packages. As far as if you don’t have an internet connection, I know Microsoft has a couple of articles out there about how to do that. So I’d refer people to those.
Machine Learning Server & Services
Chris Hyde: One other thing I wanted to mention on this slide before we shoot off here is you’ll see under the shared features, we have machine learning server standalone. That’s a different piece. That is more for if you have a team of data scientists that are already using R, but maybe they’re struggling with the multi-threading or the spilling to disk issues that I mentioned earlier. And you want to install for them on our server with the Revolution Analytics flavor Var installed, that they can then use those extra packages and functions. But that’s not going to go into what we’ll talk about today – I just wanted to point that out for those people that are seeing both of those sections there.
Now, when we do this, whether you’re using R or Python, we’re going to add a service that may be new to you; it’s called the SQL Server LaunchPad. This is the piece that when we call our magical stored procedure, that I’ll introduce in a second, it instantiates the R engine and it facilitates data being passed back and forth between SQL and R. So it’s called the Launch Pad, a new service that you may notice.
How We’ll Run It: sp_ExecuteExternalScript
When you’ve got it installed, for the piece we’re going to use today with the integration back and forth, we’re going to use a new stored procedure called sp_executeexternalscript.
Now, we have to turn that stored procedure on to be able to use it, so it’s a simple reconfiguration, but you’ll have to remember to do that, and it’s something that I’ve forgotten to do before demos and had some panic involved. I have the exact script in the slide deck so I can copy and paste for the next time that happens. And another thing to remember is to configure your memory appropriately. The data that we’re going to use in R, it’s going to be a duplication of the data that we’re sending from SQL Server. So you’ve probably heard the term in database analytics – but then a lot of places will put it “in database”, with air quotes… We’re really actually making a copy of our data. We’re doing it very efficiently behind the scenes, but that data is going to go into RAM.
So I know, back in the dark days when I was a DBA, if I had a new server, I might just take 85% or 90% of the server memory, give that to SQL Server and let the operating system have the rest – or some other system like that. But now, in the R world, if we’re going to be using 100GB worth of data, we’re going to be processing that in R, we want to have that 100GB worth of data available for us to use in RAM.
Should R Services Be Run Somewhere Else?
Erik Darling: Can I just ask a quick question? Because I know there’s a lot of guidance round, like if you’re going to use SSAS, IS, RS or whatever, you go and install that on a separate server so it can do all its magic over there. Is there a way to put R on a separate server and give it its own set of RAM and functionality, or does it have to be along with the core engine stuff?
Chris Hyde: You could, but then you’re going to lose the memory bonus that you get from the integration. There’s a service called the Binary Exchange Language Service that transfers the data back and forth between SQL and R – or python if you’re using Python – and it’s much more efficient than say an ODBC connection. I’m not sure exactly what they’ve done to make it that efficient, but it’s incredibly quick. So you’ll probably want this on the same database that you’re running the data against.
Now, I say 100GB, but in reality, whatever data you have in your data warehouse that you’re going to do analysis on, you’re probably doing some kind of pre-processing in SQL before you send it to R. So you may not need all that data, unless you have a very specialized use case. Maybe if you were doing some machine learning stuff on billions of rows, you’ll need that kind of data; but for most of us, probably not.
Erik Darling: We don’t want the machines to learn too much; that’s how Skynet happens.
Chris Hyde: Maybe it wouldn’t be a bad thing; that’s all I’m saying.
Erik Darling: Someone asks, “Can you use a read-only secondary for R services?” So if you had an AG and you had R installed on both nodes, I assume, could you do the analysis stuff on the secondary?
Chris Hyde: I don’t see any reason why you couldn’t. [crosstalk] I’m not an AG expert, so I’m just a poor BI guy here, folks.
Two-Minute Introduction to R
Chris Hyde: Alright, so I’m going to give you guys the two-minute intro to R, so that you’ll actually understand the scripts that I’ll show you later. So the assignment operator is conventionally the left arrow. You can also, if you’re a contrarian, you can use the right arrow. So you’d say one, right arrow, X, and it’s like saying X equals one. You can use the equals sign, but there is a wrinkle when you do that in a function call that pops up, so nobody does it. So use that left arrow, follow along.
So we’ve got the colon operator here that gives us the sequence. So 1:10 gives us 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. I think most of the people attending this webinar are smart enough to know what A:Z gives us. But what those do give us is something called a vector. So you can think of a vector maybe as a one-dimensional ray or an ordered list, or even as a column in a table; because just like a column in a table, everything in a vector has to be the same data type. So we can say N equals one through ten, and we have a vector with ten elements; the numbers one through ten. And you can have a vector with just one element, like X equals one in the last example. That’s perfectly okay. So even your standard, maybe a looping variable, is still a vector; it’s just a vector of one item. You can, of course, have a vector of no items at all.
We can use the combiner, or the concatenater or the collector – I’m not 100% sure what the C really stood for when it first came up, but that will take a bunch of numbers – in this case, the first seven values of the Fibonacci sequence and actually turn it into a vector. I’ve talked a little bit about data frames already.
Now, vector is the operations – for the most part they’re pretty obvious. If we have a vector that’s six elements, one through six, and we add three to that vector, we’re actually adding three to every element of the vector, so 1+3, 2+3, so on and so forth. And if we have two vectors, X and Y here, they both have six elements. If we add them together, we’re going to add the first element of each vector and then add the second element of each vector; so 1+7, 2+8, 3+9, so on and so forth.
But if our vectors are different lengths, that’s not an issue. It’s designed to work this way; that’s how vector mathematics works. So you won’t get any kind of error message. What you will get is you’ll start recycling the shorter vector. So in the last example, X has six elements, Y has three – so you have 1+7, 2+8, 3+9, then we start recycling Y; so 4+7, 5+8 and 6+9. Again, that’s how it’s supposed to work. It may seem counter-intuitive and you won’t get the error message; so just something to watch for when you’re doing vector math.
Using SSMS to Get Data In & Out of R with sp_ExecuteExternalScript
Alright, lesson portion over. Let’s jump into Management Studio and show how we’re going to get data into and out of R from there.
Erik Darling: Home again, home again.
Chris Hyde: So we’re going to do something very simple to begin with. We’re using this sp_executeexternalscript procedure, which is your new best friend. And the first parameter is pretty obvious; that’s the language that we’re going to use. If we were using Python, you know; three guesses as to what our argument would be here. And then the next argument we’re going to use is our script. Now this is the R script that we’re going to run. So in this case, we take our input data set and set our output data set equal to it, and that’s going to get passed back to us.
Now, this input_data_1, this is the data set that we’re going to send in for R to operate upon. Now I’ve got a tally table here in my admin database, so I’m just going to take the top ten numbers from that, send it into R, get it back out. So super simple; let’s make sure it all works…
Erik Darling: I thought this was going to be fast?
Chris Hyde: I did too. In fact, it was quite fast at 8:21 the last time I ran it…
Erik Darling: So I just have an off topic scripting question: do you need for parenthesis for that or would one do? I was looking at it and I was curious – when you scroll down a bit to the new column name…
Chris Hyde: I believe you do actually need both sets. We’ll see why a little later I think.
Erik Darling: They just want to make it hard for you.
Chris Hyde: You know what, this is kind of more informal than some other settings, so let’s just… There you go, it isn’t correct syntax.
Improved Version of the Query
Alright, let’s jump onto the next version of this query. Up here we’re using output data set and input data set, well those are the defaults. We can call those – we can call those whatever we want to call them. So let’s call the input data set My Query. We just have to specify that using the input data name parameter, and then our script changes to output data set equals My Query. Pretty simple. Let’s make sure we still run… Perfect.
Erik Darling: That was a lot faster that time.
Chris Hyde: It was. I think my R server must have finished my coffee for me. So guess what, we can do it with input data, we can also do it with output data. So I’m going to call our output data Results, which changes our query to Results. So that’s pretty obvious. I won’t even bother to run that, but we can start looking at variable-ising things like our input query…
Erik Darling: I love that word; variable-ising. That’s fun. I’m going to steal that one. You can have the Boston joke; I’m taking variable-ising.
Chris Hyde: Perfect; good trade. So it’s a simple query here, but let’s have a – you know, maybe a better format. If we have a long query, like we’ll see in some of my real-world examples, it could be 100 or more lines. So let’s put that into a variable; let’s keep our nice procedure call. So I’ve declared my input query variable and I’m passing that in as my input data. It still works.
And we can do the same thing with our r script. Again, this is a super simple R script; it’s one line. But if we have a more complicated script, let’s separate it out. So then we’ve got all our logic here, business logic for the query, R logic here – it’s separated out so our procedure call here is very, very clean. We can debug our query by copying and pasting into another query window. We can debug our R script by copying and pasting into R studio from here. It’s nice and clean.
Now, I haven’t seen, in the blogs that are out there, I haven’t seen a lot of standardization on this stuff. So this is kind of my forum to push for this as a standard for all of us that are using this sp_executeexternalscript. I just think…
Erik Darling: [Is there] parameter sniffing with this yet?
Chris Hyde: No parameter sniffing; nice try. So let’s make sure this all runs here… Perfect. Apparently, my sacrifices to the demo gods were good today; I knew that goat would come in handy.
Erik Darling: They always do.
Getting the Standard Deviations
Chris Hyde: Alright, so that’s pretty basic so far. Let’s start manipulating the data a little bit. So instead of just getting the ten numbers back, we’re going to take the standard deviation of those numbers. So instead of Results equal My Query, Results equals the standard definition function of My Query. Now, those of you with good eyesight or paying exceptionally good attention will know what’s about to happen; and that is a failure.
This is my favorite kind of failure, because the error message actually has a scroll bar, so you know it’s good. Now, the R error messages can be a little cryptic. What really is going on here is that this standard deviation function operates against a vector. This My Query, our data set that we passed in, is actually a data frame. So, it can’t do a standard deviation on a data frame – even though this data frame is just one column, this could be an entire table. So we have to tell it which column, using that dollar syntax, which column are we going to do our standard deviation on. We’ll do it on the sequence number, here. And this one fails too, with another lovely long error message with a scrollbar.
Erik Darling: [crosstalk] for CHECKDB errors.
Chris Hyde: Yeah – calls, source, with visible, eval, eval, call. It sounds like how you start a football play. But the important bit is that the output data set must be of type data frame. So now the type issue – the result of a standard deviation, it’s a vector result that operates on a vector and it returns a vector. Results though, you know, this sp_executeexternalscript is expecting a data frame to be passed to us. So it’s a data frame in that we had to pull just one vector out of that data frame, and it’s data frame out for sp_executeexternalscript to understand. So data frame in, data frame out. So instead, we’ve still got our standard deviation, but let’s coerce that back into a data frame so that it can come back out. And, now that works, that’s our standard deviation of the numbers one through ten.
Erik Darling: Here we go. I was hoping that would get fixed. Like, if I had to do this on my own, forget it.
Chris Hyde: As I was putting the script together the first time I did this demo, trying to figure out what all these errors meant, this took me like four hours; this simple script just to send data back and forth. So one of my hopes in going through all of this is that if I can take the four hours, and with 200 people on the call, it saves all of them that four hours, that’s 800 person hours. That’s 20 week’s worth of work; that’s pretty awesome.
Erik Darling: That’s a lot of drinking…
Chris Hyde: Oh, there was a lot of drinking involved to get through that four hours.
Changing Data Sets: Next Up, Irises
So let’s look at a more interesting set of data here. We’re going to look at the famous Iris data set, which you’ve probably all seen a million times, because it’s the dataset that everyone starts out with when they’re looking at predictive analytics. So it’s got the lengths and widths of petals and sepals of different species of iris, and then a species indicator. So based on those ratios, you can actually predict pretty well what species of iris it is from the measurements. Now, this dataset comes baked into R already, so there’s no installation needed. It’s just right there to use because it’s so popular in the larger data science world.
And there’s some other more fun data sets. I like the Titanic one particularly well, because you can take all the demographics of the passengers on the Titanic and you an use that to predict who’s going ot live, who’s going to die, who should have lived but they got thrown off a doorframe that was clearly big enough to hold two people… Maybe that’s not in there.
Erik Darling: We already know who’s going to die; it’s anyone with an Irish last name.
Chris Hyde: That’s right.
Erik Darling: They were all on the bottom of the ship; they were doomed.
Chris Hyde: One of the factors in there is the class that they were in. and you can see, look at the class and look at the gender and look at the age, and those three things are hugely predictive of who lived or who died.
Erik Darling: Bottom of the boat, man; no good.
Chris Hyde: You know, that’s a little bit morbid for this kind of presentation, so we’re just going to stick with flowers.
Erik Darling: But it’s still less morbid that Adventure Works.
Chris Hyde: Yeah, you need to give me some kind of warning that you were going to bring that up, actually.
So let’s pull this data from R over into SQL. And we have another failure, but at least this time the red piece is only two lines. And that’s just because – we haven’t talked about this with Results clause much yet, but so far we’ve been just been pulling back one column’s worth of data, and the Iris data set has five columns worth of data. So we’re going to need to specify those here in our Results set’s clause. And note that we’re using SQL data types, not R data types here.
Erik Darling: The double quotes are the standard identifier for those columns? Like those square braqckets are single ticks?
Chris Hyde: That’s right. So now we can run this and we’ve got our data set.
Executing Into a Temp Table
Chris Hyde: Now, let’s say we want to bring our data set, you know, and maybe store it in SQL server while this sp_executeexternalscript – you know the procedure, it’s not determinate; it could return one column, it could return five columns, they could be different names every time. Because it’s like that, we can’t just do an execute into an existing table. So we’ve got a couple of options; we’ve got our execute into a temp table. Because we knew that the columns were going to be, we can then select from our temp table, insert into our real table, for example. Or you can use the open row set to put the data into a table. So in this case, we have our open row set command and we have to put everything in our procedure call in here; we can’t build it dynamically, unfortunately.
So on our complicated scripts, all the nice formatting that we’ve got by variable-ising the … script and the SQL script, we’re going to lose that. But at least that does allow us to do our direct SELECT into our table – or at least it does if ad hoc distributed queries are turned on. You know, they’re not turned on by default, and there’s a good reason for that. I’m not going to go into that today. What I usually say is buy your DBA a cup of coffee and have her explain it to you, and then buy her a beer and get her to actually turn it on for you…
Erik Darling: It will take more than beer for that.
Chris Hyde: Like I said, you probably don’t even need to use this. You could do the select into s temp table and just select from there, just showing everybody a couple of different options. But yeah, this is my laptop, so I’m okay with opening up hole here. And then we can do the open row set and select it straight from our table.
Erik Darling: Well if it works on your machine then it’s good enough for me.
Chris Hyde: Yeah, nice. So yeah, hopefully everybody here who needs to use this; your DBAs are much more forgiving. I kid…
Erik Darling: Andrea says she, “Accepts cookie arrangements.”
Chris Hyde: That works.
Erik Darling: Much cheaper than Josh Groban tickets, I guess.
Chris Hyde: Much less sugary and sappy too.
Erik Darling: Wow, bringing the heat. You’ll get the [crosstalk] for that one.
Chris Hyde: Yeah, I know I’m taking a chance by alienating all the hardcore Josh Groban fans…
Erik Darling: There’s billions of them in the room.
Chris Hyde: Oh, I see I have lost nine or 10 Ds over the last couple of minutes; so maybe they were all Josh Groban fans.
Erik Darling: Or Titanic fans; you don’t know.
Chris Hyde: True. Damn you, Rose; why did you have to do that?
Next Data Set: Doctors
So this is kind of the how we do everything. Now I’m going to talk a little bit about the why. So I’m not going to turn anybody into a full-blown data scientist in 90 minutes. I’m a certain age; I’m never going to be a full-blown data scientist myself. But what I can do is I can use some of these data science-y statistical techniques and create value for my end customers.
Who are my end customers? A lot of them – a lot of my consulting clients are in healthcare. I like working in healthcare a lot, but I’m not going to talk the specific people. I’m not going to name the clients that I could name; I’m not going to shame the ones that I could shame. Instead, I’m going to talk about my fictional hospital where all of the doctors look like famous actors.
Erik Darling: No Clooney?
Chris Hyde: No Clooney; I’m trying to be a little more representative here. So we’ve got Bones, we’ve got Dr Cuddy, we’ve got a couple of people from Grey’s Anatomy, which I don’t even watch…
Erik Darling: Look at Doctor Quinn; that’s good.
Chris Hyde: We’ll hear more about Doctor Quinn later.
Erik Darling: Oh, we have a good question in Slack. “Where’s Doctor Crusher?”
Chris Hyde: No Doctor Crusher and no Doogie Howser; and seeing as Neil Patrick-Harris is from Albuquerque, I really should have a Doogie Howser in here, but the slide’s been working for me so far. What hasn’t been working at my fictional hospital is we have a report that caused some panic.
So in a live presentation if this, this is where I ask everybody, who’s been a report writer? Who’s a report writer now; who’s has to write reports in the past? Pretty much everyone raises their hands. Then I say, raise your hands if one of those reports that you’ve written has caused panic. And it’s usually 100% the same people.
So here’s the report that caused some panic at my fictional hospital. So we’re going to pop over here to reporting services – and that’s not good. Maybe the demo gods didn’t like my goat after all. Let’s do a refresh here. Well this is not.
Erik Darling: Maybe you need to go with maybe just a sheep and they were unimpressed.
Chris Hyde: Maybe the sheep didn’t like my Titanic… Let’s make sure everything is still running. I can’t connect to database. My databases look like they’re up…
Erik Darling: Let’s learn more… We’re all going to be SSRS pros after this too.
Chris Hyde: We might have one less SSRS pro after this. Well that is not good; let’s try and run these reports in visual studio instead then.
Building a Charge Summary Report
So here is the charge summary report. Maybe it is a charge summary report; let’s get to preview here. Okay, so we can run this report for all of our organization, or maybe just primary care or just specialty care. You know, we could run it for just cardiology, just the [peeds]. But let’s run it for all departments. And we could run it just for Bones, just for Dr Quinn Medicine Woman – let’s run it for all of our providers. And how do I get to the run button? There we go. Okay, I guess we will stick with our visual studio for this.
Erik Darling: All this is good enough for me anyway; graphics and all that.
Chris Hyde: This is – what are we doing here? Because this is not the up to date data… Ah, we’ll soldier ahead because I gave us nice good light of day earlier, and it’s not appearing here. But let’s imagine this is yesterday’s date here. So these are the total charges for a whole organization. And you can see that between Wednesday’s charges and Thursday’s charges, we jumped down just a ton; so some administrator is running this and they panic.
And they put in a ticket – I have to drop what I’ve been working on for a couple of month, which of course is due tomorrow, and figure out what happened to our charges. What’s wrong with this report?
Erik Darling: I see in the solution explorer there is another charge summary, right above the one you’re in, you may be in the one down?
Chris Hyde: Yes, we will look at that next. That is the solution to our panic problem.
Erik Darling: Okay, sorry about that. I’ve been in the wrong database context for a billion live demos now.
Chris Hyde: I’ve never had reporting services decide to fail on me like this, but it’s good that we have another option. But I know that this – I know that this is probably just part of normal variation. I mean, I came up with a sample data set, so I kind of cheated. But in the real world, what I did is pulled this data into R and I pulled the last 180 days worth of charges in and I saw, look, this kind of follows that traditional bell curve..
So that tells me a couple of things: it tells me that we’re going to see some normal day to day variation, but it also tells me that I can kind of quantify that normal day to day variation using these… I said there’d be math. You might remember these from high school. So if we look at the one in the bottom left here, that 95%, that tells me that if our data is normally distributed, you know, fits the bell curve, but 95% of our data is going to fall within plus or minus two standard deviations. It’s symmetrical, so that means the 5% that falls out is 2.5% on each side. So maybe one out of every 40 days, we’ll see data that’s above two standard deviations. One out of every 40 days, we’ll see data that falls below two standard deviations.
So we could get into super detail here, but I think just the one out of every 40 is going to be enough for us. We could program this standard deviation formula into T-SQL, or we could just use R because R is really good at statistics. So what we’re going to do is we’re going to calculate those mean plus two/minus two standard deviations, and put some lines on the graph to show people when they should start worrying. You know, it’s the old control line technique from my old manufacturing days.
How to Publish This Data for Consumption
Erik Darling: So there was one question, What’s the best way to publish data like this so that management can look at it without having to touch underlying data? My vote was PowerPoint, but you might know something better.
Chris Hyde: You know, there’s all sorts of ways. You can use R to bake it into – Power BI is nice for that. There’s even, you know, kind of a hacky way to use graphs in SSRS, to use the R graphs. Really what I’m going to get at here, and we’ll see when we see the final report, is just a way to show the statistical techniques in a way that our end users can use. They don’t want to know about bell curves and distributions. They just want know, tell me when I should panic; tell me when I shouldn’t panic.
So it’s nice to have a PowerPoint like this to explain to management, but they will start glazing over almost immediately and just accept that you know what you’re talking about; at least in my experience.
Erik Darling: I guess you could just jingle your car keys in front of them.
Chris Hyde: Something shiny and jangly.
Erik Darling: Hey buddy, hey buddy…
Building a Report Stored Procedure
Chris Hyde: Let’s give a little context here. Let us show the report procedure that I use to generate the report, and then we’ll show the procedure that I use to generate the statistics for it. So here, we’ve got our four variables that we’re passing in as parameters. And it’s fairly simple; we’ve got our fact table and we’re going to sum up or charge amount measure from our fact table. And then we’re going to join in our dimension tables here and we will just filter the members of those dimension tables based on the parameters that are passed in. so it’s a pretty simple path that anyone who’s been writing reports has probably seen 100 times.
So here’s the procedure that gets our returned statistics. So we’ll see right away, one difference, we only have three parameters instead of four. Well, if they run the report just for the last seven days, we don’t want to base our statistical assumptions on that; let’s base it on a wider set of data. So let’s use the 180 days worth of data that I used to pull into my chart; let’s hard code that.
But this part looks similar; we’re summing up the measure from our fact table, joining in our dimension tables, filtering the members of our dimension tables based on the parameters. So you have, just running it for [peeds] for example, we just get the statistics for [peeds]. Now we’re going to take this, we’re going to sum up our total charges by each day, and we’re going to throw it into a temp table.
Now, I would not be throwing this into a temp table in production. What I’m doing this for is just to – when I show you guys the query that creates the data set to pass into R, I don’t want to have this 60, 70 line query; I just want a nice select from a temp table, just to make it easy visually as I’m going through the demo. So again, I wouldn’t do this in production, just doing it to show you guys, make that part fit on one screen, for example.
So here’s the input query – so again, we’re just selecting from our temp table. Now we’ve got a couple of wrinkles here. So one wrinkle here is we’re only going to look at Monday through Friday, because I actually lied to you guys; it’s not really my fictional hospital, it’s really my fictional ambulatory clinic. But that doesn’t flow quite as well and I couldn’t make it fit on the slide with all the pictures. But because it’s an ambulatory clinic, we’re not open on Sundays. And on Saturdays, the only thing we have open is urgent care.
So if we’re making statistical assumptions based on all seven days, our man’s going to be somewhere around here, 300K a day. It’s amazing; we’re above average 100% of the time… No, it doesn’t quite work. So that would pop up pretty easily. The other thing, if we look back six months now here in the US, July 4th was a holiday. It was on a Tuesday…
Erik Darling: Lots of missing fingers.
Chris Hyde: We wouldn’t have been open on Tuesday. So it will throw the data off not a huge amount; 1% or 2%, we wouldn’t see it visually. So when we’re doing these things, we kind of have to have like the statistical knowledge to know what’s a valid sample set, et cetera, what’s a good statistical technique to use, but also some domain knowledge, some knowledge about the business to know that, “Hey we need to exclude days like July 4th.” So it’s where those two things really meet that’s the difficult part. It’s where it turns from a science into an art.
Now that we’ve taken care of those issues, that’s the data set. We’re going to run that query, send the results to R and we’re going to get our standard deviation. We’ve already seen how to do that in our earlier demo. We’ll get the mean, and we’ll get the mean plus two and the mean minus two. And then we’re going to use that combiner that we saw to turn those four different values into a vector. We know we have to return as a data frame for this to work, so we turn it into a data frame; except at this point, this is four rows with one – one column with two different values. What we want to return to SSRS is one row with four different columns.
So we need to transpose it, and that’s what this T does here. So a fairly simple R script, and because we’ve variable-ised everything, we have a really nice clean procedure call. And so if we look at the version of our report here, we’ve got a second data set; charge summary and charge statistics. And we will take this black line; we’ll set that equal to the mean. The red line will be our minus two sigma and the green line will be the plus two sigma. So we run the report, and let’s run it for the same parameters here. We can see that yesterday, although we were lower than normal, we’re still well within the expected range.
So if we add these control lines to the report, it makes it easy for our users to kind of use these statistical techniques without even knowing that they’re using them. So one thing I’d like to point out here – this green line – you might think, “Wouldn’t it be great to have more charges? Isn’t that a good thing if we’re above the green line?” Well, it really isn’t. I mean, the statistics, the normal distribution, you know, it cuts both ways. You can’t have your cake and eat it too; and I’m sure there’s more clichés I could be using here. But the variation, it works both ways. So if we have our charges way up here, that’s the point that I would panic. The business might not, because they’re like, “We had a great day yesterday.” But if it’s way above here, or it’s up above here more than one out of every 40 days, that’s the point, “Okay maybe we have an issue. Is there a problem in the ETL? Is my report doing a duplication somewhere. So for me, that’s the panic point rather than it being too good; so that’s something to point out there.
So that’s one way that we can take our simple statistical techniques and add that extra value by using R. Now yes, we could do all this within T-SQL and just use the STDEV function. But I wanted to start out with kind of a simple statistical use case before we get into the more complicated one; which is this overbooking without over overbooking.
So we have a problem at my fictional hospital ambulatory clinic, and that problem is our patients. Well, I guess it sounds kind of bad when I say it that way…
Erik Darling: Users are always the problem; trust me – the standard DBA boilerplate; the server was fine until the users come on it.
Chris Hyde: Exactly. So the problem with our patients is sometimes they don’t show up. Now, this isn’t so much a problem for Bones, because he’s a cardiologist, and most people don’t miss their cardiology appointments too much. Only 10% of his patients don’t show up.
Erik Darling: If they do, they’re dead anyway.
Chris Hyde: Okay, maybe I could have use the Titanic – this is definitely a morbid crowd, or at least a morbid moderator.
Predicting Overbooking for Doctors
Chris Hyde: Anyway, Doctor Quinn here is somewhere around 26%. She’s a GP, so that’s pretty common, right. GPs might see 32 patients a day for 15 minutes at a time. If a quarter of the patients don’t show up, she’s sitting around idle for two hours per day. She’s not making money for herself, she’s not making money for us, so there’s a simple solution; you just overbook an extra eight patients every day.
Well, that doesn’t work out so well, because what happens is you can’t predict that eight of those patients are going to not show every single day. Some days, you’re going to get 30 out of the 32 patients show up. And you’ve overbooked an extra eight, so you have six extra patients. At 15 minutes an appointment, by the end of the day, those people have been sitting around; the last one might have been sitting around for 90 minutes. And if you make somebody sit around for an extra 90 minutes, they don’t come back to your organization. So we tried to maximize our revenue – we end up minimizing our revenue in the long run because we ended up upsetting our patients by over overbooking. It’s the same in the hotel industry and obviously the same in the airline industry. Unlike United Airlines, we’re not going to take our patients that have been sitting around and beat them up and throw them out of the waiting room, for those of you who remember that from last year.
So this is a problem that they have too, and they have some very sophisticated models. Here at my fictional hospital, we have a fairly simple model. We found that when we looked at all the different factors, the only thing that really predicted patients no-showing in the future was the rate for that provider in the past. So that’s what we’re going to use here.
Modeling with a Galton Board
And I’m going to jump back to the slide deck here. We’re actually going to model this using something called a Galton Board. You remember this from visiting the science museum as a kid, or something from math class… [crosstalk]… Yes, that’s right; I think there’s even a game show about this now, right?
Erik Darling: Well there was – you’re right, there is. But I just remember it form Let’s Make A Deal, or whatever, – no, the morning game show that old people watch, where they do Plinko and you’ve got ot get the chip in the right thing to win the trailer? Come on…
Chris Hyde: The Price Is Right?
Erik Darling: That’s the one, perfect.
Chris Hyde: Excellent, I’ve just outed myself as an old person..
So anyway, how this works is we drop the ball here, and 50/50 it goes left, 50/50 it goes right at each one of these pegs, and as it goes down, it builds up a model of a distribution. And that’s called a binomial distribution, because at each step we have two options; left or right. It’s like flipping a coin; you can have heads or tails. Or it’s like is your patient going to show up? Yes or no. Now, Pascal figured out the mathematics of all of this. we have this lovely Pascal’s Triangle that he did back in, I think, the 16th century. But, I’ll actually show you guys what this looks like visually. And this is a good opportunity for me to go get some more water. Erik, let me know if this isn’t coming across nicely.
Erik Darling: Looks okay so far – a little blurry.
Chris Hyde: Is that having an effect?
Erik Darling: It’s a little screechy on the audio.
Chris Hyde: I’ll go ahead and stop it here then. So I kind of like that music because it adds a certain gravitas to the presentation that maybe is a departure from the previous tone.
What I did want to point out here – maybe there’s a screenshot here where it’s more full – and you can see that that distribution, it looks a lot like the bell curve that we looked at here. But there’s a couple of differences, it’s taller in the middle, it falls off more steeply, more quickly to the sides. It’s still symmetrical about the mean, but because it does have a different profile, the mathematics are a little bit different.
The Math Behind the Model
Now, I’m going to jump over into our studio, and let’s talk a little bit about some of the mathematics behind this, so we’ve got a context. First off, we’re going to use these binomial functions here; so let’s get some help on that. Let’s get help on the one called rbinom. And it shows up here on the right, and we can see that we’ve actually got four different functions here. We’ve got dbinom, pbinom, qbinom and rbinom.
Dbinom will give us an exact percentage chance; like what’s the chance of getting two heads out of throwing five coins, for example. The rbinom lets us generate that random data, so it lets us throw those five coins. The P is a little like the D except it gives us range probabilities…
Erik Darling: You’ve got the GoToWebinar window popped up in front of some of your stuff there.
Chris Hyde: I’m not having fun with that. is that still there?
Erik Darling: Yes. The best thing to do is dock it off to the side, then it at least disappears for all of us.
Chris Hyde: Okay, there we go, much better.
So the pbinom, instead of saying like what’s the chance of two out of five heads, it’s what’s the chance of more than two heads? Or what’s the chance of two heads or less? The qbinom flips that around, like what’s the 90th percentile chance of if you throw five coins, what’s the 90th percentile, what’s the 75th percentile of how many heads do you get?
Now, you’ll see the arguments for these look pretty similar. Size is there, probability is there, and we’ll see that was we go through, looking at the examples here. We re-dock and go back. So when you use that rbinom, the example I mentioned – take one fair coin, so size equals one. It’s fair, so probability if 50/50. And there’s five of them, so toss five coins, and we can see down the bottom here we get four heads. So it’s more than we’d expect, but it’s going to happen sometimes. Let’s take those ten coins, they’re still fair coins – well, it’s ten coins instead of five, let’s toss them – sorry, ten coins instead of one; toss them five times and count up the number of heads.
And we see five, sixes and sevens for the most part. Now, we could do this a whole bunch of times, 1000 times, or we can do it 10,000 times and do a quick graph of what that looks like. There we go – so that looks like our binomial distribution, but this is coins, how does that apply to our patients?
Well, just as in summing up, we summed up the number of heads – heads was one, tails was zero – we’re going to do the same thing with our patients. If they show up, it’s a one. If they don’t show up, it’s a zero. Now, this is a 50/50 coin, if it’s a fair coin. If our patients only show up at a 50/50 rate, we’re not going to be in business very long.
So it’s a little more fair, let’s see, we’ve changed the probability here to 70% instead of 50%. Let’s see how that effects the graph. And we see it starts skewing to the right, but the peak is, that we’d expect, seven out of ten.
Applying the Models to Overbooking Dr. McCoy
Now let’s really apply this to our data and let’s look at Dr McCoy’s. so in the data that I ran earlier, in the report that we should have seen in SSRS, we would have seen that 10.5% of his patients don’t show up. Now let’s see – that’s like a coin that comes up heads 89.5% of the time. He sees 16 patients a day. Let’s see what 10,000 days in his career would look like. And we see it’s very heavily skewed. Maybe one out of every six days, everyone shows up. And nearly one out of every three days, only one person doesn’t show up for him.
Now, here in our business, we’ve decided we’re going to pick a threshold of 90%. And what we’re going to say is we’re going to overbook. But even with overbooking, we don’t want extra patients sitting around 90% of the time. It’s okay to have extra patients, one or two sit around 10% of the time, but that 90% of the time, even though we’ve overbooked, we don’t want extra patients sitting around.
Now, we can see graphically here, 90%, or the top 10%, comes in right about here, a third of the way down this 16 bar. So we can kind of see graphically, we can’t overbook for Doctor McCoy, unfortunately; without breaking our business rule. Let’s look at some of the percentages and let’s see that numerically. I’m going to turn off scientific notation, and we’re going to use the dbinom to see what percentage of the time – 89.5%, what’s the percentage chance that 15 out of the 16 show up? That’s 31.7%, so it fits in with what we saw on the graph. Now let’s use our sequence operator here, the 16 to zero, and see complete percentages for everything from 16, 15 all the way down to zero patients showing up; which would be a really bad day.
Erik Darling: Morbid day, even.
Chris Hyde: That’s right. Maybe they all took the Titanic to their appointment that day.
We can see this kind of fits with what we saw on the graph; about one out of every six times, everyone shows up here. That’s the 16.8% there in the top left, so on down to – it’s a vanishingly small chance that nobody shows up. So that’s kind of telling us numerically, “Hey we can’t overbook for Doctor McCoy as well.
Bottom Line: You Still Gotta Know Math
Now, one thing that, when I was doing this, like is this real? It didn’t fit with what I thought I knew; that everyone shows up that often. Luckily, I know some basic probability; because of a misspent youth playing Dungeons and Dragons. But because of that, I know that the probability for all these discreet events coming together – what I mean by discreet is whether or not Mister Smith comes for his ten o’clock appointment has no bearing on whether Missus Torres comes in for her 10:30. So because these are discreet, you can just multiply that percentage chance together 16 times to come up with what’s that value. And it really is the same value.
So I know that when R came out, it was – when it was first put into SQL Server, people were saying it was a magic bullet. You didn’t even need to know statistics, you don’t need to know any math – and I don’t think that’s true. But I don’t think you need to know a lot of math. I think you just need to know enough math to be able to check your work when you come up with a number that doesn’t seem like it’s quite right to you. You want to have that little bit of statistical probabilistic background to be able to do some of those checks.
So, for those of you who are listening, if you only get one thing out of the presentation, I really hope it was that script of how to use sp_executeexternalscript and save you four hours. But if you get a second thing, it’s really that you should be playing more D&D, because it’s really going to help you out with this stuff.
Using Range Percentages
Alright, on to our range percentages. So what’s the probability that more than 14 of his 16 patients show up? Well, we can use the pbinom version with a queue of 14, and setting the lower tail to false means that we’re looking for more than our 14. Whereas the second example, 14 or fewer, we set the tail to true. So if we run those two together, we should see 48% at more than 14, which matches the 16.8 plus our 31.7. And then what’s the probability of 14? Well it’s 15.5%.
Let’s say we wanted to say 14 or more – we can’t actually do that with pbinom, we can only say more than 14. So we have to know that – we have to find out that the opposite of 14 or more is 13 or fewer, take the 13 or fewer here in line 51 and subtract it from one. And that should give us 76%.
Now, the real magic is going to come in our quantile probabilities here. We said that we have that 90% threshold of not having extra patients sit around. So what’s our 90th percentile of the distribution for Doctor McCoy? So let’s see that. And we think we should know what this is from the graphs. And yes, it is 16 out of 16. So it’s another way of saying that we can’t possibly overbook for Doctor McCoy, because 90% of the time, that percentile’s going to hit right on the 16 out of 16 showing up.
But things are pretty different for Family Medicine and Doctor Quinn Medicine Woman. So if we look at her complete percentages – now a little over three-quarters of her patients show up. Size is 32, she sees 32 patients a day. So let’s look at every possibility from 32 down to zero. And we see that’s barely a 100th of 1% of the time that everyone shows up. It’s about a tenth of a percent that only 31 out of 32 show up. So as we’re looking down to where our top percent is, we’re going to be somewhere in this range. But even with the lower percent, we can see down here that it’s pretty unlikely that everyone’s not going to show up for her.
Displaying the Data in a Histogram
Let’s see what that looks like graphically; use our same histogram. And you see it’s still a bit skewed but a lot closer to the classical binomial distribution. If we look for our top 10% here, we’re probably somewhere around this far right here. Let’s use the qbinom to find exactly the point… 27. So what we’re saying is that we can safely overbook five patients for Doctor Quinn every day and still only have that one or two patients sitting around for extra time on 10% of the days.
So now that we have that, let’s put this into a report that our end users can use. Because I like talking about math, I’m hoping that you guys on the webinar like learning about math. Our business users, they couldn’t give a toss about math. They just want, “How can I use that?” So let’s turn it into a report.
So we’ve got our procedure here, and we’ve got our first three parameters that should look familiar to you guys. Again, we don’t have a number of days. We want to use that 180 days to make sure that we’ve got enough data to make statistical comparisons from – or statistical assumptions; sorry. Now we’ll let them run this report for a range of appointment dates to see how many people they can overbook on each date. And we’ll even let them customize that threshold. So maybe they don’t want to use 90%; maybe they want to use 95%. They want to be really conservative and have as few people sitting around as possible. Maybe they want to be more aggressive – they don’t mind a couple of extra people sitting around; do 75% and really try and maximize revenue in the short term. We’ll give them that as the option.
Now, this part looks similar. We’re going to look at a different fact table, but still our appointment fact. We’re going to sum up our measure. We’ve joined in our dimensions and we’ve filtered the members of the dimensions based on the parameters; so they’re as we’ve seen before.
Now we’re going to figure out our no show rate, which is the number of now shows divided by the total number of appointments. But we’re still passing into our query that we’re actually going to use the show-up rate; the 90% chance that they show up for Doctor McCoy, rather than the 10% chance that they don’t. So we’ll subtract the no-show rate from one to get our show up rate, and I’m going to put this into its temp table. Again, I wouldn’t do this in production, but I just want to save us from passing in this 150 line query here into our procedure.
So we’ve got our temp table. That makes our input query very, very simple. It’s just selecting these values. Now, our script is pretty simple as well. We’re going to set a data frame equal to my query, and we’re going to use the qbinom function that we just looked at to figure out how many appointments do we expect based on our threshold. So again, this is all customizable based on that threshold that they pick. So if we have Doctor Quinn and she has 30 patients scheduled, we might end up expecting 26 of the 30 to show up. So we could actually overbook 30 minus 26, or four slots. And we’ll send those results back. Now I use the open row set here, just because I could, but we have one more thing to calculate, and that’s the total number of free slots. So the example I just gave, where 26 out of 30 patients are expected to show up for Doctor Quinn, we know we can overbook for – we still have to book the two people that we’ve never booked in the first place. So that’s six total free slots. And then we have a simple data set that we can then pass into our SSRS report here. And again, we’re going to run it for all our departments, all our providers, and this is the tricky part. I’m not sure if this is going to use the data set that I ran earlier; so we may get no data here, unfortunately.
Okay, no, great, we do get data here. So if we look at Doctor McCoy here for Monday, we can see he’s got 14 patients booked; we expect 14 to show up. So we can’t overbook for him, as we’d expect. But we still have the two total free slots; let’s get two patients in there. If we go down to Doctor Quinn for Monday, she’s got 28 patients. We expect 24 of them to show up, so that’s four we can overbook plus four we’ve never booked in the first place. So our clinic managers can get on the phone and try to get eight more patients in there for Monday. So we’ll overbook by the four, but we won’t over overbook to where we have people sitting around.
So, do our clinic managers need to know about binomial distributions and Galton Boards and such? No, of course not. But, just by us knowing those simple statistical techniques, we can add that extra bit of value to our customers through the reports that we already know how to deliver without having us be full-blown data scientists.
Using Multiple Queries in a Report
Erik Darling: So Chris, there is one question here. [El Noder] wants to ask, “If you want to use the results of multiple SQL queries in your report, how do you do that? Or is that even possible?”
Chris Hyde: So in the report it’s possible. We’d have multiple data sets, for example, here. If we wanted to do kind of R work against multiple data sets, we’d have to call sp_executeexternalscript individually multiple times…
Erik Darling: That’s not in parallel?
Chris Hyde: Then use T-SQL to combine our data that way.
Erik Darling: J.D. Walker wants to know, “For using multiple queries, would we use CTEs for that?” I think it kind of depends on what you’re trying to do though. Because you can’t have multiple results sets with CTEs, you can just have one final result set. So I suppose you could use them…
Chris Hyde: In my examples, I’m using CTEs just to kind of break up the logic a bit.
Erik Darling: J.D. was not paying attention. Shame on you, J.D.
Chris Hyde: In order to get the different result sets, like I say, you’d have to be calling sp_executeexternalscript multiple times.
Downloading the Resources
Chris Hyde: So, this is the last slide. I wanted to leave everyone with where they can get this stuff:
- Download the Power Point & demo scripts from Github
- Microsoft Data Exploration and Predictive Modeling Tutorial
- Coursera Data Science Specialization
- Email Chris: email@example.com or firstname.lastname@example.org
It’s up on my GitHub; all the slides, the demos, the scripts to create everything, it’s there. So just follow and look for the SQL Server R services presentations. There’s also a version of this using Python up there, if you’re more interested in using Python, because it’s the new 2017 thing. Microsoft has a good tutorial using R but going more in the predictive modeling space, predicting revenue for a taxi cab company. So I’d recommend, if you really want to get into this, start there. That’s a good point for that.
Then a lot of the MOOCs, the massive open online courses, have been getting into this. And Johns Hopkins did one in collaboration with Coursera that I went through, and I actually found the math professor there not to be the best. I think the better one here – and I need to update this slide deck – is the one that Microsoft is doing in conjunction with edX, and that’s if you go to edX.org [crosstalk]… Yeah. So if you go to edX.org and search for Microsoft data science certification, that’s kind of modeled the same as this Coursera one. It’s eight to ten courses, things about R programming, although you can do it in Python with the Microsoft one, things about statistics – and then it brings it together in kind of a final capstone, as they call it; kind of a real-world kind of example that you work through as a course project.
So any other questions before I finish up?
Erik Darling: I’m not seeing anything come in from Slack. No, I’m not seeing anything.
Chris Hyde: I see that Richie loves D&D, so that’s good. So if you have questions – if anybody has questions that come up later, please shoot me an email. Thank you guys so much for having me. Thanks, Brent. Thanks Erik for moderating me and laughing at my jokes so I didn’t feel so alone.