Enjoy the Podcast?
Developers & DBAs who spend a lot of their time writing TSQL queries.
TSQL was voted best programming language in 2013, let’s explore why!
In this session, we will look at what makes SQL such a fascinating language. We will do this by working through a few real-life development problems and common design challenges based on real questions asked on Ask.SQLServerCentral.com.
What in particular will we look at? Well… Tally tables, CURSORS, hidden CURSORS, CTEs, Recursive CTE, Manual Pivot, TSQL Pivot, CROSS JOIN, CROSS APPLY, CORRELATED QUERIES, STUFF(), RANK(),GO, DENSE_RANK(),LAG(), LEAD(), ROW_NUMBER(), XML PATH, MOD %, STRING_AGG(), DBCC DROPCLEANBUFFERS, DBCC FREEPROCCACHE, STATISTICS IO & TIME. (A bit of TSQL)
Why I Want to Present This Session:
You do not often get to see a session where someone talks exclusively about TSQL and more importantly why TSQL is great. This session is designed to improve the way you tackle problems in TSQL. I would like to share these tips to make you a better TSQL developer.
Brent Ozar: Alright, in this next session Terry McCann is going to talk about why tSQL was the language of the year. Take it away Terry.
Terry McCann: Cool, okay. Well, thanks very much everybody for first of all selecting this session. So, this was a personal favorite session of mine. I don’t generally do a lot of tSQL anymore, but wherever I get the chance to do some tSQL, you know, I’m right in there.
So, this session is about why tSQL was voted the language of the year. So back in 2013, there is a community index that rates programming languages, and they voted Transact-SQL the language of the year. And that’s up against you know, things like c#, C++, all these object-orientated languages, these huge languages, and tSQL came out number one.
So, we will understand that by doing a couple of things. So, in this session, what I want to do is look at a couple of tricky tSQL problems and then how you might solve them. By doing that, we’ll demonstrate a bit of tSQL along the way and why tSQL is this incredibly versatile language that made it that language of the year. So, we’ll start with a problem and we’ll look at how we could solve it and we’ll look at a couple of use cases on how I help somebody else solve it, probably on Ask SQLServerCentral, which was – when I was talking before about the website that I would spend most of my time answering questions on, it was Ask.
I don’t get the greatest amount of time at the moment to go and answer questions, I chip in every now and again when I see one. I never went down the route of things like Stack Overflow because the ferociousness of people answering questions on there, by the time I had written my answer there were five answers, whereas Ask was a little bit more – little bit slower paced so you could mull on to some of these questions.
So most of these questions are real and they are – the questions are actually taken from what somebody asked, or at least they are based on those real questions. Okay. So, what are we going to look at in this session? So, we’re going to look at a couple of things. We are going to look at tally tables, we’re going to look at cursors, we’re going to look at hidden cursors, CTEs, recursive CTEs, all this stuff that you see here; so, pivoting and a whole load of tSQL. So, by the end of this session, you may not be a master in all of these things, but what I hope is that you start thinking about, okay, well how can I take that bit of tSQL, how can I take and cross apply and how can I take a Windows function and combine those two things together to solve some kind of tricky problem that you might not have been able to do before. And the idea behind all of this is that you know, we try and do it in the most performing way. I will like, raise my hands up and say that I am not really a performance tuner. I know enough to get me by and to make my queries run that bit better. So, we’ll look at a couple of little bits around that.
So, in this session, we will look at a little bit of tSQL. So, it’s going to be pretty much all demos and the slides are mainly just to sort of aid the understanding of the question. So, if you have a better way to do it, by all means, please shout it out, put a question, stick a question up on Slack. Say another way that you might do it. So, one of the beauties of tSQL and of rational databases in general, or – so SQL is a declarative language, so rather than you know, like a procedural language where you say do this stuff, then do this stuff, then do this stuff, what you’re doing when you’re writing tSQL is you’re writing your query, which is then interpreted by the engine and that engine works you know – scripts that down and does what it wants to do. So, you could have a look at a query and solve it one way, and someone else do the completely different thing, but get the exact same result. And that could happen a hundred, a thousand times, of all these different sort of combinations of techniques. So, if you do have a better way of doing something, then please share it. Please share it on Slack, please stick it on Twitter, any way you really want to.
So, for testing performance, we will do a few little things, so as we’re going through I will do stuff like DBCC DROPCLEANBUFFERS, so to clear out of the buffer, similar to doing a restart of SQL Server, and FREEPROCCACHE. Both of these come with the standard warnings that I’m just doing this with my laptop, I’m not doing this on a production server. If you do it on a production server and it drops everything and then everybody screams at you because everything is having to be rewritten into the plan cache, that’s your fault, and you know, I’m hours ahead of you so I’ll probably be asleep while you’re panicking, so you can’t blame me. So yes, use with caution.
We’ll do things like reset statistics I/O and statistics time on so that we can see you know, how many pages is being read and here you go, you’ve got a bit of an explanation of what statistics I/O and statistics time will do. We’ll pretty much be looking at the scan count and logical reads to do a bit of comparison.
Okay, so our first question, and this question came – this one actually I think came from an internal customer that I was working with, maybe around about – oh god, I don’t know, maybe about ten years ago or something. One of the first jobs that I was working in. And so, they said you know, I work for a HR company, and I have data with periods of absence, and that goes from date A to date B. So that could be, I was ill on the 1st of January and I came back to work on the 9th of January. And for a reporting purpose, what they actually need to see is rather than from day A to day B, they need to see day one, day two, day three, day four, and you know, for every single one across multiple rows, as opposed to just that single row. And so, this query really is, how you would go about solving that. So, let’s dig in. Demo one.
So, all of this stuff I try and point to it. There’s a – on the whiteboard behind me, you can go to HyperBI.co.uk, if I can remember my own thing, and you can get all the slides from there, and all of the demos and everything else. I’ll also just stick a Bitly link and I’ll tweet it or stick it in Slack afterwards so you guys can go away and download all of these scripts. Everything is designed, maybe with the exception of one, which requires Adventure Works, is designed just to run out the back. So, you can just run this stuff.
Okay, so the first thing that we’re going to do is I’m just going to build out a database and I’m just going to populate it with a couple of tables, which I’ll come back to later and talk about. So, for now, I’m just going to build them and I’m just going to let it go. Okay, so that problem one. I’ve got these dates from date A to date B, and I need them across multiple rows. So, if you’re every asking questions on the internet, and you really want that question answered, the best way to do it is to make it really simple for somebody to answer it. If you just write, “I’ve got this problem, I’ve got these dates A to B, how do I do it?” Somebody’s probably going to answer it, but if you actually go away and you prepare some sample data and you create a temporary table and you populate that temporary table with values, someone will just copy that, paste it, and they can start working on your query. Someone’s going to be motivated to actually try and answer your query, so if you’re going to ask a question, make it obvious, and make it easy for somebody to do.
So, this person’s been really nice. They’ve made us a nice little statement, they’ve got drop tables around it, create table, and an insert. So, they’re inserting some value for us and data. And just going to create a teeny little index on there as well, which isn’t really going to do a great deal, but this is the data that they’ve generated for us. So, we’ve got these three records for employee number one, and these two down here for employee number two. And what they want to do is they want to say, Okay, there are say, 13 days between this, so what I want is I want for that particular employee, I want – they were absent on the 7th, 8th, 9th, 10th, and so forth. So that’s what we’re trying to get to.
So, as I said, when I started doing this, this was a long time ago, and when I first started my career, I started like probably some of you did, in the bad old days. Apologies if some of you are still in those bad old days, but I started off working as an access developer. So good old MS access. So how would I go about solving this problem in MS Access and I did try and solve this problem in MS Access, and my go to was, okay, I’m going to do a little bit of EBA, I’m going to write the lid on macro. That just does this process and it just loops over the data and it spits out a row for every single date.
Before everybody starts judging me and saying, “What’s this guy doing? He’s talking about row by row operations.” Just bear with me, just bear with me. We’ll get there. So that’s how I might have done it.
So, one of the ways that you can potentially do this is with a cursor. Bear with me. So, some of you guys, you may probably have written cursors yourself to do things. For those of you who haven’t, we’ll cover off the basics quickly of what a cursor is. So, a cursor is a procedural way of working with SQL Server. So instead of working on a set of your data, you’re working on a row by row. And what we have here is we have – we’re declaring our cursor, we’re inserting some data, so we’re saying this is what our cursor’s going to be, it’s going to have this data, I’m going to open that cursor, I’m going to fetch the first row, load it in to a couple of variables, and then I’m going to do that – I’m going to loop over each one of those. And then I’m going to de-allocate it. So, you have that process, row one comes in, row two comes in, row three comes in, and it all processes through. So, let’s have a look at how you could solve this problem using a cursor.
So, I have got my DROPCLEANBUFFERS, I’m declaring a couple of variables. This bottom one here is table variable, so I’ve got my declare result as a table and I’m specifying what that table looks like. Then I’m going to build out my cursor, and I’m going to say loop over each one of those rows, so these ones down here, loop over that row, then go to this one, then go to this one, then to this one, then to this one. In order to get the dates that are in between each one of these, I’m going to use a different type of cursor. I’m going to use a WHILE loop in there. So, I’m going to say while this date is not equal to that date, give me another row, and then just going to add one to this. I’m just saying add one to that, so it’s 8th, 9th, 10th, until they are equal.
So, what I’m doing is I’ve got this outer cursor, and I’ve got this inner WHILE loop and a WHILE loop, people will often argue on the internet it’s not a cursor, it is a cursor. It’s just a cursor with a different syntax. So, let’s run that. Okay, execute. And I mean, okay, so we have our results. So, for our first period of absence, we had from the 7th until the 20th. We now have the 7th, the 8th, the 9th, the 10th, the 11th, the 12th, 13th, et cetera. That’s what they wanted, and that works. And I mean, you know, looking at time wise down here, I mean you know, that didn’t take too long. So, you know, surely, I can use a cursor. A cursor works. So that sounds good, okay, I’ll settle here, I’m just going to use a cursor. End of the talk, we’re all done. Everyone use cursors for everything.
What you’ll quickly start to find is cursors are inherently evil, and should be avoided like the plague. So, when you’re used to that procedural way of writing something and you think, Okay, well, I can just use a cursor, I know how cursors work, I’m used to looping constructs, how can I do that same thing but not using a cursor? Using a set based way that is going to work with SQL and work with how SQL was designed, how the engine was designed? And so, one of those alternative methods is tally tables, and tally tables are one of those where if I’m doing this session on a conference, you know, I say, “Raise your hand if you’ve used tally tables before” and in a room of say maybe 30 people, you maybe get a handful of them. You get five or six. You know, I’ve used tally tables before, and the rest of them haven’t. So maybe that’s a similar thing. Maybe you guys haven’t used tally tables, maybe you have. And what a tally table is, is a numbers table. You might know it as a numbers table. It’s a table that you fill with a large sequence of numbers. And what I have in this example is a large sequence of numbers and then just some basic dates. And with that table, you can start using it to solve loads of really complicated tSQL problems. So, let’s have a look at how we could go about solving this, but using a tally table.
So, as I said, the cursor works. It’s doable. Bit of a mess, it’s known as a rebar operation, a row by agonizing row. That came from Jeff Modern on SQLServerCentral. So, what is that alternative method? Tally tables. I’m just going to build out a tally table. I’m going to create my table, say it’s just going to contain an integer and a date, so we’ve just got ints and dates. And going to stick a non-cluster on there as well. And then I’m going to use this funky little bit of tSQL to generate the numbers to load my tally table. And if tally tables are new to you, this bit of syntax is probably also new to you. So, this is known as the Ben-Gan style tally table generator. So, bit of tSQL written by a guy called Itzik Ben-Gan. If you start looking at you know, how to write tSQL queries in the best way possible, you will inevitably come across Itzik. His stuff, his Sessions, if you’re at SQL Bits at PASS and he’s talking, I thoroughly recommend you go and see him speak.
So, what this is doing, it’s a serious of common table expressions, so we’re starting off with his WITH statement. With level zero as select zero as G, and I’m going to union it to another select zero. So, if you run this part, all you’re going to get is you’re going to get two rows with zeros. So down here I’ve got two zeros, and what the rest of this statement is doing is it’s saying, okay, well take those two rows and then cross join them against themselves. So cross join is where you’ve got two rows over here, and then you’ve got two rows over here. What it’s going to do is there’s no way – there’s no join predicate on here. We don’t know how to join these two together, so what you’re going to end up with is that row and that row join, and that row and that row join. So, we end up with four rows. So that’s what we’re seeing here. So that one says, okay, cross join that one with itself, which gives us four rows. Then do it again, which gives us 16. Two, five, six, and you can see that all we’re doing is we’re squaring our number every time until we’re hitting here, where we’re ending up with four billion rows. And then once you’ve got those four billion rows, that’s a four billion of rows. That’s a meaningless table, but we’re going to wrap this window function across it. If this is the first time you’ve seen a window function, don’t worry, we’re going to come back to them and talk about them a little bit. But all this is going is it’s adding a row number, so it’s converting those zeros into a sequential list of numbers. One, two, three, four, five, six, seven, all the way up to 4.2 billion. And I don’t need 4.2 billion, I just need 15 thousand. I could have got rid of most of this statement, but I’m just going to load 15 thousand into my table.
While that is loading – it loaded pretty quick. There are alternative tally table generators. If after this session you think, be a year from now, I need a tally table, and you don’t want to persist it, you don’t want to create a table, you just want to use a statement like this so that it just generates a tally table on the fly, there’s a couple different methods you’ll see. The alternative one that you’ll commonly see is something like this. So, what this is saying is it’s a recursive common table expression. So, we’re saying select one as number, and then union that back onto itself where number plus one. So, what it’s doing is it’s saying okay, select one, starting at one, until 15 thousand, create me a loop. And people will say you know, this is a recursive CTE, it’s not a cursor; it is. As much as you want to claim it’s not, it’s doing a cursor type operation. It will be doing a row by row operation. You will quickly find, if you want to use this version compared to Itzik’s version, that while this will work, if I run this and I produce 15 thousand rows, it runs you know, very, very quickly. If I scaled up to those four billion, Itzik’s will run. This will crawl. I think I load tested it up to something like 100 million, and common table expression was about three thousand times slower. So, you want to be using the Ben-Gan method, and the beauty of this method is – where’s my query that shows this – the beauty of the Ben-Gan method is it is a zero I/O operation. So, if we kick this off with set statistics I/O one, what I’d be expecting to see down here is that data maybe being written to tempdb, then being read back, but that just doing exist. That data isn’t being – no pages are being read. Whereas if we do a similar sort of thing looking at the recursive CTE, what we’ve got is our workable table, and stuff is being written into there and being read from again.
So, the Itzik way is incredibly fast, zero I/O, spot on for what we want. So, using that to generate a tally table. So now we’ve got that tally table, now we need to think, okay, what can I do with it. By default, doesn’t really help us a great deal. Just having that table is cool, it can solve some problems, not many. You need to then think, okay, well how can I combine this construct of a tally table with some of my – some other functions, to actually do something interesting.
So, one of those is cross apply. So, I want to take a quick divert, explain the basics of what a cross apply statement is. Okay, so to do this, we are going to use our example of our dates again. Actually, do it on the right database, there we are. The amount of times you do this session, what’s going wrong, nothing’s working, nothing’s working, I’m against master; every time.
Okay, so similar thing. We’re trying to get from this date to this date. We’re trying to do it using a set based approach instead of a cursor. So, let’s say what I actually wanted to do initially was work out how many days there are between these two dates. So, I could say, okay, that’s fairly easy to do, I know I’ve got a date difference function. I can say, date diff day between those two dates, and you know, it will return me there’s 13 days between those two dates. But if someone comes to me and says, “Okay, what I actually want to know is how many of those days were weekends.” That’s a bit more complicated. And then they say, “Actually, what I want to know is how many of those days are company business days?” Well, that includes – it’s not a weekend, it’s not a company business day which happens at random times through the year, it’s not a bank holiday, and it’s not a bank holiday in a different country that our parent company is in. That’s pretty complicated to do.
So, one of the ways we can do that is with say something like a date dimension. So, if you’re familiar to working with data warehouses, the concept of a date dimension will be pretty common to you. It’s a table full of dates with a whole load of attributes that describe that date. That could include, is it a weekend? A binary flag, true or false. Yes, it’s a weekend; no, it’s not a weekend. Is it a bank holiday? Yes, no. Is it a business day? Yes, no. And based on that, we can then say, okay, well between these two dates, what’s the sum of all the business days, what’s the sum of all the weekends?
And we can do that by using what’s known as a correlated query. So, what we’re doing here is we’re taking our select statement, which is working out the sum of IsWeekend, so IsWeekend is a one or zero, and I’m saying, okay, between those two dates, if I was to write this as a query, give me – select the sum of these IsWeekend between these two dates. So, where this date is less than my first date, and this date is greater than my other date. You just heard me say five days of those are weekends. But what I’m doing here is I’m taking this date here, and I’m taking this date here, and then I’m going to push them in to my where statement over here and over here. So, what’s going to happen is when this outer query is evaluated, these two values are going to be pushed to this inner query. So, for every row in here, it’s going to evaluate this sum and pass me back the value.
So, if we look at that, we are getting holiday days between those two days that are four weekends. Between those two, there are no weekends, and et cetera. And so that’s an interesting way to tackle that problem, but still, it’s doing that for every single – it’s evaluating that for every single row.
If I then wanted to know, okay, well how many of those are business days? How many of those are non-working days? I could do something like sum of this and chain them together, and I’d quickly hit an error message, because this is saying you can only return one value in here. I can’t return multiple columns, I can’t return multiple rows, I just have this space for a value, and that’s not going to work. So, what I end up having to do is something like this, and just start duplicating data. So now every time there’s a row, I’m having to do three queries, and that, I don’t want to be doing.
So, an alternative method is to use a cross apply statement. So rather than having that statement up in my select line, I’m pushing that down into my where – into my from clause using a cross apply. So, I’m saying give me my normal columns from that date, and what I’m going to do is I’m going to cross apply this particular query and I’m specifying how this should operate on that outer query in the same way as what I was doing with my correlated query. So, take me that date and that date, push those into my cross applied query, and then return me the results. And so, we’re doing it this way, I can have multiple columns. I can be returning – as long as it’s just returning one row at this point, we can return multiples, and we’ll look at that. But at this point, as long as it’s returning one row, that’s what I want. So, if I run this, we then get that weekend count, and we also get a day count. So, we can see, okay, there were four days between here that were weekends, but there were a total of 14 days, which is inclusive upper and lower bound of those two dates.
So now that we understand a little bit about cross apply, we can start working with our tally table to do some interesting things. So, this is only returning us one row for every row that gets evaluated, but we can use a cross apply to return multiple rows and just duplicate our data. So, if we want to do that, I’ve got another example to sort of show that a little bit better. And that example is a case I had with a customer recently, who – their way of working was they sold insurance policies, and you could take out an insurance policy, say against a car, and in your first year, you would have a start date and an end date. Then if you renewed that insurance, rather than giving you a new row in that table, they would just change the end date and increment it by a year, and then increment it by a year, and a year, and a year, and a year. So, when they came on doing year on year reporting, it become really difficult to say, quickly see, we had all of these policies in this year, all of these policies in this year, that was tricky. So, had to think again about how can I work out how many years there are between these two dates, and then separate that across multiple rows so that when that gets pushed into their reporting system, into their data warehouse, into their semantic layer, however it’s going, it ends up as multiple rows.
So, with all of these sort of things, what we’re looking to do is we’re looking to take something we – maybe we already know quite well, like a date diff, and combine these little functions together so that we can produce something that works quite succinctly to solve our query. So, what I’m doing here is I’m saying, okay, work out how many years there are between each one of these two dates. So, for every year, I want to separate a new row. Okay, so I’ve worked out there are five years, 15 years, and four years. So again, for this one, I want that across five, that one on 15, that one on four.
So how can I go about doing this? And the answer is our tally table. So, if I look up my tally table, where that value is less than my tally integer – now, remember, that goes one, two, three, four, it’s a sequential number all the way up to 15 thousand – if I say where that number is less than or equal to the tally count, it’s going to return me five numbers, one, two, three, four five. Next time I do it, it’s going to return me those 15. One, two, three, four, five, up to 15. And then with four. And in this one, the difference to what I was doing before, I was doing an aggregate, which only returned me one row. Now I’m not doing that aggregate, so I can – I end up with a whole load of duplication. So, if we execute this, what we’re saying here is – so this is returning me the number five, and the number four, 14, or 15, and then number four, and then taking that number, I’m going to pass that into my cross apply where the tally int number is less than or equal to that function. And what that’s done is it’s exploded my data across all these multiple rows.
So, if we look at the first example, where we had five years, we’ve got our data here duplicated five times, and then I’ve just appended my tally integer number on the end. So, what I can do now with this tally integer is start translating these number values into something that’s a bit more meaningful. So, this policy started on this date and it ran for a year, it didn’t run for five years, then the next one should be well, that date plus a day, and then running for another year, et cetera. So, you can start building this query up a bit more. If there are any questions, please interrupt me.
Brent Ozar: Yes, you’re good. Couple folks were like, “I’m not sure I would want to tally table” and then when you start running it they’re like, “I see why I want to tally table.”
Terry McCann: Okay, good. So, I hope at the end of this talk, if you take one thing away, it’s maybe I need a tally table, and you can add one. So, what we can do is we can do that logic now. So, we’ve got this tally integer, this one, two, three, four, five. I can then use that to subtract years, to add dates, to work out this start and year-end periods, and now what I’ve got is rather than that one value, I’ve got my year indicator, it’s the fifth year of that policy, which started in 2011, ran to the end of 2011, et cetera. You can see how that works.
So that’s one way of doing it. So now if we apply that back to our original problem, crawl your mind back to our, we’ve got our dates and we’re trying to work out individual days, we can then just use our tally table in the same sort of way to generate that. So rather than doing that cursor now, we can just use our tally table to generate those additional rows.
Okay, same thing. So, we have 52 rows in our cursor, we have 52 now. Same query, this time it’s doing it in a set based approach as opposed to doing it in a row-by-row approach. So, to prove this version is better than the other version, I mean, you know, they both run in sub seconds, they are both very, very fast; at this point, what I’m going to do is I’m going to try and just inflate my data a little bit. So, I’m going to insert a couple more rows into the table. And I’m going to do that using this little bit of handy SQL. GO100. If you stick a number at the end of your batch separator, your batch separator is still GO, you haven’t been evil and changed it to something else. If you put GO100, it’s going to execute everything in that batch 100 times. So, you can see, 34 rows been inserted, that would be inserted 100 times.
So, we’ve got a bit more data in there, I’m now just going to select from that table again and this time I’m going to insert more and more. So, every time I’m going to duplicate my data in there. So, we end up with somewhere just over 100,000 rows. I’ve got – what’s that? 108,000 rows in that table now.
So, we do our quick comparison, this is our cursor. Again, imagine it’s doing a loop, so it’s going to do a loop for every single row in that table, 108,000 loops. It’s going to do a loop for every single date in between those values. So that’s an awful lot of looping.
So, we’ll run that, and what we’ll do is we’ll wait. And we’ll wait, and we’ll wait. And so, I’m not going to waste all of your time, I’m just going to terminate this query, and you know, we’re just going to have to wait an awful long time for it to go through and complete that looping operation. It’s looping through all those rows, row by agonizing row. Whereas if we do the same thing using our set based approach, that’s going to run. It’s still doing quite a substantial operation, so it’s still going to take a little bit of time, but what you’ll see is that that completed in seven seconds, as opposed to the query that will take an awful long time to run.
When I did this session before, someone asked me you know, “Can you show me the execution plan? Can you show me the plan of your tally table, and can you show me the plan of your cursor?” And like well, not really, because my cursor is going to produce me 108,000 plans. My tally table is going to produce me one plan, so no I can’t, because we’ll be waiting even longer whilst Management Studio tries to build up all those plans. So, cursors, you know, are evil, avoid them like the plague. Tally tables are quite nice, and if you take anything from this session, it’s add a tally table.
Okay, so the next query – now, this one is a bit of a big, long-winded one, but this is one of those queries that somebody asked, and I was like, “This is an awesome question.” So, what the guy was trying to do was he had a problem where he had – he was trying to equally distribute customer records to employees at the same branch. So, he had different stores with a maximum assignment per employee. So, what he had was a load of customers, two branches, and at each of those branches was a couple of employees. And imagine the new the new leads, you need to distribute those to your employees so that they can work those leads. And he needed to evenly distribute those leads. He didn’t want to add any bias, he just wanted to say, okay, just distribute them equally.
And I don’t think that’s a very easy way to explain that, so I tried to do a bit of a diagram to better demonstrate what I mean. So, we’ve got those 11 customers. And he wants to take so many of them to branch one, and so many of them to branch two. And then based on the amount of employees at either branch one or branch two, evenly distribute that data between them. So, three go to employee one, which is that green block, three to employee two at branch two, two, two and one. We don’t care who is the unfortunate one who only gets one at this point. It’s just sort of proof of concept.
So how would you go about doing that? I sort of came across this problem, like, how would you go about solving this? And again, like a really good question asker on the internet, what he did was build out a load of temporary tables, build out all of the inputs, and insert so what we ended up with was this. So, we’ve got our 11 customers that are already associated to a branch, so those six are in that branch, those five are in that branch, this one below, and our employees are associated to a branch as well. So, you’ve got these two guys here, and these three guys here. So, what you’re sort of saying is I want to take that one and associate it to you, and then that one to you and so forth indiscriminately and in an automated way. So as these two values increase, so that table ends up being 100,000 customers that need to be associated to the employees, how to go about doing that in a non-discriminate way.
So how could you solve it? Of course, with a cursor. No, not with a cursor. You could do with a cursor but we already know cursors are evil and they should be destroyed and if you use a cursor you should be fired from your job. I’ll caveat that because cursors are really good for basic DBA operations. If you’ve got a load of databases and you need to loop over all those databases, all those tables, and get all your rows out, or something like that, or do backup all your databases, cursors used to be a really good way to do that. Now, you should probably just do that with PowerShell and not with cursors. But you could use the cursor if you don’t know PowerShell. So, there are uses for it, just not for querying. If you find yourself writing a cursor that’s going to end up with somebody having to run that on demand, just take a step back and think, maybe I’m not doing the right thing.
Okay, so how could we solve this not using a cursor? So, my first go-to was okay, what I need to do is I need to start associating numbers to each of these to that I can start joining them together. And you could do that using window functions. So, before this session, we were talking about – or you guys were talking about things that Oracle did really well, and that SQL Server, it was always playing catch up on. One of those is window functions. I remember – again, I’ve working with lots of database platforms, way back, before I really started working with SQL. I was doing a lot of Oracle dev, and you know, I’d be using window functions all the time. And then I came over to start working with SQL Server, and I just started writing my window functions, it’d be like, nope. These don’t exist until what, 2012, when they really started kicking in? So, prior to that, maybe 2008 R2, I can’t remember off the top of my head, before this stuff started coming in, you just couldn’t do that stuff. So, you know, SQL Server is a little bit on the back foot. But yes, we now have window functions, so if you’re using something before 2012, there’s a good chance that some of this stuff won’t work for you.
So, I’ve got an example dataset, so what it is, is if I run this, again, our topical discussion seem to just be you know, flowing really nicely into my session, like we prepared it. But we didn’t. So, members of Game of Thrones have been taking SQL Server exams, stuff just seems to be marrying up so well. So, we’ve got our poor Eddard, Jon and poor Rob. They’ve been taking exams, and this is the history of their last exams. And what we have is a couple of different examples of window functions. So, if we have a quick look at how you actually go about writing a window function and what it’s actually doing, we have got – so the syntax is this. You’ve got a function – probably better to use a different color than blue because everything else is blue. We’ve got a function here and we’re saying I want you to apply this function over a window of my data. And I’m specifying that window by passing in this partition and order by logic. So, I’m saying partition by score date, so every time there’s a different score, ordered by that score, so every time there’s a different score – date, sorry – that’s going to be a new window in my data, and then I’m going to apply a row number to it. So, I’m going to say that one, two, three, four, five, et cetera.
Then we’ve also got a couple of different functions. We’ve got rank and we’ve got dense rank. I can never remember which one does which off the top of my head. Basically, one will give you ties, and one will give you gaps. So if you’ve got somebody who scores 500, the next person scores 500, the next one scores 800, 800 should be number one, but with those next two, they’re the same. Should they be number two and number three? Should they both be number two? Should they both be number one, then number two, number two, and the next person be number three? Or should that next person be number four? So rank and dense rank will do one of those. I’ve got the actual explanation in a slide.
So, there we go, we’ve got that syntax again. We’re saying I am partitioning it by my employee in this example, which creates me these two partitions. That blue partition, that red partition. Then I’m ordering it by something in my data, which is my sales date, which then gets me to apply my row number. So, I’ve got window one and window two. Yes, couple different functions, we’ve already explained those. So yes, let’s have a look at them.
So, what that’s actually doing is our row number here is ordered by – partitioned by score date and then ordered my score. So, on our – we’re ending up with what was the best score that they got on any particular date that they took is. If they all took them at the same time, who got the best value? So, on that date there, that date there, and that date there, it’s ordered least, so who got the worst score. So that was the worst one, and then second and then third. Because it’s using a row number, you’re getting one, two and three. But they’re both the same value. 900 and 900. If we did a rank, what we end up with is number one, then number two and number two, so we have ties.
When we do that with a dense rank, we get the exact same thing. Number one, number two and number two. However, one will lead you to ending up with number three, and one will end up with number four. So, if you have the example here, which is score date, so if we’re taking the 22nd, 22nd, and the 22nd there, so number one is that score there, which is the same as that score there, and then we have this one. So, we’ve got number one, number one, and number three. So, we don’t have a number two in that scenario, whereas in this one, we have a number one, a number one, and a number two.
I would implore you to download this stuff and have a play with it if you’ve not worked with windows functions before. Then the other functions that are really handy are lag and lead. And what lag and lead will allow you to do is say within my current window, go outside of it, or go the other way and actually take me my next value. So, where we have here, we take Eddard as an example. The first time he scored, he scored 850, and the next time he scored 699. So, when I’m on this row, I can say give me the score that he next got, which is 699. Then when I’m on the next row in that window, I can say, well give me the score that he previously got and the score that he got next.
So, I could start working out, okay, well you know, what’s the difference between that score and what he got before, and then what he got now, is he improving, are we seeing some kind of trend, we can do some interesting stuff with lag and lead. How we would go and do this before we had lag and lead was really painful. You’d have to give everything an ID, you’d then have to do self-joins and say okay, well it’s this one plus this number, and it was very messy. Lag and lead does it very well.
Couple of more examples, just because window functions are really interesting. What I’ve got here is some example sales data, so employees, similar to something like AdventureWorks, who are making some sales. On that particular day, made £100, $100, $20, et cetera. And what I want to do is I want to know what is the first date that somebody made some money, what was the first day they made a sale?
Now, I could have done that with something like just give me the min of the sales date, grouped by the employee, and that would work. But this way, what I have now, by doing a row number, by doing this guy, I’m getting this sequential list down here, which is giving me the order. So, I know that was the first time this guy made a sale. That was the first time this guy made a sale.
Now, if I wanted to, what I want is I just want to return this row and this row. And this is where you might be thinking, “Okay, you could do that, but that’s probably not the best way to do it” and I would agree. You could filter it down if you wanted to, and intuitively, you might want to write something like this. So, with my window function in my select line and then push my window function into my where clause, and say okay, cool, that works up there, so I’m just going to change this to say where that’s equal to one. That’s going to be the first one. And I mean, that will work. Won’t it? No, it won’t because window functions can only appear in the select or the order by clause. That pretty much comes down to the way that you write a tSQL statement is different to the way that it gets operated on.
So, you have this thing in the background called the logical query processing model, so you may write select these columns from this table, you know, where it’s this, grouped by this, having this, and that’s the way you may write it, and then order by. But what it’s actually going to do is it’s going to go away to your tables first, get your data, it’s then going to join that stuff together, it’s going to do all this stuff to it, and then your select line is the last thing that gets – or the second to last that actually gets operated on. If I remember, I was on slide 20 and then scroll down, we can just quickly see that.
So, you might write that select from where grouped by having order by, and that actually gets operated as from where GroupBy having, select, then order by. So, we’re taking our data, we’re essentially mashing it all together into this big table. We’re then going to limit our rows down, we’re then going to group our data, and then we’re going to limit those groups. So, by using our having, then we do our select, so we say we want these columns, and then we order by. So that’s why if you alias something in a select statement, say customer ID is cust ID, and then you try to say where cust ID is equal to one, it’s like, I have no idea what you’re on about. I don’t know where this alias is, and that’s because that’s happened up here before it’s got down here.
But I can say order by cust ID, and it’s like yes cool, I know what that is because it’s been operated afterwards. And the same sort of thing happens with window functions. Cool, okay, so that won’t work. One way it could work is we could use a CTE, we’ve seen our CTE, so we’ve got this statement, with pushing in a sub-query and then using the result of that sub-query to pass into another query. So, I could say, give me this query, then with the result, filter that down to one. That’s one way to do it. There are many ways to get to that exact same result, that’s probably not the best, most efficient way to do it. But just to demonstrate the point.
Okay, one last thing that you can do. So, you can use window functions with different types of aggregates. You don’t have to do them with row number, with rank. You can do them with things like sum, average and count, and get some really interesting results. So, looking at our sales data again, for this employee – so we’re partitioning by the employee, so we’ve got these two different partitions. I could say, okay, well what’s the total that that employee made. So, all of those numbers added up together on every single line. Now, you definitely don’t want to do this where you’ve got naïve consumers of your data because what they’re going to do is they’re going to look at that and go, total per employee and just sum up all of that and then they’ve got some hideous number that’s massively inflated.
But what you could start to do with it is start applying things like, okay, well if that sale was 100 and my total was 2135, what’s that as a percentage of that value? What’s this as a percentage? And start seeing for sales made, what’s the total – what is that sale as a percentage of the total sales made in that period? How much is that sale more than their average? And you can start doing some really interesting things that would be quite tricky otherwise.
I think I’ve got an example of that. So, if we run that same thing again, what you can see down here is £100 or $100 was 4.68% of the total, whereas this line here was 92% of their total. That’s the one that we want to be praising that person about. So, by using – what we’ve got up here is a sum with a window function, and we can then start doing some really interesting stuff with it.
Okay, again, so that’s window functions. Window functions are awesome. If you take away two things from the session, it’s tally tables, it’s window functions; they are great for solving many, many problems. The next little piece that we need in here is the modular function. Again, something that you may use in a programming sense, but probably might not think, well why would I want to do that in SQL. And what the modular function does is say okay, give me a number. How many times does a number fit into that other number, give me the remainder. So, if we say ten mod one, and to do a mod in SQL Server, we use the percentage sign, what that’s saying is how many times does one fit into ten, and just give me the remainder. So, one fits into ten evenly, there is no remainder; we’ve got a zero. If we do the same thing with two, two goes in five times with no remainder, three goes in three times with one remainder.
Now, if we flip that logic and we say one mod seven, what we end up with is a way of separating our data into batches of that number. So, if we look at this example here, one, two, three, to eight, mod seven, what I actually end up with is one mod seven is one, two mod seven is two, three – seven mod seven is zero, eight mod seven is one. So now I’ve got this like sequence number being generated, being able to evenly distribute my data by some number.
So, you think back to my example, I’ve got two employees and I’ve got three employees. What I want to do is for each branch, is I want to evenly distribute by two and I want to evenly distribute by three. So, by using that mod, I can now take values and evenly distribute them across multiple rows. Okay, so the solution, this one’s a bit of a beast. What are we doing here? We are taking those 11 branches, this is our thing again. We’ve got branch one. We’ve got branch two, then I am working out okay, so this is the customer sequence, I want to be evenly distributing these guys. That should say number of employees, not branches. Number of employees, I’ve got two and I’ve got three. And so now I can make this pseudo surrogate employee index. So, employee one gets that one, employee one gets that and that one, and then two gets that, two gets that, and the same down here.
So, I can then apply a similar bit of logic to say okay, well, generate me random employees in the similar fashion. So now based on that branch, I’ve got branches, employee one, employee two, branch two, employee one, two, three. And so, what I need to do is join on your employee index, and your employee branch, and associate that value.
So again, this feels like a bit you know, I would say download the code and have a look and run this. So, what we’re doing is we’re doing some mods, we’re selecting, we’ve got here, we’ve got a correlated query, which is modding over a row number to work out how to evenly distribute our employees. We’ve then got additional row numbers calculating what’s our employee index using CTEs and just mashing it all together, which ends up with those employees being – those customers being evenly distributed to those employees and those ones being distributed to those employees.
Brent Ozar: Nice.
Terry McCann: So that can be done in multiple different ways, and on Ask SQLServerCentral, it’s funny that you mention Guy Luca is – there’s a couple of people that always used to anyway, jump on the tSQL questions, jump on trying to answer tricky stuff like this, Guy Luca, Kev Riley and so they all descend on this thing and come up with very similar answers. When you see a problem like this, it’s like, okay, I know I need window functions, I need to some way that I can start doing this, and you know, these things, these techniques just pop up time and time again. So, you could do it in a different way, you could do it with a cursor. It’d be much easier with a cursor, but it’s definitely not going to scale.
Okay, back to the slides. If you want to read more about tally tables, I thoroughly recommend having a look at Itzik’s book, so if you remember not tally tables, window functions, so if you remember that section that generated a tally table, that was written by Itzik, this book is written by Itzik. There’s a theme that Itzik is very good at this stuff.
Brent Ozar: That’s impressive.
Terry McCann: That’s a very good book. So, there are additional questions in this deck, but what I don’t want to do is go through another one. What I prefer you guys to probably do is download the code and step through this example. And what I probably prefer to do is to show you another example of why I personally think tSQL is amazing, and that came about from a question that related to tracking Friday the 13th.
So Tomaz – I apologize if I’m butchering your surname Tomaz, Tomaz Kastrun. He is an MVP from Slovenia, and he posted this question on his blog about a year ago now, which was – it was on Friday the 13th, and he wanted to find the shortest way in tSQL to generate a list of Friday the 13ths. The end goal being that you should be able to tweet it. So, it should be less than 250 characters. So, I’ve like – this is great. You know, this is the sort of stuff – again, this is when I was working in a job where I didn’t really have a great deal of work to do. I could answer questions on the internet and do silly things like this. So, I’m like great, I am going to try and at least beat Tomaz at this.
So, Friday the 13th, okay, so this was Tomaz’s example. So, he’s doing a WHILE loop, so I would argue that you know, that’s probably not the best route to do this anyway because it is doing that WHILE loop, that rebar operation row by agonizing row, but it’s a way to do it. And he came out at 276 characters. So too big to fit in a tweet. But how would I go about solving this? How would I go about generating Friday the 13ths in a set based way?
The answer is with a tally table. Good old tally tables. But this needs to be tweeted, so I can’t guarantee that everybody is going to have a tally table with the exact same tally table syntax that I do, not do I want to be pushing over that huge lot of characters that generates a tally table. So, my first example was to think, okay, what’s a big system table that I can use to generate myself a lot of random numbers? A lot of numbers. So, the DMV, sys.dm.os_performance_counters. If I cross apply that, I end up with quite a large subset of data. Then I can apply a window function over the top to get me that sequence of values like a tally table. And then I can say, okay, well give me the Friday the 13th where the day is the 13th and the day in the week is five, which is a Friday. So, where it’s Friday the 13th which was one way to do it. And in a set based approach, it worked, but it was 329 characters. It was even bigger than Tomaz’s. I mean, that’s not going to get in my tweet.
So, I was like, right, I can get this down, and I can get this smaller. So, I was like, well, I need some other way to generate some data, so again, I’m going to use that same sys performance counters, but this time I’m going to use something a bit different. So, somebody bought me a mug for like a secret Santa, and that mug had facts about the world. And one of those facts was, if a month starts with a Sunday, you will always have a Friday the 13th. So, I was like okay. So, if I work out that the Friday the 13th will always start with a Sunday, I can just say, is the first day of that month equal to a Sunday, and instead of doing a day, I’m adding months.
Brent Ozar: Smart.
Terry McCann: Which again, worked. And I was like, 254. Okay, I’m getting there, that’s better. It fits, but you know, I can get this better. I can get this down a bit more. So, to get that down a bit more, I used sys.syscolumns, because rather than doing days now, I’m doing months. So, my numbers can be a lot fewer. So, what I’m doing here is a – just a statement with an inner statement, so a sub-query, using that same logic where the Sunday is equal to the first day of the month, generate me a whole load of Friday the 13ths.
So just to prove it, the 13th of October, which is next month, it’s going to be unlucky for some, it is Friday the 13th. That logic stands for everything. So, I was like great, you know, that’s it, I’ve done it. 126 characters. That’s definitely going to be the winner, right? And then I was like, great. So, I posted that thinking, yes, definitely the winner. And then a lot of you will probably have come across Mladen Prajdic, also Slovenian MVP who in the great way of Mladen, just posts this in his great accent, read it and weep.
So, what he did is a cursor declaring a variable with this, and that generates Friday the 13ths. There is no spacing in that, but it works. I have made it slightly – made it with spacing so that you can see how it works, but that was 84 characters and does the exact same thing as what Tomaz’s original query does of generating Friday the 13ths.
Brent Ozar: Damn, that’s good.
Terry McCann: Yes, in 84 characters, so an awful lot better than mine. So, what I would say is that tSQL is an amazingly versatile language. There’s a reason why at least I think that it was the language of the year in 2013, and we’ll come on to why it actually was. But with a declarative language, because you have so much freedom and it’s going off to the engine to calculate it, you know, just be aware that there is always more than one way to write a query, but just remember that some of those are better than others.
So that’s it for me. There is more in that deck that you guys can go through in your free time if you want to. If you do want to ask me any questions; if you have any tricky tSQL problems, I do still like answering those sorts of questions, so you can get me at SQL Shark. I work for a consultancy in the UK called Adatis, that’s my Adatis email address, our website, and our Twitter.
I have a session at PASS, not on anything to do with tSQL, which is on the session Brent, that you had to remove this morning because – which is the graph processing in 2017. So, if you want to know about graph processing in 2017, please come along to my session at PASS.
Brent Ozar: Dang, you were winning in the voting too early on, but…
Terry McCann: Next time.
Brent Ozar: Exactly, it just means you’re going to get in on the next one so, very nice.
Terry McCann: Last thing. Why was the language the language of the year? So, this year, Transact-SQL is 30th. And the only real reason why it ever ended up being number one is because they changed the model in which they calculated how they scored it, and they added in a couple of key words. So, it’s all based on web search, and by adding in a couple of key words as well as programming language, tSQL got asked the most questions pretty much on the internet, with people having problems, which shot it to number one, and then after that they changed it and it went back down to normal. So – but regardless, 2013 tSQL was language of the year, and it is an incredibly, versatile language, very good. So…
Brent Ozar: Wow. Well nice job, very nice job sir. Let’s see here. We’ve already got a couple of questions. Let me pop this round and see right. Faig says, he would love to see a challenge where you generate a calendar matrix in a query so that where you would – just like you see in the Windows pop up for you know, the date calendars, he wants to see that in rows and columns. That has to have been done.
Terry McCann: Yes, definitely. How you would do it off the top of my head, no idea, but you know, it would definitely be a combination of those things. Or you just do it in the great dirty cursor.
Brent Ozar: And when you did the mod thing, I thought for sure you were going to go down the road of building dynamic SQL. I thought for sure you were going to build dynamic SQL to go crank out the mods and I was like, you know, that’s how I would do it, and then when you did it all, that’s amazing.
Erik Darling: I want to say that there were a couple articles on SQLServerCentral that do the exact same thing with the generating, like an actual like, viewable calendar where it’s like the days of the week. Pretty crazy.
Terry McCann: Yes, so yes, I would always try to avoid dynamic SQL purely for the hairy pitfalls. I’m sure you know, when you sanitize your inputs and stuff like that it always works well, and you know, everybody always sanitizes their inputs with dynamic SQL, so no one’s ever been fallen fail of injection attacks and you know, those sorts of things.
Brent Ozar: Now, I mean, even in 2017, we run into it all the time. It’s amazing how many people haven’t heard their bobby table story.
Erik Darling: Well, what amazes me is the number of – the sheer number of people who are just like, I’ll just use exec.
Terry McCann: Hey it works, woohoo. Add password equal to.
Brent Ozar: J.D. Walker says, “Yes, everyone sanitize their input and also the check is in the mail.” Exactly. Well, thank you, sir.
Latest posts by Terry McCann (see all)
- Enhancing relational models with graph processing In SQL Server 2017. - July 17, 2017
- Why TSQL was the language of the year - April 19, 2017