Enjoy the Podcast?
Database developers, DBAs, and .Net developers who care about database performance
User-defined functions in SQL Server are very much like custom methods and properties in .Net languages. At first sight, they seem to be the perfect tool to introduce code encapsulation and reuse in T-SQL. So why is this feature mostly avoided by all T-SQL gurus?
The reason is performance. In this session, you will learn how user-defined functions feed the optimizer with misleading and insufficient information, how the optimizer fails to use even what little information it has, and how this can lead to shocking query performance.
However, you will also see that there is a way to avoid the problems. One type of user-defined function, the inline table-valued type, may not be the easiest to use, but it is the only type of user-defined function that doesn’t impact performance. I will show how, with just a little extra effort, you can reap the benefits of code encapsulation and reuse without sacrificing performance.
Why I Want to Present This Session:
It is very tempting to use T-SQL user-defined function. And so easy to shoot yourself in the foot while using them. Every time I see examples of this, I wish I could have prevented that. And that is exactly what this session is about: help people avoid the mistake of “bad” usage of T-SQL user-defined functions.
It also helps that presenting this session is always a lot of fun for me, and hopefully for the attendees.
Slide deck and demo code: http://tinyurl.com/HugoUDF
Brent Ozar: In this session at GroupBy, Hugo Kornelis will be talking about something that I used a lot as a developer, functions, and why they’re not such a good answer. Take it away, Hugo.
Hugo Kornelis: Thank you, Brent. For this session, we will start with an experiment. I will make you run a quiz, I’m going to make everyone run a quiz. We’ll see how far the Kahoot plus one can scale. Please go to Kahoot.it on your devices everyone.
What we’re going to do, is we’re going to run through a few multiple-choice questions. When the questions appear, you will have four answer options on your device. Just quickly touch the one that is correct, or that you think is correct. The faster you answer, the more points you will get if your answer is correct. If you answer is not correct, then you will get zero points.
Obviously, the goal is to win. If you win, I’m going to kick you out, because you already know everything about user-defined functions. If you don’t win, you stay in. Since the video goes up on YouTube later, you will definitely win eternal glory if you get it right.
Brent Ozar: Wow, this is really cool.
Hugo Kornelis: I thought so. Are you participating, Brent?
Brent Ozar: Yeah, yeah. I am in there.
Hugo Kornelis: Oh, I see, Aaron as well. I’ll give it some more time for everyone to join. Some people are still joining in, but I want to go ahead, so I’ll start the quiz. Good luck, everyone. Here’s the first question. “What is a benefit of using user-defined functions?” Is it performance? Code reuse? Security? Or they have no benefits at all? There is a 20-second maximum answer time. If everyone answers, then we will skip the rest of the waiting time. The benefit is code reuse. I’m glad to see that the majority got that right. Oh, wow, Brent. You’re doing well. You’re on the top of the leaderboard.
Brent Ozar: I’ll give you people the trick. Watch the webcast, and you see the question. Then look down at your phone. Don’t look down at your phone.
Hugo Kornelis: I’m not sure which one is faster, depends of a lot of factors. There can be delays in the webcast as well. Let’s go to the next question. Which of these is not supported in SQL Server? Watch the word “not.” Multi-statements table valued user-defined function, inline table valued user-defined function, scalar user-defined function, or inline scalar user-defined function? The actual trick is not to listen to me talking, but read, because most people read faster than I speak.
Brent Ozar: Somebody’s going to beat me on this one. I wasn’t fast enough.
Hugo Kornelis: Inline scalar user-defined function is the only type of the four listed that is not supported. Wow. A lot of more diverse answers here. Brent, you’re still on top of things. You’re good. You’re good. I’m going to have to kick you out of the session after this.
Let’s see how you do on the third question. “What must you do in order to pass a column value into a scalar user-defined function? Do you need to use APPLY, a CTE, a derived table, or you can you just pass in the column name? The answers are coming in fast.
Brent Ozar: I’m not sure I got this one right. I went fast, but I’m not sure I got it right.
Hugo Kornelis: Let’s see.
Brent Ozar: It’s probably a good thing it doesn’t let me answer again.
Hugo Kornelis: Did you get it right, Brent?
Brent Ozar: Yes, I did. I’m in first place now.
Hugo Kornelis: Yes. You are. You are far ahead. Over 100 points ahead of Rich.
Brent Ozar: Dang, Andy Mallon is in there.
Hugo Kornelis: I have no idea who these people are, but they are also good in numbers two to five.
Brent Ozar: 17 players have three in a row right. That’s cool.
Hugo Kornelis: Yeah, that’s a lot. They get bonus points for an answer streak.
Brent Ozar: Oh, nice.
Hugo Kornelis: So, fourth question. It’s almost the same, it’s certainly different. “What must you do in order to pass a column value into a table-valued user-defined function?” The answer options are the same as before: APPLY, CTE, derived table, or just pass in the column name? I see a thinking face on Brent.
Brent Ozar: Yeah, I got the wrong one. I’m pretty sure I got the wrong one.
Hugo Kornelis: Good. We’ll see another leader, then. Actually, there are three correct answers here. APPLY, CTE, and derived table are all correct, but you cannot just pass in the column name.
Brent Ozar: Thank god.
Hugo Kornelis: And Brent, happy again, and still on top of things. Wow.
Brent Ozar: Unbelievable.
Hugo Kornelis: You are totally running with this, Brent.
Brent Ozar: And they thought I was just a pretty face.
Hugo Kornelis: Okay, [inaudible 00:07:10], but I think you have, too, Brent.
Brent Ozar: Yes, yes.
Hugo Kornelis: So we’re going to the next. The next question will be the all-deciding last question, so take care. “If you use a user-defined…” oh wait. [Question on screen reads, “Is Brent Ozar the most awesome person on earth?”]
Brent Ozar: I know the answer to this one.
Hugo Kornelis: Why did you change my presentation? You shouldn’t. Let’s just run this now. Let’s answer. I’m not happy with this, Brent.
Brent Ozar: This is life. I’m going for “Hugo Kornelis is better looking.”
Hugo Kornelis: Thank you, Brent. Thank you. No, the correct answer is, “Who cares, just start the presentation.” We’re not here about people. It’s wrong, Brent. And 20 people answered, “Brent… who?” Okay, guys, he’s got your IP. If you’re suddenly disconnected, it’s not my fault. It’s his fault.
Brent Ozar: Good answer.
Hugo Kornelis: He knows. He has the power.
Brent Ozar: And I still won.
Hugo Kornelis: You still won? Even with an incorrect answer. Well done, Brent. Well done. Very good. And this was not even faking anything. This was totally fair. By the way, I accused Brent of changing my questions. He didn’t. Trust me. I password protected it. The password is “Brent sucks.” He would never have guessed it. It must have been Jeremiah.
Anyway, so far the fun stuff. Let’s move on. Presentation. This is me. There’s always an ego slide in every presentation. You don’t care about any of this except my email and my Twitter. If you have any questions after the session, pose them to me. I’ll try to respond. If you have questions during the session, there’s a chat window somewhere on your screen. Type them. If Brent sees any good questions, he’ll just interrupt me and ask them. If Brent sees any questions that we can use to have a good laugh at your expense, then he will also do that. He will change the name to accuse someone else. If there’s any questions unanswered at the end, like I said, you have my email and I’ll try to respond. I also try to be more active on Twitter nowadays.
So, what are we going to do? We already started with a quiz. We are going to talk about user-defined functions. So let’s first see what is a user-defined function. After the theoretical part, it’s time to jump in for the demo. And because I love demos, I will follow the demo with even more demos. And that brings it to the end of the session, so we’ll have a short conclusion, and that’s it.
What is a user-defined function? This is where I’m going to address you, Brent. Can you give me a good definition of a user-defined function?
Brent Ozar: Like when someone wants to run T-SQL repeatedly, but they don’t want to copy/paste it into every piece of their code. They just want to define it once and then slide it into any query just to save themselves time. And as a programmer, I was always taught to not retype my code. Just slide a reference into it.
Hugo Kornelis: I tried to count how many words you used, but I’ve—I’m losing count. There is a much shorter definition.
Brent Ozar: Crappy?
Hugo Kornelis: This is user-defined function. It’s a black box. The benefit of a black box is, as you already said, code reuse. If you have complex code, and you have to use it in multiple places, write it once. Test the hell out of it. Make sure it’s absolutely correct. Then, close it up. Never touch it again. And reuse it everywhere. So you don’t have to reinvent it. You don’t have to retest it. It’s simply there. It works. If ever the logic changes, you have the logic in one place. You go to that place. You change it. You once more code the hell out of it. Then everywhere where you use it will magically start using the new logic.
So, from a development perspective, user-defined functions are absolutely great. But that same black box phenomenon tends to be hard on SQL Server. That’s what we’ll see in the demos. Every good demo has a scenario. The scenario for this demo is that you are a database developer for AdventureWorks Cycling Company. Have you ever seen this code before, Brent?
Brent Ozar: Oh, I’m a big believer in their stuff. I have a lot of their products.
Hugo Kornelis: So, you have not seen this code before?
Brent Ozar: No.
Hugo Kornelis: You have never done any certification exams. Certification exams are full with these kinds of words, and I like to put this code in my slides so that people taking certification exams get a good laugh, and that’s the only benefit they will ever get from the certification. At least they have that.
Brent Ozar: That’s sad but true.
Hugo Kornelis: In reality, you are probably working for another company. You probably have another job title, who cares? You have a job. That’s what counts. You have a job. You have a job, you’re sitting in your cubicle, you’re doing your work, and then your boss steps up to you. He says, “Hey, it’s that time of year again.” I’m going to switch to first person because I like to tell it in first person. So, boss steps up to me. Says, “Hey Hugo, it’s that time of year again. The higher-ups are talking about the wages for next year. What are we going to do? Boardroom is discussing stuff and here are a few key points we are considering, but it’s hush-hush now. Don’t tell anyone. It’s completely NDA, but we need you for this. We need to share this with you.
“We want to drive our sales, and we feel that the sales represents difficult work. [Inaudible 00:13:09] if we give them more incentive. So, cut their base salary by 25% and increase their commission percentage by the same 25%. We expect that will drive sales up by at least 50% for next year. For the salespeople, those in unions, the unions are strong. They have already fought some battles and we don’t want to get into a strike situation, so let’s just give them the 3% they are demanding. Then the rest of the staff who are not organized in the union, well, screw them. They can be happy that they just keep their salary.
“Based on these numbers, and assuming an average of 1800 hours worked per year, we need you to run a query over the database to see how much this will cost because it’s all nice and fine to write up these numbers, but we need to present the plan to the shareholders and make sure we don’t go bankrupt with this plan. Can you do this?” Sure, I can do that. Remember, I have a job. So I can do that.
Let’s jump into Management Studio and the tab you see now is the tab I used to create the database. I’m starting it. It will take some time. What this tab does is it creates an empty database called UDF demo. Then it creates several tables that are basically copies of tables in the AdventureWorks sample database. Almost exact copies. Then I run a loop with 50 iterations where I copy data from the AdventureWorks sample database to my demo database. I have, in fact, 50 copies of AdventureWorks combined in a single database. The result? 30,000 rows in the person table and, rounding up, 15,000 employees, with about 15,000 rows in the employee pay history. And from the department history of those employees, 850 are salesperson. So that’s how I set this up.
Now I’m going to talk about code reuse. First, imagine that I have never heard of code reuse. I’m a cowboy coder. I just do as I see fit. I’m not going to do any code reuse. I would obviously never do something like this in real life, but let’s just run with it.
This is the query I came up with based on the requirements. It takes some explanation. This looks quite complex, but this logic is the whole formula of what you saw on the spreadsheet. There are some weird things going on here. For instance, what is this business entity ID? Well, I will not accept blame for that. Like I said, this is a copy of AdventureWorks. Apparently, someone from Microsoft thought that business entity ID would be a great name as the primary key for the person table, because why would you ever use person ID? So whenever you see business entity ID in this demo, just think person ID, and we’re all fine. Similarly, the salaried flag is also a weird column name.
Here’s something that’s very little known. I’m not sure that even you know, Brent. If you go to the AdventureWorks sample database and go to the extend properties, do you know that there’s extensive documentation on all columns and tables in there?
Brent Ozar: Get out of here. No, really?
Hugo Kornelis: Yes. Really. In the salaried flag, there’s a weird description that clarifies that this flag doesn’t, in fact, describe whether someone gets a salary or not, as you would assume by the name, but whether they are organized or not. So, yeah, I translate it as being a union member.
Brent Ozar: Yeah, yeah. That’s exactly.
Hugo Kornelis: This query, let’s look at the from clause first. We’re going to grab all employees, join to the person table to get the data of the individual person, and then join to employee pay ranked, which is defined as a CTE here. I’ll get back to that. This joins to their pay rate.
Then I’m doing an outer join to the salesperson table. And outer join because not every employee is a salesperson, but those who are, I need to check that the business entity ID from the salesperson table is not null. If the outer join doesn’t find a match, then this will be null. If it is not null, then we can use the formula for salespersons for the new pay rate for next year. Otherwise, for the other people, we will check whether they are organized. If no, so salary flag equals one means they are not organized and not a union member. Then, they will get their current rate. Otherwise, they will get the three percent payment increase.
The employee pay ranked, like I said, it’s a CTE. That’s because if you look at the employee pay history by itself, you will see that there can be multiple rows for an employee, and those have a start date. It’s called the rate change date. So what is your current pay rate? Well, that’s the rate since the last change.
The only way to find that, because there is no end date in there, only the rate change date, and no “is current” column or anything like that, so the best way to find the current rate is to find all rows, then add a ranking, partitioning by business entity ID. So we start counting at one for each new person, then order by their descending rate change date. This gives the number one to the most recent rate change, the number two to the one before that, and then in the join, I’m only joining two to one with number one. This is the most recent.
Brent Ozar: That’s a lot of work.
Hugo Kornelis: Both explanations, but I don’t want people looking at this query and thinking, “What the hey is this guy doing?”
Brent Ozar: That’s what I think of every time I look at AdventureWorks.
Hugo Kornelis: Well, I can’t change AdventureWorks, but I can at least explain the logic of this query. So, let’s look. As you see, I also have set statistics “IOm”. I know it says IO, but I’ve never seen output in the results of this, so I always call it set statistics “IOm.” Execute this. And if I look at the set statistics IO results, you will see that the number of logical reads, 259 for the person table, 77 for employee pay history, 338 for employee and 11 for salesperson. So every table that participates in the query is hit once. The number of logical reads—trust me on that, I’m going to show it, is the number of pages used by the table. That makes sense because we’re reading all the rows from the table. There’s no WHERE clause in this query.
Take a look at the execution plan. I see a confirmation of this. I clustered index scans on every leaf and I see some logic here for the rank numbers. And if I look at how all the data is combined, I see hedge joins, merge joins. So, all in all, this runs pretty decently. This runs effectively. For a query like this that combines tables without any WHERE clause, this is the type of plan you really want to see. So I’m happy with the plan. I’m happy with the statistics I see here. I’m happy with the performance.
Obviously, even more important than that, I’m also happy with the results. I took some random names from the total list, did the computation, made sure that everything is correct. After verifying that my results are indeed correct, I present this to my manager. He looks at it, turns two shades more pale than he normally already is, and he says, “Hugo, you may be presenting to mostly Americans now, but you are still in the EU. We have data protection laws here. We have privacy laws here. You cannot just show the first name and last name like this. You need to remove it.” I turn even more shades paler and I quickly command out this line and run the same query again. And now I no longer violate any laws. The rest of the results are still the same.
If you paid attention, you would have seen that I still have this join to the person table here. I’m not using any columns from the person table anymore. I was scrambling to get this fixed, very hastily. I didn’t pay attention myself, but I’m wasting performance. I’m joining to a table I don’t use. So let’s see where we find that in the execution plan. If you remember the previous execution plan, you will that this plan is simpler.
If you look at the set statistics IO, you will not see the person table anymore. Even though I failed to remove the person table from the query, SQL Server has optimized and said, “Aha, you say that you want to join to the person table, but you’re not doing anything with the results. So let me run some checks.” Can this join ever cause a row to be removed? No. There is a trust on foreign key constraints, so if you have an employee then I know that there will be a person on the other end.
Can this join cause extra rows to appear? No, because it has a primary key in the person table, so every employee that you join to a person will be joined to exactly one person. So what does this join do? Nothing. So why should I waste any cycles on it? Here the optimizer is helping me. I was sloppy, but the optimizer saved my bacon and gave me the better performance without me doing anything for it.
Now, I’m not saying that you should just join every able in every query to be safe and let the optimize figure it out. That’s probably not a good idea. But in cases like this, where you were in haste, it’s actually a nice feature that SQL Server does this for you.
Now, these results are fine. I present them to my manager. The next thing he says is, “Yeah, do you really expect me to print this out? Waste the rainforest? And then take a stack of paper to the boardroom and tell them, here’s the results? Do you think our board wants to do the addition themselves?” No. They don’t. “Right. So, give me a single number.”
So I go back to my cubicle. Now my life gets easy because I already have the query that is correct. What I do is copy, paste, and make some small changes. This expression is still the same expression. We had used it before. The only thing I did was add a sum around it and I remove the other columns. It’s only a sum and no other columns. I don’t need a GROUPBY. It will simply get a single aggregate of everything. I run this. I did not check the math on this myself. I used a much smaller subset from my test database. It’s correct. Trust me. We’re not talking about correctness here.
I look at the execution plan, and again, I see merge, joins, clustered index scans. Obviously because I have more time now, I did remove the person table myself. But even if I had left it in, it wouldn’t have been that bad. IOs are again as I want them to be. So query performance is correct. Results are correct. This can go to the boardroom and I can continue on my actual work, watching cat videos.
Brent Ozar: Man, I tell you what, AdventureWorks [inaudible 00:25:35]. They spend a lot of money people.
Hugo Kornelis: If you hire them 50 times each, it’s a lot of money. Remember, I multiplied the database by 50 at the start. Anyway, after some time, I hear some muffled shouts from the boardroom and there’s all this isolation. Normally, you don’t hear anything from there. So it must really be hell in there. Ten minutes later, my manager is at my desk. I quickly close my YouTube channel and pretend I’m actually doing real work. He tells me, “Yeah, the blame game has started. The number is too high, and they all think it’s the other department. So we need you to break this down by department.”
I can do that. Again, I use the same query. The same expression with the sum, but now I need to break it down by query. So I add department ID column, and to add to the department, I need to join to another table, the employee department history. Employee department history is like employee pay history, in that it tracks multiple departments as an employee moves to another department with a start date.
Luckily, we also have an end date here. So for the employee department history, finding the current department is a lot easier. I can simply join to the one with end date null. Rest of the query is the same. Let’s execute it. Let’s quickly check performance, clustered index scan. Hatch match. Hatch match. Merge join. More scans. Let’s go to the IO. Yeah. It looks great. I am happy with this. This is like I want. With these numbers, the managers have something to bash each other with.
Before my manager goes back to the boardroom with these figures, he says, “There’s one more thing I need to ask you, Hugo. We have a new ID. We want to inform our staff and make sure that they are up to date on what is going to happen. So we will have a newsletter coming out as soon as the decisions are final and we have already talked to the .NET guys. They will set up a web portal. Then employees can log in and go to the web portal, enter their employee number, and they will see their projected wage for next year. So we need you to prepare the stored procedure.”
It took me some time, but I actually got it through the heads of the head guys that they need to let me write stored procedures, and they call them. I did that on this line. It’s great. Huge. Everyone should have such clients. Anyway, downside of that is actually I need to stop watching cat videos and write stored procedures. But you know what? I already have this query, the query I started with. Remember? Without aggregation.
So this query? Copy it. I paste it and then I simply add WHERE person ID equals at person ID. In this case, I didn’t enclose it in procedure yet, I always test a query first using a local variable. Just set to a value. When that’s done, I will write create proc before it. That’s it. But I’m now not focusing on the procedure itself. This is a performance and user-defined function, so let’s just see how this runs. I get the results. As you see, I have put the first name and the last name back in. You may wonder why. Well, you don’t know my colleagues. They need to be reminded occasionally. Seriously. Don’t ask.
Anyway, looking at the execution plan because this query will be executed about 15,000 times because every employee will see it after the newsletter is out. So now performance really is important. So let’s look at the execution plan. And I see a completely different plan. I see seeks. I see nested loop joins. That’s great, because with a WHERE clause that is so restrictive to return only a single row, you actually want that.
So I look at the rest of the plan. Most of it looks good. The only thing I don’t like is that I have a clustered index scan here which is still for this CTE. We’re still scanning the index. We also see that in the statistics IO, there’s 77 logical reads. If I were doing this for a real company and not for a demo like now, I would try to find a way to optimize this a bit further. But for now, I consider it good enough for this demo.
Now after this, I have done my day’s work. I’ve done a lot of work. I’ve been very productive. I’ve created the queries, and it was fast because I could use copy and paste. But what happens next? After more shouting in the boardroom, my manager comes back to me and says, “Okay. The numbers were wrong. We need to start playing with the numbers. So we will not cut the rate for the salesmen by 25%, but by 30%, and their commission will go up by 20%. We want to negotiate with the union so they will get 2.5 instead of 3%. And giving nothing to the non-union members may be a bit too harsh. There are some very key players there. We don’t want them to quit. So we should give them at least 1%.”
You see where this is going. I need to change the numbers in these queries, but I need to change them here, and here, and here, and here. I need to make the same change four times. What’s the risk of making not the same change each of those four times? It’s maybe not extremely high, but it’s definitely non-zero. There is an actual risk of me messing up there. And just changing the numbers may be a minor change. It can be worse.
Perhaps my manager tells me, “You know, it’s 50% sales increase that we’re predicting. Overall, that’s great, but we need to refine that.” We started in Brazil. We’re doing great in Brazil. We can do 100% sales increase there. America, there 50% sounds about right. But then Europe, remember what happened there? We had Brexit. And look at the polls in the Netherlands. They have elections coming up soon. Wilders is very popular. They might even trump the American election results. We’ll be happy to just have a zero percent rate in Europe.
Now I need to change my query again, factor in the region where the sales people are working. But that might be a different table. So certainly, I have to add an extra join here and here and here and here. I need to make sure they are added in the exact same way without making any mistake. This is where accidents happen. This is where software development goes wrong. That’s why the cowboy coders who work like this often leave a trail of bad things. That’s why I would never have done it this way.
Let’s rewind to the first time my manager stepped into my office and asked me to do this math. Instead of just writing a query, I want to make it reusable. So I hit Books Online and I find out how do I write reusable code. Ah, I can use a user-defined function. I want it to be really reusable in this case, so I’m going to first write a user-defined function that computes the projected commission for a salesperson. I code it with a person ID. Remember, business entity ID is person ID. Then the factor that the commission is going to go up, the 25%, I can change it by simply changing the goal, and expected sale increase, which is now 50% but can also be changed. Then I’m going to read the salesperson table to find the current commission percentage and the current sales. Those numbers go into the formula and I return the projected commission.
Let’s create the function. Creating a function is never a performance issue. It’s using them. Next, user-defined functions get projected wage, which is used for every employee, so we call it the business entity ID, the person ID, at 1800 hours per year. The factor of which the salesperson will see their salary cut. The factor of with which the commission will go up. The factor of which the sales will go up—I’m sorry. The factor of which salaried people or non-salaried people will go up or down.
Check whether the person exists in the salesperson table, yes or no. Remember, that in a bits variable. If you’re not a salesperson, then, and only then, I’m going to hit the employee table to see whether you are a union member or not. So I save a read if you are a salesperson because then I don’t need to do that.
Get the most recent pay rate. Now very easy, which simply you top one. Then either when you’re a salesperson, I do some math including a call to the first user-defined function. Otherwise, I’m going to do one of these two formulas. Let’s create this user-defined function as well. And let’s test this.
Here’s my first query again. I should have enabled the execution plan. As you see, the results appear immediately, so the formula looks very good. And if I look at the execution plan, you will also see that I saved a lot of work. I’m only accessing two tables now. I’m not entirely sure how SQL Server does it. I’ve already seen how SQL Server does magic by eliminating the person table, so I’m going to fly with this. Look at the IO. The person table is read and employee table is read. This looks good. It’s performing even better than before.
So let’s look at the query for the boardroom with the grand total. Where’re my results? Something’s going wrong, this is taking too long. And you see? I hate to admit it but, Brent, the rest of the audience, I just lied to you. When I ran this query, I said the results were fast, but look carefully. The first results appear fast, but in the corner here, point with the mouse, you will see the clock ticking away for, in this case, five seconds. It wasn’t running fast. It was returning the first results fast, but it took four to five seconds to actually return all results. When we go back to the cowboy coder that I never was and never shall be, and do the same query there, you will also see the first results coming fast, but you will see that all results are returned in zero seconds. So something is weird here.
Looking at the grand total, we see that it takes time. A lot of time. Even though when we look at the set statistics, I always see only 48 logical reads. When I look at the execution plan, we only see a single index scan, and aggregate, and that’s it. How can this take so much time? I’ll let you in on a small, little secret, if you promise not to tell anyone. SQL Server is lying to you. SQL Server is lying through its teeth. So let’s see if you can catch it red-handed.
I’m going to start an extended events session and I’m going watch the live data. I’ll also show what I do in this extended events session. This session tracks only a single event, the SQL statement completed event. This is a very lightweight extended event session. If I now rerun the same query, it will again finish in approximately four to five seconds. Again, tells me that there were 48 logical reads. But now, after I stop the event session to make it [inaudible 00:38:46], I can see, and I hope the zoom works in this format. I can see that we actually did 87,048 logical reads. This is not a lie. This is the real number. We didn’t do this in 48 logical reads.
So where do those additional 87,000 logical reads come from? Well, to see that, I’m going to start another extended events session. This session is similar to the one before, but I add one more event, sp statement completed. If I can give all of you an advice, never, ever, ever run this extended events session on a production server, at least not without very strict filtering. Because people say extended events is light overhead, but you can make it big overhead. I think I’m going to prove that now.
We’re going to run the same query again, and I will see more details in the extended events session. You will also see that it now takes six to seven seconds to run. So there is already a 20% overhead just because of the extended events session. I’m going to stop it immediately again and go to the results. And now, and I’m not going to analyze this in detail, at the very end, you will see the same high number again. If you look above that, you will see all those statements. Every statement you see here is a statement from one of the two user-defined functions.
If you’re going to count it—please don’t. It’s a lot of data. If you are going to count it, you will see that there were 15,000 executions of the outer most user-defined function, and 850 of the innermost user-defined function, which is nested inside the outermost. And every time they’re run, there is two logical reads. Two logical reads. Two logical reads. Two logical reads. Two. And there’s multiple of those two logical reads for each execution.
So 15,000 times, two plus two plus two plus two or whatever, adds up to this extreme high number of logical reads that we do not see here. SQL Server does not tell us this. And why doesn’t it tell us? Because we used a black box. And what is problem with a black box? You can’t see what’s happening inside because it’s black and it’s closed.
That’s why you don’t see the actual performance hit you take when you use a user-defined function. Now there is one way, and not many people know this. There is one way that you can actually see what is going on in the user-defined function in the execution plan. Remember, this execution plan shows nothing. But you can actually see what’s going on just by not running a query, but requesting an estimated execution plan. If you do that, then you will see at the top the same query, exactly the same query. Below that, you will see the execution plan for the GET projected wage user-defined function. Below that, for the GET projected commission.
Now here’s the interesting thing. I’m not going to into the details of costing but costing is relevant in execution plans. There is something weird going on here. In the outermost query, the plan that actually runs the query, you may think that this is where the execution plan is invoked because it’s a compute scalar, but you would be wrong. The user-defined function is actually invoked in the stream aggregate. The compute scalar is just there for null handling. And here, I’m going to zoom this again, you will see that the GET projected wage is actually called as part of the stream aggregate.
Now here’s the interesting thing. Because we see the estimated execution plan, we can look at the estimated cost. Estimated cost for the total execution of a single GET projected wage is, well, let’s round it, 0.01. About a hundredth of the unit cost. If you look at the stream aggregate, you can see how many rows it processes. That’s about 15,000. So there will be 15,000 executions of GET projected wage.
15,000 executions costing 0.01 cost units each, makes for a total estimated cost of 0.008. Obvious, right? SQL Server in this case isn’t just lying to us, it’s lying to itself. Even though here you can see that SQL Server is aware that there is a non-zero cost of executing this user-defined function, and this is actually one using index seek. I’m not scanning anything, it’s just seeking, so it could have been worse.
Even though SQL Server is aware there is non-zero cost, SQL Server will for the stream aggregate or the compute scalar, or anything else where it computes it, still assume a cost of 0.0000001. Forgive me if I have one zero too much or too little. But if you look at the costing of any computation, no matter if it’s a simple addition of numbers or a complex geometrical equation or a user-defined function, it’s always the same number, 0.000000001 times the number of rows. Always and every time.
In the case of the stream aggregate, there’s some overhead, but in plans where you actually get it inside a compute scalar, you will see that exact number. In fact, I can show that by running the estimated plan for the first query, where there is no aggregation. So here, the compute scalar actually does call the user-defined function, as you can see here. So this does the computation. Estimated operator cost here is 0.00129. Where does this come from? Well, from the expectation of, rounded, 12,900 rows.
So, SQL Server is lying to itself. It estimates the cost of the user-defined function as being almost zero. That’s why you will sometimes also see that the same user-defined function with the same parameters is called twice instead of just once. Why? Because optimizer thinks, okay, I can start that and then direct it along through the other operators until I can reuse it, or I can recompute it. Hmm. Recomputing it is very cheap. Let me save some storage. It makes the decision based on this lie to itself that this cheap.
Okay, in the interest of time, I’m not going to run the rest of the demos tables. Simply, every time show the same picture of incomplete data in the execution plan, incomplete set statistics IO, bad performance. If you actually track how long it takes, with one exception, if you run this with the filter on just one specific row, then the user-defined function will only be executed once in this case. But even here there’s no guarantee. For this query, SQL Server could come up with an execution plan where it first does the user-defined function for every row and then filters. It doesn’t in this case, but there is no guarantee that it will never do that. So, be very careful with those scalar user-defined functions.
I’m not happy with the performance anymore. It looked good, but it was actually terrible and the reason it was terrible was, as we saw, all the data access going on in a user-defined function it is called once, and then again, and again, over and over again. I need to change that. I need to fix that. Obviously, if data access is the problem, one fix is to remove the data access. So I’m going to create new user-defined functions, get projected commission 2, and just use the math. So instead reading the tables in the user-defined function, I’m going to pass in not the person ID, but the current commission and the current sales. So now I don’t have to read them from the table anymore. And I also have the same factors. And this works.
Brent Ozar: I never thought about that.
Hugo Kornelis: You never thought about that for good reason. I’ll show you why. So here’s the parameter list for the actual GET projected wage. See, there’s a lot of parameters that need to go in. A lot of parameters. But at least we could eliminate all data access. Now we have the user-defined functions, but if you want to use them in a query, you need to supply all those parameters.
Where do you get them? Well, you join a bunch of tables. So, yeah, we did use a user-defined function here. We eliminated this formula, this case expression that I had before. We simplified this a bit. We could put some of the logic in a black box and this one actually does run fast. It’s not me cheating you guys. This is actually running fast. Every auto-query you see in this demo runs fast.
There is still some overhead. You are evoking user-defined functions so SQL Server has to pass control, pass it back again. There is some overhead involved here, but it’s little. But you didn’t just lose the performance issues, you also lost the benefits.
I have this join here and I have a copy of it here, and another copy of it here. And a fourth copy here. So, yeah, I could put some of the logic in the black box, but the computation wasn’t even the hardest part. The hardest part was getting all the data complete. Remember with this weird CTE that I had to explain and this outer join that I had to explain? All that project, which is way more complicated than the mathematical formula, is still duplicated all over the place. If I need to change it, I need to change it everywhere. So, yeah, I did use a user-defined function, but I didn’t get any benefit. There’s no help here.
I’m not the guy who gives up easily, so I opened Books Online again and leafed to the next page, and hey, there’s another type of user-defined function. There’s not just the scalar user-defined function, no, there’s also something called the multi-statements table valued user-defined function. Please don’t make me say that again. So let’s just call it a table valued function.
The long name is actually relevant. It can contain multiple statements. It can be very handy. It’s like a stored procedure. You can put as many statements in there as you want, you can control flow, you can do a lot of things. There are some limitations, but you can do a lot of things. In the end, you return a table of data instead of just a single value.
What I’m going to do here is create a new function. And because I want to return a table, I say returns a table. I have to give it a name and I have to specify the column list. So this is the table I’m going to return. Then I have one or more queries. Somewhere in that logic, I’m inserting into a table variable. This table variable is not explicitly declared. It’s implicitly declared as part of the returns clause. So when this executes, it will run the query, insert data in a table variable, and return that to the client.
Now my query gets very simple. The only thing I need to do in the query is just use a different function name in the FROM clause. Because it’s table valued, you can use it the FROM clause. And now, you can just select data from it and results are returned instantly. Zero seconds here. Obviously not exactly zero, but we’re rounding it down to zero. Let’s include the execution plan to see what’s happening. You see that very little is happening.
If you look at the set statistics IO, the only thing you see is reading from a temporary table which is actually the table variable. What we see in the execution plan is SQL’s operator says do this first, then do this. So if we’re doing a table valued function, this is the black box. Here the magic happens, but since the box is black, we don’t see what’s happening. Again, execution plan isn’t showing us everything. Set statistics IO wasn’t showing us everything. We only see the clustered index scan.
So in this black box, this query that you see up on the screen, I keep pointing at my screen. Everyone makes that mistake only once. I already repeat it. I did it twice already. So, in this black box, this query you see above is executing, stuffing data in a table variable, and then rescanning the table variable and reading from it again. Using the estimated execution plan trick, we can actually see that happening. So we see the same plan here. Then scrolling down, we see in here, this is the plan that we saw before with the merge join and the index scan. The plan that we saw when I was working as a cowboy coder and just running the query.
The only thing that’s added is an insert, so we’re inserting into the table there. Then at the end, we’re just rejoining the control. There are some problems here. The problem is, it’s a black box. User-defined functions are a black box. It’s not just that we can’t see what’s going in from inside the black box, you can’t see what’s going on outside. So if I [inaudible for several words 00:53:45]. If you look at the SELECT, you will see that two columns less are returned, if you look at the properties.
This plan also is still the same. We’re still joining, and I need to scroll a bit find it, we’re still joining to the person table here. Why? Because while this function is running inside the black box, SQL Server does not know what the outside will do with the results. So the user-defined function when running [inaudible 00:54:23].
What I now did is I scrolled ahead to the query where I needed to add the department and break it down by department. We’ll see a new issue, one that SQL Server doesn’t even try to hide from me. When I execute this with execution plan enabled, you will see the results.
Let’s look at the set statistics IO. There is a weird issue here. I see 14,500 scans for a total of almost 30,000 logical reads on the employee department history. That was a small table. It had only like 80 pages or so. How are we reading so many pages here? The answer is in the execution plan. And, as I said before, but I think that got lost because of the connection issues, this is a problem that these are actually not being hidden from us. So what we see here is again a sequence, do the black box thing first. It fills a table variable. Then, we’re going to read from that table variable in here. Then we’re going to join to the department history.
Here is an interesting issue. Because the table valued user-defined function runs as a black box, SQL Server has no way of knowing how many rows will be returned. It’s a complete black box. It doesn’t know what’s going on inside. So it can’t use any of the statistics it has available. It still has to come up with a number because otherwise you can’t estimate anything. Here’s the interesting thing. What Microsoft considered the most valuable number for something that returns a table valued structure was one row. So the estimate is one row. Always one.
So why did they choose to return a table if I wanted one row? I don’t know. Microsoft got a bit smarter in SQL Server 2014 when they changed the default to 100. But of course I can find ways to poke fun of that number. Any number you put here is just a fixed number. It’s almost always going to be wrong. In this case, the wrongness is hurting us tremendously.
Because there’s an estimate of one row, the optimizer says, okay, for this one row, I’m not going to scan the employee department history and do a merge join. No, I’m going to do a nested loops join because it’s only once. Yeah. But in reality, it’s going to do it 14,500 times. Now we get 14,500 repetitions of this nest loops join. 14,500 executions of the clustered index seek. And that is why we suddenly get 29,000 logical reads.
If you think this is bad, you haven’t seen everything yet. It gets even worse when we look at the final query where I’m preparing the stored procedure for the [inaudible 00:57:32] that will be hit by every single one of our employees within probably an hour after that newsletter comes out.
I’m not going to bother with actually executing it. I’m going to look at the estimated execution plan again. Sure enough, there’s assignment variable, there’s the plan, invoke the table valued function. Then do a clustered seek, another scan. So it’s effective. It’s not scanning the enter table, it’s just seeking. That’s good.
But look at what’s happening inside the black box. In the estimated plan, you can see it. It’s still scanning all the tables. Merge joining. Inserting all that. So when 15,000 employees hit our web server to execute this query, 15,500 executions will each call the user-defined function, which create a table variable with 15,000 rows, with just one row from that and throw away the rest, times 15,000. Yeah, that’s not the kind of performance I had in mind when creating this. So I’ll almost ready to give up.
But remember what I said. I don’t give up easily. There has to be a better way, and, if fact, there is one. There is another type of user-defined function that actually even has a slightly shorter name: an inline table valued user-defined function. The name may be shorter, but the usage is more complex. The inline table valued user-defined function also returns a table, so it’s very similar to the previous one, but the syntax is different. It still says returns table, but there’s no name here, and there is no specification here. We don’t see how the table looks. And then there’s also no begin and end in the body. There’s only a return statement. That is by design.
If you use an inline table valued user-defined function, you have to type returns table as return. This is fixed. Then you can do one single statement. So all your logic has to go in one single statement. If you currently have a multi-statement table valued function that uses tons of logic and if all and you want to replace this with a single statement, you will probably come up with a very complex query. It might still be worth it because we’re going to see how much better this version performs.
But let’s first create it. In my case, the single statement wasn’t that hard to come up with because I already showed it in my first demo, but I didn’t use code reuse. It’s exactly the same statement. Usage of this function is the same as for a multi-statement table valued user-defined function for the user. Both types of function are table valued. From the user’s perspective, you don’t see the difference in the query, but you see differences elsewhere.
Let’s run it. The results look fine. They return fast. And now the set statistics IO suddenly gives me the full picture again. I see all the tables that are accessed. I see everything that’s happening. I can even look at the execution plan and see everything that this query is doing, including any potential issues.
I may not be able to fix them in this query. I need to fix the user-defined function, but I see everything that’s going on. From a coding perspective, I have this nice black box that I can simply reuse in other queries wherever I want. It’s still a black box for the developer. But for the optimizer, we have given the black box a glass lid. The optimizer is able to peek inside the black box and see what’s going on. Remember, when I wanted to abide by the law and not show personally identifiable information? Well, I can command out first name and last name here, run the query again, and now the person table is no longer accessed, just when I remove those columns from the original query. It’s no longer in the plan.
In this case, it’s not me being sloppy. There is no way I could have removed the person table myself here, but the optimizer still does this for me because this glass lid allows it to look inside and outside, and the code in the user-defined function knows, okay, first and last name are not executed. I’m not going to use it. I’m not going to bother fetching them.
If I join to the departments to get a breakdown per department, you will also see that all the information I need is here. Because of this glass lid, we no longer have a weird estimate of one row. All the estimates are correct. You will see that SQL Server even puts stuff right in the middle somewhere, the department history is right here. Right in the middle. So it actually mixes parts of the user-defined function with parts of the outer query to get one final query.
This is the great benefit of using inline user-defined functions. Obviously because of this [inaudible 01:02:41], when I add a where clause, I immediately get a plan with looped joins and seeks and we don’t get the bad performance on a web form when 15,000 people are hitting this query. Problem solved, my day is good. I was able to find a way to have code reuse without sacrificing performance. Of course, it will always be some overhead, but it’s minimal in this case. And this is really the silver bullet. So, that concludes the demo.
After the first demo, I promised more demos. Switch gears. Completely different scenario. You are a database developer for—oh, never mind. You have a job. You have a job, and this company does something that Gianluca would probably condemn and put me for in hell, but this company does it anyway. What this company does is they have various tables with locations. So they have a table of countries, a table of states, a table of regions, etc. And each of those tables looks the same. It has an ID and a name column. Country ID, country name. State ID, state name, etc.
Then you have other tables where you refer to locations, but those references are a bit weird sometimes. When we track sales in our company, we prefer to do this on the state level. If you want to do any analysis on your sales and you only track that this sale was made somewhere in the USA, yeah, that’s not very useful. You want to know whether it was in Wisconsin, or in Arizona, or in whatever state. If you sell something in the Netherlands, yeah, well, we’re a small country. We don’t even have states. So sales in the Netherlands are tracked at the country level. So how are you going to solve this? How do you solve this with a sales table that has a reference to either a country or a state?
There are multiple ways you can do this. The way I’m going to show in a minute is probably not the best way, except when you are demoing user-defined functions. So let’s immediately jump into the demo and remember this is not a recommended best practice for storing this type of data.
I’ll just a start the query. What this query does is create a new empty database, very intelligently called UDFDemo2 because I’m very creative. Then we have a table of continents. We have data lines. Continent one is Africa, continent three is Australia, etc. We have a table of countries. Country two is the United States of America, etc. We have some states. And those are all definitions you can find on Wikipedia. There’s some stuff that’s very proprietary to our company, so we have defined areas like the Americas or EMEA or the Pacific. Then we have regions like the European Union, southwest of USA, etc. All those tables exist, have been created.
Here’s the actual data that refers to these tables. So like I already said, sales are either to a state or two a country. How are we going to track that? The location type is at char 1 or char 1 or whatever column. There is either—and I have check constraint here—it’s either a C or an S. So here we store whether the location is a country or a state. Now we have the ID here. So this location ID can either refer to the countries table or to the states table, depending on the value for its representatives. They can work at the country level or at the state level, but some sales representatives share an entire area or a region in the [inaudible 01:07:00]
Our directors do no work at the country or state level but they can work for an area or a region or even for a continent. Yeah, I know, always a bit weird with continents, but I blame the English language for making country and continent starting with the same starting letter. This is all [inaudible 01:07:18] here. It’s all realistic data.
I also wanted to do some performance testing with larger tables. I don’t like realistic data. I like random data. So I add a few more regions and areas so that I have six rows in every table. And then I use basically casting a die a lot of times, 10,000 casts of the die to come up with random references, giving me a big table with 10,000 references to randomly create locations.
Now, I already know that scalar user-defined functions are not a good idea, but sometimes it helps to get the mental juices flowing to start by something that you know is not fast, but is easy to wrap your head around. So in this case, I decided to start with a scalar user-defined function anyway. I’m even going to bother checking its performance because I know it will be bad, but I want to make sure that I understand the logic.
So I’m going to create a function called get location, just called with a type parameter and then ID parameter. It returns a vchar 30. In this format, as a scalar user-defined function, it’s actually quite simple how you code it. If the type passed in is a C, then you simply get the country name from the countries table based on the location ID passed in. You put that in a variable.
If the location type is not C but S, you do the same, but now for the states table, or for the areas table, the regions table, etc. Then you return to the location and life is good. You have your data. Let’s quickly check this. Yeah. The results are correct. Like I said, I’m not going to look at performance. I already know it’s going to be bad. This works on every single table I have, even on the big table which has all the random references. The results are simply there.
So this is the logic I need to write, but I need to write it as an inline table valued user-defined function. And inline means it has one query. So the next step in my logical process is to try to create a single query for this.
Brent Ozar: Ouch.
Hugo Kornelis: To do this, I use a trick. I’m going to test it on the big table. I’m going to select rows from the big table, because it has every possible reference. What I’m going to do is go into an outer join on the countries table. This join with only produce a row if this predicate is true because this predicate references not only the countries table but also a simple reference on the big table. There is weird thing. If I am processing a row from the big table that references not a country but something else, then it doesn’t matter which row from the countries table I try to join it to. The t.locationtype will never become C because that doesn’t change when I try to join it to another country.
So when I’m processing a row that has an S in the location type, this outer join will never produce a row. It will simple produce a null value. However, that same row which has an S in its location type will find a match in the states table because the states table, for one state ID, this predicate will be correct. So we do an outer join to every table with the correct type in here. For each row in the big table, exactly one of those outer joins will produce a row. The others will not. Then I use a COALESCE function, which is like IS NULL on steroids because you can use multiple parameters to return the first, in this case, single non-null value.
If I run this query, you will see the correct results and trust me, I verified them. They are correct. And now I’m looking at performance. Performance looks very okay. 25 logical reads on the big table. Just two on every other one. It’s all scans and hash matches. Again, because there’s no WHERE clause in here, this is the kind of plan I want. If I don’t want to do this for the big table, but, for instance, the sales table, I can copy the query, paste it, and change the table name. I don’t need to change anything else. Well, yeah, okay, the sales ID. I don’t need to change anything else. I’m lazy. I didn’t even remove the areas, regions, and continents, even though they will never be used in this query.
Again, the optimizer does its magic for me. The optimizer looks and says check constraint. Says, hey, there’s a check constraint here that tells me there can only location type C or S. So this location type will never happen in this table. That means that this outer join will never do anything, which means I don’t need it. That’s why in set statistics IO you only see the states, the countries, and the sales table.
In the execution plan you will see, in this case, nested loops joins. That’s because the number of rows is extremely low here. It’s just two rows. For real sales tables, with a few thousand rows, and without a WHERE clause, these would also have been hash matches. Good. Problem solved. I have everything in a single query and when you have everything in a single query, you can create an inline user-defined function by pasting in that single query.
So here’s the function. Get all locations, big table. We can simple select from it. Because we still have this glass lid, we see everything that’s going on. We get good performance and life is good. But how is this for code reuse? How can I reuse this user-defined function to query the sales table? Well, I can’t. So I have to create a new copy for specifically, the sales table. After that, I have to create copies for the directors table. So instead of removing the code duplication, I hid it in multiple black boxes all over the place. This [inaudible 01:13:54]. I thought I had the silver bullet, but it’s not working anymore. Hmm.
Then is struck me. We found that SQL Server was lying to us all the time. Why not return the favor? We not lie to SQL Server? Why do I want to lie to SQL Server? Because I actually had functionally a good situation here. I had a good function, get location, that independent of table, gets where the data comes from, gets me the location. It’s a scalar function. I want it to be an inline table valued function. So I’m going to lie. I’m going to cheat and say to SQL Server, I’m going to return a table. That table will have one column. It will have one row. Guess what a table with one column and one row is? It’s a scalar function. That’s the lie I’m going to implement here.
So on here we see all the elements come together. So, I use this query. And the only thing I change is, instead of using the big table, I’m picking a table based on the parameters that were passed into the function. So from the parameters in this function, I’m going to create a single row table. That single row table will then be outer joined to every other table in the same way we had it before in the single query on the big table, to get me the correct location.
Now, because I cheated SQL Server, I cannot simply say select dbo.get location inline. Remember, I created a function that effectively returns a single value, but it technically is still table valued function. So this will fail. I have to pretend I’m selecting from a table. If I do that, then I actually get one single value back, but SQL Server still thinks it’s table.
If you want to use this in a real query, you cannot simply join to it. This has to do with how the join is defined. If you join two tables, SQL Server has to be able to move these around. Simple ANSI standards, this is how joins are defined. Every part of the join has to be independent. Some database platforms still allow you this kind of query, which is a violation of the ANSI standard. SQL Server doesn’t, but the need is still there. So SQL Server instead has the cross apply. Cross apply is effectively similar to a join, but it forces the order. It tells logically speaking, SQL Server…
Brent Ozar: Hugo, your audio’s gone again.
Hugo Kornelis: Oh boy. I’m sorry for the audio issues. I hope it’s coming back soon.
Brent Ozar: Looks like we’re back…
Hugo Kornelis: I hope someone will find the time to run the last demos and you will see that you still have this glass lid experience. You will see that you get good experience, but you need some trickery to get it working.
To conclude, scalar user-defined functions are a real performance killer. They do this data access once per execution. There’s no way for the optimizer to do its magic here. And also, I didn’t touch on this in the presentation itself, but scalar user-defined functions do not go well with parallelism. So with all your plans, completely serial. Multi-statement table valued user-defined functions kill performance in a completely different way. It has to do with the fixed estimates and it has to do with completely filling the entire table and then selecting from it. Again, no optimization.
The only type of user-defined function that is okay is inline table valued user-defined function. This is the black box with a glass lid. You get your benefit of code reuse, but the optimizer can still look inside and do all its magic. There are problems with this.
Brent Ozar: Hugo’s audio has disappeared again. So we’ll go ahead and wrap up here. I want to thank everybody for coming out today. Want to especially thank Hugo for volunteering to present to everybody.
Latest posts by Hugo Kornelis (see all)
- T-SQL User-Defined Functions, or: How to Kill Performance in One Easy Step - November 22, 2016