Any DBA that wants to start using/looking after Azure data services.
There seems to be a common misconception that once you move from on premises SQL Server to Azure PaaS offerings a DBA is no longer required. This perception is wrong and in the session I’ll show you why. As a business intelligence consultant I develop data platform solutions in Azure that once productionised need administration. As the title suggests, be my Azure DBA. Maybe not DB for database. Maybe in Azure I need a DSA, a Data Services Administrator. Specifically we’ll cover a real business intelligence solution in Azure that uses Data Factory, Data Lake, Batch Service, Blob Storage and Azure AD. Help me administer this next generation data solution.
Enjoy the Podcast?
Why I Want to Present This Session:
Mainly to deal with this misconception that we don’t need DBA’s in Azure. We do. I want them and I want to shout about what could become the DBA of the future. A DSA!
Brent Ozar: Howdy, everybody. For the next episode at GroupBy, Paul Andrew is going to be talking about how to be his Azure DBA, or DSA, as he calls it. So take it away, Paul.
Paul Andrew: Thank you. So what I like to do with a lot of my sessions is start by playing the Azure icon game. This is a primer to the main session. And it tends to work quite well at events when I’m face to face with people because they’re sat in the audience and they can’t cheat. I think, because we’re doing it on a webinar, people could perhaps just go into the Azure portal and find out the answers, but I think we should play anyway. And, Brent, I’m expecting full marks here.
Brent Ozar: Oh boy, this is going to bad.
Paul Andrew: So, I don’t know maybe if we should do fastest finger first in Slack or something like that maybe?
Brent Ozar: Sure.
Paul Andrew: I’ll minimize the presenter view and so hopefully I can keep an eye on Slack as we go through. So the Azure icon game – so a lot of my slides have got a lot of these icons in, so it did start off as a bit of fun, but I think it does actually help perhaps people learn what I’m talking about with some of my other content. So we’ll start off with an easy one. This is an easy one, right?
Brent Ozar: Oh boy, I think I know what that one is. We’ll see if people in the Slack channel know it. Or if you’re only in the GoToWebinar Q&A, you can put your answer in the questions too as well. No one’s even typing. I don’t think they even know – okay, Melanie in Q&A got it. It’s the Aazure platform overall, she says.
Paul Andrew: Yeah, absolutely, so that’s an easy one. Good start. Hopefully another easy one. We’ve already talked about it, or we’ve certainly mentioned it…
Brent Ozar: let’s see – still I don’t even see people typing, this is how stumped they are…
Paul Andrew: Oh dear. Again, I think this is perhaps why this is the perfect audience for this talk.
Brent Ozar: Brandon says a logic app?
Paul Andrew: No, this is a virtual machine. Okay, this is going to be an easy one. Microsoft have done some really great marketing with this one.
Brent Ozar: I didn’t know it had little bands on it. Yes, Melanie says Cosmos DB.
Paul Andrew: Yes, Melanie. Good work.
Brent Ozar: Melanie is two for three.
Paul Andrew: How about this one – we’re going to get a little bit harder now. So if we’ve got an Azure tenant, we will certainly have this.
Brent Ozar: Todd says an active directory.
Paul Andrew: Yes, Azure active directory.
Brent Ozar: We have a lot of wrong answers in the Q&A, but yes…
Paul Andrew: Okay, this is an easier one. Brent, what’s that one?
Brent Ozar: I want to say that’s Oracle. I believe that’s the Microsoft hosted Oracle.
Paul Andrew: Yes. Okay, a little bit harder…
Brent Ozar: The first placed cloud, that’s what that is; whatever one is in the first place.
Paul Andrew: That one is the Azure advisor…
Brent Ozar: Ooh, Tran got it in Slack.
Paul Andrew: Nice one. Okay, this is a nice intuitive one I think. Again, we’ve all got one of these if you’ve got some Azure workspace.
Brent Ozar: Rubik’s Cube… Tran says event grid…
Paul Andrew: Not event grid. This is actually just our resources.
Brent Ozar: Todd at least got it. Todd said all resources over in GoToWebinar Q&A.
Paul Andrew: I should open that window as well. Okay, excellent. Okay, moving on, what’s this one?
Brent Ozar: I think I know this, but I’m not sure. Tran says virtual network, Daniel says network.
Paul Andrew: yeah, virtual network. Tran’s taking it home here. So we had the Azure SQL DB, what about this one?
Brent Ozar: I think I know this one. We’ll see if other people get it. Daniel says elastic pool, Robert says data warehouse. Engineering SQL says containers…
Paul Andrew: Robert, yes, SQL DW or SQL Data Warehouse. Okay, moving on, how about this one?
Brent Ozar: Oh, I know this one. Is this the Internet of Things?
Paul Andrew: No. I think everyone’s stumped on this one.
Brent Ozar: Yeah, everyone went dead quiet. Cory says, “My kid’s attention span?”
Paul Andrew: This one is cognitive services.
Brent Ozar: Oh, so the cloud is supposed to be a brain in that one. That’s what that was.
Paul Andrew: I think so, yeah. How about this one? Sorry, the resolution on this one is not very good.
Brent Ozar: See, we’ve already talked about virtual machines and I would have guessed that was virtual machines because it looks like pizza boxes, but that’s not it. Let’s see how the attendees do. Tran is typing. Tran says a node.
Paul Andrew: Not a node.
Brent Ozar: Crash Dan says cache…
Paul Andrew: Data bricks, yeah – it does look like a Redis cache… How about this one? A little bit more intuitive, I think, this one…
Brent Ozar: Breaking Bad? Gene says Azure Machine Learning.
Paul Andrew: ML, yes. They’re waking up now, aren’t they? Either they’re looking at the portal or we’re getting them engaged.
Brent Ozar: You’ve hit the parts that Daniel uses, or that Gene uses, yes.
Paul Andrew: How about this one?
Brent Ozar: What the – Tran says storage.
Paul Andrew: Yes, storage. Good. Sticking with the storage theme, a folder with a lightning bolt.
Brent Ozar: This is the file service thing, but I forget what it’s called? No?
Paul Andrew: You can swim in it. That’s probably a big hint.
Brent Ozar: So Tracy says cache; Thomas says Data Lake.
Paul Andrew: Data Lake, yes, Data Lake storage.
Brent Ozar: What sense does that make? Do you want to swim where there’s electricity in the air?
Paul Andrew: Who knows. To accompany our Data Lake store, we might have this…
Brent Ozar: Let’s see what people come up with. Crash Dan says pool, but that’s not it. Let’s see – Gene is typing again. Gene might have it. Data Lake analytics?
Paul Andrew: Yes, correct. Good. How about this one? Quite intuitive again, I think. Data Factory…
Brent Ozar: Data Factory…
Paul Andrew: Good work, yeah. If we have a custom activity in our Data Factory, we might need this – is the best hint that I can give you.
Brent Ozar: Web jobs, Gene says.
Paul Andrew: Not a web job. Any other takers?
Brent Ozar: Logging, Engineering SQL says…
Paul Andrew: This is the Azure batch service. So we will talk about this one in a bit as well. Okay, moving swiftly on, serverless compute, people. We’ve got some more lightning bolts going off here. Automation, yes, on the money. Okay, that was an easy one. Similarly with the lightning bolts…
Brent Ozar: A couple of people typing. Function app?
Paul Andrew: It is a function, yes; Azure functions. Good work. I think this is quite a nice one. This one makes sense to me.
Brent Ozar: Azure baby-making?
Paul Andrew: Logic apps, yes. Somebody in Q&A. How about this one? This is the platform as a service version of our SQL Server friend.
Brent Ozar: The crowd goes – well, Robert Rice is typing, Tom’s typing. Flash isn’t good, someone says. Yes, I agree, Flash is dead. The iPhone killed that ten years ago. Availability Groups?
Paul Andrew: It’s analysis services, that one. We’re nearly there.
Brent Ozar: I was going to say, I don’t know what we’re learning. It’s not good. We’re not smart. Oh, this is bowling.
Paul Andrew: This is real-time data; if we want an endpoint for real-time data, we might send it to one of these.
Brent Ozar: The crowd is silent on this one. There’s not even any typing going on.
Paul Andrew: This is the Azure event hub. Similarly, for our events – this is quite a hard one. I don’t know if anyone’s using this one yet. This is the event grid. I’ll move on from our real time data and we might put it through this service.
Brent Ozar: That’s kind of a nice icon. I like that. It’s like a cog with a hairdo.
Paul Andrew: Stream analytics, yes. Good work.
Brent Ozar: Robert says cog comb-over.
Paul Andrew: We could rename it to that, I think. This one – I think this is a tricky one that catches people out. That one is the data catalogue. It’s quite a nice tool that one. This one, this is certainly more important for us, I think, these days.
Brent Ozar: Key store?
Paul Andrew: Yeah, key store, key vaults is its actual name. We’ll take that. Okay, last one, the one that underpins all Azure services.
Brent Ozar: Meat bags.
Paul Andrew: We should all know this.
Brent Ozar: Our survey says – we’ve got a couple of people typing. Oh no, they backspaced out. Coffee, Daniel says.
Paul Andrew: Okay, caffeine, yeah. Good stuff. Thank you for playing the Azure icon game. You win absolutely nothing. If you’d like to play along at home, I do recommend getting the stencils for Visio, which, I say, is probably when we’re architecting solutions, I spend a lot of my time scrolling through there trying to find the right icon. So yeah, do go and get that and you too can play along at home. Good, so hopefully we’re now all warmed up. Brent, are you feeling warmed up?
Brent Ozar: I am, I am, yes. I now feel completely incompetent.
Paul Andrew: Oh, this is just a thing just to help as a primer. Okay, so my main session then, By My Azure DBA, or DSA. So this was, I think, an abstract and a talk that I put together about a year ago now and it was very much just a cry for help. So as I’ve already said, I spend a lot of time working with the Azure services and doing things, and my role has very much changed now to be a BI dev, if you like, but in Azure. And there are lots of problems that we face with some of these services and I basically just needed some help with them. And when you’re living on this ragged edge of development, all too often, I will search for an answer using our favorite tool Google and I’ll get no results. So it was really just a cry for help about a year ago now when I had certain issues that I wanted – oh, there must be somebody out there that is now going to be moving into this Azure space, a bit like I am, and they could probably help me out.
So I was thinking along the lines of a DBA type function, but as I say, for me, my job has changed very much as a BI person from the SQL Server stack to now this Azure platform of services. And for me, this DBA role, this on-premises DBA that’s been working with SQL Server, for me, I think there’s a gap here now where I need somebody’s support. So this was where this talk came about. And hopefully I’m not going to offend anybody here by making bold statements about what people’s jobs are or what people’s jobs should be, but really, just a bit of thought-provoking. So we’ll go through some of these scenarios and we’ll see how we feel about it.
So, if you want the slide deck that I’m about to show you, please feel free to go and grab it. I’ve got a community event repository on my gitHub, MrPaulAndrew, which is the same as my Twitter handle, so please go and grab it from the GroupBy folder on there. Just a quick disclaimer – it’s something that I tend to put up when I’ve got an international audience. You’ve got a blunt Englishman here with a dry sense of humor, so again, please don’t take things too literally, that’s my quick disclaimer.
So our agenda in how I want to break this down and hopefully how I can relay some of these concerns and issues to you, well I’ll hopefully just set the scene with a bit of context, and Brent’s given me some great primer content for leading into this. And then I’ll actually talk about some real-world scenarios and where I actually face issues that I really want somebody to help me with. So we’ll look at that and then we’ll perhaps just draw a few conclusions.
So first up then, let’s just take a step back in time. For me, I think we need to look at our IT industry and look at the people that are actually working in it and particularly those people’s roles. So once upon a time, when I was at school, there was a single person in the IT department, this IT management. It was one guy and that one guy knew everything that he needed to do to manage the network, and this was all good. And then, as technology advanced and things became more involved, we sort of broke that out and this person kind of got diluted. Forgive the stereotype, these avatars. They do also come from the Microsoft stencil pack as well, so you can get them and have some fun with them. So then things kind of progressed and these roles diluted a bit more and people got a little bit more specialized in what they’re doing. And this was great because this stuff gets more and more advanced. And that’s certainly true of the tech that we’re using now.
And maybe if we skip forward a little bit, for me, this is kind of like current day on the right-hand side here. So these roles have now been diluted and specialized to the point where we’ve got teams of people working on some of this stuff, and that’s absolutely the case. We need this because we’ve got some very advanced technology out there. And as we know, nobody can be an expert in everything. So for me, present day over on the right-hand side, and as we’re present day, in here, we’ve got me. I didn’t want to put myself at the front of this team. I’m going to be the second one back because I’m not that brave.
So here’s me in this current world, this current situation. And let’s kind of do something a little bit scary and let’s look inside my head, which says – I promise you, this is a family-friendly show. But if we look inside my head once upon a time when we talk about an on-premises solution as a data analytics person, we pretty much had SQL Server. You know, this was our friend that we know and love. It’s very much now, for me, a mature technology. We’ve had ten years or so now of this product just getting better and better. And within it, we’ve got a whole bunch of services which, obviously, this is certainly growing now. I’ve perhaps just put the main ones up for me.
But inside my head, as I say, we’ve got these services and how I need to apply them in my day to day life. And I’m going to bravely say that on the vertical here, we’ve got my knowledge depth in some of these services. And here’s my crude line of my knowledge depth, I think, on these services, once upon a time. And that’s where we were and that was great. So sticking with that idea, there’s obviously some gaps there. I don’t have ultimate knowledge depth on all of these services, so I need somebody’s help. So along comes my admin team, if you like, or my DBA. And don’t get me wrong, I love those DBAs, so let’s do something even scarier now and let’s perhaps look inside the head of a DBA.
So I’ve picked a beardy DBA, a good friend of mine. So if we’re inside that DBA’s head and maybe we’re kind of overlaying the DBA’s knowledge now with my own, maybe something like this squiffy line – I haven’t actually clarified this with my subject matter expert, but hopefully he wouldn’t disagree with me. And between us, with our SQL Server product, we can pretty much cover most of the challenges that we’re going to face. So, where I’ve got gaps in my knowledge, that will be perhaps picked up by the DBA and vice versa. We can help each other out here and that’s great. So that’s very much an on-prem situation. Now, if we kind of move forward in time and we start talking about cloud services, and it may be a BI platform as a service solution that we’re going to deliver for a customer, we’re going to do that with Azure.
Now, we’ve all played the icon game here, so let’s bring a handful of those up. Now, if I bring on the same sort of squiffy line, it’s going to be something a bit like this. Now, for me, this is a massive problem because where we could have a certain depth of knowledge in a particular service, now we can’t just have a good depth of knowledge in, say, one or two of these services. That knowledge, we’ve now had to massively expand it where we have to have a huge breadth of knowledge on a whole bunch of other stuff. And this really makes my head hurt. I did play around when I was creating this slide with how modest to be with how far I bring this line down or not with the depth of knowledge, but I’ll comfortably go with that for some of these.
And if I want backup with some of this, if I get stuck, if I’ve got a problem, if I want somebody’s help, do I call to this administration team? Are they going to be there for me when I’ve got issues here? I think at the moment, the answer is perhaps no, but that’s not to say this can’t change. Again, just to sort of frame it and just to summarize, it’s no longer about the depth of knowledge which you hold with a particular product. We’ve got to have this breadth of knowledge on all of these services. Particularly if we’re architecting and designing solutions here, we need to know what they’re capable of. So yeah, this, for me, kind of frames the problem.
Just to maybe perhaps elaborate on that as well, if we’ve got a hybrid solution that’s using both Azure and SQL Server, the problem becomes twofold. Maybe we’ve even got a SQL Server workload that’s kind of running in [NI] as VMs, so maybe we’ve got a bit of both. So it’s hard. And ultimately we end up a bit like this, or certainly this is how I feel; the brain exploding emoji.
So, I think it’s time, perhaps, for a change. And as I say, this is where this session and this talk kind of came from. It was very much a call for help. So if we bring back out timeline and our IT manager and the way that things have progressed with roles in our industry, what I think we need to do now is take present day and actually roll that forward. Let’s see where this is actually going to go. And this is pure speculation – please feel free to put in Slack if you think Paul’s just talking rubbish; I don’t mind. So what we actually need to do is let’s dilute the analytics team maybe that I’m working in now, and maybe there’s more depth here. We need somebody to actually specialize maybe in distribute and compute or a Lambda architecture or a data processing that’s going to involve real-time data was well.
So as I say, it’s just speculation, but I think this is where we need to start heading and where there is – there’s certain verticals here where we actually need people to specialize more. Similarly, if we want to pick on the admin team – pure speculation, perhaps we need a cloud services admin that’s going to help me out there, who knows, a federated user admin. I’m speculating, but hopefully this frames the problem. Are we all kind of comfortable with that before we move on and start looking at anything more real world?
Brent Ozar: It looks good. Lots of people saying that now all of a sudden the icon game starts to sink in as you see how much your knowledge suffers on different icons.
Paul Andrew: Absolutely, absolutely – it’s only going to get worse, I promise. So for me, that was very much the context in where this talk came from. Hopefully, you agree with me. Hopefully, I’ve not offended anybody with some of those bold assumptions there. So if we now actually put that into context of a real-world scenario, this is what we like to think of as the modern data warehouse. So when we’re talking to our customers, this is very much a generic architecture of things that we might need to use to deliver what it is we’re creating for them in the data analytics space.
Now, immediately there’s two icons missing from here, which is, of course, our Azure active directory and our key vault, who very important icons. But for me, as a business intelligence person, you know, worrying about managing active directory and keys isn’t something that I’m naturally going to want to have to deal with.
So this is hopefully where we’re heading now. So what I want to do is go through this architecture, just pick on a few services that are in here and talk a little bit more about some of the challenges that I face as a BI person and ultimately where I feel my remit as a BI person ends. I ultimately want the help of more of an admin type person to have my back. So the first one we’re going to start with is my favorite Data Factory. So I like Data Factory and what we’ll do is, maybe for people that are less familiar with some of this cloud tech, I’ll spend perhaps a slide talking about what these services actually are, just for people that don’t know.
So Data Factory is our orchestration service. We could loosely compare it to our SQL Server agent in the cloud if you like, but it does do a bit more. It will also perform copying of our data from A to B if we need it to, that’s one function of it. The other function is that it’s going to go away and call some other service to do the work for it. So it’s our cloud orchestrator. And we say, you know, go and do this, go and do this, go and do this, and it will receive those instructions as JSON. So hopefully – we could spend a lot more time talking about Data Factory, but that will do for the context of this talk. And maybe to help a little bit more as well, we’ve got some components in Data Factory that maybe we need to be aware of.
So the first one is a linked service. So this is very much how Data Factory is going to connect to something that we want to talk to. And this, if you like, we can perhaps compare it to our SSIS connection manager, maybe. We’re going to have a linked service that stores our connection string that stores our credentials. So bear that one in mind. We’ve got datasets that we have to define, which is kind of a metadata representation of where we’re going to pull things too and from, whether it be a folder or a database table. Within Data Factory, we have something called an activity. So this is the piece of instructions that allows Data Factory to talk to another service and it says, hey I want you to go and do this for me, in essence. And lastly, we have this idea in Data Factory of pipelines. So we can group our activities together into logical things where we can stop and start them, control them, you know, we can schedule those pipelines and we can trigger them. So just four components for Data Factory hopefully just to give you a little bit of background if you’ve not used the service yet.
The other thing to note about Data Factory is it has these things called integration runtimes. So there’s three versions of these, and these are, if you like, the bits of compute that allow data factory to do the work that you’ve asked of it. And the first one of these is an Azure integration runtime. So this handles things like our data movements and if we’re asking some other service to do the work for us. So we use this Azure IR to carry out that instruction that we’ve given it. Next up is the SSIS integration runtime. So if we want to bring our SSIS packages into Azure, we’ll need one of these to actually run that SSIS package on. I’ll talk a little bit more about that in a moment. And lastly, we have this self-hosted integration runtime. So this is very much like a VPN gateway client that we can install on some Windows box and it will allow Data Factory to go and get content from SQL Server tables or file systems. We can even use it with ODBC drivers and things like that if we need to. So, three integration runtimes for Data Factory and four other components that we talked about.
So bearing in mind those things and what Data Factory does, what I then want to put to you is this idea of what I’ve come to see as administration considerations for this service. And this is certainly stuff which, again, I think goes perhaps beyond my remit and certainly goes beyond my comfort zone for some of this stuff that we have to encounter day to day. So of course, the first one which we’ve already touched on is active directory. We’ve got our service principals. When we deploy Data Factory, it will create a service principal for us that it is going to use to run under. We’ll have maybe a key vault service because we need this to look after our linked service credentials, our connection information.
Next up, we might have this idea of an Azure batch service. Now, I’ve talked about this if we need Data Factory to run some custom code for us, if we want to use a custom activity. And if that’s the case, we will use the Azure batch service to do that. That Azure batch service is ultimately perhaps going to talk to our Azure key vault. And the other thing that this Azure batch service will do, or what happens under the hood is that we have a scale set of virtual machines. So within this batch service, we will define a compute pool, and that compute pool is just a set of VMs that we need to run our code. So straight away in this scenario, we’ve gone from working with what we thought was a contained platform as a service product, we’ve now had to extend it, maybe because we need a custom activity. And straight away, we’re into some sort of area of infrastructure IaaS-type services that maybe we no longer have the skills to manage ourselves.
This batch service, with its scale set of VMs, will perhaps have a VNET. Now, typically when somebody mentions a VNET to me in Azure, I pretty much run for the hills because, of course, with a VNET comes a whole bunch of stuff relating to, you know, routing tables and network security groups, subnets, firewalls, address spaces, DNS services, hearing, service endpoints and maybe an express route.
Now, I just want to work with Data Factory. I’m a BI person. I just want to develop something that’s going to support my ETL pipeline. I don’t want to have to get into the realms of worrying about a VNET because I need a custom activity to clean my data. I didn’t sign up for that. but straight away, I’m kind of thrust into this area of IaaS where I absolutely accept that I’m very much not comfortable there and this problem kind of continues. So with our SSIS IR, our integration runtime if we want to run our syspackages, we will need one of these. And similarly, under the hood, that will have a scale set of virtual machines, which again will also perhaps require a VNET. If we our SSIS package to talk to some on-premises SQL instances maybe, that’s going to have to have a VNET and express route; so outside my comfort zone again.
Maybe we want to stop and start that SSIS integration runtime because it’s expensive. So maybe we need to break out our Azure automation run books to handle that for us. And that’s going to have its own credentials and a run-as account and, you know, we’re executing PowerShell in the cloud, so that’s got some implications there.
Brent Ozar: I was just going to say, there’s an awful lot of white space on the left-hand side and I have a bad feeling about this.
Paul Andrew: Absolutely. So next up on my hit list of considerations, we’ve got a hosted integration runtime. So as I mentioned, this is the gateway for data factory to talk to maybe an on-premises resource or we can use it in Azure as well. And that will perhaps require a VM. If it’s in Azure, it’s going to need a VNET again. So there’s problems here. I’m outside of my comfort zone again straight away. Maybe I want Data Factory to send some email alerts for me – you know, some email alerts that I’m not going to get out of the box. I want something a little more detailed. So I can do this with an Azure function. But with that function, if I’m going to put some c# in there that’s using the SMTP client, I’m going to have to have maybe and Office 365 mailbox. Who’s going to create that? Who’s going to look after that for me?
Moving round the clock, activity logs, monitoring, you know, we obviously all want to monitor our services and we’ve got some stuff that Data Factory gives us out of the box and its monitoring portal, but maybe that’s not enough detail. Maybe we need to create something more there and maybe we want some operational dashboards that we need to create for people to look after them for us. Again, you know, considerations here that, as a BI person, I don’t want to have to worry about. Lastly, you’ll be glad to know, for Data Factory, the Azure integration runtime. Now, out of the box and by default, this will say to us that it’s going to auto-resolve the Azure region that it’s going to run in.
Now, that’s okay, but frankly, I don’t trust it because this has got an implication of when we’re actually moving data around. Because if we don’t want our data to go to an Azure region that we don’t allow our data to be in, but maybe Data Factory isn’t available in that region, we can have the orchestration service in a US region, but we can say to it, actually use the integration runtime in a European region or a German region or something like that. And this has an implication because we will then know that if the source of our data is in the region that we specify for the integration run time, it’s only going to move it round there, even if the orchestration service is in a different region. So again, this is an implication and, out of the box, it will just happen and it will say it’s resolved it, but maybe we want to be a little bit more explicit. And again, this is something that maybe somebody else would like to help me with. So those are a few considerations there with Data Factory. Everybody happy with those?
Brent Ozar: Yeah, Lee says, “Sysadmin considerations portion is why I quickly hit the eject button during my first playing around with cloud services. There was just so much work to do in order to get something usable going.”
Paul Andrew: Yeah, absolutely, and so many, I think, security considerations here as well, which is always a big one for me.
Brent Ozar: Cory also says, just to point out, he says, “Aren’t most companies thinking that when you go to the cloud, it’s going to require less employees? It sounds like you’re saying that you’re going to need more knowledge and more employees when you go to the cloud, and this doesn’t imagine what I thought was going to line up with cloud sales pitches.”
Paul Andrew: I think, for me, I think the same amount of employees, it’s just a change in role. You know, my role has evolved from on-prem BI to Azure BI and I’m still one person. I’ve gained a few pounds, but I’m still one person; my role and my knowledge that I had has just altered. I think the same is perhaps true for some of the considerations here. We’ll move on then. So we’ll come back to my architecture picture. And I’ve talked about Data Factory, so now let’s look at the next one on my list, which is Data Lake, particularly the Data Lake storage. And again, we’ll do a little quick what is data Lake storage.
So for us, this our landing area of all data. It’s this bucket which can take petabytes of data out of the box and we will put everything in there, whether it be structured or semi-structured, unstructured. We’ll very rarely produce a BI solution in Azure that doesn’t have a Data Lake. We just need that landing area for all of our stuff to go into, and then we’ll start manipulating it, transforming it once it’s in the data lake. As the picture suggests, we can do that in a couple of different flavors, and there’s some performance considerations there depending on if you’re putting structured data into it, how it’s actually going to break it apart on that Hadoop file system underneath. I’ve put a link in there for performance tuning if you want to understand how it breaks apart your structured data. But ultimately, it’s this area where we land our data. And as you can tell from this slide, we will land it pretty much in the CSV form; CSV or Parquet just for performance.
So that Data Lake then, what we’ll do is we’ll break it apart. So we’ll have a little bit of JSON and a little bit of PowerShell, and the first thing we’ll do is we’ll run that against our Data Lake store and create this subset of folders. So very much at the root level, we’ll break it apart into raw, based, enriched, and created, and then subfolders for business area and source system. And all that gives us is a very nicely structured Data Lake Store that can then pretty much handle 90% of the stuff that we’re going to do with it when we put our customer data into it.
You know, we can think of this very much like a SQL Server database schema. We want a schema for people, a schema for sales, et cetera. The same idea, just in a folder structure in this Data Lake Store. There’s a great blog post from my colleague … on creating that structure if you need to. So hopefully that frames a little bit about Data Lake Store.
Now, let’s go back to the admin considerations. What do we need to do to work with this service? So the first one is the directory access. So all of those subfolders that we created, we’re going to have a whole bunch of users and services that want to read and write form them, so we need to take those service principles of those services and grant them read write. And do we want them to inherit permissions – the same sort of thing that we would have with a Windows file system, but now, with this Data Lake Store. And of course, that’s going to involve Azure Active Directory again; so stuff that I don’t particularly want to worry about.
Next up is our encryption. So Data Lake Store will be encrypted by default at rest, but maybe we want to have those encryption keys managed in our key vault. Maybe we don’t want to let Data Lake Store manage them ourselves, we want to take care of that, so we might need the key vault service. Our Data Lake Store has its own firewall, which is actually turned off by default. So you might want to turn that on and have a wide list of external IPs that can actually talk to this service. So a consideration there.
Pricing is obviously a big one when we’re thinking about some of these Azure services now. And with Data Lake storage, we can actually pre-buy the storage at a monthly cost, which makes it a lot cheaper if you’re putting lots of data in there. By default, it will be on a pay as you go tier, but at some point, I want some friendly admin to look at my Data Lake Store and say, hey you’ve got 10TB in there, let’s start pre-buying this storage rather than having it pay as you go. It’s pretty cheap anyway, but just a consideration.
With generation two of this service, we are hopefully soon going to have available geo-redundancy as well. At the moment, there I no geo-redundancy with Data Lake Store generation one. If you want that, you’re going to have to hand-crank something that copies it to another Azure region. But with generation two, this is going to become an option. So if that’s something you need, that’s a consideration there for which actual flavor of the storage you want to take.
And then lastly, no surprises here, we perhaps want to monitor our service, and maybe we need some PowerBI Dashboard or something to do that. so that’s, for me, the admin considerations around Data Lake Store. Obviously, a few similarities with the permissions and things, but same problems, and there’s nobody to help me with this.
Brent Ozar: Wow, there’s a lot of work. Hiram asks a slightly unrelated question. He says, “Is Cosmos DB a part of Data Lake?”
Paul Andrew: No, Cosmos DB is very much a separate service. I would perhaps even go as far as to make the argument that, for me, Cosmos DB is very much for your OLTP workloads and Data Lake Store is very much for your OLAP workloads; so two distinct services for different purposes is how I would perhaps answer that question.
So let’s move on then. Let’s come back to our architecture. So we’ve talked about Data Factory and Data Lake. Let’s have a look now at Databricks. So I’d be very interested to know if anybody’s using Databricks in production. So if you don’t know what Databricks is, it’s this analytics platform. It’s not a new platform. It was created by this chap, Matei in 2013, and it lives on top of this Apache Spark analytics ecosystem. So people have been using this for a while, and now it’s been made available to us in Azure by Microsoft, which is all good. So, much faster using in-memory transformations rather than perhaps your slow I/O transformations that we had maybe with Data Lake Analytics.
And in the context of my modern data warehouse, we’re very much talking about the batch processing that we can do with Databricks here. So I’m not really touching on the streaming side of it or the machine learning elements of Databricks here; very much just the batch processing. So hopefully that’s enough on what it is. I think Databricks has perhaps been around long enough for people to know, it’s just it’s kind of obviously new to Azure.
So admin considerations then, so we’ve got our user management, we’ve got our user management, pretty much like we had before, which is going to need our Azure Active Directory. This time, with Databricks, we perhaps want to think about what’s actually going to be mounted in our Databricks file system. So if we’ve got our data in Data Lake Store, you know, are we going to make Data Lake Store available to Databricks and which parts of Data Lake Store are we going to make available to it so we can start doing those transformations?
Our jobs, you know, we’ve got job execution there that we perhaps need to monitor and manage and maybe we need to actually look at how those jobs are going to be called; whether they’re going to be orchestrated with Data Factory. Maybe they’re going to be done with a notebook that somebody’s created, which is good. Or maybe we’ve somebody using Java and we’ve got some extra Java libraries that are going to go into that. so again, just some management considerations that once we get beyond that transforming of the data, once this stuff is operationalized, we need to think about that. I’ve put the dreaded VNet on here again because that is a consideration if you want Databricks joined to your VNet.
The cluster management – so Databricks has to have a spark cluster running for you to actually execute your code. And it’s quite nice, you know, we can have a cluster which is running all the time, we can stop and start it, or we can just say, you know, I want you to create a cluster for me whenever I submit a job. We can define that cluster in terms of its control nodes and its worker nodes, and that’s all good, but what actually happens behind the scenes in Azure is it will create a separate resource group for that. So you’ll have a very oddly named resource group that perhaps appears in your Azure subscription when you’ve got a Databricks cluster. And because we’ve kind of abstracted away from this service a little bit, that cluster will have a series of compute nodes that you define, but behind the scenes, Microsoft just spin up virtual machines. And those virtual machines become the nodes of your Databricks cluster.
Now, as I think Daniel said, when you do this, you will have a quota of cores that you can have for a virtual machine in Azure. Now, I’m way off over to the right in my Databricks service and I say, yeah, I want a cluster that’s this big, I want you to auto-scale to ten nodes, each of those nodes maybe has four cores. If you do that and that cluster auto-scales, straight away, you’re perhaps going to hit the default CPU quota limit that you have in your Azure subscription. It comes down to the CPU type and the region that you’re running in, but I think it’s fair to say it’s an admin consideration there.
I think the last one there is how we actually want to control access, because we’ve got secrets and we’ve got secret scope management in Databricks. I think I do actually just include an extra slide there because maybe that’s not immediately obvious to some people, but with Databricks, we’ve got a dev test live environment. We can actually have one service to take care of all of that, but we can have credentials that are scoped, maybe just for the dev stuff or credential scope just for the test environment. And we can, say, just have this one service with different workspaces that are scoped for different things. So more considerations there, I think, and again, I don’t want to look after this. So that’s my admin considerations for Databricks.
Brent Ozar: This feels like you’re making a sales pitch really for someone to come in and be your Azure services administrator.
Paul Andrew: Absolutely. So what do we have next? What should we look at next? I think my next one on the hit list is our Azure SQL Data Warehouse. So quick one, what it is – for me, the best way to talk about what Azure SQL DW is, is if we start by thinking about our Azure SQL DB. With our SQL DB, we’ve got a single compute node and we’ve got a single block of storage. Now, we can scale-up these things, so explicitly saying scale-up there. We can make the CPU more powerful. We can make the storage bigger. So that’s our SQL DB. With our SQL DW, it’s now distributed. So instead of a single block of storage, we have 60 distributions, or 60 storage nodes, and we spread our data across those storage nodes.
Similarly, we have not just one compute node, but in this diagram, we have four. We can scale out that compute. We can add more compute nodes to the service. And we’ll have a control node that comes in and distributes that query, and those nodes then have an allocation of those storage distributions. They will then do the work for us before bringing it back together. So hopefully, that sets the scene for what our SQL DW is and why we might use it for certain workloads. And the nice thing is as well that the storage, very much like with our SQL DB, is disconnected from the compute, so it means we can scale it up without affecting the storage or we can scale out our SQL DW to more compute nodes if we need to.
So then, admin considerations, what do we actually need to do here? We know it’s a SQL database of sorts, so we’re probably a little bit more familiar with things that we might need to do in there with tables and credentials. But because now we’re in this sort of platform as a service environment, what else might we need to do with our SQL DW? Well, the first one, and perhaps a big one that relates quite heavily to the cost is, we might want to stop and start the compute; so disconnect that compute from the storage so when we’re not processing our data, we don’t want to be paying for it. So that’s one consideration.
Backups – at the moment for SQL DW, we just have the option to go in there and configure geo-redundant backups. And for me, it’s a bit of a tick-box exercise doing that one. But again, I think it’s a consideration for some data services administrator that they, you know, can go in there, they can look after that, perhaps.
A good one is auditing and threat detection. So I think Microsoft, if you want to turn this on per server, last time I checked it was about £11 a month. And they will do the auditing for you and they will do the threat detection with some of their fancy machine learning type services and AI will look after that for you. It’s all well and good, AI can do that, but we still need somebody to go and configure it. We need to define a storage area. We need to work with that.
I was just looking at the Slack channel there – yes, we can pause SQL DW, not a SQL database, Tran, to hopefully answer your question.
Moving round my spoke of considerations, I think the next one I’ve got is our firewall. So our SQL DW is going to sit in a logical SQL Server instance in Azure, and that’s where we’re going to have to go in and define our external IP addresses that we can actually use to get to it. So that’s something that we’ll have to sort out and if there’s a whitelist of external addresses that we want to give to the server, allow access to.
Next up is our friend, encryption. Of course, we’ve talked about encryption with our Data Lake Store, but similarly, with our SQL type service, we do have transparent data encryption as well and we do have the option to manage our keys, which is something that is going to perhaps be done by key vault. So something that might need to be done there.
Moving around, we have the scaling of the service. So if we have a particularly high workload or something that’s going to take a lot of crunching, we might want to scale out our SQL DW, do that processing, and then scale it back down. So it’s something that somebody who’s monitoring this service might spot, you know. They might see a CPU spike. They might say, okay, next time we run that load, let’s scale out the service and bring it back down. So somebody could do with looking after that. And, of course, our friend, monitoring – we always need to monitor our SQL databases. So yeah, absolutely, the sales pitch continues for where I need this Azure data services type person to help me.
Next up then, in our architecture, I think we have our regular storage accounts. Now, this is, again, a tier zero service. We all perhaps have one of these, or we all need one of these. I know certainly I’ve got a lot of blob storage accounts that I will use where I’ll dump class libraries and all sorts of stuff. But I’m an ignorant BI developer here. I just say, get me a blob storage account, put my stuff in it, and that’s it. But what other things do we need to consider? I’m kind of really telling myself at this point a little bit as well.
So our access keys are our first one, and maybe we want to recycle these access keys on a periodic basis. And if we recycle them, are we going to use our Azure key vault? Are we going to recycle that key, put it in our key vault, and then everything else that talks to key vault is happy? If we’ve got some app services or websites or something, we’ve got this idea of shared access keys as well. Again, I’m very much out of my comfort zone there, but I know these things exist and maybe we need to consider using them.
Our VNet, again, it creeps in. We have a lot more of these platforms as services that do actually have VNets and firewalls that we might need to consider and think about setting up. It came as a surprise to me that Azure storage now actually has the ability to use Azure search so it can start applying indexes and things to our storage, which was a new one to me when I looked at this, but something there that perhaps needs setting up and configuring.
Our friend, encryption, it creeps in again here. Azure storage – I forget if this is enabled out of the box or not, but we can encrypt it and we can manage those keys. We might want to set up some custom domain if we know we’re using our storage for something external. We’ve got maybe some website that’s talking to it. We might want to actually have a custom domain that’s put in here, rather than it being the usual sort of Microsft.storage.URL that we get when we access it.
Redundancy is a factor. So something that is currently available with our storage accounts, we can say whether we want it zone-redundant or geo-redundant. Those things can be turned on with a little bit of configuration. I’m not saying these things that we’re talking about are hard to do, they’re just considerations for when we use this service. Alerting is a factor, so what – I think if it perhaps more as if we’ve got something that’s subscribing to the storage account, maybe with Event Grid, we’ve got something that’s going to be triggered or something that’s going to happen if something changes in our storage; so alerting is in there as well.
And I think perhaps my favorite one, which has now recently become available for our storage account, is this idea of soft deletes. So we now ultimately have a recycle bin for our storage account and we can turn this on and it will have a fixed period of days that you can retain the copy of your data that you’ve soft deleted before it becomes a hard delete. So all great features. It makes it a very versatile service, but lots of things to consider there if we’re looking at an …
Moving on, hopefully, you’re getting the feel for my slides here, getting a feel for the pattern that we’re going to go through. I think perhaps next, and finally, on my list here, we have analysis services – so our SQL Server on-prem service now moved up into Azure in tabular form as an Azure platform as a service end product. But, you know, is it still the same? Well, let’s see, shall we? Hopefully, we know what it is because it’s an established service, but when we come to consider what we might need to do with it, there’s a few things that we’ve got on here.
So we’ve got the stop and start functionality again, you know, do we want to have this service running all the time? Maybe during the night, we want to turn it off, or outside of office hours. So something that needs to happen to stop and start it – typically, we’ll do this with Azure automation runbooks again if we’re perhaps orchestrating, we’ll have a runbook with a webhook and some PowerShell. It will stop and start the analysis services model when we don’t need it.
Backups for analysis services – now, I say this is really where, I think, things have evolved in Azure. Once upon a time, if we had the on-premises equivalent of analysis services, we would have to break out some XMLA or a SQL agent job to do our backups and do something with it. Now in Azure, we go into it, we tick a box and we say, yes do some backups, put it in this storage account, and that’s it, we’re done. Now again, I still want somebody to help me do this. I still want somebody to take care of it and somebody t say, maybe if we’ve got lots of Azure analysis services, we have a single storage account for all of them and somebody just to go and configure my backups all in that one place. Maybe we’ll make that geo-redundant as well, who knows; whatever your requirements are.
Like with our SQL databases, this analysis services approach has its own firewall. And again, we’ll maybe need to white-list some external IP addresses to access it. I think, if memory serves again, it’s turned off by default, so you might want to turn that on if you haven’t already. Replicas are a good one. So something that was very painful to configure if we had the on-premises cubes, we can now get pretty much fairly easily in Azure, we can have read-only replicas; I think up to seven of them, last time I checked. So, seven read-only replicas for our cube database. But if we’ve got a database that is dedicated for doing the processing and the loading, we then need somebody to go and say, okay cube processing has finished on the primary, now go and sync that to all my read-only replicas. It’s just something that needs to happen.
Activity monitoring for SSAS. Now again, this is something that’s made very easy in Azure, but it still needs to happen. I think there’s a great demo of this on Channel 9. Christian Wade, I think, showed this using log analytics within Azure; something that was quite painful with Extended Events in an on-premises world, it’s now made a bit easier in Azure. But we still need somebody to go and look at that and somebody to enable it.
I think the last one here with analysis services, like with our other services, is the scaling. Do we want this service to be scaled up when we’re processing and scale back down? Or somebody that’s just going to monitor it and say, okay, you know, we’ve deployed some new models to our service, maybe we want to consider scaling it up now. So something there to think about.
So I think that’s it for the modern data warehouse and that block of knowledge that I wanted to pass on. And hopefully, after setting the context and talking about that modern data warehouse in its generic form, we now perhaps have a better understanding of where me, as I BI person, I want some help with this stuff.
Brent Ozar: Thomas asks an excellent related question. He says, “Looking at this chart, at this point, what’s my monthly Azure charge? This looks awesome but expensive.”
Paul Andrew: Yeah, for any particular service, Thomas?
Brent Ozar: I think he meant the whole diagram, the whole modern data warehouse.
Paul Andrew: Oh right, okay. So sorry, yeah, this one. Somewhere between zero and 9000. The consultant’s answer, of course, it depends. This is very much just our generic modern data warehouse that we use in some of our marketing material and that sort of thing, and it’s very much just a generic representation of all the moving parts that you might need in Azure to do business intelligence, shall we say. The depends is obviously going to come down to what your data is, where is it coming from? What transformations do we need to do to it? What presentation layers do you want form it?
There’s even considerations here, if there’s Azure services that aren’t actually available in every region, we might need a multi-region solution here. If we’re moving data from region to region, you’ve got egress charges to consider there as well. So the costing of something like this is, I think, for me, very much a dark art, but of course, it’s something that customers will always ask; how much is this going to cost? And I’ve got some monolithic spreadsheets that I will break out if I’m trying to estimate how much it’s going to cost a customer; even things down to our Data Factory activities. There will be a set price, I think, for the first 50,000 data Factory activities at once cost, and then beyond 50,000, it’s going to be this, you know. So when am I going to hit 50,000 activities? Is somebody going to tell me? I’ve got to work out, you know, some sort of timeline so I can say to my customer, yes in four days time at nine o’clock, based on our current workload, you’re going to go into the next pricing tier for your Data Factory activities. You know, it really just takes some head scratching with the costing, certainly. So I think I’ve dodged your question quite well there. I’ve not answered it at all.
Brent Ozar: professional consulting…
Paul Andrew: Yeah, well just trying to be realistic. The short answer is, how much is that going to cost? Who knows? I think there’s a great analogy as well that if somebody said the cost of an on-premises solution versus an Azure solution, you say, well, when you buy a car, you know how much that car is going to cost you up front. You pay for the complete thing. Very much like a SQL Server license, you pay for it up front. But if someone says, okay I’m not going to buy a car, I’m just going to get taxis. I’m going to get Ubers everywhere – how many journeys am I going to make? How much is it going to cost? You don’t know because the car is being taken care of by somebody else. You’ve just got to pay for the use. It’s a little bit like that with some of this stuff, I think.
So I think that’s bringing me to the conclusion. So really just wrapping up then, I think the conclusion here, as you’ve probably gathered, I think there is very much a role here that needs to evolve for, you know, some great people out there to start supporting some of this cloud tech. as I say, all of my customers now, they’re using Azure for their BI solutions, I really, really want a DBA type person to have my back when it comes to looking after some of that other stuff. Maybe not the infrastructure side of it, but certainly some of the backups, the keys, all that sort of thing. And maybe, to try and elaborate that with some very loose comparisons, if you like, but the task that was once an on-premises task, it has just evolved now into an Azure task. Maybe slightly differently, maybe we need to do something a little bit different, but things have evolved, my role has evolved and I think there’s some room here for other roles to evolve as well.
So I think, last up then, Paul thinks that we need a DSA, a data services administrator, do you agree?
Brent Ozar: Yeah, there’s a lot of good discussion in Slack. Is that your last slide or do you have more stuff?
Paul Andrew: My last slide is thank you for listening.
Brent Ozar: In that case, I’ll fire away with the questions. Lee asks a really interesting one. He says, “How does the breadth of knowledge needed for the cloud mesh with people who spent years developing specialized skills as a DBA? Say for example you’ve got a lot of skills in high availability – it feels like I’ve spent years trying to become good with SQL Server, only now being forced into being a generalist.”
Paul Andrew: Yeah, great question. I don’t know what the answer is. I guess, ultimately all I can really do is extend my empathy there; I felt the same way. We specialize in sys.packages and performance tuning them and making sure we add our buffers the right size and all the rest of it, and now yeah, I’m forced into using all this cloud tech because that’s what’s asked of me. It’s a hard one. It’s a learning curve, which in some cases, feels pretty much vertical. So I don’t have a good answer for you, all I can do really is extend my empathy.
Brent Ozar: if you wanted to find people who would become your DSA, like when you’re talking to people who are on your team, who are coming in from client staff, what do you look for in people with skills that would make them a good DSA?
Paul Andrew: Good question. I don’t know. It’s somebody that’s got maybe a proven track record in administering something, somebody that recognizes the day to day requirements of services that are already out in the wild and they know how to look after them. You know, somebody that’s got that mindset, that is willing to, you know, what does this do, what does this do, or what happens there if we change that? Somebody like that, that’s got that mindset, that’s not afraid, I guess, just to get stuck in and try it out is perhaps all I’d look for initially.
Brent Ozar: Nice, well thanks for the talk today, Paul, appreciate it. Very good. There’s a lively discussion going on now over in Slack. Nice job.
Latest posts by Paul Andrew (see all)
- ETL in Azure made easy with Data Factory Data Flow - September 24, 2018
- Be My Azure DBA (DSA) - September 14, 2017
- Working with Azure Data Factory & Creating Custom Activities - March 21, 2017