Enjoy the Podcast?
Managers, Developers and Data Scientists who are contemplating the future of Azure Data Analytics
Technology changes quickly – patterns and approaches less so. As people move towards the cloud, there are clear benefits of adopting a distributed cloud architecture employing a range of disparate tools. As this evolves, the patterns that were designed for single box solutions should be challenged.
This session will take you through the pattern known as the Lambda architecture, a reference pattern for building data analytics systems that can handle any combination of data velocity, variety and volume. The session will outline the set of tools and integration points that can underpin the approach. Do you need to build out real-time reporting systems? Or crunch through petabytes of data? Or perhaps you are adopting a cloud architecture and simply want to make sure it can handle anything the future throws at it? Whether you’re dealing with Gigabytes or Petabyes, Daily Loads or Data Streaming – This session is for you.
We will follow the movement of data through batch and speed layers via Azure Data Lake Store & Analytics, Data Factory, SQL Datawarehouse and Streaming Analytics, before looking briefly at Azure Analysis Services with PowerBI. This is a largely theory-based session to prime you for the future.
Why I Want to Present This Session:
It’s an architecture session – which isn’t really the norm for these conferences – but the way Azure is going, there are hundreds of options, different ways of achieving the same goal. For those who are confused by the myriad options around, I want to cut through the hype and clearly lay out a simple approach that works for loads of different scenarios!
Brent Ozar: So in this session at GroupBy, Simon Whiteley is going to be talking about modern data warehousing; so take it away, Simon.
Simon Whiteley: Alright, cool, thank you. Cool, okay so modern data warehousing – it’s kind of a new phrase that people are throwing around. We’re actually so new, we’re changing it to modern analytics platform because it’s not necessarily about warehousing. But essentially we’re talking about how you do things differently in Azure. So we’re talking about the people who are traditionally doing BI; the people who have got that one box SQL Server solution. And then suddenly you’ve got a whole load of new options, you’ve got the cloud; you’ve got everything going on. How does that change, do you have to do things differently? And the answer is yes.
So my name’s Simon Whiteley, I am a consultant with Adatis Consulting. It’s traditionally a data warehousing consultancy, again doing lots of SSIS, lots of data warehousing, building Kimball style marts. But now it’s kind of exploded into a plethora of technology, you’ve got Data Lake, you’ve got Data Factory, you’ve got Azure SQL Data Warehouse. You’ve got a load of data science things being spun up on HDInsight. There’s a whole load of different stuff going on.
So in terms of what do we do, it’s anything to do with data in the Microsoft realm, whether it’s on-prem or Azure. So, me particularly, I am the Chief Cloud Architect, something like that. Essentially, I nip between lots of different projects making sure that people are using a standard approach to tech, people are taking their learnings, feeding them back. If we build a really neat utility in one place, making sure we share that with the other clients; essentially trying to make sure we’re all going in the right direction. So I’ve been involved in a lot of this different tech, a lot of doing different things and trying to keep an eye on where we’re going and making sure there’s a path there. That’s, kind of, me.
So, we’re going to talk a little bit about cloud BI. Essentially where we are now – so where we came from, why things are different and how we got to this place. We’re going to talk about Lambda Architecture; that is a traditional big data approach to architecture. And we’re going to talk about how it actually can affect us and how we can learn from the way that they do things. We’re then going to talk about the native Azure approach; so how you take those learnings, the Lambda Architecture, and how you do it with the new set of tools we’ve got, how those learnings apply. Then, some alternative models and how it kind of fits together.
So, the first part of it is a load of theory; essentially talking through these approaches, what they mean and how they work. But then I’ve got a working end to end demo that I can just step through, show you how these different pieces talk to each other, how it connects and how the whole thing works.
So, without further ado, a history… If you are a traditional warehousing person, this should look familiar. You’ve got, kind of, the one box solution. So using a tool like SSIS, an ETL tool, to get data from somewhere and land it in a staging database, usually in the same format as it originally came in. You then pick it up again and you do some transformations; maybe some data typing, maybe add a column, maybe aggregate. Essentially you’re doing your transformation, and then you put it down again in a nice clean format. You might pick it up again, transform it into dimensions, facts; make it jump dimensions, drive dimensions, that kind of thing.
And that makes a lot of sense, even in a one box solution, because SSIS is an in-memory pipeline. It’s a fast way to do things because you can pull 10,000 rows into memory, do your add column, when that’s done, take those 10,000 rows and start transferring data types, and then bring the next 10,000 into memory and start doing that. And you have that conveyor belt of stuff going through memory and it’s super fast and super efficient. And that’s kind of the ingrained thinking of most developers, that that’s the way we do stuff; that makes sense. But does it always make sense when we’re changing our architectures around?
So I’m going to put that up in the corner, so we’ve got that on-prem SQL Server. Over ten years ago, 2005, we had Google and co coming out with all the Hadoop papers, basically saying there are data problems that that approach can’t really fix. Now, you get to a point when you are scaling your server up and up and up, you’re throwing more CPU, you’re throwing more memory, you’re doing really complicated disk management strategies and you’re still not quite fixing the problems. There’s a size of data, at which point, that becomes really hard to do in a one box solution. Maybe your data’s coming in too fast, maybe the data’s too big, maybe you’re not dealing with data that can be easily read; that’s the whole three Vs of big data and all that kind of stuff – you velocity, volume, and variety.
So that’s kind of where we spun up and had big data solutions. I say solution because there’s a whole world of them. It started off just things like MapReduce – the idea of having lots of small boxes doing a little bit of work, a little bit of aggregation, then you take the result of all those boxes and you add that together and you have that pyramid of aggregation. But then that kind of expands and every man and his dog came up with their own big data solution, to the point when you can play that game of is it big data or Pokémon because there’s so many ridiculously named big data solutions everywhere.
Eugine Meidinger: Simon, I think we should probably let everyone know that that’s an actual website.
Simon Whiteley: It is, yeah, you can share the link on Slack.
Eugine Meidinger: Yeah, I’ll put the link in Slack so everybody can play as we go along.
Simon Whiteley: Yeah, so there’s that big technical divide. You’ve got the guys who are SQL Server guys, dealing with T-SQL with a little bit of c#, some expression language. And then you’ve got the guys on the big data side with a plethora of languages, either hand cranking Java themselves or working in Spark, R, Python, Scala; all these different ways of doing things. And there’s that big divide – so it’s seen as, “This is how we, in the SQL Server, Microsoft world, do things.” And then, “That’s how those big data guys do things.”
Fast forward, we’ve got the cloud coming along. And all of us on the good ship SQL Server jumped on it and went, “Oh that’s great. I can have a VM with SQL Server on it and do things in exactly the same way. So the majority of people jumped on it, and just used it as a giant VM host. So, you’re spinning up a SQL Server – sure, you can turn it off in the evenings and weekends and save a bit of cash. If you’re going really advanced, you might have SSIS on its own VM and just turn that on, run your job and turn it off again and treat it like a cloud service. But you have to hand crank the automation around that, you have to manage it yourself. So we all use the cloud, but we’re not making the most out of it. We’re not getting the real benefit of cloud computing; whereas, on the other side of the fence, those big data guys, it’s perfect for what they do. If you’re running queries that need 100 separate small VMs doing a little bit of cheap work and then passing that back. You can spin those 100 VMs up for the lifetime of a query and then clean them up again. And you don’t have to manage the server, there’s not other people trying to use those 100VMs at the same time. You can say, I’d like ten for that query, throw them away and I’m done, I’d like 100 for that query and another 100 for another query at the same time.
The whole Idea of flexible, scalable compute makes a whole load of sense if you’re talking about that kind of architecture. But, once again, we’ve got technical barriers. The same thing is true of all those traditional ways of working, of traditional big data; because it’s now almost traditional, it’s been around for so long. They’re still third-party languages, you have to write Python, you have to write Scala, you have to go and play with that kind of stuff. Sure the [inaudible] wants to try and look like SQL. You’ve got Spark QL, you’ve got Hive, which looks like a SQL table, but they all require configuration and setup and you have to manage all those different clusters, you know, where the data’s being stored.
So, what we’re trying to do, the Holy Grail, the thing we are after is that bit in the middle, that modern analytics platform; modern data warehouse. Taking all those really good things that we’ve learned from big data, taking that approach, the, “I want to do lots of small things at once, aggregate it. I want to be able to scale really, really effectively.” But I would like people from the SQL Server world to be able to use it without having to retrain, without having to learn new languages. I’d like things that are SQL friendly. So, that’s the wish list.
One of the most important things for me is I don’t want to manage another server. I spent my life doing server setup, installing SQL Server, managing everything on it, managing the OS, working out patching windows. And a lot of the modern Azure approaches, so the PAS, which is platform as a service, a lot of those things, all this is done for you. That’s not to say it’s the best approach. You can, if you want to have the configuration, if you want to be able to manage it and fine tweak the details, great, but if you want to just set up and all of your activity is writing code that produces business value rather than managing that stuff, PAS is really good for that.
So we’re just looking at PAS components. We’re trying to keep everything as, “I don’t want to touch and configure the hardware, I just want it to work That’s the plan.” So, on my wish list is being able to handle massive data sets. You need to be able to scale really effectively. You need to be able to scale linearly; I want predictable performance. So if I literally throw twice as many compute units at it, I would like it to go twice as fast. I don’t want diminishing returns when I’m paying twice as much and then getting an incremental bit of performance, and then twice as much and even less. It needs to be able to handle near real time. Now that used to be this wish list that the business manager had to be real time, but then you talk to him and maybe he only needs it every hour, every day; they’re not making the decision that regularly.
But more and more, because we’re getting more sensor data, we’re getting more car telemetry, we’re getting more and more things that are actually connected and actually giving us stuff that can be considered real time. We’re getting more systems that can actually pump out real time data, even though they’re traditional systems. You can have CRM systems that fire events. You can have your tills actually sending messages out with all the transaction details. So the idea of having real time is not only more achievable these days, it’s actually more applicable.
It needs to be fault tolerant. So obviously, if anything breaks I’d like the whole thing not to fall over and have to be, again, fiddling and fixing things, and I want that low barrier to entry for SQL devs. I would like it so people could look at it and understand what’s going on and not have to be digging through a ton of code to figure out what’s going on. So that is our objective. We’re trying to do traditional big data stuff with that wish list. I’ll show you how it’s done.
So, that brings us onto Lambda. This is the historical big data approach. It’s an architectural pattern for achieving near real time data that demonstrates a few of the really key points of big data. So, written by a guy called Nathan Marz, who went on to write things like Storm. He’s kind of one of those guys – he had a technical problem, he wanted to get around it, so he would write his own database management software; wouldn’t necessarily settle for fixing things. So he’s got that book, and that book is incredibly good, but it digs a lot into the detail. A lot of it is implementation and code to do with Storm and ElephantDB and all these texts. But in terms of the actual theory of how these patterns work, it’s good. So I recommend reading about half of it.
Fundamentally, it’s trying to fix one specific problem with real time reporting. So, that particular problem – we’ve got a message. That little line is a bit of information coming in. And we get information coming in regularly, and at some point, we kick off a batch process. So we’ve got that chunk of data; maybe that’s an hour, a day, maybe 15 minutes. Regardless, that part of the process takes a little while to run. So while that’s running, we’re actually still getting data in. Then eventually, it will spit out the results. So I’ve now got my processed data. Maybe that’s been aggregated, it’s been looked up against a load of lookup tables. Maybe it’s been validated so I know that that data is fundamentally correct. Whatever I’m doing, I now have my batch produced data. And that process continues and you have data coming in, processing kicking off, results spat out, and it’s just that constant thing.
The problem comes when we have a look halfway through that process. So we’ve had two of our batches spit out their results; they’re finished. We’ve got a third batch that’s in progress that’s currently churning through. We haven’t had any results from it yet, but the window is in the middle of batch processing, and we’ve had some events since.
Now, at that point, if someone was a query the data and say, you know, what are my real time results? They’d be hitting those top two. We’d only be looking at part of the data we’ve got; that’s not real time. That’s real time with a significant delay because we’re having to batch process because we’re having to work to try and get to the end. So that is the idea of Lambda. Lambda is trying to fix that problem, saying “How do we actually make it real time?” Without giving up the fact that we’re using batch processing to get validity – we’ve got accurate data, we’ve got reliable data, but it’s slow; so that’s what we’re trying to fix.
So the solution, the way it roughly kind of works, we’ve got that same thing. We’ve got messages coming in every so often, and then we kick off a batch process. Absolutely the same here, but at the same time as that batch process is going on, we’re streaming that data. So we’re taking the messages that are coming in and we’re putting them into a separate bucket going, “Here’s what’s come in since I kicked off that batch process.” So, actually at any point, we’ve got those two sets of data.
We’ve got the results of the batch processing and we’ve got any records that have come in since. When batch processing finishes, we then wipe down that list so we’re not duplicating data. But at any point, if you want to figure out what’s going on, you can query across both sides. So Lambda Architecture is that idea of having – not an architecture that has some streaming data and some batch data – it’s an architecture that uses both technologies in combination to get near real time data. It’s kind of that idea of eventual accuracy in that the data that’s coming through streaming, that’s quite light touch. We need to be getting through quickly so we’re not doing heavy amounts of validation, we’re not really working on that data, we’re just trying to get the results to the user as quickly as possible. But then when the next batch process comes up, we’ll take those results away, put them through the proper process, get them accurate and get them figured out.
So we’re always having that two things working in unison of batch reliable data and streaming data just to top us up. And that’s kind of the idea. That’s the fundamental of the Lambda architecture, and it comes into three different bits. So batch – that is the thing that’s kicking off. That is some kind of compute, some way of chunking through all your data and doing aggregates and lookups and all that kind of stuff. Speed – that’s your streaming data; a fast way of getting data from one end to the other with maximum throughput of data. And then you’ve got serving – a way of being able to respond to real time queries across both. Because the majority of the time you’re looking at batch, that’s not actually presenting data to a user; that’s transforming data and putting it somewhere to be used further downstream. The same as speed a lot of the time. Serving is this query layer that allows you to interact and actually sort of pull those two together. So they’re the three main components that we’re going to need if we’re going to build our own Lambda Architecture.
So I’m going to really quickly skim through the technologies Marz used in the original one, just to give you a flavor of how they work. But again, the big thing is about how we replicate that; how we use modern Azure tech to do these things. So, batch layer. Now this is traditional MapReduce big data crunching. Essentially, you’ve got distributed storage, HDFS, Hadoop distributed file system.
So you have your data spread out across a load of quite cheap small disks. And then you have compute, and you spin off lots of compute and allocate compute to each of your storage disks. So each bit of compute is doing a small bit of work, so you work out the aggregate from that disk, the aggregate from that disk, the aggregate from that disk, and then you add up the results from each layer, maybe do some more to it. And you have the, again, that pyramid of aggregation. You’re taking a huge amount of data processing and chunking up into as many small sections as possible. That’s roughly the idea. It’s slightly more complicated than that, but as long as you’re happy with the fact that there are lots of parallel processing going on, that’s the whole point.
So if we suddenly have twice as much data and we wanted to go just as fast, we can simply spread it out over twice as many disks, twice as much compute; we can just keep adding compute and storage and we should get a linear improvement in terms of speed. So parallel processing is one of the key things in big data. So, in part of the batch layer, we need something that can run in parallel that can run over distributed file systems; and again, the less we can manage it, the better. I don’t want to have to tell it how many disks to spread data across. I want flexibility, sometimes to use lots of compute, sometimes you only use a little bit of compute.
The speed layer – now again, this is where Marz decided to write his own event processing system he used and wrote a thing called Storm. Essentially, this is the idea of having a big old bucket and messages coming in; so small events, kind of a fact transaction kind of thing. So maybe some IDs, a few numerical figures, something we’re actively trying to measure coming in, being put into this messaging queue and then something that’s constantly reading that queue, so you’ve got these things called spouts, which is just firing out events that it finds in that messaging queue.
And then bolts – that’s just a small bit of compute that I’m doing. So I might be adding columns together, I might be taking 100 messages a seconds and aggregating up and spitting out one record per second saying here’s the average speed, temperature, whatever I’m measuring that went through in that second. So it’s essentially a way of getting a whole load of information coming in very quickly, squeezing it down to more controllable sizes and controlling the rate of flow of that and making it more understanding; so I can enrich it.
So if I just had dimension keys in my little fact coming through, I can do lookups onto dimensions and make it just a little bit more understandable. Kind of the speed layer – something that’s able to deal with this level, this speed, of records coming through, still allow me to touch it slightly and have some lightweight compute applied, and then just putting it somewhere I can use it. Se we need to be able to connect to various other systems and actually push those out.
And then finally, the serving layer – This is probably where we vary slightly. So Marz built his own system called the ElephantDB. It’s a sharded database layer, so having lots of different distributed databases that can all query flat file storage again.
So we’ve got this batch layer working in the kind of Hadoop style, where you’re picking up a flat file and you’re putting down flat files. And then a serving layer working in normal RDMS, where you can go in and write something a little bit like SQL, it will go off and query those and return it to the user as normal datasets results to grid, kind of thing. So we need something that accepts SQL – that you could go and write your normal SQL queries that can go off and take the results of that batch processing, stitch it together with the serving side and present a nice querying front. Essentially something that will return data to users.
So, they are the three things that we need. From the Marz point of view, the tech looks a bit like that. So you’ve got Hadoop, you’ve got Pail, which is how he was distributing data across those disks. That’s basically how you decide which data goes where on all of those different distributions. Cascalog, the way of actually writing those Java MapReduce jobs, you’ve got Storm, which is event processing and ElephantDB is the serving. So you’ve got your whole mix of tech. Again, that isn’t familiar to a lot of people; certainly not from the Microsoft stack. Maybe HDFS we kind of know, but that’s about it.
So we need to get versions of all of those tech; ways of doing all of that kind of stuff in ways that match our original list of criteria, ways that can be configured using SQL. So, the Azure approach – finally getting onto some actual tech we can use. We’ve got some choices. So this is what it looks like. In the batch layer, we’ve got a choice of storage. We’ve got Blob or Data Lake. We’ve got a choice of compute, so we can go for Data Lake Analytics, which is a little bit closer to the original in terms of we’re writing queries that pick up data from one distributed storage and put it into another. We’ve got SQL Data Warehouse, which is an MPP querying system, so it looks like SQL Server. It is a flavor of SQL Server. You’ve got Streaming Analytics down at the bottom; that’s our streaming event processing. And then our serving layer – we can use SQL Data Warehouse again, or we’ve got more semantic layers, so analysis services and Power BI.
We’ve got loads of different choices. I’m going to quickly run through each and have a talk about which one is right for this particular scenario; essentially why I’ve made the choices I’ve made. To start off, storage. So there’s much of a muchness here.
In terms of Blob Storage and Data Lake Store, they’re both Hadoop file systems. They both have the idea of I’m going to distribute your files across a lot of independent disks. A slight difference in terms of Data Lake is web HDFS, Blob is normal. They’ve got some differences in terms of how you can use them. For me, Data Lake has a lot more security layers. You can secure folders and individual files and say that team can write to that folder and not to that one, they can read from that folder. So from an enterprise point of view, it kind of makes a bit more sense to stick with Data Lake. But most of them you can use interchangeably, if you need. One or two configuration issues with Data Lake in terms of that there’s still some tools that are only just able to actually talk to it, but I think we’re now at the point where most things can, and certainly everything we’re going through, can use Data Lake Store. So for this example I’ll stick with Data Lake.
The fundamental thing you need to know is that when you get a file, it will take that file, and if it’s above a certain size, I think more than 256MB, it will split that up into different sections. So a 1GB file should make four different sections if it’s the kind of file that can be split up; CSV or any kind of row based data. It will then take that data and create three copies of it and put them on different nodes. So you have this idea of you’ve got lots of copies of your data, lots of different bits of your file in lots of different areas, and that means if one of the disks happens to break, you never lose it because you’ve got that part of the file stored somewhere else.
It means you can always be accessing part of the file in parallel. So you can be reading that file with four different bits of computer at once. So again, you’ve got parallelism even just for a single file. Again, if you’ve got loads and loads of files, you can be reading those automatically. Now, you don’t need to configure any of them; so from the user point of view, Data Lake Store is just a file explorer. You go there and you click upload and you put a file there, and behind the scenes, that’s managed. It’s managing disks, it’s saving things to different extents, but you don’t have to see it. You don’t choose how big it is, you don’t have to provision [upfront], “Okay, we’re going to have 1TB of data in our Data Lake.” You can just, “I would like a Data Lake, please.” Start uploading data and you get charged for as much as you’re using at any one point. But it is minimal configuration, minimal management, highly fault tolerant and highly parallelized. So, kind of just what we need it to do.
So, storage, very, very happy with; we can just use Data Lake On the other side we’ve got the batch compute; the thing that’s going to be running every so often to chunk through all the files that we’ve got and produce aggregates, produce valid data, produce another file that we can consume that’s got the summary in there.
So, Data Lake Analytics – you pay per query, so then you don’t have to provision a server. You don’t have to be paying for a whole server and then managing who’s using it at a time. You can just say, “For this query, I would like you to use that many compute units.” Hit go, and it will provision itself, run, close everything when it’s finished. It’s the definition of a nice, scalable system. Now, it uses something called USQL, unified SQL, there’s various different flavors of it, which is kind of a blend of c# and SQL.
So technically it is not T-SQL. It is not the SQL language, but it’s built to be familiar. It’s built to feel easy to use for someone from a SQL background. And the whole thing is, it takes structured or unstructured data in and you can spit out structured or unstructured data. You apply structure during processing, but it’s very flexible in terms of kind of data you can deal with.
Now, SQL Data Warehouse, on the other side, you spin up as a server and you pay per hour, but that is T-SQL. It’s a cut down version, it’s a special version, similar – based on the PDW or APS. So it is that MPP version of SQL Server, massively parallel processing. So there are certain restrictions in terms of some of the language that you can’t use, but it’s a much more fully featured version. And you use management studio and you write stored procs and it feels like normal SQL Server. But again, this is fully structured, so as soon as you’ve pulled the data in, it’s then sitting in a SQL table and acts like a SQL table. So it depends whether you want that data to still be available in the unstructured distribute way or whether you’re happy for that now to be in a relational data engine.
Both of them can actually still access that HDFS layer because Azure SQL Data Warehouse has PolyBase in there. Now, that is an extension to SQL Server – it’s in SQL Server 2016 – where you can write a table that actually whenever you interact with that table, that goes off and it queries flat files and brings a result back. So it feels like you’re querying a local table, it looks like a local table, but every time you hit go, it’s going back to the flat file, reading from disk and bringing it back. So they both have the ability to interact. For this example I’m going to stick with U-SQL, Data Lake Analytics for our batch, because it’s more in line with that original idea of querying the data, doing some processing, doing some aggregates, spitting it out on the other side.
So, in terms of how it works, say we’ve got those four files. We’ve got a file split up into four extents over our Data Lake Store down at the bottom. If I write a query, selecting from it, doing some aggregates and I’ve only given it one compute unit, then it will need to run, read the first one, produce its results. Spin down, do the second one, third one, et cetera. If I want that to go faster, I can give it two compute units, and then it will be able to read two at once, then the next two. I can make it do all four at once and then it will keep going faster and faster. So it’s very scalable in terms of how I get it to work with that distributed file storage.
Finally, we’ve got Azure Streaming Analytics. So that is our speed layer. That is talking about how we get events going through. And honestly, there aren’t many options here. So if we’re talking about completely platform as a service and we’re talking about T-SQL friendly, then Streaming Analytics is basically our option. So you’ve got Storm in Azure, you can spin up a HDInsight and have Storm working on it. You’ve got Spark streaming, which is very, very similar. But if you want to not have to manage your cluster, not have to decide when it’s turned on and turned off, and then you want to be able to write standard T-SQL in the middle of it and have SELECT, FROM, WHERE, INTO, MIN, MAX, all that kind of stuff, then Streaming Analytics is basically your choice. But it’s super flexible, really easy to set up. And one of the key things I really like about it is the ability to vertically partition files. So if you’ve got a load of data coming in and it’s writing that to a flat file, you can say when you get to the hour mark, start a new file and name it accordingly.
So you end up with something like this, where you’ve got your inputs, your dates, all of your times, and so you know which file you’re currently writing to. So you can say, okay well I’ve started the new file, that means the previous file is now ready for batch processing. You can use it as its own triggering mechanism, and it does that out of the box. So there’s some really neat features that kind of make it work for what we’re trying to do.
Finally, the serving layer – so how users actually query this data. So, it would feel more reasonable to say analysis services and Power BI because they are user consumption layers. However, both of them, if you’re talking about flat files, they have to be scheduled refresh. So we have to say a maximum of every 15 minutes refreshing that query. So the majority of the time, we’re trying to say I want real time data, then we need something that can use PolyBase. So you could use SQL Server 2016 on a VM, if you want the infrastructure and you want to manage it. You can use SQL Data Warehouse in a lightweight manner, and that can do direct query via PolyBase.
So we’re using that as a front end, as a view, as a window onto our Data Lake. So all the files we’ve got in there, we can write in SQL within the data warehouse, hit go, that will be going off, querying it and bringing it back. The big limitation is the concurrency. It is meant for highly analytical queries. It’s not meant for thousands and thousands of users all hammering at once because only 32 users can use it at once, and that’s when you’ve scaled it up a bit. No matter how many times you scale it up, no matter how much money you pay per hour on a SQL Data Warehouse, 32 is your maximum current queries.
Eugene Meidinger: About that, so Brent has a question. He’s asking, “Is that something that Microsoft has improved over time, or is that going to be the default for the foreseeable future?” Because that is pretty low for a data warehouse.
Simon Whiteley: That brings me onto the next bit. So to explain how SQL Data Warehouse works and where that concurrency problem comes in, we can have a look at how it actually works. The reason being is because you’ve got that control node. That bit at the top is kind of where that concurrency bottleneck’s happening. You have compute and you can scale the number of compute nodes you have, but it’s all going through that one control node. It’s kind of a factor of MPP. Massively parallel processing systems have concurrency problems, they’re not meant for that highly concurrent interaction. You normally push it somewhere else.
So from the get go, from the very start of Azure Data Warehouse, it’s already had 32 concurrent queries as the max; and that’s once you’ve scaled it up to ten compute units, I think. I think it’s 1000 that you have to have to have 32 at once; otherwise it’s less than that. Any further above that, it’s just extra performance. so the queries will finish faster, but you still can only have 32 at once. It will just get through that queue that little bit faster.
So, those color bits are our distribution nodes. So I’ve got 60 distribution nodes in SQL Data Warehouse, and that is fixed. No matter how scaled up you are, whether you’re at the lowest, 100, all the way up to 6000, you always have 60 distribution nodes. And what you’re changing when you scale is that middle number, the number of compute nodes. So I’m currently at 300, I’ve got three compute. If I scale it to four, I simply just have fewer distributions per compute; each bit of compute has to do less work. That’s why you can go all the way up to 6000 because that simply means there’s one compute unit per distribution. But again, there’s only one control node.
Eugene Meidinger: Quick question, no one asked this but just for the sake of the audience, you know, you reference massively parallel processing, and I know what that means. But that kind of has a specific definition, because it’s not just like big data, it literally is how you slice things up. So can you take a minute to explain, when you say MPP, what that actually means to people?
Simon Whiteley: I can, and I actually have a slide deck that tells me this, so give me a second and I’ll get some slides up.
Eugene Meidinger: Just for people who don’t understand some of the big data architecturesd and…
Simon Whiteley: No, that is absolutely fine. I’ll just quit out of that for now. Okay, so I believe this is the question, SMP versus MPP. Okay, so these are the two things, and whenever I’m doing a talk about this, I always start off by going, “Who’s used MPP?” And a couple of hands go up. Then say, “Who’s used SMP?” And everyone goes I don’t know. That’s the trick question because SMP is symmetric multiprocessing, AKA a standard SQL Server. It is the way a SQL Server normally works. So in that, you have a load of compute. You’ve got the CPUs that are on your server, and then you’ve got storage, and any of those CPUs can share that same bit of storage. So you might have a SAN array, you might have lots of different disks, but it doesn’t matter which of your calls, which thread, is working. That can access any of your shared storage.
So as you grow and grow, as that system gets more and more CPUs, that bit in the middle, that server, working out what’s accessing which bit of data at once, has more and more work to do. So that’s the idea of SMP, kind of I’ve got a load of CPU, I’ve got a shared bucket of storage, so any bit of CPU can access any bit of storage and you get to points where you get diminishing returns when you’re dealing with a really, really, really big system.
Now, SMP has a different idea to that which is much more similar to the batch processing idea, that Hadoop MapReduce style. We have compute aligned with storage. That means, if there’s a particular fact record, only one of those compute nodes would actually have access to that one bit of storage. So rather than any CPU being able to access any bit of storage, it’s aligned. That CPU has that storage, that CPU has that storage. So that means I can scale that out by adding another CPU and more storage, more CPU and more storage, and spread out my data in a thinner and thinner line amongst that, meaning each bit of CPU can do less work, and that can just scale and scale and scale and scale. So that’s kind of the idea in terms of – MPP came around to get over that idea of diminishing returns, of making bigger and bigger and bigger boxes; just to make a w ay that is linearly scalable. Taking those MapReduce style ideas and saying how do I do that as a SQL Server, how do I make a database engine work that way?
Now, it does mean that there are concerns and you have different problems than normal SQL Servers. You’ve got to figure out how you layer your data across that storage, how you design things so that you are efficiently using all of those compute nodes at once. You don’t want to write a query that is limited and the data’s only on one node, in which case only one stripe, one bit of your compute is actually working at a time. It’s all about getting it to work efficiently across everything. If that answers the question…
Eugene Meidinger: Yeah, I think that was a good job and helps explain why you run these concurrency limits, because you’ve split it across these different nodes, it’s not like a giant SQL Server. No, I think that’s perfect.
Simon Whiteley: It’s like lots of small SQL Servers reporting into one brain SQL Server, and because of that aggregation, because of that powerful throughput into the aggregation layer, you then have a touch point, a threshold where you can only go so far.
Eugene Meidinger: A bottleneck, absolutely. Yeah, another question came in. So does Azure SQL database, the PAS service, support PolyBase? Because someone was saying…
Simon Whiteley: No it does not, and that makes me so sad. So currently it is only the on premise version of SQL Server; so the one that you install on a VM or on a bit of kit, or SQL Data Warehouse that has PolyBase. And the two are not the same. So the version of PolyBase that’s on Azure SQL Data Warehouse can talk to Data Lake, you can PolyBase over Data Lake, whereas the one that’s currently in SQL Server 2016, to my knowledge, doesn’t have that update yet. So obviously the release cycle of having a SQL Server install version versus a PAS instance is different as well. Sadly you cannot use PolyBase in SQLDB, and so there is that – there is currently a bit of a gap where if you are not dealing with large amounts of data, then SQL Data Warehouse is probably going to be a fairly expensive tool; it’s a hammer to crack a nut. So these PolyBase ideas don’t quite make sense if you’re dealing with only a few gig of data. That’s a little bit of a gap, but PolyBase is a super performant idea, so you might not need PolyBase if you’re dealing with that small amount of data.
Cool, so PolyBase, in terms of how PolyBase works with data warehouse is each one of those bits of compute has its own PolyBase reader. It has eight external threads, so every bit of compute can do a certain amount of reading from flat files. So if you add more compute, that means your PolyBase goes faster. So it’s not just the fact that your queries of data internal to SQL Data Warehouse go faster when you scale up, it means that you can be reading more from flat files. You can be doing more MapReduce style jobs at once. So it’s kind of a scalable querying engine on top of flat files, as well as the whole MPP thing. So that’s why it works quite nicely as this serving layer for this example.
So, touchdown to that, we’ve gone through our decision process, we’ve had a look at the different tools that are available. We’re going to use Data Lake Store, Data Lake Analytics, Streaming Analytics to do our speed and then we can query the results of both using SQL Data Warehouse. That is the plan. To give you an idea of how that data flow actually works, because that’s lots of different tools and not really anything talking to each other. We’ve got a message. We’ve got our telemetry sensor, we’ve got our Raspberry Pi, we’ve got whatever it is sending us information, and Streaming Analytics is reading from wherever those events have been sent. So the messages might be an event hub, Streaming Analytics is reading from that event hub, doing a tiny bit of compute and then putting that into Data Lake Store.
So we’re saving that as a flat file, just as a CSV, and we’re filling that CSV up, and then every hour we’ll just start on a new one. So we’re building up pretty much delta files in a file structure within our Data Lake Store. We’re doing batch processing. Once that file is complete, once we’re happy that that file is full and not going to get touched again, we’ll pull out, do some batch processing in Data Lake Analytics, and produce a summary file.
That summary file can then be read into SQL Data Warehouse via PolyBase. And at the same time as we’re reading that via PolyBase, we can be reading from our flat file that’s in flight, the one that’s slowly filling up with data. And we can have a view which joins the two together – that means when our query comes in, we’re seeing out batch data, but we’re also getting the latest serving data, and we’re seeing that data come in and fill up as we go. So, via just straightforward integration, you don’t need to build anything particularly special for this, you can actually have a Lambda Architecture. Not saying you would need that for every scenario, just an example of how you can take big data architecture and you can use it in T-SQL friendly tech.
Now, from an orchestration point of view, it gets a bad rep, but Data Factory actually works quite well for this. So a lot of people originally, when Data Factory came out, though it was SSIS in the cloud and got super excited and tried to do complex control flow things and decide whether or not to run jobs, and it’s not very good at that kind of thing. It is a straightforward, at this time slice, at this particular regular point, do these actions in a row. But for this example where we just need to say on the hour every hour, knowing that at that point Streaming Analytics will have started a new job, started this new file, we can say kick off my U-SQL, do my batch processing, call a stored proc is SQL Data Warehouse to take the result of that and pull it in. You can actually have this in a nice linear thing. So data factory works really nicely for that simple, predictable, I know what’s going to happen and I just want to do these things in a row.
So orchestration is fairly easy. We don’t need to have a big huge orchestration scheduling event, because a lot of these things just talk to each other naturally. So you’re saying just run that query, run that query, fire that job, and things will talk quite nicely. So, this is where we go and have a look at how it works.
From the English point of view, this is the Blue Peter moment of here’s what I made earlier, and I can just step through a working example with the code all coming through. I was going to try and build it live, but I think that’s too many things to go wrong at once. So I’ve got one that is nicely set up and I know is working.
Okay, so here we have the Azure portal. SO here we’ve got our different components all in a line. So we’ve got event hub – so I’m going to kick something off in a second that will be sending messages up into the event hub. So that’s just a place that can just be constantly sitting there, receiving files, receiving small packets of information from whatever my sensor is. Streaming Analytics, which is picking up those sensor’s information, doing a little bit of work to it, putting it down into Data Lake Store. Data Lake Analytics, running jobs on that file and then pushing out aggregate files back into Data Lake, and then my SQL Data Warehouse putting the two together and giving me a querying front end.
So, my event hub. I’ve got a little bit of PowerShell over here. I’m not going to go too far into it. We’ve got PowerShell sitting there running, and this is a really super simple little PowerShell script that is executing a stored proc locally. I’ve just got AdventureWorks installed in a local SQL Server. And that stored Proc is just picking a random record from the fact table. It’s doing a little bit of script to turn it into something a bit like JSON – well, an actual little JSON individual argument. And then it’s calling the rest [inaudible]. It’s just saying there’s the header, there’s a little bit of information, go. So if I hit go, it’s going to start pushing out data, so each one of those is a record being sent to the event hub, and so I’ve now got a stream of information. That is information going up into the cloud, being stored in an event hub and waiting to be picked up. We’ll just leave that running in the background.
Okay, so we’re going to have a peek in our event hub. So this is the event hub namespace. Again, it’s a collection of event hubs. And you can see I’ve got messages flying through at different points. You can see, I was doing it late last night, and then this morning I had different amounts of messages just flowing through the system. I’ve got sales data down there, here’s my event hub, so it’s just a container of information, sitting there receiving all these messages as they come. There’s not much configuration to do there. It is literally just a bucket waiting for messages to be sent through.
So the thing where I want to do stuff, I’ve got my streaming layer, my speed. So Streaming Analytics has three parts. You’ve got your inputs, your queries and your outputs. So, the first input I’ve got is my sales feed. Now that’s just my event hub. So we click on that, so I’ve got my service plus Lambda, my event hub name; that’s my sales name, my bucket within my overall event hub container. It has a username and key, and I didn’t want to share that out to the internet, just in case people start sending me things. And finally, I’m telling it to expect JSON. It’s as simple as that – all it needs to know to actually go off, talk to the event hub and start scraping that information, start getting those messages as they arrive.
Then if we look back at the PowerShell, the information that we’re sending, so the stuff down here, I’m just sending an ID and then some keys and some numbers. There’s no real information I can use in there. So one of the things I’m doing on the fly, I’m doing a lookup against a quick CSV. So I’ve just got a file over here, I’ve got this dim product CSV, and that is, as you can see there, a literal list of all the keys and the product names.
So what I want to do is, when my stream analytics – for each message going through there, I want it to look up against that and return a product name. So I’m going to enrich that message as it’s being sent through. So again, we do that in stream analytics, and that’s what that dim products reference Blob is.
So that is a CSV I’ve put in Blob Storage. It’s, again, limitation currently, you can only have it in Blob Storage, you can’t do it in Data Lake. So it’s got my storage account, it’s got my account key telling it where to find that particular file. And I’m currently just saying that’s a fixed file. I only want it to go to the same one file each time, it’s just going to load that into memory, it’s not going to change. You can set this up so that every hour, every minute, every day, it has a new file and so you can have this data change over time. For this example I’ve just kept it really, really simple, CSV, one off, go look at that table.
Okay, so they’re our two inputs. Then for the outputs, I’ve got ADLS, I’ve got Data Lake Store. So for this – there we go, so I’ve got my Data Lake Store, which I’ll show you in a second. I’ve got the path, so I’m putting it to a certain place within Data Lake, but I’m including the date and time. So those two ones that are within your braces, they are telling it, “I want you to put the current date and I want you to put the current time” And that’s what tells it to start a new file when it reaches those thresholds. So you can have it, when it gets to a new date, you can make these files daily just by having the date in there, or you can make it hourly by having date and time.
So I’ve included those two, so it now knows as soon as it starts, when it’s creating these files, as soon as it rolls over into the new hour, it will start making a new file and start filling that up. That’s literally all you need to do to tell it to do that, so it’s fairly nice and easy to configure in that way. Then finally, the magic in the middle is a query, which nicely enough, is SQL. So I’m saying SLECT which columns I want from both sides into my output – so I’ve got ADLS as my output there. And again, I’m just telling it to join on an inner join from my sales feed. That’s my event hub, they’re my messages coming in. Join to dim product – again, that’s the one I defined, that is my reference table in Blob – and to join both on those two keys.
And with that information, that is all I need in there to actually start pushing that data through. Now, I could do a lot more in here. I could be having it aggregating the data up, doing time series, saying give me the average amount for the last five seconds, ten seconds – all this kind of rolling window, tumbling window is really, really useful, but I don’t need to do it at this point, so I’m not making that any more complex.
You can upload files here and then use test. So if you’re doing anything complex with the SQL, you can be running that as if you’re running the query and it will say this is the data I’m going to spit out. Again, that’s super simple and I can just show you the files that have been created. So we go back, back to our dashboard where we can look at Data Lake Store. Which again, really, really simple, you can just create a Data Lake and it is then there, immediately ready for you to use. And you’ve got this data explorer, which is essentially a file explorer. So you can have directories, you’ve got access, so you can say people are allowed read access to that, write access to that. I’ve set up a service principle, so I have a service account, so all the things can come and talk to my Data Lake and they’ve got permissions.
So I could say, a certain application is allowed to write only into a certain folder. So you could have nice segregation to make sure people aren’t going to go and edit things and make sure once you’ve loaded data into the lake, no one else can access it. So it gives it that kind of – it’s almost an immutable delta. It’s not fully immutable because if you’ve got access you can go ahead and do it, but you can start to build those principles into it.
One of the interesting things when people start to use Data Lake, one of the common problems is what if it becomes a data swamp? They just have no structure and they start putting files in, then start complaining saying this is useless, I can’t find anything. One of the starting principles of using a Data Lake is figuring out, what’s the folder structure, how do you want things to work, how are you going to put stuff in there?
So we use a set structure. So raw, that is raw data, we haven’t changed anything. That is the structure it originally arrived in the lake. Corporate, because it’s one of our internal systems, and sales feed is the actual entity. They’ve used two folders, these were created by Streaming Analytics, because we’ve rolled over from a [inaudible]. Inside of there, we’ll have the individual files that have been made. And it is UTC, so that’s an hour behind me, so you can see that that 12, that is the file that is currently being written to. There are currently records going and landing in there. So, we can currently see, that size is going up. There are records that are landing into the CSV. That is an ongoing process, whereas the other ones aren’t changing. We’ve now rolled away from that, we’re no longer writing to that particular file.
And that’s all managed by that one little bit of date and time with the curly brackets in our Streaming Analytics set up. So there’s no more configuration, there’s no orchestration; I don’t have to have anything coming in and scheduling it and telling it to do that. That just works out of the box really nicely and easily.
So that’s my Data Lake all set up and ready to go. It is as simple as that. I’m not having to tell it where to put those files. I’m not having to say this is how to split these files across different network shares. Now obviously, these files are very, very small. I’m just doing a nice example set so we don’t have to wait a while and I don’t run out of money trying to process this stuff. But again, you can see how this scales up and would become a much bigger example.
Okay, cool, now we start processing stuff. So we’re going to talk compute. So we’ve got our batch layer, Data Lake Analytics. So we can go in – this is one of the really nice things in that you can do it straight through the browser. Again, all of these things I can be doing via visual studio and deploying and being a proper developer and source controlling and being really good about stuff, but I’m just showing the quickest example; you can do all of it through the browser. So I’m going to grab a little bit of code and I’ll show you what that looks like. So a bit of U-SQL, there we go.
So, I’ve got some variables I’m declaring at the top. Again, like SQL, I can just declare variables and use them throughout the rest of my script. But one thing to note is that I’m using c# data types. This is a c# environment made to have SQL style syntax. So I feel like I’m writing SQL, it’s a familiar language to me, but I can extend it. I can write DLLs, I can add new references into it, I can call web services. So it’s a very extensible language, built to be friendly to the SQL developer.
So I’m telling it where to find that file. I’m telling it where to put the file and I’m giving it a date reference for the current time. Again, similar to stream analytics, I’ve got a few things in my file name that I’m actually pulling out. So I’m saying, when you’re reading the path, when you’re finding the folder that it’s in, you can understand that I’m saying that first bit of the path, that’s the year. Then I’ve got the month, then I’ve got the day. That means I can have file date as something that’s actually in my query. That means each row that gets pulled out of my data, I can say, well stamp it with the date, stamp it with the file name. I can derive information from the physical location of my files. So what this is actually going to do, this is going to bring all files it finds within that structure. So I’m not saying go and find a particular file, I’m saying re-curse through everything in this entire directory tree. So anything that sits underneath sales feed is going to dynamically go through and pull in any file it finds that sits under there, and then bring the information of which particular file each record sourced from into my query. That alone is super useful.
So this extract statement – it’s the same as SELECT, but it means I’m going and doing flat files. So saying go to my storage, expect to find these columns – because again, we’re doing from flat files, flat files don’t necessarily have metadata in there. Especially if it’s a CSV, it doesn’t know what data type each of those columns is. So I’m having to say, sales order number, that’s a string, product key, that’s an int. I’m telling it how to understand the files it’s going to get.
So it’s going to find all of those columns in the file, two virtual columns I’m adding in, based on that path. I’m telling it, “use that variable to find out where to go,” and I’m telling it, “it’s a CSV.” So this extractor.CSV, there’s a few built in flat file extractors saying this is how to understand that flat file. And you can build them yourself, you can build extensions and you can find some open source GitHub ones. There’s a ton you can do to extend this language, but there’s already a fair bit in there. And I’m skipping the first header row because I don’t want it to try and turn my row of headers into those data types because that’s going to break.
So that, as a query, will return me @sales. That’s a variable – it’s like a table variable within my U-SQL script. That is a recordset that I can use later. I’ve got my result down here. So I’m saying SELECT, back to normal SQL, I’m saying SELECT my product name, the sum of sales amount from sales. So I’m using the record set from the extract in my next step. I’m saying take the results of all that reading you were doing and do some SQL on it.
Now, the really smart bit is this bit. So I’ve got WHERE, I’m using a search predicate, but I’m doing that on the virtual columns. What that means is that won’t read the files that don’t actually meet that. So that file, I’m saying, the one that’s currently being written to by Streaming Analytics, it’s not less than today; so it’s not a file from previous days. And it’s not less than the current hour, because it is the current hour, and that means that bit skipped. So it means I’m not going to read that file that is in progress. So whenever I run this U-SQL script, it will always ignore the current hour. So rather than having to bring those files in and then know how far through that is, I can actually control which elements of that flat file is read. Even though it’s a later part of the query, it will push that back up. So to push down predicates to earlier parts of the query based on these virtual columns, U-SQL is a pretty smart language.
Finally, I’m going to output it. So I’m taking whatever result I get from that, put it to my output directory, which is the one up here – enhanced sales totals, make @sale .CSV – because I’ve just given that a fixed one, it’s just going to replace that each time I run it, and again, do it as CSV.
Finally, up here I’ve got these Aus, that’s the analytical units. How many compute nodes do I want to throw at this? Now, because this is actually – I’ve got a few more files to do, that will mean I’m going to be reading from five files at once. It means the query’s going to cost me five times as much, but it will only cost me a few pence per minute per unit; it’s not an expensive language. Because we’re talking about a small amount of time, you pay by the pro rata amount of compute that you actually use. So I’m going to hit go, and that’s going to submit that job.
Now, this isn’t a server that’s sitting there fixed waiting for me. So I don’t get an instant response; it doesn’t run immediately. It has to go and say, this person has submitted this query, they want five units, go and find those five units, wait for them to spin up, then push the query out to it, run them and get the results. So again, this isn’t that kind of immediate query engine. It’s not a user sitting there trying to get the result back on a page. You wouldn’t put something like this behind a web page or a dashboard, this is a batch processing thing. There’s always that overhead of time we have to wait for it to get processed.
So you can see, I’ve got my input files and it’s found a load of files it wants to process. That does have the current hour in there, because it knows that is a file that is there. But then over here we’ve got our data that is actually being processed, one could say, what are all the different things that we’re doing – so it won’t actually bother reading anything from it. So we’ll have our latest extract over there going from the hour 12 thing with now rows read from it. So again, this is kind of a nice, smart way of doing things in that I’ve got 130 rows, 129 rows, I’m reading data from each of my different files, I’m not bothering to read anything from the one that didn’t need those criteria. So it’s nice and efficient, it’s good.
So that’s finished, and that’s spat out output. So back in my Data Lake and click on that, and there we go. So I’ve got an output file, I’ve got a summary. So of all of those transactions or all those different events that were coming in, I’ve now just got byproduct name, the totals, and again, I’ve not had to build any big integrations Data Lake Analytics talks to Data Lake Store natively. Streaming Analytics talks to data Lake Store natively.
Okay, so the final step, we need our SQL Data Warehouse. So we’ve got it sitting here and it’s turned on. I need to remember to pause that because I keep running out of money because I forget.
And that’s my server. So I do need management studio here. You can do the query editor too, but the query editor is a little ropey. So let’s go back. I’m going to grab some code and show you what you need to do to set up PolyBase. There’s a couple of prerequisites that I’ve already set up, but I’ll show you what the code looks like.
So firstly, I’m going to store some credentials inside my SQL Server. So I need to set up a master key. I need encryption. I need to be able to put in what my service principle ID is, what the OAuth end point is and what the secret is. And I want no one to be able to query the database and actually find out what that is. So I’m creating an encryption key first, and then I’ve got this database scope credential. That means anyone else can refer to Azure storage credential and that will actually mean the identity in the secret, but I don’t need to share that with anyone. Once I’ve created that, I can just refer to that without having to give anyone my key, which is great.
So I’ve gone ahead and done that with my actual secret, so I don’t have to go and show it to everyone on the internet. Then I can create a data source based on it. So I’m saying my location, that’s my Data Lake, Lambda.AzureDataLakeStore.net. So I can go and connect to that using my credential. I’m telling it there’s where the Data Lake is, this is the username and password to connect to it. So with those bits and pieces, that’s all it needs, all my data warehouse needs to be able to connect and talk to my Data Lake Store.
There’s a second little thing which is called your external file format, and that’s similar to in U-SQL, we’re telling it to use the CSV extractor. This is again saying this is how to understand that file. Expect to find a CSV, expect it to have double quotes for the string delimiter. We’re giving it enough information to be able to understand the files that it’s finding. So I’ve gone ahead and already ran those, so it should be able to go ahead. Then I’ve got my two different bits.
So firstly, I want to create an external table against a batch layer. This is that summary record. This is the product names and the total sales amount across all of my history, which I’ve just recalculated. So all I’m expecting is the product name and the sales amount and telling it the data types. Again, the flat file itself doesn’t know its own data types, so I need to say expect to find an nvarchar, expect to find a decimal. PolyBase isn’t great when it comes to data typing. So if there’s anything that doesn’t match that, I can reject it. So I can tell it to reject a certain number of rows. Unfortunately, currently there’s no way to get at those rejections. So if I’m bringing in several million rows and I reject 1000 because their data types are wrong, I don’t know which 1000 were gone. So normally when using PolyBase, I data type quite loosely in PolyBase and then retype within the warehouse. That’s a matter of preference because I don’t like having unknown missing rows.
In this case, we’re only dealing with a small file and we’ve just made it, so we’re pretty happy that’s going to be right.
Eugene Meidinger: So quick question about the PolyBase. With the [inaudible] version, instead of the data warehouse, what year versions are supported? So is it just 2016 and higher or?
Simon Whiteley: Yeah, it was released in 2016. So 2016 and 2017, when it comes out, will both be PolyBase friendly; nothing before that.
Eugene Meidinger: Okay, and one other question, unrelated but from much earlier – when you were talking about how Data Factory is good for kind of hourly stuff but not triggered events, so much, someone was asking could you use Logic Apps to be kind of a triggering process?
Simon Whiteley: Not yet. So currently you can force a Data Factory to rerun a slice, but you have to trick it. You have to go and use a bit of PowerShell or c# and say invalidate that slice, retry it, and that’s a bit hacky; it’s not great. There are rumblings that a future version of Data Factory will allow ad hoc pipelines. There’s not a lot of information about that yet; when that does come along then yes, absolutely. So I’ve been using Logic Apps more and more for that kind of event triggered, event driven processing, which doesn’t talk to Data Factory quite nicely yet, but it will do eventually, without saying too much more.
Cool, okay so out batch layer – just going to hit go on that. So that’s created it, it also validates it. It also makes sure that that particular file or directory exists. Again, I don’t have to point it at a particular file. I can say re-curse through this directory tree and bring back all files. For this example I’m just going to a single file so I can do that.
So I can do SELECT FROM my batch layer. Again, every time I hit go on that query, every time I execute, that is going back to the Data Lake, reading that into memory using those PolyBase readers that are on each of my compute nodes, and then returning it to me. So that is coming straight from that flat file. So again, I’ve managed to expose my flat file within my relational data store.
Then I’ve got the other side, I’ve got my speed layer. So this is going to go to my latest file. Now PolyBase has a limitation in that you can’t do that kind of same filtering. You can’t say where the file name equals as part of when you’re selecting from the external table, you have to do it as part of the creation of it. Again, I would love it if I could just create a generic one and then say SELECT* from my external table WHERE – give it a file name and have it only read from a certain one. Currently it would read everything in the memory, bring it all back and then filter it afterwards, which isn’t that performant.
So what I’m going to do is get the name of the latest file, and for now I’m going to hardcode it to that latest file. But again, you can dynamically recreate external tables just using a bit of dynamic SQL in a stored proc. So if I was doing it in an automated fashion, I’d be dropping and recreating that external table and just pointing it at the relevant file. So again, I’m going to go into data Lake Analytics, I’m going to step into my Data Lake explorer, go raw, corporate, sales feed and grab the name of my latest file. It’s that chap, 13 – we are just over into a new hour. Okay, so I’m going to grab that, copy it and put that over here, and get rid of the double CSV.
So that’s now pointing in a very specific file. I’m going to create my external table – there we go, so that’s now access to that file that’s updating as we speak. So each time I run this query, I get slightly different figures going through as I get more and more products.
So that means I can then have this final query, which is the union for the two. Finally, we’re at the point where we’ve got something over our batch layer, something over our speed layer that’s currently in the middle of currently updating, and I can do a union of the two.
So I can get my sales amount to just prove the point. We can say what’s my current sales amount by pulling from those two. So it’s 24 million? Yeah, 24 million, 949 being the end. Do another SELECT and we should see that that number has changed. We’re reflecting – there we go, so it’s now 24,490,818. So every time I refresh this query, that’s going back to my flat files, it’s doing a combination of the batch processed one that I can do loads of validation for, I can have my lookups, I can have any extra work I want to do. I could have gone through ten steps to make sure that that figure is absolutely watertight, and it’s got my lightweight minimal processing fast layer, which will eventually go through my batch process to catch up.
So that is real time streaming all the way through, end to end, using Azure and using modern tech, but taking those principles. So we’re going to slip back to some slides not, just to finish things off. I’m glad everything worked; that’s all good. Okay, so it’s included in a variation. So pulling all that together, what does it actually look like? This is the big monster slide of doom. You end up with something that looks a bit like that, you have lots of different bits and pieces in your architecture. And that feels really overwhelming for people when you first look at it, because you’re used to having your SQL Server, and there’s different components of that do different things. You’ve got your reporting services, your integration service, your analysis services all sitting on that one install, or at least on different machines but via the same software package.
And we’re kind of doing the same thing, it’s just we’re distributing them. We’re saying have that scale itself, have that scale itself, each one can adjust itself to make sure you are throwing the power and the money, really, at the right place at the right time. You’re not having to buy a massive server that can cope with the one little spike that you happen to do in your overnight processing. You’re tailoring your architecture exactly to the computer that you need.
So, just to talk through quickly, we’ve got Data Factory due to Logic Apps functions, you know, various different ways of triggering these bits and pieces, which again, don’t talk to each other that nicely yet. But pretty soon we should be able to have Logic Apps triggering a Data Factory, making these things work together so you can have that best of workflow conditional logic for each loops, your Data Factory triggering your big pipelines, you know, throw in compute where it’s necessary.
So you’ve got those things bringing files from on premise sources, from cloud data sources, landing in Data Lake. Data Lake is your big staging area. Everything from all different places, whether it’s streaming, your different cloud applications, your on premise servers all kind of land in this one central place and keep building up and up and up. Because storage is cheap, you don’t need to say how big it’s going to be, it will just keep growing and growing and growing. Sure, you can archive it and manage your costs, but essentially, you don’t have to manage the technology. It works itself.
You can have Data Lake Analytics doing your batch processing and the unstructured data. If there’s data that you can’t pull into PolyBase because it’s JSON or XML or it’s unstructured entirely, Data Lake Analytics can read it and can pull it through and sanitize it so it’s more available. So Data Lake Analytics could be your entire processing engine, or it could be your preprocessing engine. Then you can pull it into SQL Data Warehouse, that’s your big querying engine. That is your relational store. Now it might be you just go straight into there, it might be that you go through other steps, but essentially that’s where you can give your SQL analysts access to all these different tools in a familiar environment.
And then you might have people querying that directly. You may have people going through analysis services, Power BI. All these things are quite familiar tools and techniques, it’s just we’ve exploded it, we’ve spread them out into different dispirit sources. So it’s scary at first, but it’s actually a really nice little architecture that now finally everything talks to each other.
So variants on that Lambda Architecture. You know, you might have just a plain relational batch system. So you’re just loading delta files, here’s my day’s extract of sales, here’s my hour’s input – here’s my data from MDS, going into data Lake Store, and you pull that into the data warehouse, again via PolyBase. And then you may have procedures that do clean, warehouse, turn it into a Kimball style star. You can still do the things, the approaches, the data modeling that we’re used to. It’s just that we’re using more appropriate cloud technology behind it.
Again, you can expose that via traditional semantic models. So you might have your star being pulled out into Azure Analysis Services, into Power BI, if you want Azure Analysis Services to be a big core central thing that is accessed by several people. It might be that you go straight to Power BI. You have several options as to how you work, especially with Power BI premium, which is much bigger and you can have bigger data sets than the previous 1GB limit.
You may decide to go a different way. You may decide, who actually needs a database if we can do all this stuff anyway? So you’ve got all your flat files going in to Data Lake Store, you’re using Data Lake Analytics, so essentially you’re writing SQL on it – but you could make something like a star schema. You know, you can have lots and lots of files as your fact. You can have dimensions that have gone through lookups, they’ve have been cleaned, they’ve been joined to other reference data, they’ve got surrogate keys. And then you can expose that directly in a model, because analysis services can now pull directly from Data Lake Store, and into Power BI if you wanted to.
If, for some reason, you hate relational databases, which I don’t but some people do, and you’ve got so many options – it might be that you have a corporate enterprise data warehouse and maybe you need the flexibility of not having to go through the whole dev, UAT, test process every time. And so you can use the approach to spin up small flexible marts and then just see if people use it. Then as people get used to it, as you start to get value, as you start to dig into it, you can go great, okay, let’s formalize that, let’s pull that into a relational engine. You’ve got the flexibility now to be able to take these different approaches and change direction because all of these tools just talk to each other.
And finally, you might have what people sometimes refer to as Lambda Architecture, which is when you’ve got two different directions going at once. So I’ve still got my flat files and Data Lake Analytics is doing some processing, pulling that into SQL data warehouse, doing some repeated stored procs. I’m also streaming data into it. I might be streaming that data directly into Power BI in the form of live reports. You don’t actually have to touch your streaming data down to a database. You can use Power BI as a destination straight from Streaming Analytics. So you can get your reports constantly ticking up and up and up, getting that operational reporting. By not having a touch point, that’s getting there as fast as physically possible. But you can have multiple outputs of Streaming Analytics, so it’s going to your operational reports, but at the same time, landing from where it can be pulled in for historical analysis. So there’s a ton of flexibility, there’s a load of different patterns, but the technologies behind it aren’t changing.
So if you’re familiar with Streaming Analytics, if you’re familiar with U-SQL and Data Lake Analytics, which is probably the weirdest one to get used to, and you’re happy with SQL Data Warehouse, essentially as a big version of SQL Server with its own quirks and own limitations, you can make a very advanced, high performance, flexible system out of all of these things.
So, that is the end of everything I wanted to go through. I’m on Twitter, blogs and stuff around. If there’s any more questions, anything you guys want to throw at me while I’m here then have at it.
Eugene Meidinger: A couple of things. There was a question on the chat, and I think I can take a first stab at this. Somebody was asking how does this compare with Streaming Analytics via Power BI? And my understanding is that’s pretty limited because Streaming Analytics with Power BI is just the speed layer. And even then, you just basically kind of get the most recent value, or you can do a little bit of aggregation, but it’s kind of like skimming on top of the water, you really don’t have a lot of control over it, to my knowledge.
Simon Whiteley: Yeah, I’m doing a session about that at SQL Cambridge tomorrow. [crosstal] A little plug. So essentially, you just get the streaming element. It does have a history layer. You can tell Power BI to retain the history, but then it’s only available through Power BI. So it’s only if you’re using Power BI can you actually get at that stuff. And it’s limited because the streaming data set in Power BI can’t be joined to any other data sources. So you can’t do your normal Power BI style thing of mashing it up with a lookup table and a fact table somewhere else. You can’t pull these things together, it is literally the one thing you send to it. So if you’re sending it to Power BI, you have to have done all your lookups and all your computation and everything within Streaming Analytics before it gets there. But again, you can do both. You can have Streaming Analytics with two queries in there, some of which is going straight to Power BI, some of which is going and being landed. It’s not an either or, it’s just another piece in the puzzle if you need that real time ticking away updating query.
Eugene Meidinger: Okay, so I had two quick questions myself. One of them is – I can see a potential challenge with trying to combine batch and streaming layers, and that’s dealing with aggregations that aren’t item potent. You know, you can run them twice and have the same effect. So sum is pretty straightforward because you just say I’ll just sum it again and it works, but like with average, you’d have to include some extra information to unroll that average. You’d have to include the number of entries. So I guess…
Simon Whiteley: In that case, it’s only because I’ve – in the batch layer, I’ve aggregated. I could have just, rather than aggregating and making that summary file, I could have just pulled it into a fact table within the warehouse. In which case, I would be running over the lowest granularity data and building it up from scratch in time anyway.
Eugene Meidinger: Right, but I’d imagine you’d still be doing some sort of work, otherwise you wouldn’t really need a batch layer.
Simon Whiteley: Yeah, but that might be figure checking, it might be doing validation, it might be saying if there’s something with this segment or [crosstalk] type that hasn’t been gone through that or it hasn’t been shipped yet, or the ship date was before the billing date, whatever. All of those kind of business checks, yeah, you can do in your batch layer.
Eugene Meidinger: Okay, Brent had a question of what types of companies or data warehouses or business scenarios aren’t a good fit for this? Like when should the red flags pop up and you go maybe this isn’t the right solution, let me back off a bit?
Simon Whiteley: So it depends on which bit, how far down. So the SQL Data Warehouse is probably the biggie. That is the one that costs a fair amount of money when it’s turned on. When you’ve got it scaled up, it can cost as much as 60 pounds per hour, and that racks up in terms of cost. Although the pound is quite cheap now, so it depends. So probably if you’ve got a really small amount of data, you’d probably mix and match. You wouldn’t then use SQL Data Warehouse, and that’s where there’s a bit of a question in the architecture, because you can’t PolyBase. So you’d end up using Data Factory to pull data from the lake into the SQLDB, I guess. And that’s kind of a grey area. That’s the bit where – you can still use a lot of these techniques, you can still be preprocessing with U-SQL, you can still be getting the data ready. But then you’re going back to slightly more old fashioned, use an ETL tool, pick the data up from there, put it into your SQLDB. So it’s only when – real low volumes is where it starts to fall down on the processing side.
But Streaming Analytics, you can be throwing a message a minute, a message an hour and it will still work. You’re not really getting the benefit from it, but there’s none of these tools, except SQL Data Warehouse, have a lower barrier to entry. I think all of them, you can just keep going up and up and up in terms of scale. You can be getting thousands and thousands of messages a second through into event hubs. You need to do some more configuration. When you get to the high levels, you need to start partitioning, so you need to be saying that goes into that bucket and that goes into that bucket, and then you can build streams that are pulling from a different bucket. So you need to design a slightly more advanced system if you’re starting to push the boundaries, but in terms of [inaudible] is when this tech doesn’t scale big enough.
Maybe not, unless you’re getting full insane hundreds of petabytes scale, maybe you’d start to hit a limit. But in terms of where the majority of us are sitting, it’s fairly flexible.
Eugene Meidinger: So the other question I had was kind of a career question, and you might be a bit biased because as a cloud architect and a consultant, the more complex things are, the more money you make. But there seems to be this interesting tradeoff where some of the prior technology, when it comes to the data warehousing and databases, was a little bit more like maybe the IKEA model or something, where you’ve got some weird Swedish name and a bunch of instructions and there’s one general way to do it, ideally. Whereas this, it’s a lot more Lowe’s or Home Depot or a hardware store, where it’s like we’ve got 2x4s, we’ve got hammers, we’ve got nails. You can build it any way you want, but you have to figure out which way you’re going to build it. And you showed like four different examples of ways you can go with this kind of stuff. So I guess my question is, how do you feel about the polyglot nature of where data technology, BI technology is going? Where okay maybe ten years ago you could know SSIS, SSRS, you know, SQL Server, and you were good, and now we’ve got these ten different technologies that build this one solution. And at least in Azure, they’re a little bit more cohesive, whereas if you went the Hadoop rout – you talked about big data or Pokémon, where you’ve got to learn all these names and everything. So what is your feeling about the direction it’s going where it seems like it’s the smaller more precise tools – so you’ve got hammers and screwdrivers instead of just a Swiss army knife?
Simon Whiteley: I mean, that is the challenge that – so we obviously faced it as a consultancy. A lot of our guys are SSIS, SSRS, SSAS specialists who are now suddenly going, “I’m going to have to learn a whole raft of new stuff. But I guess that whole concern, that whole – is it polyglot? How many languages am I going to have to learn? That’s why I originally put this talk together of actually we’ve now got that end to end – most things are using SQL.
Eugene Meidinger: And a little bit of c#, yeah.
Simon Whiteley: Yeah, I mean if you’re doing it properly, sure you’re starting to learn a little bit of JSON, you know, you need to be familiar with that kind of thing. But then modern tabular, the tabular object model, you know, TMSL, the tabular model scripting language, that’s JSON as well. That’s the way it’s going. So you get more out of this stuff if you can do a little bit of c#, do a little bit of PowerShell, you’re more familiar with that kind of stuff. But if you are literally a data modeler, you’re trying to build BI systems, you can go through all of this, writing things that are very much -T-SQL-esque. Sure, you’re going to have to come up with the times, but in the same way as when tabular appeared compared to MDX, all the MDX guys now are probably looking at going, “Yeah I should probably learn DAX,” because that’s built into Power BI, it seems to be the way it’s going.
Languages evolve and change, you need to keep on top of the skills as they develop and as the tech changes.
Eugene Meidinger: I think one big last question to kind of wrap things up, as we’re getting close to the half hour. So what are your favorite places to learn about this type of stuff, or where do you suggest people start, or just how to keep up with some of these things or where to go to dive into this?
Simon Whiteley: So there’s a lot to learn. So there are starting to be more and more resources online, Pluralsight kind of places do have courses for some of this stuff now. A little bit of it is a tad cutting edge, so not everything is covered. The big thing for me is – because of the whole nature of Azure, you could spin something up, you can try it and then you can shut it down again. You know, by spinning up a SQL Data Warehouse, you’re not committing to working with the data warehouse for the next few years until you can decommission it, you know.
You can spin one up, have a play for an hour and turn it off. So anyone who’s looking at this – certainly if you’ve got an MSDN account, you get $150 of free Azure credit per month, so you can try this stuff out, you can spin it up, you can get the online docs up, read a few blogs, a few guides, but the best thing is to just try it out. You’ve got 100 pounds a month to try, to sandbox, give it a go, see what you like and what you don’t like. And then if you spend all your money, everything turns off, starting the following month it turns back on again. So the majority of this stuff we’re learning by actually trying to do something, having a challenge. How do I do real-time data? I’ve got an API, how do I get that API into Power BI? Here’s my set of tools, and then just sitting there spinning things up, working end to end, trying to get it working just as a sandboxing environment. And by far that is the way we’ve learned most, just getting hands on and trying it, because you can try it for free if you’ve got MSDN. You can sign up to a free trial and that gives you 50 quid, I think, for a month. So you can at least – even if you don’t have MSDN, you can at least try some of this for free. So yeah, get your hands dirty is probably my advice.
Eugene Meidinger: Right. Adatis, your company, happens to have a blog that you might suggest to people.
Simon Whiteley: Yeah, I should plug ourselves, right. Blogs.Adatis.co.uk and each consultant has got their own blog on there. Mine tend to be SQL Data Warehouse, so I’m MPP heavy in terms of what I was doing. So if you’ve got anything that you want to read, look that way. Obviously, if there’s anything that you think you would like to know and we don’t have a blog about it, just Tweet at me and I’ll kick one of our consultants to write something. So we are fairly open and friendly. The more information people want, it’s fine.
Brent Ozar: Very cool, great session, man. Thank you very much for giving it, Simon, nice job. A big round of virtual applause for that.
Latest posts by Simon Whiteley (see all)
- Modern Data Warehousing – The new approach to Azure BI - April 21, 2017