Anyone who’s ever thought it was a good idea to roll their own encryption routines.
SQL Server gives us plenty of options when it comes to encrypting our data. But have you ever wanted to write your own encryption routines, perhaps you think that what SQL Server offers us doesn’t quite fit the bill for you?
I’m going to look into the basics of how encryption works and then we’ll learn how we can go about writing our own encryption routines within SQL Server. When we’re happy that those routines are secure, we’ll look at ways that we can go about cracking those routines.
Writing our own encryption within SQL might sound like a good idea and could even be something that you’ve tried out yourself but there’s a chance you might change your mind when you see how easily amateur cryptography can be broken.
Enjoy the Podcast?
Why I Want to Present This Session:
Brent Ozar: So in this session at GroupBy, David Fowler’s going to be talking about writing your own encryption routines in SQL Server and then breaking them, co-hosted with Daniel Hutmacher, who will be asking the questions on here. So take it away, David.
David Fowler: Thank you very much, and thank you very much for having me at GroupBy. It’s a pretty cool to be here, pretty cool to be talking. So for the next hour or so, we’ve got a little bit of fun with encryption. We’re going to pop the hood, we’re going to get down and dirty with the cogs and the pulleys of encryption, have a little look about how it works underneath. We’re going to have a little look at how we can incorporate – how we can implement that into T-SQL and also look at any vulnerabilities, attack vectors, and how we can look at cracking those routines.
So that’s a good start, my little clicky thing’s not working. The click’s not switched on. That’s it, things work better when they’re switched on, don’t they? If that’s the biggest goof of the day I’ll be happy. So a little bit about me. My name’s David Fowler, I’ve been working as a SQL Server DBA for around about 15 years now. I’ve had a few spells in development. I’ve worked as – I’ve worked in embedded systems written in C, I’ve worked in web apps, written in .NET, but SQL is my love. SQL is my passion, that’s my drive, that’s what I love to do. I blog over at sqlundercover.com, myself, colleague, Adrian Buckman, we have Chris Jones as well who just recently joined us. Just, if we find anything a little bit cool, little bit interesting, we will chuck a blog post up there. Also any scripts that we find a little bit interesting, we think people might find useful tends to go up there as well, so feel free to – feel free, excuse me, to go up there and have a little poke around.
When I’m not nerding over databases, I’m a keen triathlete, I’m a keen cycling, I love racing, I love bikes. I race road bikes, mountain bikes, cycle across, time trial, pretty much anything that’s got two wheels, I’m going to be all over it. If you want to get in touch, you can drop me a line at firstname.lastname@example.org. I’m also on social media, I’m on LinkedIn, I’m on Facebook, I’m not on Twitter as myself, but SQL Undercover has got a Twitter account, so again, if you pop on to sqlundercover.com, you’ll find the links to all of our social media on there, so feel free to get in touch with me how you like.
So the agenda for the next hour. We’re going to start off with a little section that I’ve called The Arcane Sciences. This is just a few fundamentals that we’re going to need to know, we’re going to need to get our heads around before we can start looking at encryption itself. We then got our first algorithm, the first cypher, which is the COR Cypher. We’re going to have a look at the Caesar Cypher. We’ve got something that I’ve called the Caesar-XOR collaboration. A little section called Pimp My Cypher, and we’re going to finally finish up with what I’ve called Shifting the Key.
So before we do get started, there is a little bit of a disclaimer. Basically, all the encryption routines in this presentation are purely for demonstration purposes only. That means that myself, SQL Undercover, or anyone else I might be associated with cannot be held liable for any damages that may be arising from using them. So basically, if you take any of these algorithms, use them for production, someone comes in, nicks all your top secret data, you have been warned, don’t come crying to me.
So that brings us on to the first section, The Arcane Sciences. So what are The Arcane Sciences? What is this ancient knowledge that is known to scholars of your way, way back when and long, long forgotten by the modern DBAs? We’re talking about a couple of things. We’re talking about binary, and we’re talking about bitwise logic. Now, we need to get our head about these things because a lot of encryption algorithms work at a binary level. So binary, what is binary? Now, ask someone that question, ask someone, “What is binary?” A lot of times, they’re going to turn around and say, “You know, it’s some funky code that computers use and it’s made of ones and zeros.” Well, that’s kind of true, but what really is it? What really is binary? Binary is what we call a base two counting system. Each bit, bit, being short for binary digit, can have one or two values. It can be a one, or it can be a zero. Each bit represents a power of two, and most of us know that eight bits form one byte. But what is it? How do you go about reading binary? What do all those little ones and zeros really mean? Well, to understand binary, we need to first have a little think about and understand how our everyday counting system works, how decimals work, or base ten. Now, in decimal, each digit can have one of ten values. It can be a zero, a one, a two, a three, a four, a five, a six, a seven, an eight, or a nine. That’s great for representing numbers through naught and nine, but what happens when we need to represent a number bigger than nine? A ten, eleven, a twelve? We haven’t got symbol for ten, eleven and twelve, so what do we do? Well, we just simply add another digit to the left. So that means that every digit is going to be worth ten times the digit to its right. That little table there just shows us the value or what each digit represents, so we can say over on the right hand side, the least significant digit position, that’s going to represent the ones. The next digit represents the tens, the next digit the hundreds, then the thousands, the ten thousands, and so on.
So let’s just think about the number, 3578. Think about how that’s represented. Have a little look at the position of each of those digits. The first digit, over from the least significant digit position, the eight, that’s going to be worth one. Well, that’s in the one’s position so we know that is going to be worth eight times one. The next digit, the seven, that’s sitting in the 10’s position so we know that’s going to be worth seven times 10. The next digit, the five, in the 100’s position, that’s going to be worth five times 100, and finally, our last digit, the three, over there in 1000’s position, is going to be worth three times 1000. So to understand the value of that number, all we have to do is add together the value of each of those individual digits. In this case, eight plus 70 plus 500 plus 3000, gives us 3578. Now, this is basic stuff, we all know this stuff, we learned this when we were five years old back in primary school. So how does this help us with binary?
Well, binary works in a very, very similar way to our base ten counter system. Instead of ten possibilities of values, binary is going to give us two. It gives us a one and it gives us a zero. This means that each bit is going to be worth two times the bit to its right. So little table there will show us the position of each bit and its value. So we can see over there on the right hand side, the position of the least significant bit, that is going to be worth one. The next bit, that is worth double the bit to its right, two times one is going to be two. The next bit, the third bit, again, double the value of its bit to its right is going to be worth four, then we have eight, 16, 32, 64, 128, and so on. So let’s take a look at the byte 11010010. And think about how we’re going to work out what number that byte represents. The first thing we’re going to need to do is just look at the position of each bit. Now, in binary, as with most counter systems, the zeros are worth nothing. So we can kind of ignore those. The only things we’re really interested in are the ones, or more specifically, the position of those ones. Now, we can see the first bit we come to working from the right hand side, the first number one we come to is in the two’s position, so we know that bit is going to be worth one times two, or straight up two. The next bit we’ve got, the next one, is sitting in 16’s position, so that’s going to be worth one times 16. The next one in the 64’s position is going to be worth one times 64, and the last one we’ve got over there on the left hand side in 128’s position is going to be worth 128. So to work out the value of the byte 11010010, all we need to do is add together the value of each of those individual bits. So in this case, two plus 16 plus 64 plus 128 is going to give us 210. And that is binary. Binary really is that simple. And then before I go on, I just want to show you a little cool little thing that not everyone knows about, and that is the old Windows calculator. Now, Windows calculator has got a few different modes. We’ve got a standard mode, we’ve got a scientific mode, and we’ve got a programmer mode. In that programmer mode, we can select hex decil, decimal, octal or binary. If we set decimal and just pop in a number, 210, what this calculator will give us the value of 210 in hex decil, in octal, and in binary. We can do the same with binary. If we just clear that and bash in some binary number, we can see that value of that binary number in hex decimal, decimal, and octal. So it’s just a quite cool little tool if you need to convert back and forth between binary and decimal.
So with binary nailed, we can look at bitwise logic. So what is bitwise logic and how does it differ from Boolean logic? Now, we’re all familiar with Boolean logic, we use it every single day. We use it in our where clauses, our if statements, our case statements. Basically, Boolean logic acts on two Boolean values. We if we take a look at that example where clause where customer location equals England, and purchases greater than 20. We can see there we have two Boolean values. Customer location equals England is going to be true or false, it’s going to be a Boolean value. Purchase greater than 20 is going to be a Boolean value, it’s going to be true or false. So those two Boolean values we’re applying the Boolean operator and. That and, as we all know, is going to look at those two Boolean values, it’s going to evaluate them and it’s going to give us back a single result in Boolean value. And Boolean logic gives us three operators. Strictly speaking, it gives us four, there is not as well, but for just today we’re going to ignore not. We’re not interested in NOT. We’re only interested in AND, ORs and XORs, or the exclusive or. Now, I did have a little demo showing you Boolean logic but I’m going to assume that everyone knows how Boolean logic works so I’m just going to skip straight through that and get on to bitwise logic.
Now, bitwise logic has very, very similar operators to Boolean logic. In bitwise logic, we have a bitwise and, which in T-SQL is represented using the ampersand. We have a bitwise OR, the pipe, and we have a bitwise XOR, the carrot. So bitwise logic works with individual bits of a value. So I’m just going to skip over to SQL and I’m going to show you bitwise logic in action. Just log in to my machine, and let’s just zoom this in so you can see a little bit better. So to apply bitwise logic in SQL, we can do something like this. We can say our bit of value, our bitwise operator, in this case, an AND, and our second bit of value. So to run this select statement, all we’re going to do here, we’re just going to build up a bit of truth table really for the bitwise AND. We’re going to look at a bitwise AND, apply to a one and one. A bitwise AND, apply to one and zero, and a bitwise AND, apply it to a zero and a zero. If we run that, we can see that this works in a very, very similar way to our Boolean AND, where both bits are a one, or true, we’re going to get back a one. Where one bit or the other is a zero, we’re going to get back a zero, and if both bits are zero, we’re going to get back a zero. So that’s a bitwise AND. What about a bitwise OR? Well, let’s do the same thing. Let’s build up this truth table. If we do that again, we can see this works in the exact same way as our Boolean OR. So where both bits are a one, we’re going to get back a one. Where one or the other is a one, we’re going to get back a one. And where both are zero, obviously, we’re going to get back a zero. And finally, the XOR, the exclusive OR, and you’re not going to be surprised to learn that once again, the bitwise exclusive OR works in exactly the same way as the Boolean XOR. So remember with a bitwise XOR, we’re only going to get a one back where one value or the other value is one, but not both. With an exclusive OR, the two bits, the two values are mutually exclusive. So that is bitwise logic, working on individual bits. What about if I want to apply a bitwise logic to something bigger? Maybe a byte? Well, we can do something like this. Here, we’re going to say apply a bitwise AND to 220 and 177. What’s that going to give us back? What result are we going to get back from this? Well, let’s find out. SQL has given us back 144. And where’s it got that from? How’s it figured out 144 from 220 and 177? Well, to see what it’s doing, we need to look at each of those numbers in their binary form. So 220 in binary is 11011100. 177 in binary is 10110001. So how do we go about applying a bitwise AND to those two bytes? Well, remember what I said early on. Bitwise logic works on the individual bits of the values. So we’re going to have to work through this bit by bit by bit. We’re going to start over on the right hand side, we’re going to start with the most significant bit. We’re going to apply a bitwise AND, the most significant bit of 220, and the most significant bit, its corresponding bit in 177. Do that, we can see the most significant bit of 220 is a one, the most significant bit of 177 is also one. We know if we apply a bitwise AND to a one and a one, we’re going to get back one. Once we’ve got that, we can just move on to the second bit. The one in 220, its corresponding bit in 177 is a zero, bitwise AND one and zero is going to give us a zero. The third bit, a zero and a one will give us a zero, the fourth bit, a one and a one is going to give us a one, the fifth bit, a one and a zero, the sixth bit, a one and a zero, the seventh bit, a zero and a zero, so remember a zero and a zero bitwise will still give us a zero, and last but not least, the least significant bit in 220 is zero, a one in 177, will give us a zero. So applying a bitwise AND to 11011100 and 10110001 results in the byte 10010000 or 144. And that’s bitwise logic, that is how bitwise logic works.
Again, let me just show you something I quite like with the old Windows calculator again. So let’s clear this out. Now, the Windows calculator can perform bitwise logic for us. Up top here, we’ve got a few buttons here, one for each of the bitwise operators. So let’s just try popping in 220 and 177, hit equals, and out pops 144. So again, it’s just quite a neat little tool if you have to be working at this level with bitwise logic or converting decimal into binary and so on, it’s quite a neat little tool to have.
So, now we’ve got bitwise logic nailed, we can move on to our first cypher. Our first cypher is the XOR cypher. Now, the XOR cypher is used in many, many, much more complicated algorithms. It’s using AES, it’s used DES, it’s using all the derivatives of DES. To encrypt, using the XOR cypher, all we have to do is apply a bitwise XOR against some key of some description, and our plaintext, our unencrypted data. To decrypt, we simply reverse the process. We apply an XOR to our cypher text, our encrypted data, and our key, the same key that was used to encrypt that cypher text. So hop over to SQL, we’ll have a little look at that in action. So the first thing – hang on, I’ve jumped ahead of myself, let’s go back to my thing because I have jumped over myself. Okay, so let’s just look at how we’re going to perform this before we jump into SQL. So for example, we want to encrypt the plaintext GroupBy. Now in SQL, T-SQL can only apply bitwise logic to certain data types. We can apply it to a BIT, an INT, a SMALLINT, a TINYINT, or a BIGINT. We can also apply it to binaries and VAR binaries, but there is a slight caveat with this. We can’t apply bitwise logic to a binary in a binary. If we want to use a binary, we have to apply bitwise logic to a binary and either an INT of some description, may it be INT, SMALLINT, TINYINT, BIGINT, or a bit. I don’t know why this is, I don’t know why we can’t apply a bitwise logic to a binary and binary, there might be a very good reason within SQL for that, I have no idea why that is. If anyone is more clever than me on Slack knows, feel free to let us know.
We also in T-SQL cannot apply bitwise logic to textual data, to characters. Some languages will let us do that, T-SQL won’t, and this is a slight problem for us because we want to encrypt the plaintext GroupBy. That’s textual data. How are we going to do that? Well, one thing we’re going to need to do then is somehow convert that textual data into numeric data. How can we do this? Well, one way we can do it is to use the ASCII code of each character. So the ASCII code will give us a bunch of TINYINTs, one TINYINT for each character of our plaintext. With that, we can then just apply an XOR, an 8bit XOR to each byte of our plaintext, each of those ASCII codes. That will give us some encrypted ASCII code, which we can convert back into its character representation and that will reveal our cypher text, our encrypted data.
So now I can jump over to SQL and I can show you some action. So the first thing we need to do is somehow get the ASCII code for our first character of our plaintext, the G. Now, SQL gives us this really beautiful little function, ASCII. Now, ASCII will take in a character and it will spit out for us that character’s ASCII code. So let’s just have a little look at that. We run that, we can see the ASCII code for an upper case G is 71. You can check that against your ASCII table, you’ll find that that is absolutely bang on the money. So now that we know that, now that we’ve got our ASCII code, all we need to do is apply an XOR to that ASCII code and some key value. This key value can be any 8bit number. We’ve just chosen 123 here just because that’s what I’ve decided to choose. If we do that, if we [inaudible] G, using the key, 123, it encrypts as an angled bracket. Now, obviously, if we change the key, we will find that our encrypted character will also change. That’s great, that is really great. But how do we go about decrypting this? Well, decrypt, we just do the reverse. We will apply our key, the key that was originally used to encrypt the data, to our encrypted cypher text, do that, and our will pop G. Obviously, if we don’t have the right key, we’re not going to get back the right plaintext.
So we can put that together in to a little routine, at a proc. Okay, so we’re going to create a proc. Now, once I’ve done this, I’m going to stick all of these demo scripts, the slides are going to go up on my GitHub site so you can download these and play with them as you like. They’re not there at the moment but this afternoon at some point I’ll do that. So we’re going to create a little proc here, we’re going to call it encrypt data XOR. Now, let’s just change my database because I don’t want to put this in my master database. So this proc is just going to loop through every single character in our plaintext and apply our key using the XOR algorithm. So let’s just create this proc, in it goes, and so to encrypt the data, the plaintext GroupBy, we’re going to just plug that into our proc, we’re going to plug in the key, 23, and out will come some encrypted cypher text. To decrypt, we just do the reverse. We plug in our cypher text, we plug in the key that was used to encrypt that cypher text, run the proc, and out will come our now decrypted plaintext. Obviously, if we have the wrong key, if we don’t have the right key, we’re going to get back a load of old gobbledygook.
So we’ve done it. That is how we can now securely encrypt all of our data. We do that, we know now the next time someone hacks us, our data is safe. Well, someone has hacked us, some dirty little thief has come in to our system, they’ve taken away all of our top secret sensitive data. But we don’t care do we? We don’t care about that because we know it’s encrypted. We know there’s no way that they can read that. Well, I hate to burst your bubble here, but they have. Somehow, they’ve managed to decode all that data, they’ve managed to crack our cypher. How have they managed to do this? Well, our cypher, our XOR cypher has got one very, very big weakness. Just have a think for a minute what that weakness may be. That weakness is its key, or more specifically, the length of its key. We used a single byte 8bit key. Now, a single byte 8bit key has only got 256 possible key values. That leaves is very, very vulnerable to what is known as a brute force attack. With a brute force attack, all the attacker does is simply try every single possible key value until they find one that works. It’s about as subtle as subtle as David Banner in a bad mood, but it works quite nicely and well against weaker algorithms.
So let’s have a little look at a brute force attack. So we’re going to run this little section of code. Now, this section of code we’re going to pass in our encrypted cypher text and we’re just going to spin through, we’re going to loop through every single possible key value, and remember, because we’re using an 8bit key, we know the maximum key value we’re going to have is 255. The biggest number you could have in an 8bit byte. So if we run that, we get back a load of possible plaintext values. All we have to do now is scroll through these plaintext values until we find something that makes sense, and there it is. There is our original plaintext now decrypted for the world to see. And we also now know the key, and this is a critical thing. We know the key that that data was encrypted with. So any other data we have encrypted using the key of 23 is now compromised. So that is what my junior DBA would call a copious conundrum. How do you go about protecting against a brute force attack? Well, there’s a couple of things that we can do here. We could make a longer key, or we could try and increase the complexity of our algorithm.
Very, very few modern algorithms rely on a single round of encryption. The more rounds of encryption we use, the more complex our algorithm is going to become, and more difficult it’s going to be to crack. So before we think about any more, I just want to have a look at another cypher. I want to have a look at the Caesar cypher. Now, the Caesar cypher was allegedly invented by the Roman emperor, Julius Caesar. He was a clever guy. To encrypt using the Caesar cypher, all we do is just shift each character of our plaintext by a set number of characters in the alphabet. So for example, if you want to encrypt SQL Undercover, we could just shift each character in our alphabet by two, that would give us the cypher text, USNWPFGTEQXGT. So just to see how we got to that, let’s take a look at the position of S in the alphabet. We shift by two, that will give us a U. So how can we go about implementing the Caesar cypher in T-SQL? Well, it’s very, very simple. All we need to do is add a numeric key to a character’s ASCII code. Simple. With 256 possible ASCII codes, that gives us 256 possible shift positions. Hang on a minute, hang on a minute. With 256 possible shift positions, this is no more secure than our XOR cypher was. It’s going to be just as vulnerable to brute force attack, this is not going to be good enough. I’m sorry Julius, I’m sure 2000 years ago this was fantastic, this was great for parse and encrypted all your armies, but today it’s just not going to cut it. But what about if we were to combine the XOR and the Caesar cyphers?
We could then perform maybe an XOR cypher followed by a character shift. Our cypher will suddenly become much more complicated, much more complex, and much more difficult to crack. So the XOR cypher has 256 possible key values. It’s weak, we know it’s easy to break, we can brute force that no worries. The Caesar cypher, 256 possible shift positions, that’s no better than our XOR cypher. But just think about these numbers. 256 possible key values, 256 possible shift positions. If we were to combine those two cyphers, that would give us 65536 possible shift key combinations. So by combining these two relatively weak cyphers, we can make something much, much stronger. Now okay, I’ll admit to you, 65000 off shift key combinations in a world of modern computing is not going to be difficult to crack. That’s going to be fairly easy to brute force, but for the sake of this demo today, I’m not going to sit here scrolling through that many possible plaintext values, so we’re going to say that 65536 possible shift key combinations is good enough, that is going to be resistant to a brute force attack.
So let’s just hop over to SQL, and we’ll have a little look at this in action. So the first thing we need to do is get our ASCII code and perform a shift on the ASCII code. Essentially, move the G two places along in the alphabet. We do that, we end up with an I. So A, B, C, D, E, F, G, H, I. To decrypt the Caesar cypher, we just reverse it. We’ll do a reverse shift on our cypher text, we do that and out pops the G. So how do you combine the Caesar cypher and the XOR cypher? Well, quite simple. We’ll first perform a Caesar cypher, that will give us some encrypted ASCII code. We’ll then apply a bitwise AND to that – a bitwise XOR, sorry, to that, with some key value. Doing that on a G, will give us the character two. Now, obviously if we change the shift, our cypher text will change. If we change the key, our cypher text will change. So to decrypt, to decrypt we just turn the whole thing on its head. We get the ASCII code of our cypher text, we apply a XOR encryption to that, we then do a reverse shift, doing that on a character two, gets us back to our original plaintext. Again, we can pop that into a little proc. This time we’re going to pass in our plaintext as before, we’re going to pass in our key as before, we’re also now going to pass in a shift position, and because the encryption decryption algorithms are slightly different, we can pass in a decrypt flag. This is going to tell us whether it’s encrypt or decrypt, so which algorithm we’re going to be using.
So let me just run that in. So as you can see, exactly the same as before, we’re just going to loop through each character in our plaintext, we’ll apply the relevant algorithm to that, which will build up our cypher text. So let’s just run that in, and try encrypting the data GroupBy using the key of 210, the shift of two, and because we’re encrypting, the decrypt switch is switched to zero, which will give us some cypher text back. To decrypt, we just do it the other way round. We pass in our cypher text, we use the same key that was used to encrypt, the same shift value that was used to encrypt, but this time because we’re decrypt, we switch the decrypt switch to one, and out pops our decrypted plaintext. Now obviously, once again, if we don’t have the correct key, we’re not going to get back the correct values, and the same is true of the shift. So now we have it, there we have it. We can now securely encrypt all of our data so that it is now resistant to a brute force attack. So now next time, our dirty little thief comes in, steals all of our data and goes running of with it, we can laugh at it. We can laugh at him, we can point at him because we know he can’t read any of our data.
Well, he has hacked us again, he has got our data, and he has somehow once again, managed to decode our cypher text. He has cracked our cypher yet again. How has he managed to do it this time? Well, the XOR-Caesar cypher has got one big weakness. What is that weakness? Have a little think what that may be. It is a straight substitution cypher. Now, what that means is that every unencrypted character’s encrypted counterpart will always be the same. So if we hop over to SQL I’ll show you what that means. Let’s try encrypting the text AAABBBCCC using the key 113, a shift of two. If we do that, we can see that A always encrypts as two, B always encrypts as five, C always encrypts as 4. Let’s just pop ABC on to the end of that, let’s run that again, and again, we can still see A is two, B is five, C is four. That is a problem for us. It is a problem because it means if we can work out what an encrypted character represents, we will know all the occurrences of that character in our cypher text. Substitution cypher are very, very vulnerable to what is known as statistical attacks. Statistical attacks also known as statistical analysis or frequency analysis, work on the fact that certain letters and combinations of letters in English language occur more often than others. So by analyzing the cypher text, we’ve got a pretty good chance at being able to identify these common patterns. That little table there shows us the most frequently occurring characters in the English language. So we can see that E, E shows up around about 12.02% of the time. A T shows up a little over 9% of the time. An A, 8.12% of the time. And again, down at the bottom end of the scale, we can see a Z. A Z is only going to rock up around about 0.07% of the time, a J, 0.1%. So we can use these sorts of numbers, these sorts of statistics to guess, to try and predict what each character in our cypher text represents.
So let me again, just pop over to SQL, and I’ll show you this in action. So first of all, we’re going to try and encrypt this nice chunk of data here, this nice chunk of text. Now, this is just something I’ve grabbed off the BBC News website the other day. So if we encrypt this, we get back a nice big chunk of fairly unintelligible cypher text. How do we now go about analyzing this? How do we go about breaking this down? Well, the first thing that we need to do is analyze the frequency of each character. Count up how often each character appears in that cypher text. So we’ve got a little bit of code that will do that for us. First, we’re going to create a temp table character frequency, to hold these frequencies for us. Let me run that, and then we’re going to run a little bit of code now which is going to populate that for us. This is going to go through each of our cypher text which I have plugged in here, and it’s just going to count up how often each character appears. So let’s just do that now. And just select my character frequency table, just to see what we’ve got. So straight away, we can see that certain characters are showing up more often than others. Just to make things a little bit easier, we’re going to fill out that frequency rank row. Again, I’m not going to go through all the code and how it all works because I say, it’s all up on GitHub, so if you do want to have a poke around, a play around with it afterwards, jump up there, grab it down and you’re free to do so.
So now if we select from our character frequency table now, we can clearly see what the most commonly occurring characters are, and those that show up not very often. Knowing this, we can now start analyzing the character, we can start making assumptions and guessing on what each of those characters is going to represent. Now, one thing we do know is that the most commonly occurring character in a big old chunk of text like that is always going to be the space. And you can see, sitting up here, number one, the open bracket which shows up 267 times, almost double the amount of times our next most frequently seen character appears. So we can make a pretty good guess then that that open bracket represents a space. Now, a lot of algorithms actually get rid of spaces or pad them out or do something with them just to stop this, but just for today’s demo, I haven’t done that. So knowing the space, we can now just update our frequency table, so we can now see that our first character, the open bracket, the predicted character is now – that is a space. And for now it’s very, very simple for us now to just take our original cypher text and just replace each occurrence of that space, of the open bracket in our case with a space. If we do that, it then suddenly makes it very, very easy for us to be able to recognize and be able to identify individual words in our cypher text. Knowing individual words lets us now perform a bit more analysis. We can start looking for common patterns. One thing we all know is that the most commonly occurring single character letter in the English language is going to be an A. So all we need to do now, now that we know spaces, we can have a little look through, we can count up every occurrence of a singularly occurring letter, which we’ll do with this little bit of code here. We run that, we now select from our character frequency chart, down at the bottom, we will see each occurrence of a singularly occurring character. So this time, we’ve actually only got one character here that shows up on its own, and that is an I. That shows up seven times. So we can be pretty confident that I represents an A. So again, let’s just update our character frequency table with that. We’ll just clean it up so we’ll just remove all those little patterns we last put in. And now select from the character frequency table, and we can see the I, we predict that is going to represent an A in our cypher text. And we can carry on, and the demo here goes on a lot further. I’m not going to go through the whole demo because it takes time, but we can carry on now. We can look at other patterns we can look for. We know the most commonly occurring three letter word is the, so we can analyze all the three letter words, find the most commonly occurring one, and we’ve now identified a T, an H, and an E. We also know the most commonly occurring three letter word that begins with an A is an and. Well, we know what A is, so it shouldn’t be too difficult for us to find an and. Do that, and we’ve got an N and a D. So this is just the way that these sort of statistical analysis can be done. As I said, I’m not going to go through the whole script, you can grab it off GitHub later on and play around as you like. Again, try using different plaintext and see how it gets on.
So how do we now go about producing something that is resilient to a statistical analysis? Well, to avoid this vulnerable statistical analysis, we need to make sure that a character isn’t always going to be encrypted in the same way. Look what happens when we try and encrypt the string AAAA, as we have been doing using an 8bit key. We start off with our plaintext. Every byte is the same. We use the word encrypting in identical characters. Because we’re using an 8bit key, that means that our 8bit key – the same key is going to be applied to the same character each time. That means that every single character is always going to be encrypted in the same way. How can we get around this? How can we prevent this from happening? Well, what happens if you’re trying to encrypt data using a 32bit key for example? We’ll start out again with the same plaintext, the same four identical characters. But this time, because we’re using a 32bit key, we can encrypt a different portion, a different byte of our key against each byte of our plaintext, and that means that every character, even if they all started out the same, is going to be encrypted differently. This makes it far, far more difficult to carry out a statistical attack. But how does this do against a brute force attack? Well, using a 32bit key, that gives us just over three million possible key values. So I’m going to say that’s doing pretty well. So again, let’s just jump over to SQL and see about how we can build this up.
Let’s get to the bottom of that demo. So again, we’re going to create a little procedure. This time I’m going to call our procedure encrypt data the pimped XOR. We’re going to pass in a key; now this time, we’re going to use a BIGINT. We’re using a BIGINT because before it was a TINYINT, now we’re using a much bigger number. We’re using a 32bit number, so we’re going to need a BIGINT to hold that number. We’re going to pass in our plaintext. Now, instead of looping through every character individually, we’re going to break our plaintext down into four bytes blocks. We’re then going to just apply our key, our four byte key, against that four-byte block, and that will build up our cypher text. So let’s just run this in. And now, let’s see some action, so if we attempt to encrypt the string, AAAA using our now much longer key, before those A’s were all encrypted to the same character. Now with our new pimped XOR cypher, our new longer key, we can see – let me just scroll that up – we can see that each character, each occurrence of that A encrypts differently. To decrypt, we again, we’re just going to plug in our cypher text, we’re going to use the same key that was used to encrypt, and out will pop our now decrypted plaintext.
So there we have it. We have done it. Now we have done it. Now we know we can produce a cypher that is resistant to brute force attacks, that is resistant to statistical analysis, so now as he has done again, that dirty little data thief has come in, stolen all our data, he’s gone and run off with it again, we know we can laugh, we can point at him because we know he can’t read that. He can’t decrypt that using the methods he’s done before. Well, we have another problem. Once again, this devious little fiend has managed to decode our cypher text. He’s read all of our data. He has cracked our cypher. How? How has he managed to do that this time? Well, let me tell you about something called a plaintext attack. In a plaintext attack, the idea is that by analyzing the cypher text and even just part of the plaintext, we can reveal a whole or part of the key that was used to encrypt it. XOR cyphers regardless of the key length are very, very susceptible to plaintext attacks. Let me show you why that is. Let’s start at the plaintext 10111001. We’ll apply the key 10101010 to that. That gives us the cypher text, 00010011. Now, look what happens if we try and put XOR, the plaintext, and the cypher text. We have our plaintext, we have our cypher text. Now XOR those together, and we get the result, 10101010. We’re back, we’ve got our key value out. And that is a massive, massive problem. So if we can somehow discover a portion of the plaintext, we can use that to retrieve the key and crack the rest of the message.
Now, there’s a few ways we could get hold of that plaintext, that portion of plaintext. We may have performed some sort of statistical analysis already, which has revealed part of the text that we can use. Perhaps we’ve managed to intersect part of the original plaintext, or we may have just guessed. Now, guessing isn’t as silly as it sounds either. Imagine some sort of things we encrypt. Perhaps we’re encrypting a name, a first name. Well, there are only so many common first names out there, so we can have a pretty good go at guessing what that first name could be. As I say, yes, so look at the following. The following was encrypted using a 16bit key.
Somehow, we have managed to get first two bytes of the plaintext. All we now need to do is XOR those two bytes of the plaintext against a cypher text, we’ll have the key, we can easily decrypt the rest of that cypher text. We don’t even really need to reveal the whole key. Just a single byte in that key is going to allow us to decrypt half a character in the message. If we can do that, we seriously weaken the cypher. It was a 16bit key, there’s only 8bits left, we could probably perform a brute force attack on that, we might be able to do some other sorts of attacks.
So just hop over to SQL and I’ll show you a plaintext attack happening. So we’re going to create a little proc, we’re going to call it the crack-a-matic. Inside that proc, we’re going to pass some plaintext, we’re going to pass our plaintext and our cypher text. We’re going to build up a four-byte ASCII representation of each of those plaintexts and the cypher texts, and just apply an XOR against the two four-byte blocks. So let’s run this in. And now look what happens when we try and encrypt the data GroupBy using the key value 4027558485. We get back some cypher text as we would expect. Now, our attacker has managed to somehow get hold of the first four characters of our plaintext. He plugs that into the crack-a-matic along with its corresponding characters, its corresponding bytes from the cypher text. Do that, we get back the value 4027558485. We’ve got back our original key, and that is a massive, massive problem because we can now use that key to decrypt the rest of that data. And also, obviously we now have the key value that was used, so any other data that was encrypted using that key value is now compromised. That, in the words of my junior DBA, is a copious conundrum. So how do we go about protecting against that?
Well, we need to go back to a little bit more binary. We need to look at a bit shift. So what is a bit shift? A bit shift is kind of like what it sounds. We simply shift the bits of a byte either left or right by a specified number of positions. So if we were to perform a left bit shift on the byte 10000001, let’s see what happens. Well, the first thing that happens is the most significant bit is dropped off. We get rid of that, that’s gone. We then shift every single bit along by one, and then the least significant bit we pad out with a zero. That’s a bit shift. Circular bit shift is slightly different. What is a circular bit shift? Well, it’s similar to a standard bit shift, except that instead of throwing away that most significant bit, we move it to the position of the least significant bit. So again, let’s have a little look, see this happening.
First of all, let’s most that most significant bit aside. We shift our byte by one bit, and instead now of padding our least significant bit with a zero, we take whatever that most significant bit was, in this case, it was a one, we move it into the position of the least significant bit. So that most significant bit always cycles around. How can we perform this in T-SQL? Well, unlike some languages, T-SQL does not provide us with bit shift operator. So we’re going to have to do it some other way. Let’s just have a little look at what happens when we perform a left bit shift on the value one. Left bit shift will provide us with two. Let’s do a bit shift on the value two, we get a four. Do you notice a pattern here? Every time we perform a left bit shift, the value is doubling. So we can kind of cheat a little bit really. In T-SQL, all we need to do is by multiplying that value by two, we’ll have the same effect as a left bit shift. We’ve effectively shifted all the bits to the left by one.
What about an overflow? Well, remember what I said. With a bit shift, the most significant bit is usually dropped off, that’s got shut off. If we tried doing this in SQL, it’s going to cause us an issue. Either SQL is going to exceed the maximum value of the data type we’re using and chuck us out an error, or the value is going to simply exceed what we’re expecting it to be. So what can we do? How can we check for an overflow? Well, the first thing to do is check the value of the most significant bit. We can use this funky little bitwise logic to do that if we want to, using a bitwise AND, or much simply, we can just check if the value of our number is equal or greater than that of the most significant bit. If that’s true, then we know that most significant bit is going to be a one. If it detects an overflow, we just subtract the value of the most significant bit from the value and then we shift. So a circular bit shift in T-SQL, what are we going to do there? It’s very, very simple. We’re going to first check for an overflow as we’ve already done. If we detect an overflow, we’ll deal with it the same way as described earlier, we’ll shift the bits and then we simply add a one to our number. Adding a one just simply will effectively change that least significant bit to a one. If we do not detect an overflow, if we decide the most significant bit was a zero, we just don’t need to worry about it.
So how can we now make an algorithm that’s resilient to plaintext attack? Well, let’s just look at the following example. We start with the plaintext, 01100101. We’ll apply the key, 11110000. That will give us the cypher text, 10010101. Now, we know if we XOR that cypher text, with that plaintext, we’re going to get back our original key. But what if we were to perform a second round of XOR? What if we were to shift that with a circular bit shift on that key and use that bit shifted key to perform another round of XOR against our cypher text? Well, shift that key by one, we end up with a key 111000001. Apply that to our cypher text, and we get the final cypher text of 01110100. Let’s have a little look at that. The plaintext and the cypher text apply an XOR to those two, and we get 00010001. Now, that does not match our key. So – and this is exactly how algorithms such as DES works. DES does it – there’s a bit more to [inaudible] with DES, but essentially, what DES does is it will perform 16 rounds of an XOR cypher. Each round is performed using a bit shifted key. So the key will either shift one or two bits, depending on what round of XOR it’s on, and out will pop at the end, some cypher text. This makes things much, much more difficult to perform a plaintext attack against. So let’s just hop over to SQL and I can show you that in action here.
So again, we’re going to pass in to our new algorithm – actually, I’ve called this mini DES. Originally I had the algorithm called mini DES before I changed it on the slide. So the demo script’s a little bit out of date, but we’re going to pass in our BIGINT key. We’re going to pass in our plaintext. Now, the first thing we now need to do is produce two keys. Now, the first key which will be used in the first round of the XOR is simply going to be the key that we pass in. The second key, we are going to just perform a bit shift against. We then just loop through each four byte block from our plaintext as we did before. First of all, we apply a first round of XOR. So that will be using our original key. We then, against the encrypted data we get back from that, will perform a second round of XOR using our bit shifted key to give us our final cypher text. So let’s just create this proc, let’s just try encrypting some data. So we’re going to encrypt the data GroupBy, using the key 4027558485. And we can just decrypt the same way, passing in our cypher text and the key that was used the encrypt that cypher text, and out will pop our original data, our original plaintext GroupBy.
But what about if we want to perform a plaintext attack at this? Let’s try. Once again, our attacker has managed to get a hold of the first four characters of our plaintext. So he’s then going to plug that now into the crack-a-matic, along with its corresponding characters, its corresponding bytes from the cypher text. If he does that, now, we get back the value 269549310. That does not match our original key value, and if we try and encrypt that cypher text using the key we’ve just got back from our crack-a-matic, we will see we get back a load of old gobbledygook. So by doing that, by shifting the key, we can now build something that is resistant to the plaintext attack.
So that pretty much brings me to the end of the session, but there is one final question. SQL Server gives us some really, really great ways to encrypt our data. We’ve got all sorts of brilliant ways of encrypting data. So what is the point of hand cranking our own routines? Well, that is a good question, and was kind of the point of today’s session. First of all, I hope it’s been quite interesting to see how encryption works inside and how the cogs turn. But also, I would try and make a point that DIY encryption is pretty easy to crack. Unless you really know what you’re doing, all we’ve looked at is some very, very basic, very fundamental attack vectors. There’s much, much more clever guys out there than me that can do much, much more clever stuff, on much, much more complicated algorithms. So if you do decide to hand crank your own routines, think about it, be very, very careful. Just bear in mind that DES, DES as its 52bit key, and its 16 rounds of XOR is now considered insecure. So that pretty much brings me to the end of the session. So thank you very, very much for listening. I hope you did find it useful, and that’s me, thank you.
Brent Ozar: Nice job David, very cool. And of course, as we’re saying in Slack, everybody is pretty much now terrified and glad that they don’t have to write anything for encryption.
Latest posts by David Fowler (see all)
- Writing your own encryption routines in SQL… and then cracking them - August 11, 2017