X
Tech

Encryption, hashing, and obfuscation

Secure it or lose it says PreEmptive Solutions' chief scientist. In a world of scoundrels, hackers, and thieves, protecting data has become an integral part of computing.
Written by Paul Tyma, Contributor
COMMENTARY--I can imagine a world where the computers needed no security. Where there were no passwords, no security checks, and no firewalls. Where the computers communicated freely and shared information rather than hiding it. I can also imagine the scoundrels of earth hacking that world’s net and stealing all their credit cards.

Unfortunately, that is our reality. In today’s world of interconnected networks: Secure it, or lose it. We don’t have the luxury of running insecure networks or exposing insecure data. Unfortunately, it’s often our task to sell that data and try to prevent those from seeing it to profit from our work.

That’s where encryption, hashing, and obfuscation come in. This article hopes to clear up some misconception in the industry as to how these technologies work and how they relate to each. In practice, they do not conflict at all. Simply put, each works best in a specific problem domain.

Definitions
Let’s start by providing some contextual definitions. Encryption is the act of converting data from an understandable form to a non-understandable one in such a way that it can be converted back with no loss of information. Notice that lossless compression algorithms also fit that description. Although they are not normally thought of as encryption, they do generally remove the understandability of data as they compress. We can add intent to our definition stating that encryption’s primary intent is to hide data (whereas compression’s primary intent is to make it smaller).

Because of our environment, we often think of encryption as something very strong. Contemporary algorithms are practically uncrackable. That is, it would take millions of years for our best computers to decipher highly encrypted data.

It’s interesting to note that the methodology of encrypting data (or the key used) may not be identical to the methodology of decrypting that data. The important part is that the original message can be obtained in some way. This implies that there is a one-to-one relation between every encryptable message and its encrypted representation. That one-to-one relation is another characteristic of encryption.

A related topic to encryption is one-way hashing. One-way hashing algorithms work in many ways the same as encryption with a few important differences. As the name implies, these algorithms are one-way. That is, once the data is converted, the hashed version of the data cannot (theoretically) be used to recreate the original data.

Have you ever lost our password and asked a system administrator what it was? It’s true that system administrators often have access to the password file, but they can’t tell you your password. One-way hashing is used to encrypt passwords. This is quite an appropriate application of these algorithms. After all, the computer doesn’t really care that your password is “snookums”, it only cares that you enter precisely that every time you attempt to log in. Therefore, when you enter your password, the computer can one-way hash it and compare the result to the version in the password file. Since one-way hashes are a one-to-one relation, if the two versions match, you are allowed access.

If a hacker obtains access to the password file, all they are greeted with are a collection of mashed data. Their only recourse is to brute force one-way hashing every word in the dictionary and comparing it to the password file. If your password is “lucky” it will probably be found via this attack. If its “J0E#$Cu.uL”, it probably won’t. It’s important to realize that because one-way hashing is one-way, it is in some ways stronger than encryption. Even though encryption is mathematically sound, if a hacker somehow obtains the decryption key – it’s all over. They have complete and full access to the data. This isn’t true for one-way hashing, nothing in the world can help a hacker decipher a one-way hash. Their only option is to persistently guess what the data was and then test to see if they were right.

Finally obfuscation works similarly to the above two concepts again with important differences. Obfuscation is a term applied general to the protection of program code not general data. It has gained interest in recent years because computer programs written in newer languages such as Java and C# are easy to understand even after compilation. Languages such as C and C++ which typically are compiled into machine code and then manipulated by optimizing compilers are difficult to follow in that form.

In fact, Java and C# are so understandable after compilation that programs called decompilers exist to automatically reassemble your Java or C# source program from the compiled executables. The problem with this is that anytime you release software written in these languages, its painfully easy for someone to reverse-engineer it. From there, they can bypass licensing code, steal intellectual property, or even take pieces out to help construct their own competing product. Certainly, the need for this type of protection is, like encryption, a case-by-case evaluation.

An important restriction placed upon obfuscation is that the data it works upon must remain active. That is, obfuscation attempts to shroud the meaning of data to one viewer while protecting it from another. In computing terms, its primary use is the protection of program code. It must retain execution integrity of that program for a computer, but hinder reverse engineering efforts of humans. After all, we are attempting to protect a computer program that will someday run inside a computer. This is the direct reason we can’t encrypt computer programs. Surely, encryption would be an excellent solution except that a computer can’t run encrypted programs. Per the definition of encryption, the program would no longer be understandable (even by a computer).

Solutions do exist to encrypt program data and then decrypt that data immediately prior to execution but these solutions always suffer from relatively classic flaws. For one, before the computer executes the program, it must have the unencrypted program exist in memory somewhere. It’s not hard for a hacker to take snapshots of that memory. Secondly, any strong encryption requires a key for decryption. Where is that key to be stored? Obviously, if that key is stored somewhere in the program, it will be found and the program will be in naked, unprotected view. Even if it is somehow transmitted with every execution, it can be intercepted.

Encryption has not historically solved this problem adequately. Obfuscation works because it needs no decryption phase. The program is still in plain view, it just means far less to a human. Generally, obfuscated programs can run faster than their unobfuscated version since small optimizations can be applied during obfuscation (and generally, abstraction and indirection introduced by programmers can be removed).

Removing context Obfuscation works by removing the context we (and automated decompilers) use to understand the program as a whole. For example, assume you intercepted the following email in a rare, irreproducible email error:

----------------------------------
From: fred@xyz.com
To: Jane@xyz.com
Subject: Here it is!

40000
---------------------------------

Overall, the fact that you intercepted this email gives you nothing. You have full knowledge of the data Fred sent to Jane and you have the knowledge that Fred did send some data to Jane but without any further context, that data is meaningless. That number could indicate a salary, an answer to a math problem, or an order for 40000 widgets.

Conversely, we assume that this same exact message has a great deal of meaning to Jane. She knows who Fred is, and she knows why he sends her numbers. Jane has context, you do not.

In simple terms, this is how obfuscators work. They perform several operations on program code to make it less understandable for humans without removing any of the meaning for computers. Some example obfuscating transforms are:

1) Renaming of all program identifiers to meaningless names. Things like “getPayroll” becomes “a”.
2) Apply simple (and fast) encryption to embedded strings in the program.
3) Change the control flow created by compilers. Generally, a java compiler always outputs for-loops in the same constructive way. Decompilers look for those constructions to recognize that loop. An obfuscator can reorder machine instructions to still “loop” but in highly unrecognizable ways. In some ways, you can think of this as introducing “spaghetti code”.

Some obfuscators also introduce erroneous code that is cleverly never executed to throw off would-be hackers or remove dead-code achieving surprising code size reductions.

In this context, it can be argued that obfuscation is stronger than encryption. Certainly data with strong encryption is practically impossible to decrypt, however, it also cannot be executed in that form and if the key is obtained, it is now in complete view. Obfuscation is a one-way lossy transformation that destroys the structure that reverse-engineering programs look for.

The information that existed prior to obfuscation does not equivalently exist after obfuscation. Unlike one-way hashing, there is not a one-to-one relation between unobfuscated and obfuscated code. If possible, a brute force attack could find many (and under the right circumstances, infinite) possible correct “original versions” of the code. There isn’t enough information to be sure which original version would be the right one. Given that even the tiniest manufactured logic error in heuristically created code could crash an application, brute forcing to do undo obfuscation isn’t particularly viable.

Protect now, Protect later
Encryption and one-way hashing have been part of passive-data protection for many years. With the advent of dynamically-linked, intermediately-compiled languages such as Java and C#, the research into obfuscation is sure to increase.

Computing leaders have seen this problem a mile away. Microsoft has recognized this problem and included a third-party obfuscation product into their Visual Studio .NET development product giving their developers a solution right from the start. Obfuscation is destined to be added to the post-compilation phase of all future development.

biography
Paul Tyma is Chief Scientist of PreEmptive Solutions, Inc., a code security company. Paul is a frequent writer for JavaPro magazine, Dr. Dobb's Journal, Communications of the ACM, and is the lead author of Macmillan's Java Primer Plus. He has spoken at the Software Development conferences for five years and is currently the Java and Security track chairs. Paul is the lead architect of the PreEmptive's DashO line of products and has developed the engine for PreEmptive's Dotfuscator product. He is currently writing Designing for Performance to be published by Addison-Wesley.

Editorial standards