loading image... Gichuki P Mwangi

# Hash Function

This might sound like a Twitter hashtag gimmick, but it is not. It is actually nothing to do with social media, at all. When hash functions are mentioned, a lot comes into scope ranging from security to integrity, to crypto. In fact, you use hash functions each day, and you are right now. It is one of the things your computer did while you connected to this blog; a TLS handshake – fancy as it may seem. In this article, we shall properly segment the hash function concept into understandable parts with the aim of building an overall understanding – ‘hash functions demystified‘.

To understand hash functions, one must accept to take in the meaning some few terms which include; functions, map, size and code. All these terms are as used in computer science, not other subjects. A function is a callable unit of code and code is a set of instructions forming a computer program executable by a computer. By ‘callable’, the note here is; a block of code that can be called using an expression. Let us use a simple, universal example, alright?
Please note that you may need to know a bit of programming to understand this. If you don’t, do not worry because this is not a programming article. Check the example below:

# …using python programming language for this example

#Defining the function
define test_function(t):
return t+5

#Calling the function
test_function(t)

Next terms that need understanding; map and size. A map is a data element structure between two distinct data models while size means the number of bits specific data is composed of. With those few definitions, you can now fully understand the meaning of a hash functions.

A hash function is a function that can be used to map data of arbitrary size to data of a fixed size. It could be tempting to leave the definition as is but let us expound on it a little further. Assume you have data (1 GB) in a flash disk, and you decide to use a hash function on it. The product after using the hash function – you get data that is 256 bits. You then get more data (10 GB) and you take it though the same hash – you’ll get the same-size 256 bit output. How this works may seem out of the box but, it really is a wonderful process that is quite simple. This article shall explain just that.

loading image... Gichuki P Mwangi

Before we go ahead, I would like to strictly specify that hashing is not encryption and encryption is not hashing. This is because of two very significant characteristics of hashing; determinism and irreversibility. This basically means that when hashing happens, the same input always leads to the same output and you can never get the original input. Encryption is reversible. If a hash function cannot meet at least these minimum requirements, it is broken. Not useful.

It is also important to properly highlight the properties of a good hash function starting with the two mentioned earlier. These characteristics are dependent on the applications of the hash but are important in all hash functions.
Determinism
For all hash functions; any specific input value must always result in the same hash value. This excludes some scenarios but for the purposes of this article, it shall stand as is.
Randomization
For a good hash function, an input value must be evenly mapped over the hash output range and with the same probability. This is important because guessing the value of the input should prove difficult.
Non-invertibility
Given the hashed value h(x), it would be unfeasible to reconstruct the input x in a reasonable amount of time.

Hash functions can be grouped into three distinct categories based on their purpose: indexing, integrity and cryptographic hash functions. Let us consider each of the categories.

Indexing
This kind of hash is used to improve performance by grouping values of equal or similar parts in collective operations. This in turn improves fundamental tasks such as searching. For instance, each file stored in a computer can be pre-computed into a hash and stored in an array. Once a search is performed, the array is searched instead of the files storage. This reduces the time significantly. These hash functions are generally specialized and have high performance. An example of such a hash is Java’s hash code.

Integrity
Consider a scenario where you are sending files to a pal over the internet. It is however paramount that the files sent be the exact same copy because accidental corruption can occur. One of the ways you could ensure that the file is the exact copy is to send them several times and let your pal confirm that they are all the same. This would be impractical while sending files that are say; 1 TB. Is it even economically feasible to do that?
To counter such a bottleneck, integrity hashes (redundancy hashes) are used. You send a file with its hash and your pal just needs to compare the hashes once he / she computes the hash of the file you sent them. An example of this type of hash is the CRC32. It is important to note that CRC32 is broken.
A redundancy hash has to have a high avalanche coefficient, work with all types of data and have a high performance.

Cryptography
Cryptographic hash functions are very secure and hard to break. Emphasis is put on making sure that they are irreversible and also have a very high avalanche coefficient. Their applications include but are not limited to; password storage and blockchain technologies. They are also resistant to attacks and collision, plus they should be universal in the nature of data they work with.

So, how do hash functions work? 

For us to understand the working of hash functions, we could consider as simple one and you can build your knowledge from there; the mode hash function.

int h(int x) {
  return x % 10;
}

This function maps any integer onto one of the indexes of an array of ten slots. Remember that array indexing starts at zero (0). See the image below.

an array with 10 slots

The % symbol is used to mean modulus as in mathematics – the remainder after division. If you have a number 3 and you divide it by 2, you get 1 remainder 1. This expression can be represented as;
3 % 2 = 1.
This is the logic we shall use with our mod hash function. 

With our function, the divider we shall use is 10 therefore consider a value x. To store it in our array, we store it at index x % 10. 

x = 27 therefore, we store it at 27 % 10
x = 122 therefore, we store it at 122 % 10

Consider searching a value where x = 392 using the same function.
392 % 10 = 2
Therefore, all you have to do is check if the value in index 2 is equal to 392. If it is not, then there is no such value stored in the array.

This example can form a basis for how hash functions work and with it, you can understand truly more complex hash functions – over time.

A hash function alone cannot be used for purposes that are commercially viable. The main reason is because hash functions always take in data at a fixed length. 1 GB of data has to be divided into several chunks for it to be digested by the hash function. To implement this, hashing algorithms are used. There are many different types of hashing algorithms but let us consider one of the most popular; SHA – Secure Hashing Algorithm. From this algorithm, there are different variations; SHA-1, SHA-2, SHA-3… Let us stick with SHA-1.

SHA-1 takes in data in chunks of 512 bits. Consider then an input that is 1024 bits long. The message shall have to be divided into two, then run through the hash function. That is not all though. If the products of the hashing were to be taken as is plainly and output, then they could become compromised. An extra step is added to the process to ensure security and this is called padding. In this technique, the message is divided into data blocks and then the hash function is repeated as many times as the data blocks.

Now you are in the clear, you are wiser.

Gichuki P. Mwangi

A computer scientist with a passion to solve real world, day to day problems using new computer technologies and those already in existence.

Leave a Reply

Your email address will not be published. Required fields are marked *