Password Hashing 101

By Suhana on Dec 14, 2009

It's fairly common practice these days for web sites that require simple user
authentication to store the users passwords in a hashed form rather than plain
text.

This document attempts to show how this process itself works, and discusses the
pros and cons of this technique and suggests alternatives.

To start with, just what is a hash function? Quite simple it's a one-way
process of turning plain text into a "hash" (usually consisting of hexadecimal
characters only) of a fixed length.

crc32( "password" ) = 35C246D5

Now a hash collision is something that happens when too different bits of plain
text, when passed through a hash function result in identical hashes. This is
something we really don't want as it gives the attacker the ability to gain
access without knowing the original password.

crc32( "qzzxipe" ) = 35C246D5

Now, how is an attacker going to know that the text "qzzxipe" produces the same
hash as "password" when passed through our crc32? Simple: tables.

Hash            Text
----------      ----------
E8B7BE43        a
71BEEFF9        b
06B9DF6F        c
...
8CDC1683        x
FBDB2615        y
62D277AF        z
078A19D7        aa
9E83486D        ab
E98478FB        ac
...

In order to generate these tables, the attacker must simple run through a large
loop a bit like an odometer on a car, and for each sample word, pass it through
the hash and store the result. It's then a simple case of looking up the hash
to see what text is needed.

As a side note, while writing this, I was looking for a set of collisions for
the word "password", by checking every possible combination of lowercase
letters. The result:

Searching for a collision for "password" using 'a'-'z'

7 character collision found after 1,739,233,772 comparisons - qzzxipe
7 character collision found after 3,314,480,911 comparisons - bgxayrj
7 character collision found after 6,541,920,606 comparisons - aioqodu
8 character collision found after 12,948,840,982 comparisons - mefxuwoa
8 character collision found after 12,987,867,295 comparisons - nkqhcapa
8 character collision found after 15,007,178,291 comparisons - bwwdbova
8 character collision found after 17,281,565,592 comparisons - scxemxcb
8 character collision found after 22,080,843,462 comparisons - cpmlklsb
...

And how does he know what hash is needed to start with? Well that's down to
your application's security. If he can somehow ask it for a username and hash,
he's then got all the tools at hand to masquerade as that user. This means that
your application will have to be pretty solid to protect against a number of
attacks, including XSS and SQL injection. Not to mention, you should never
store these hashes in cookies, or in a hidden field / comment on the web page.

My search program used only the lower case letters, so it has a character set
of only 26 letters. I could easily create a set of tables containing the hash
and the input required to produce that hash as follows:

Table                   Number of Entries
--------------------    --------------------
one letter              26
two letters             676
three letters           17,576
four letters            456,976

You can see that as the number of letters in the sample input increases, the
storage space required increases .. a lot!

Okay, but just using lowercase letters is not very good, lets extend this idea
to lowercase and uppercase letters.

Table                   Number of Entries
--------------------    --------------------
one letter              52
two letters             2,704
three letters           140,608
four letters            7,311,616

That's getting a bit larger. But what about adding digits?

Table                   Number of Entries
--------------------    --------------------
one letter              62
two letters             3,844
three letters           238,328
four letters            14,776,336

But why are we even bothering about limiting the input? It's not like anybody
will ever see the plain text password, so lets just allow !-~ That's basically
the entire printable ASCII character set from character #33 (exclamation mark)
to character #126 (tilde).

Table                   Number of Entries
--------------------    --------------------
one letter              94
two letters             8,836
three letters           830,584
four letters            78,074,896

That's a lot of entries! Or is it? One of my passwords is 16 characters long,
which in turn would need a table with:

37,157,429,083,410,091,685,945,089,785,856

entries in it!

You might be thinking at this stage, that there is no way an attacker could
create such big tables and you'd probably be correct, however... Lets look at
my example above again:

Searching for a collision for "password" using 'a'-'z'

7 character collision found after 1,739,233,772 comparisons - qzzxipe
7 character collision found after 3,314,480,911 comparisons - bgxayrj
7 character collision found after 6,541,920,606 comparisons - aioqodu
8 character collision found after 12,948,840,982 comparisons - mefxuwoa
8 character collision found after 12,987,867,295 comparisons - nkqhcapa
8 character collision found after 15,007,178,291 comparisons - bwwdbova
8 character collision found after 17,281,565,592 comparisons - scxemxcb
8 character collision found after 22,080,843,462 comparisons - cpmlklsb
...

This was only using the lowercase letters. What happens if we extend the search
to encompass all our printable ascii characters...

Searching for a collision for "password" using '!'-'~'

5 character collision found after 4,359,811,489 comparisons - <Y(oW
6 character collision found after 10,367,151,886 comparisons - q[giF!
6 character collision found after 16,601,948,120 comparisons - CY:\8"
6 character collision found after 19,692,249,221 comparisons - ~Hv4`"
6 character collision found after 20,347,202,700 comparisons - 3eKYh"
6 character collision found after 24,628,881,407 comparisons - `jMJA#
6 character collision found after 25,224,716,054 comparisons - -Gp'I#
6 character collision found after 30,415,206,444 comparisons - !h%U-$
...

You might conclude that limiting the characters that are acceptable would be
better as there appears to be a lot more collisions, however the time taken to
perform this search was a lot longer. Remember for each character, instead of
looking at just 26 possibilites, I was looking at 94 possibilites. You can see
the difference by looking at the number of comparisons performed.

The main thing to note here is that I now don't even require just the 7 letter
table, but the 5 letter tables which are obviously smaller and faster to
generate than the 7 or greater letter tables.

Okay, what can we do about this? Assuming a hacker can:

a) Get a hash from your system
b) Knows the hashing function involved
c) Has a set of tables (or can generate them)

they will be able to get into your system and start masquerading as another
user - maybe stealing information, changeing the way things etc.

Lets see what facilities there are in prevention.

The first thing that can be done is to "salt" the input to the hash function:

crc32( "password" + "salt" ) = 451929C1

The obvious change here is in the resulting value of the hash. This means that
those lovely tables the attacker created are no use as looking up the value of
521C2314 will no longer result in "suhana" being returned.

Okay, this doesn't sound so special, what's the advantage? Well two things
here:

The first is that the attacker doesn't know the salt. So he has no frame of
reference from where to even look. It's not really feasible to construct a set
of tables that are pre-salted as each the attacker has no idea how the salt was
applied, be it to the front of the input text, or the rear, or even
intertwined.

The trouble is that if the salt is discovered, then the attacker can create a
new set of tables, perform the lookup and again he's in.

So how about a two part salt?

crc32( "password" + "salt1" + "salt2" ) = CC6077BE

Why two part? Well this has proved fairly popular and is simple to implement.
The developer picks a random string for salt1 and stores this inside the
application's source. The second salt is a small one (say up to 10 characters)
and is created just before the hash function is called, and stored in the
database.

Okay, so you may say that if the attacker can get a hash, he could probably get
salt2 as well, and you'd be right, however, again he doesn't know where it's
applied.

crc32( "salt2" + "password" + "salt1" ) = 88C33989

How you generate these salts is off course up to you. Personally when I was
using this system, I'd use a cryptographically secure source to generate one
salt which was locked to the application itself. Nobody else could see it and
assuming my server was secure, nobody else would see it. I'd generate the
smaller salt on the fly with just a simple random-number generator.

If you were to use a really simple one lowercase character salt appended to the
password, then the attacker would need to generate 26 new sets of tables in
order to stand a chance of finding a collision. If you use a 63 character salt
using the full ascii character set that would require:

20,279,384,836,208,608,938,076,368,687,847,075,393,042,404,657,
326,807,978,041,929,338,841,131,091,451,128,748,772,209,263,378,
585,156,010,424,226,296,549,588,598,784

sets of tables!

Ideally in fact, you should be using the full range of the character set - so
256 possibilites per character instead of 94.

So at this point you may well be thinking your system is secure. Well before
explain why not, lets just take a look at the actual hashing function.

I've been using crc32 as simply it's the easiest tool for me to produce a
number of collisions in. You will have all hopefully heard of md5, sha1 etc. so
lets take a look at these:

Hash Function       Length of output in bits
---------------     ------------------------------
CRC32               32
MD5                 128
SHA1                160
SHA256              256

CRC32 we can ignore in real situations as collisions can be generated with ease
and very high speed on any hardware.

MD5 itself looks like a much better solution with SHA1 marginally better than
that, but if you have been paying attention to the press you may have heard
that somebody "cracked" md5.

It is now considered possible to generate collisions in md5 very fast. Around
the mid 1990's a collision was discovered using the MD5 hash which prompted
security researchers to suggest an alternative. In 2006, on eminent researcher
produced an application capable of generating a collision on standard computing
hardware in less than a minute!

Okay, so lets up the ante and use SHA1? Well again of you do a little research
you will discover that there are weaknesses in the SHA1 algorithm allowing an
attacker to create a collision in far less time than expected. So again,
ideally, we need to use something stronger.

SHA256 any good? Well it's even stronger but there is still some research
suggesting that a certain type of attack can render this not quite as strong as
initially considered.

Surely, there is something that could be used... Sure there is: SHA1. Yes, I
know I pointed out that research has been done to prove that SHA1 is not secure
from certain types of attacks, however for most common applications it will be
usable.

Okay, but lets be sensible about this. The speed that MD5 can be broken
suggests that perhaps SHA256 would be better as hopefully it will take a few
more years before anybody can generate a collision in a few minutes.

But wait... What about using multiple hashes, or a chain of hashes? So maybe:

hash = sha1( md5( "password" . "salt" ) )

would be better? No, in fact it's worse than the standard sha1 function. Why?
Consider the data going into the md5 routine. It's nicely complex however we've
stated that md5 is thoroughly broken, and given we know the output of md5 is
32 hexadecimal characters we already know how to generate tables for sha1.

As far as I'm aware, it's still not computationally feasible to use tables to
break sha1, however there are algorithmic methods of breaking it and given the
advances in cloud computing for one, and just general computing power, this
will not present a major problem.

The simple rule here is do not try and construct your owh hashing routine as
it's highly unlikely you will succeed in producing anything that is even
close to md5 in strength - and again, as I've pointed out, that is very easy
to break.

So the suggestion may well be:

hash = sha256( "password" + "long-salt" + "short-salt" )

where the "short-salt" is generated randomly just prior to calling this
function and stored along with the resulting hash in the database. The
"long-salt" is something specific to that application and only stored somewhere
in the source code.

By the way, when talking about salts, instead of using just simple ascii-
-printable characters, you should really be using raw binary data. This means
you will have 256 possibilities for every character in your salt!

Of course you still have a problem however. If we go back to what an attacker
needs to break your system:

a) Get a hash from your system
b) Knows the hashing function involved
c) Has a set of tables (or can generate them)

So, (C) we have kind off protected against by using large salts, we can't
really protect against (B) as frankly all the so-called "super-duper" hash
functions I've seen on the net are simple to break as the author hasn't done
the math nor the research to mathematically prove his solution is solid.
There's also the question of using something from the net itself, surely
everybody has seen it? In some cases this can prove fatal. I've seen certain
applications which have large support forums with people banging away about how
great their latest hash function only for a lot of people to deploy it and
wonder how attackers just walk straight in.

What worries me here is that we still have (A). An attacker can still get a
hash from the system. If he can get this, it's entirely possible he can get the
salt, the hash function and any stored salt to boot.

Maybe... it's time to rethink the ideas and use a totally different method.

Now the above hashing principal is fairly reasonable, but if you are
considering it because it's "good practice" or it's been suggestion by a
friend, don't.

What on earth makes you think you have the skills to outwit the talents and
huge computing power available to todays attacker? What happens if some
researcher discovers a hugh flaw in sha256 that allows collisions to be
generated in seconds?

So where do we go for here. Well it's a case of working out why your solution
to the perfect hash is flawed that holds the key. Hashes by their very nature
are generally fast to process the data. And you really want to make things
difficult for the attacker not just today, but at some point in the future as
well.

There's an article:

Enough With The Rainbow Tables:
What You Need To Know About Secure Password Schemes

written by some talented professionals over at Matasano security

http://chargen.matasano.com/

how describe this process in more detail and while I still have some
reservations about the pros and cons of using bcrypt, I suspect that is down to
my own inabilities as a programmer to come up with certain aspects of the
password hashing / verification routines.

/**
 **  Password Hashing 101 - Source Code
 **
 **  Copyright (C) 2009, Suhana
 **  All rights reserved.
 **  
 **  Redistribution and use in source and binary forms, with or without
 **  modification, are permitted provided that the following conditions are
 **  met:
 **
 **  1. Redistributions of source code must retain the above copyright
 **     notice, this list of conditions and the following disclaimer.
 **  2. Redistributions in binary form must reproduce the above copyright
 **     notice, this list of conditions and the following disclaimer in the
 **     documentation and/or other materials provided with the distribution.
 **  3. Neither the name of the copyright holder nor the
 **     names of their contributors may be used to endorse or promote products
 **     derived from this software without specific prior written permission.
 **
 **  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ''AS IS'' AND ANY
 **  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 **  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 **  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY
 **  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 **  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 **  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 **  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 **  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 **  OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 **  SUCH DAMAGE.
 **
 **  The following source code was used to generate some of the hashes in the
 **  accompanying article.
 **
 **  Unfortunately, I've misplaced the original author / copyright of the
 **  crc routines so my apologies to them however I believe that given it's
 **  wide usage, it is acceptable to reprint here.
 **
 **  Compile with:
 **
 **    gcc -W -Wall -O3 -o run main.c
 **
 **  Run with:
 **
 **    ./run
 **
 **  Terminate with:
 **
 **    Ctrl-C
 **
 **/

/* main.c */

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "crc32.h"

#define FIRST_CHAR '!'
#define LAST_CHAR  '~'

static unsigned long int crc_tab[256];

void crc32_gentab( void )
{
   unsigned long crc, poly = 0xEDB88320L;
   int           i, j;

    for (i = 0; i < 256; i++)
    {
        crc = i;

        for (j = 8; j > 0; j--)
            if (crc & 1)
                crc = (crc >> 1) ^ poly;
            else
                crc >>= 1;

        crc_tab[i] = crc;
    }
}

unsigned long int crc32( unsigned char* block, unsigned int length )
{
    register unsigned long crc = 0xFFFFFFFF;
    unsigned long          i;

    for (i = 0; i < length; i++)
        crc = ((crc >> 8) & 0x00FFFFFF) ^ crc_tab[(crc ^ *block++) & 0xFF];

    return (crc ^ 0xFFFFFFFF);
}

void find_collision( char* input )
{
    unsigned long      hash = crc32((unsigned char *)input, strlen(input));
    char               word[100];
    unsigned long long compared = 0;

    word[0] = FIRST_CHAR;
    word[1] = 0;

    printf("Searching for a collision for \"%s\" (%08lX) using '%c'-'%c'\n\n", input, hash, FIRST_CHAR, LAST_CHAR);

    while (1)
    {
        if (crc32((unsigned char *)word, strlen(word)) == hash)
        {
            printf("%u character collision found after %llu comparisons - %s\n", (unsigned)strlen(word), compared, word);
        }

        compared++;

        char* ptr = word;

        while (*ptr)
        {
            if (*ptr == LAST_CHAR)
            {
                *ptr++ = FIRST_CHAR;

                if (!*ptr)
                {
                    *ptr++ = FIRST_CHAR;
                    *ptr   = 0;

                    break;
                }
            }
            else
            {
                (*ptr)++;

                break;
            }
        }
    }
}

int main( void )
{
    crc32_gentab();

    find_collision("password");

    return 0;
}

Comments

Sign in to comment.
Are you sure you want to unfollow this person?
Are you sure you want to delete this?
Click "Unsubscribe" to stop receiving notices pertaining to this post.
Click "Subscribe" to resume notices pertaining to this post.