The MD5 algorithm

The MD5 algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. It can still be used as a checksum to verify data integrity, but only against unintentional corruption.

Like most hash functions, MD5 is neither encryption nor encoding. It can be cracked by brute-force attack and suffers from extensive vulnerabilities as detailed in the security section below.

MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4. The source code in RFC 1321 contains a "by attribution" RSA license. The abbreviation "MD" stands for "Message Digest."

The security of the MD5 has been severely compromised, with its weaknesses having been exploited in the field, most infamously by the Flame malware in 2012. The CMU Software Engineering Institute considers MD5 essentially "cryptographically broken and unsuitable for further use". Despite this known vulnerability, MD5 remains in use.

MD5 is one in a series of message digest algorithms designed by Professor Ronald Rivest of MIT (Rivest, 1992). When analytic work indicated that MD5's predecessor MD4 was likely to be insecure, Rivest designed MD5 in 1991 as a secure replacement. (Hans Dobbertin did indeed later find weaknesses in MD4.)

In 1993, Den Boer and Bosselaers gave an early, although limited, result of finding a "pseudo-collision" of the MD5 compression function; that is, two different initialization vectors that produce an identical digest.

In 1996, Dobbertin announced a collision of the compression function of MD5 (Dobbertin, 1996). While this was not an attack on the full MD5 hash function, it was close enough for cryptographers to recommend switching to a replacement, such as SHA-1 or RIPEMD-160.

The size of the hash value (128 bits) is small enough to contemplate a birthday attack. MD5CRK was a distributed project started in March 2004 with the aim of demonstrating that MD5 is practically insecure by finding a collision using a birthday attack.

MD5CRK ended shortly after 17 August 2004, when collisions for the full MD5 were announced by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu. Their analytical attack was reported to take only one hour on an IBM p690 cluster.

On 1 March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne de Weger demonstrated construction of two X.509 certificates with different public keys and the same MD5 hash value, a demonstrably practical collision. The construction included private keys for both public keys. A few days later, Vlastimil Klima described an improved algorithm, able to construct MD5 collisions in a few hours on a single notebook computer. On 18 March 2006, Klima published an algorithm that could find a collision within one minute on a single notebook computer, using a method he calls tunneling.

Various MD5-related RFC errata have been published. In 2009, the United States Cyber Command used an MD5 hash value of their mission statement as a part of their official emblem.

On 24 December 2010, Tao Xie and Dengguo Feng announced the first published single-block (512-bit) MD5 collision. (Previous collision discoveries had relied on multi-block attacks.) For "security reasons", Xie and Feng did not disclose the new attack method. They issued a challenge to the cryptographic community, offering a US$10,000 reward to the first finder of a different 64-byte collision before 1 January 2013. Marc Stevens responded to the challenge and published colliding single-block messages as well as the construction algorithm and sources.

In 2011 an informational RFC 6151 was approved to update the security considerations in MD5 and HMAC-MD5.

How can I generate an MD5 hash?

java.security.MessageDigest is your friend. Call getInstance("MD5") to get an MD5 message digest you can use.

The MessageDigest class can provide you with an instance of the MD5 digest.

When working with strings and the crypto classes be sure to always specify the encoding you want the byte representation in. If you just use string.getBytes() it will use the platform default. (Not all platforms use the same defaults)

Getting a File's MD5 Checksum in Java

There's an input stream decorator, java.security.DigestInputStream, so that you can compute the digest while using the input stream as you normally would, instead of having to make an extra pass over the data.

MessageDigest md = MessageDigest.getInstance("MD5");
try (InputStream is = Files.newInputStream(Paths.get("file.txt"));
 DigestInputStream dis = new DigestInputStream(is, md))
{
 /* Read decorated stream (dis) to EOF as normal... */
}
byte[] digest = md.digest();

Calculate MD5 checksum for a file

It's very simple using System.Security.Cryptography.MD5:

using (var md5 = MD5.Create())
{
 using (var stream = File.OpenRead(filename))
 {
 return md5.ComputeHash(stream);
 }
}

(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)

How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)

Different results with Java's digest versus external utilities

Got it. The Windows file system is behaving differently depending on the architecture of your process. This article explains it all - in particular:

But what about 32-bit applications that have the system path hard coded and is running in a 64-bit Windows? How can they find the new SysWOW64 folder without changes in the program code, you might think. The answer is that the emulator redirects calls to System32 folder to the SysWOW64 folder transparently so even if the folder is hard coded to the System32 folder (like C:\Windows\System32), the emulator will make sure that the SysWOW64 folder is used instead. So same source code, that uses the System32 folder, can be compiled to both 32-bit and 64-bit program code without any changes.

Try copying calc.exe to somewhere else... then run the same tools again. You'll get the same results as Java. Something about the Windows file system is giving different data to the tools than it's giving to Java... I'm sure it's something to do with it being in the Windows directory, and thus probably handled "differently".

Is it possible to decrypt md5 hashes?

You can't. The whole point of a hash is that it's one way only. This means that if someone manages to get the list of MD5 hashes, they still can't get your password. (Not that MD5 is as secure as it might be, but never mind.) Additionally it means that even if someone uses the same password on multiple sites (yes, we all know we shouldn't, but...) administrators of site A won't be able to use the user's password on site B.

Even if you could, you shouldn't email them their password - that's sensitive information which might remain sensitive.

Instead, create a tool for resetting the password based on a timestamped hash value that can only be used once. Email a url to them which includes that timestamped hash value and at that URL let them change their password. Additionally, if you can, make that timestamped hash expire after a short period (e.g. 24 hours) so that even if they don't change the password (because they don't log in), there's a reduced vulnerability window.

Many systems generate a new, random password and email that to them, forcing them to change it on first login. This is not desirable because someone could change every password in your system by brute-forcing the password reset form.

Technically, it's 'possible', but under very strict conditions (rainbow tables, brute forcing based on the very small possibility that a user's password is in that hash database).

But that doesn't mean it's

You don't want to 'reverse' an MD5 hash. Using the methods outlined below, you'll never need to. 'Reversing' MD5 is actually considered malicious - a few websites offer the ability to 'crack' and bruteforce MD5 hashes - but all they are are massive databases containing dictionary words, previously submitted passwords and other words. There is a very small chance that it will have the MD5 hash you need reversed. And if you've salted the MD5 hash - this won't work either! :)


The way logins with MD5 hashing should work is:

During Registration:
User creates password -> Password is hashed using MD5 -> Hash stored in database

During Login:
User enters username and password -> (Username checked) Password is hashed using MD5 -> Hash is compared with stored hash in database

When 'Lost Password' is needed:

2 options:

or

Generating an MD5 checksum of a file

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:

def md5(fname):
 hash_md5 = hashlib.md5()
 with open(fname, "rb") as f:
 for chunk in iter(lambda: f.read(4096), b""):
 hash_md5.update(chunk)
 return hash_md5.hexdigest()

There is a way that's pretty memory inefficient.

single file:

print hashlib.md5(open(full_path, 'rb').read()).hexdigest()

list of files:

import hashlib
[(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]

But, MD5 is known broken and (IMHO) should come with scary deprecation warnings and removed from the library, so here's how you should actually do it:

import hashlib
[(fname, hashlib.sha256(open(fname, 'rb').read()).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib
def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
 for block in bytesiter:
 hasher.update(block)
 return (hasher.hexdigest() if ashexstr else hasher.digest())
def file_as_blockiter(afile, blocksize=65536):
 with afile:
 block = afile.read(blocksize)
 while len(block) > 0:
 yield block
 block = afile.read(blocksize)
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
 for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
 for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.

fastest MD5 Implementation in JavaScript

I've heard Joseph's Myers implementation is quite fast. Additionally, he has a lengthy article on Javascript optimization describing what he learned while writing his implementation. It's a good read for anyone interested in performant javascript.

http://www.webreference.com/programming/javascript/jkm3/

His MD5 implementation can be found here

I would suggest you use CryptoJS in this case.

Basically CryptoJS is a growing collection of standard and secure cryptographic algorithms implemented in JavaScript using best practices and patterns. They are fast, and they have a consistent and simple interface.

So if you want to calculate the MD5 hash of your password string then do as follows:

<script src="http://crypto-js.googlecode.com/svn/tags/3.0.2/build/rollups/md5.js"></script>
<script>
 var passhash = CryptoJS.MD5(password);
 $.post(
 'includes/login.php',
 { user: username, pass: passhash },
 onLogin,
 'json' );
</script>

So this script will post the hash of your password string to the server.

For further info and support on other hash calculating algorithms you can visit:

http://code.google.com/p/crypto-js/

While selecting library it's also important to see if it supports modern frameworks such as Bower, passes jslint, supports plugin model for JQuery or module systems such as AMD/RequireJS in addition to being in active development and have more than 1 contributors. There are couple of options that satisfies some or all of these additional criteria:

Example from CryptoJS:

//just include md5.js from the CryptoJS rollups folder
var hash = CryptoJS.MD5("Message");
console.log(hash.toString()); 

There is a performance comparison between above libraries at http://jsperf.com/md5-shootout/7. On my machine current tests (which are admittedly old) shows that if you are looking for speed Spark MD5 is your best bet (and so is plain JKM code). However if you looking for more comprehensive library then CryptoJS is your best bet although it is 79% slower than Spark MD5. However I would imagine CryptoJS would eventually achieve same speed as it is bit more active project.

Get MD5 hash of big files in Python

Break the file into 128-byte chunks and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks. Basically, when MD5 digest()s the file, this is exactly what it is doing.

If you make sure you free the memory on each iteration (i.e. not read the entire file to memory), this shall take no more than 128 bytes of memory.

One example is to read the chunks like so:

f = open(fileName)
while not endOfFile:
 f.read(128)

How to get MD5 sum of a string?

Use pythons hashlib

import hashlib
m = hashlib.md5()
m.update("000005fab4534d05api_key9a0554259914a86fb9e7eb014e4e5d52permswrite")
print m.hexdigest()

Output: a02506b31c1cd46c2e0b6380fb94eb3d

In Java, how do I convert a byte array to a string of hex digits while keeping leading zeros?

A simple approach would be to check how many digits are output by Integer.toHexString() and add a leading zero to each byte if needed. Something like this:

public static String toHexString(byte[] bytes) {
 StringBuilder hexString = new StringBuilder();
 for (int i = 0; i < bytes.length; i++) {
 String hex = Integer.toHexString(0xFF & bytes[i]);
 if (hex.length() == 1) {
 hexString.append('0');
 }
 hexString.append(hex);
 }
 return hexString.toString();
}

MD5 algorithm in Objective C

md5 is available on the iPhone and can be added as an addition for ie NSString and NSData like below.

MyAdditions.h

@interface NSString (MyAdditions)
- (NSString *)md5;
@end
@interface NSData (MyAdditions)
- (NSString*)md5;
@end

MyAdditions.m

#import "MyAdditions.h"
#import <CommonCrypto/CommonDigest.h> // Need to import for CC_MD5 access
@implementation NSString (MyAdditions)
- (NSString *)md5
{
 const char *cStr = [self UTF8String];
 unsigned char result[CC_MD5_DIGEST_LENGTH];
 CC_MD5( cStr, (int)strlen(cStr), result ); // This is the md5 call
 return [NSString stringWithFormat:
 @"%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
 result[0], result[1], result[2], result[3],
 result[4], result[5], result[6], result[7],
 result[8], result[9], result[10], result[11],
 result[12], result[13], result[14], result[15]
 ];
}
@end
@implementation NSData (MyAdditions)
- (NSString*)md5
{
 unsigned char result[CC_MD5_DIGEST_LENGTH];
 CC_MD5( self.bytes, (int)self.length, result ); // This is the md5 call
 return [NSString stringWithFormat:
 @"%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
 result[0], result[1], result[2], result[3],
 result[4], result[5], result[6], result[7],
 result[8], result[9], result[10], result[11],
 result[12], result[13], result[14], result[15]
 ];
}
@end

SHA1 vs md5 vs SHA256: which to use for a PHP login?

Neither. You should use bcrypt. The hashes you mention are all optimized to be quick and easy on hardware, and so cracking them share the same qualities. If you have no other choice, at least be sure to use a long salt and re-hash multiple times.

Using bcrypt in PHP 5.5+

PHP 5.5 offers new functions for password hashing. This is the recommend approach for password storage in modern web applications.

// Creating a hash
$hash = password_hash($password, PASSWORD_DEFAULT, ['cost' => 12]);
// If you omit the ['cost' => 12] part, it will default to 10
// Verifying the password against the stored hash
if (password_verify($password, $hash)) {
 // Success! Log the user in here.
}

If you're using an older version of PHP you really should upgrade, but until you do you can use password_compat to expose this API.

Also, please let password_hash() generate the salt for you. It uses a [CSPRNG](http://

Two caveats of bcrypt

  1. Bcrypt will silently truncate any password longer than 72 characters.
  2. Bcrypt will truncate after any NUL characters.

(Proof of Concept for both caveats here.)

You might be tempted to resolve the first caveat by pre-hashing your passwords before running them through bcrypt, but doing so can cause your application to run headfirst into the second.

Instead of writing your own scheme, use an existing library written and/or evaluated by security experts.

TL;DR - Use bcrypt.

Maximum length for MD5 input/output

MD5 processes an arbitrary-length message into a fixed-length output of 128 bits, typically represented as a sequence of 32 hexadecimal digits.

How many random elements before MD5 produces collisions?

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Is MD5 still good enough to uniquely identify files?

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.

Is calculating an MD5 hash less CPU intensive than SHA family functions?

Yes, MD5 is somewhat less CPU-intensive. On my Intel x86 (Core2 Quad Q6600, 2.4 GHz, using one core), I get this in 32-bit mode:

MD5 411
SHA-1 218
SHA-256 118
SHA-512 46

and this in 64-bit mode:

MD5 407
SHA-1 312
SHA-256 148
SHA-512 189

Figures are in megabytes per second, for a "long" message (this is what you get for messages longer than 8 kB). This is with sphlib, a library of hash function implementations in C (and Java). All implementations are from the same author (me) and were made with comparable efforts at optimizations; thus the speed differences can be considered as really intrinsic to the functions.

As a point of comparison, consider that a recent hard disk will run at about 100 MB/s, and anything over USB will top below 60 MB/s. Even though SHA-256 appears "slow" here, it is fast enough for most purposes.

Note that OpenSSL includes a 32-bit implementation of SHA-512 which is quite faster than my code (but not as fast as the 64-bit SHA-512), because the OpenSSL implementation is in assembly and uses SSE2 registers, something which cannot be done in plain C. SHA-512 is the only function among those four which benefits from a SSE2 implementation.

Edit: on this page, one can find a report on the speed of many hash functions (click on the "Telechargez maintenant" link). The report is in French, but it is mostly full of tables and numbers, and numbers are international. The implemented hash functions do not include the SHA-3 candidates (except SHABAL) but I am working on it.

PHP best way to MD5 multi-dimensional array?

md5(serialize($array));

However, it's worth noting that (ironically) json_encode performs noticeably faster:

md5(json_encode($array));

In fact, the speed increase is two-fold here as (1) json_encode alone performs faster than serialize, and (2) json_encode produces a smaller string and therefore less for md5 to handle.

Edit: Here is evidence to support this claim:

<?php //this is the array I'm using -- it's multidimensional.
$array = unserialize('a:6:{i:0;a:0:{}i:1;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}i:5;a:5:{i:0;a:0:{}i:1;a:4:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}i:3;a:6:{i:0;a:0:{}i:1;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}i:5;a:5:{i:0;a:0:{}i:1;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}}}');
//The serialize test
$b4_s = microtime(1);
for ($i=0;$i<10000;$i++) {
 $serial = md5(serialize($array));
}
echo 'serialize() w/ md5() took: '.($sTime = microtime(1)-$b4_s).' sec<br/>';
//The json test
$b4_j = microtime(1);
for ($i=0;$i<10000;$i++) {
 $serial = md5(json_encode($array));
}
echo 'json_encode() w/ md5() took: '.($jTime = microtime(1)-$b4_j).' sec<br/><br/>';
echo 'json_encode is <strong>'.( round(($sTime/$jTime)*100,1) ).'%</strong> faster with a difference of <strong>'.($sTime-$jTime).' seconds</strong>';

JSON_ENCODE is consistently over 250% (2.5x) faster (often over 300%) -- this is not a trivial difference. You may see the results of the test with this live script here:

Now, one thing to note is array(1,2,3) will produce a different MD5 as array(3,2,1). If this is NOT what you want. Try the following code:

//Optionally make a copy of the array (if you want to preserve the original order)
$original = $array;
array_multisort($array);
$hash = md5(json_encode($array));

Edit: There's been some question as to whether reversing the order would produce the same results. So, I've done that (correctly) here:

As you can see, the results are exactly the same. Here's the (corrected) test originally created by someone related to Drupal:

And for good measure, here's a function/method you can copy and paste (tested in 5.3.3-1ubuntu9.5):

function array_md5(Array $array) {
 //since we're inside a function (which uses a copied array, not
 //a referenced array), you shouldn't need to copy the array
 array_multisort($array);
 return md5(json_encode($array));
}

Can two different strings generate the same MD5 hash code?

For a set of even billions of assets, the chances of random collisions are negligibly small -- nothing that you should worry about. Considering the birthday paradox, given a set of 2^64 (or 18,446,744,073,709,551,616) assets, the probability of a single MD5 collision within this set is 50%. At this scale, you'd probably beat Google in terms of storage capacity.

However, because the MD5 hash function has been broken (it's vulnerable to a collision attack), any determined attacker can produce 2 colliding assets in a matter of seconds worth of CPU power. So if you want to use MD5, make sure that such an attacker would not compromise the security of your application!

Also, consider the ramifications if an attacker could forge a collision to an existing asset in your database. While there are no such known attacks (preimage attacks) against MD5 (as of 2011), it could become possible by extending the current research on collision attacks.

If these turn out to be a problem, I suggest looking at the SHA-2 series of hash functions (SHA-256, SHA-384 and SHA-512). The downside is that it's slightly slower and has longer hash output.