value is called a collision. The problem of determining the collision rate of a particular algorithm falls under a particular probability theory called the birthday paradox. The birthday paradox says that in order to get a 50 % probability that two people in a given room have the same birthday, month and day, all you need is to have 23 people in the room. In order to get to 100 % probability, you would need 367 people in the room. There is a very slim potential for having 366 people in a room who all have a different birthday. To guarantee that you would have a duplicate, you would need to have 367 (365 + 1 for leap day + 1 to get the duplicate). This particular mathematical problem has the potential to open doors for attacks against the hash algorithm.
When you hear cryptographic, you may think encryption. We are not talking about encrypting the evidence. Instead, we are talking about passing the evidence through a very complicated mathematical function in order to get a single output value. Hashing algorithms used for this purpose are sometimes called one-way functions because there is no way to get the original data back from just the hash value. Similarly, for a hash algorithm to be acceptable for verifying integrity, there should be no way to have two files with different contents generate the same hash value. This means that we can be highly confident that if we have one hash value each time we test a file, the content of that file hasn't changed because it shouldn't be possible to make any change to the content of the file such that the original hash value is returned. The only way to get the original hash value is for the data to remain unaltered.
NOTE
A cryptographic hash takes into consideration only the data that resides within the file. It does not use any of the metadata like the filename or dates. As a result, you can change the name of the file and the hash value for that file will remain the same.
NOTE
Cryptography is really just about secret writing, which isn't necessarily the same as encryption. Hashes are used in encryption processes as a general rule because they are so good at determining whether something has changed. If you have encrypted something, you want to make sure it hasn't been tampered with in any fashion. You want to know that what you receive is exactly what was sent. The same is true when we are talking about forensic evidence.
For many years, the cryptographic hash standard used by most digital forensic practitioners and tools was Message Digest 5 (MD5). MD5 was created in 1992 and it generates a 128-bit value that is typically represented using hexadecimal numbering because it is shorter and more representative than other methods like printing out all 128 binary bits. To demonstrate the process of hashing, I placed the following text into a file:
Hi, this is some text. It is being placed in this file in order to get a hash value from the file.
The MD5 hash value for that file is 2583a3fab8faaba111a567b1e44c2fa4. No matter how many times I run the MD5 hash utility against that file, I will get the same value back. The MD5 hash algorithm is non-linear, however. This means that a change to the file of a single bit will yield an entirely different result, and not just a result that is one bit different from the original hash. Every bit in the file will make a difference to the calculation. If you have an extra space or an end of line where there wasn't one in the original input, the value will be different. To demonstrate this, changing the first letter of the text file from an H to a G is a single-bit difference in how it is stored on the computer since the value for H is 72 and the value for G is 71 on the ASCII table. The hash value resulting from this altered file is 2a9739d833abe855112dc86f53780908. This is a substantive change, demonstrating the complexity of the hashing function.
NOTE
MD5 is the algorithm but there are countless implementations of that algorithm. Every program that can generate an MD5 hash value contains an implementation of the MD5 algorithm.
One of the problems with the MD5 algorithm, though, is that it is only 128 bits. This isn't an especially large space in which to be generating values, leading it to be vulnerable to collisions. As a result, for many purposes, the MD5 hash has been superseded by the Secure Hash Algorithm 1 (SHA-1) hash. The SHA-1 hash generates a 160-bit value, which can be rendered using 40 hexadecimal digits. Even this isn't always considered large enough. As a result, the SHA-2 standard for cryptographic hashing has several alternatives that generate longer values. One that you may run into, particularly in the encryption space, is SHA-256, which generates a 256-bit value. Where the 128-bit MD5 hash algorithm has the potential to generate roughly 3.4 × 10^38 unique values, the SHA-256 hash algorithm can yield 1.15 × 10^77 unique values. It boggles the mind to think about how large those numbers are, frankly. Generating a SHA-1 hash against our original text file gives us a value of 286f55360324d42bcb1231ef5706a9774ed0969e. The SHA-256 hash value of our original file is 3ebcc1766a03b456517d10e315623b88bf41541595b5e9f60f8bd48e06bcb7ba. These are all different values that were generated against the same input file.
One thing to keep in mind is that any change at all to the data in the source file will generate a completely different value. Adding or removing a line break, for example, would constitute removing an entire character from the file. If that were done, the file may look identical to your eyes but the hash values would be completely different. To see the difference, you would have to view the file using something like a hexadecimal editor to see how it is truly represented in storage and not just how it is displayed.
You can use a number of utilities to generate these values. The preceding values were generated using the built-in, command-line utilities on a Mac OS system. Linux has similar command-line utilities available. On Microsoft Windows, you can download a number of programs, though Microsoft doesn't include any by default. Microsoft does, however, have a utility that you can download that will generate the different hash values for you. The name of the utility is File Checksum Identity Verifier (FCIV).
Any time you obtain a file such as a packet capture or a log file, you should immediately generate a hash value for that file. MD5 hash values are considered acceptable in court cases as of the time of this writing, though an investigation would be more durable if algorithms like SHA-1 or SHA-256, which generate longer values, were to be used. MD5 continues to demonstrate flaws the longer it is used and those flaws may eventually make evidence verification from MD5 hashes suspect in a court case.
Over the course of looking at packet captures in Chapter 4, we will talk about some other values that perform similar functions. One of those is the cyclic redundancy check (CRC), which is also mathematically computed and is often used to validate that data hasn't been altered. These sorts of values, though, are commonly called checksums rather than hashes.
Chain of Custody
Sometimes it seems as though TV shows like NCIS, CSI, Bones, and others that portray forensics simultaneously advance and set back the field of forensics. Although some of the technical aspects of forensics, including the language, are ridiculous, these shows do sometimes get things right. This was especially true in the early days of NCIS, as an example, where everything they collected was bagged and tagged. If evidence is handed off from one person to another, it must be documented. This documentation is the chain of custody. Evidence should be kept in a protected and locked location if you are going to be presenting any of it in court. Though this may be less necessary if you are involved in investigating an incident on a corporate network, it's still a good habit. For a start, as noted earlier in this chapter, you never know when the event you are investigating may turn from a localized incident to something where legal proceedings are required. As an example, the very first well-known distributed denial of service (DDoS) attack in February 2000 appeared as a number of separate incidents to the companies involved. However, when it came time to prosecute Michael Calce, known as Mafiaboy, the FBI would have needed evidence and that evidence would have come from the individual companies who were targets of the attacks – Yahoo, Dell, Amazon, and so on.
Even in the case of investigating a network incident in a business setting, documenting the chain of custody is a good strategy. This ensures that you know who was handling the potential evidence at any given time. It provides for accountability and a history. If anything were to go wrong at any point, including loss of or damage to the evidence, you would have a historical record of who was