Processing & Productions

Processing FAQs

We get a lot of good questions about our processing and production capabilities from our partners and clients. This page will collect the best of these and provide answers. If you have other questions, please email them to us at This e-mail address is being protected from spambots. You need JavaScript enabled to view it .

1. Does Catalyst handle Office 2007 files?

Yes, we have been handling Office 2007 files for several years now.

We index the text of the files and create preview versions of the files to facilitate review. In addition, our Excel viewer is compatible with the Office 2007 format.

You can also download the native files themselves. If you don't have Office 2007 installed on your computer, you can download a free plug-in from Microsoft for viewing the native files here.

2. What technology do you use for hashing and why?

There are at least two primary methods for hashing: MD5 and SHA-1. We use SHA-1 for our work but can provide you with MD5 values as well if desired.

In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has been employed in a wide variety of security applications and is also commonly used to check the integrity of files.

The SHA hash functions are a set of cryptographic hash functions designed by the National Security Agency (NSA) and published by the National Institute of Standards and Technology (NIST) as a U.S. Federal Information Processing Standard. SHA stands for Secure Hash Algorithm. SHA-1 is the best established of the existing SHA hash functions and is employed in several widely used security applications and protocols. It forms part of several widely used security applications and protocols, including TLS and SSL, PGP, SSH, S/MIME, and IPsec .

For our purposes, the key is that one or the other approach is used consistently across your document population. Documents hashed using MD5 will not dedupe against documents hashed using SHA-1 and  vice versa. Likewise, you have to make sure that the fields used for hashing are identical across your population. If there are any changes to your methodology during the hashing, the process will not work.

3. How do you hash MSG files (Outlook email) to find duplicates?

We first convert the Outlook MSG files to HTML. Then we hash the resulting HTML file.

Our system allows us to choose which fields and body text we want to use for the hashing. Typically, we use these fields: from, to, cc, bcc, subject, attachment name, sent date/time, and body.

We don't hash MSG files directly because each MSG file contains a field which stores the creation Date and Time. This value will change every time the MSG is saved, which will cause the record not to be treated as a duplicate.

4. My file has the wrong extension or no extension. Why is that?

There could be several reasons for this. First, if we received the file after it was processed, we don't further investigate file extensions to make sure they are correct. Rather, we load the files directly into our system.

During the course of indexing, however, our system interrogates the files to determine their true file type. If the file type can be recognized, the system will attempt to extract text for indexing, preview and search. However, we do not change the file type shown for the native file.

If the file was embedded in an email file (as an attachment for example) or an Office file, there could be a different reason. Embedded files in the Microsoft applications often do not have a file extension. Without a file extension, the Windows Operating System does not know what the file is. You have doubtless seen this when you tried to open a Word document that had no "doc" extension. Your computer had to ask you what application should be used to open it.

When we process the files directly, we interrogate files with missing or unknown extensions. We then add the identified extension to the file name. This helps at a later time when you want to view the native and have it associated with the proper program. While this software is good at doing this, it doesn't always succeed. The reason is that the file formats don't have a simple header advising of the file type. Rather, the software has to look within the binary code and make this analysis. If it is just a Word document, the process is pretty simple. If, however, the Word file has Excel files embedded within it, the software has to try to determine which file is the dominant file.

Most of the time, the system correctly analyzes the file type. On rare occasions, it can miss. That is why the file might be mislabeled.

 
P2 Helpful Files

FIle Exclusions

Media Handling

Equivio Processing