Friday, May 20, 2011

Classes of Protected Information and DLP

Data Loss Prevention (DLP) systems have to deal with a variety of formats of data and identify protected data in those formats.  In general, protected information falls into these formats:
  • Unstructured text - as found in text documents - including various types of information:
    • Corporate proprietary information or trade secrets
    • Personal health records
    • Personal financial records
    • Personal identifying information
  • Structured data - as found in spreadsheets, tables, database output, and CSV files
To deal with these different formats of protected information, a variety of approaches are used in a DLP system.

For corporate proprietary information, document fingerprinting is the predominant approach to identifying parts or complete copies of proprietary documents.  This requires the administrator to register proprietary documents with the DLP system, and then the DLP system can match fragments or wholesale copies of the proprietary documents.

Another approach that can be used for proprietary documents is to embed tags in the documents, such as "Company Confidential", and then add a simple rule to the DLP system to watch for that tag.  However, this depends on corporate users applying the correct tags to the documents, and is easy for a malicious insider to circumvent, for example, by simply removing the tag before transmitting the document to an unauthorized recipient.

For data like personal health information (PHI) or personal financial information (PFI), several approaches (or a combination of approaches) are typically used.  A combination of search terms can be used to determine if data contains information referring to a particular individuals or group of individuals, plus whether the data contains significant information about those individuals.  For example, an email message from a bank containing the customer's account number, name, and account balance, it might be considered to be information protected under the Gramm-Leach-Bliley Act (GLBA).

Another approach to PHI and PFI is to use information from a corporate database, such as account numbers and customer names, in the DLP system to search for matches.  If an account number and associated customer name turns up in an email message, the message might be considered to contain information protected under GLBA.

A third approach, specific to personal financial information, is to look for credit card information.  Credit card numbers use a standard format and are assigned in specific ways, so it is possible to look at a sixteen-digit number and determine with a high degree of accuracy whether that number is probably a VISA or MasterCard credit card number.

For personal identifying information, an approach is to look for national identification numbers, state driver's license numbers, or account numbers.  In the United States, the Social Security Number (SSN) is often used (and abused) for purposes of identification and authentication for financial and health purposes, and as such has gained status as a protected piece of information.  Unfortunately, the format of the SSN was developed without the concept of check digits or embedded validators, so it is easy for a DLP system to mistake a number in the form 123-45-6789 as an SSN.

As for structured data, DLP systems can identify protected contents in a couple of ways.  One is to write rules for the DLP system that match the format of data typically used in a company, such as forms that are often used for things like customer orders.  Another approach is to use information from a corporate database, such as account numbers and customer names, in the DLP system to search for matches.

These formats cover the majority of ways I have seen protected information stored and transmitted in ways that DLP systems can help identify and protect the data.

No comments:

Post a Comment