The concept of Data Classification as a whole can become confusing, generally due to the term not being standardized in the space. This term usually evokes one of two thoughts: determining what type of information is in a piece of data or marking/tagging a piece of data based on content determination. Both of these are important in the overall data governance plan within an organization for different reasons.
Data Classification as Identification
Frequently data classification is along the lines of identifying what the type of information there is in a piece of data. This usually falls into two different areas:
Classifying Based on Sensitivity
This is used especially in highly regulated industries, or in organizations that have to engage with a specific form of compliance. This generally can rank based on audience, for instance, determining whether something is confidential, top-secret, etc. This is done by identifying the appropriate pieces of information necessary to identify a classification type, whether that’s intellectual property, personal, private, or protected information of an individual or the organization.
This is a method that can be complex but has many ways to scale up and automate. When patterns, words, and phrases can be defined then regular expressions can be used to find most data points in a data set and validation methods tacked on can add an extra level of confidence to the analysis of the data.
Classifying Based on Content
Another option is determining the theme of content in a data set and using that to mark a file. For instance, if a file is a mortgage application, it may fall into the category of “finance”, or an offer letter may fall under the category of “Human Resources”. This becomes a more complex way to classify data than simple regular expressions. This requires a very heavy human component to classify content as it is created, or a human to go through and evaluate the content at a later point. There are interesting technologies out there with natural language processing and machine learning that help with this, but overall it’s likely to be a very manual process.
Data Classification as Marking
As part of a true data governance strategy within an organization, files that are deemed sensitive or categorized will benefit wildly from some level of marking of that content. This can take form in a few different ways:
In this situation, an actual tag is placed on the file in some way. For example, Microsoft Office files offer this type of tagging directly on the file – it ends up adding data to a file in the envelope that is an Office file. By adding a tag in this manner content can be programmatically searched for or directly interfaced with a Data Loss Prevention (DLP) solution for the implementation of Rights Management. This can be done manually in the file itself or in bulk when done programmatically with a product (such as StealthAUDIT) or from an interface like PowerShell
One of the weakest methods, yet most common, of marking a file as a certain type is by a hierarchical taxonomy implemented, most commonly, but not always, structured in terms of folders. In this instance, there’s usually a high-level structure of separating content (by the department, by date, by sensitivity, etc.) with several subsections. This frequently leads to an overlap of content in multiple locations, or a lack of overlap when it is needed, which causes increased difficulties in locating content when it becomes necessary to secure or remove content. This method is probably the most common due to it being the first real method in file systems and the ease of implementation but also is the least useful from a classification perspective.
Extended File Metadata
Many modern collaboration platforms offer the ability to add additional metadata to content within the platform without changing the file itself – SharePoint is an excellent example of this, though others exist like Box, Dropbox, and Google Drive as a few others. In these platforms their options to add a different level of metadata to a file that will allow improved searchability and classification, as well as keying into different features of the platforms.
None of these are new problems, however. Many organizations have run into these exact same challenges before. While I will strongly encourage taking advantage of any content collaboration platform for sharing content and classifying it, if you are still on a file share there are plenty of options for you. I encourage taking a moment to look into StealthAUDIT with Sensitive Data Discovery and file tagging options to truly manage your internal data classifications.