I know that this blog entry is a bit long but you try explaining this kind of abstract concept in less words, it is hard!
I work for EMC, as well as making disk drives, (I'm a software guy so that's just exhausted my total knowledge of hardware), we are actually the 6th largest software company in the world. One the the most challenging things about this is understanding the capabilities of all of the other applications in the company -- more importantly one needs to be able to assess how the other applications can be used to increase the value of your specific software offering. Bear in mind that here at EMC we have exactly 7498 different software products, (OK I completely made that up but we do have a lot - it could easily be more than this!).
Occasionally it all comes together -- you have a epiphany and realize that you can take a couple of existing products and solve an age-old problem. It is amazingly satisfying when this happens. One relatively recent example of this is the "Structured Data as a Record" conundrum. Let me explain...
I used to get a lot of questions about storing 'structured data' as a record. To those of us who live in the enterprise content management space the term 'structured data' refers to data that comes from a relational database. Customers want to take data currently stored in a database and make parts of it in to a record. They want to do this without removing the content from the database. I used to think that this showed a misunderstanding of what a record is but now I realize that it actually highlights a classic problem with understanding software solutions. The customer was asking for a specific solution rather than defining the problem -- they were asking how to make database content immutable in place but that's a solution not a problem; the problem they are really trying to solve is how to produce the output of this data at a future time.
Actions on Structured Data
Typically people want to do 3 things to data in a database, (other than just use it of course). They want to optimize the database, archive the data and apply compliance controls to the data, (these are not mutually exclusive BTW, they may want to do all 3 at once). We have in-house software and partners galore that deal with the first two but the last one is a different animal. If you asked most people in the know how to do this they would tell you that it was not possible - with good cause. Before we dive in to the solution let's consider the three main types of structured data to which you typically need to apply compliance controls.
Types of Structured Data
Structured data is usually generated by computer applications and systems -- not people, it can be sub-divided in to page-oriented, line-oriented and native database.
Page-orientated Data is designed to to be printed in a page format, it is made to be easily read and interpreted by a human, the page breaks add context to the final report. It has what we call "print fidelity", this means that the layout is significant. Typical examples would be bank statements or house-sale closing documents.
Line-orientated Data is just as it sounds...line after line of boring data. It is typically in date/time sequence and may have a fixed or variable line structure. It may be character delimited, fixed column, undelimited lines of text or even theoretically XML, (XML is just delimited text anyway). This is not usually designed to be ready by people on a day to day basis. A good example are transaction logs from a trading system or audit logs from one of your enterprise systems.
Database Data is the stuff at rest in its native environment, it is sitting in your relational database of choice. To be clear, the two types above will often have been derived from a database in the first place.
So what's the problem?
Imagine trying to make a single record in a relational database immutable. I cannot even define what that means. Are you taking a row from a table and locking it down, or just a single value? Do you lock down the relationships between the data too? What about the queries and reports used to consolidate the values in to useable data? The questions keep on going...
So what's the solution?
Back to the advantage of have access to all of the resources of a 13 billion dollar company then... EMC purchased a company called Acartus a few years ago. I always thought that their software was cool, it allowed you to grab a print run and instead of letting it stream to a printer the software converted it to a PDF file. You could then archive this PDF as an exact facsimile of the printed documents. In the case of page-oriented print run, the software knew where the physical and logical breaks in the run were so it knew where each page started and ended but also where each logical section was. It could take you to page 4 of a 15 page bank statement in a PDF that contained 132,000 statements. When we bought Acartus they embarked on a project to use the EMC Documentum repository to natively store these documents. The final product is now called Archive Services for Reports -- ASR.
Recently I started working with the ASR team on a specific account. We already knew that we could manage report output as a record but we realized that we could also use the capabilities of ASR and RM to address some of the structured content compliance challenges. I'll keep it brief, (well Chapman brief), but email me with your questions if you want to understand more and I'll add more detail to the blog.
Redefining "Structured Content"
Firstly, while the content is in the active database you really cannot make it in to a record. However, if you take the data out of the database you have a fighting chance, this sounds like a cop out but it isn't.
Consider the real requirement. The regulations do not say that your active database data need to be retained in situ, the regulations say that you must retain and be able to produce documentary evidence of a specific event. For example, you may need to guarantee to the courts or customers that you can provide details all of your customer's credit card transactions for 7 years.
So, the solution just needs to allow you to recover specific details drawn from the structured data. In order to do this using today's technology we need to scrub the attempt to manage data in place in the database and focus on meeting the regulatory requirements by using page- and line-oriented data.
Page-oriented Structured Data as a record
Let's return to the way too ubiquitous credit card statement. Your personal account details and the details of each transaction are stored in a relational database. Assume that the regulations state that you must keep those statements for 7 years.
If you tried to lock down the data in the native database you'd be heading for trouble. However, if you use something like ASR then you'd be able to store the entire day's credit card statements in a single uber-PDF file and retain it for 7 years. Sounds fairly trivial doesn't it? In fact, we can do this today without using a variety of products. Now consider the nightmare caused by "exceptions". What happens if a customer sues the credit card company over a specific transaction that happened 6 years ago?
The credit card company needs to place a legal hold on one of the 132,000 statements in that uber-PDF file. With most technologies you would have to hold all 132,000 statements until the legal hold was removed. This causes a huge liability to a company who would rather dispose of their data as soon as legally allowable.
The epiphany that we had was to realize that because ASR is more intelligent than the other report archiving solutions we could actually demonstrate a much more sophisticated approach. Out of the box, ASR allows you to check out a single logical section of a print run and view it as a single object separate from the uber-PDF. I request someone's statement and ASR extracts just those pages and creates a new object in the repository that only contains that individual statement. It then updates its internal index so that it knows that the logical section is now a separate object. The original data is left in the uber-PDF.
Assume that you need to place a legal hold on a single statement. Using ASR you find the statement and apply the hold, behind the scenes that single statement turns in to a new object separate from the original PDF file and it is that new object that has the hold applied to it. When the original 7 years are up you can safely delete the original uber-PDF file knowing that the held statement is protected. There's a little more to it than this to implement the house keeping but bottom line is that it allows you to do records management on the full print run by exception only.
Line-oriented Structured Data as a record
Line-oriented is a little different because the data does not have such a complex logical structure. In fact, each line typically represents a single discrete piece of information. The results of a transaction, a security log event, etc.
The key requirement with line-oriented data is typically to archive it because the production systems get very full very quickly. However, the data that is going to be archived is usually discoverable or more often required to defend an action.
Again, during our epiphany session we realized that ASR could be used to not only archive line-oriented data but also to enforce compliance controls to that same data. We archive content from the live database and print it to a PDF file. The PDF file might contain lines representing transactions from a single day -- this could be hundreds of thousands of lines of data. The PDF file is stored in the EMC Documentum repository until a discovery or an analysis needs to take place. At this point the contents of the PDF file are loaded in to a temporary database, you query the temporary database, get your data as a standalone XML file and then make that in to a record. The XML extract file is generated automatically by the system which means that it can be shown to be a good record of the original data -- we can effectively show a coherent chain of custody.
Conclusion
Does this actually solve the core problem of managing structured content as a record or is it just a software vendor using their tools to manufacture a half-baked pretend solution? I believe it truly is the former. It only looks like the latter if you forget the real objective -- to retain discoverable content in a way that shows a chain of custody.
It doesn't work in every case but believe me, it works a lot better in reality than it does in theory. In theory there are loads of reasons why you would not want to do this but in reality it is quite achievable.
As always, email me if you want more information...