This is the second in a series of articles that discuss the benefits Moving Content out of SQL Server. This article discusses pairing SQL Server to an archive system; the next discusses pairing to a traditional ECM solution. It wraps up with an overview of the pros and cons of each including not putting it there in the first place.
Store the unstructured content in a classic archive system
I believe that we are entering what I’ll call the era of “Archive 2.0” when it comes to SharePoint content. Once SharePoint really took off we saw a flurry of archiving solutions hit the market but I class them as V1.0 type solutions. When you look at how invasive they are, the underlying technologies that they rely on and their policy management you realize that there’s a lot of meat left on that bone, (lousy metaphor, I must be hungry).
Let’s define an archive and look at the business problem it solves, consider how Archive 1.0 solutions address those problems and then maybe how Archive 2.0 might do a better job.
What is an Archive and why does SharePoint care?
Classic archiving systems are typically focused on pulling fixed, infrequently accessed or "old" information out of production systems and managing them in a separate repository. Generally, they provide value through policy-based management of the content for storage optimization, compliance and eDiscovery. In SharePoint’s case this might be content that is fixed in nature such as scanned images, equally it could be stuff that you have finished with; old versions of SharePoint content, content in abandoned SharePoint sites, etc. These content types may still be accessed after they have been archived but generally they do not represent your most active content.
The benefit of archiving is that you are leaving only active content in the production system which allows the systems to scale more effectively; it also gives you better storage and backup management. Archives are not just somewhere to let the content go to die but this is a real value too. Consider for example decommissioning a SharePoint site: if you have all of your discoverable content already stored in the archive you can decommission the sites knowing that if you have to refer back to the content you can either search for a specific item or restore the entire site from the archive.
These systems differ from the file system approach because they are interested in storing not just unstructured content on a day-forward basis. They want all of your content – structured and unstructured and they want to be able to specify exactly what they want and when. They are optimized for high rates of ingestion – typically tuned to be able to ingest the volumes of email content which typically dwarf all other data types. They will store your documents, calendars, Blogs, Wikis, site collections, site configuration and even custom metadata. As an added bonus, with many of these solutions you can often restore individual objects or even complete sites from the stored information instead of having to use SharePoint’s own backup/restore tools.
Unlike using a file system, archives can store more metadata – context that allows you to understand what the objects are. This allows you to implement more intelligent policies for security, retention, classification, etc.
Many systems will support some form of basic 2-way synchronization model; for example, you might be able to run rules that dispose of content in the archive and the archive process will reach in to SharePoint and delete the corresponding objects. (Personally I find this idea scary, it reminds me of the challenges we have in automated Records Management disposition – do you really want to be the person that clicked on the ‘Dispose All’ button?) As scary as it seems, it is a necessity in many cases otherwise you’ll just end up with thousands of sites full of thousands of unwanted documents. If eDiscovery is an issue for you then this is something that you might want to think about especially hard.
How is it done in Archive 1.0
Objects are removed from SharePoint and replace with shortcuts.
With unstructured content, these Archive 1.0 systems often move objects out of SharePoint and store them in the archive leaving behind shortcuts back to the archived objects. I’ve noticed that vendors come up with very creative names to disguise these technology travesties, you’ll see them called links, stubs, pointers, proxies, placeholders, etc… These are fundamentally flawed because SharePoint does not have the concept of a native shortcut object. This means that many important SharePoint features may not survive the shortcutting process including Microsoft Office integrations, full text indexing, custom metadata, workflows...
To be fair, early on there weren’t many options and shortcuts to archives are not as hideous as they are to ECM systems because archived data should not be terribly active; read-only access via a shortcut is less of a catastrophe than it would be with active content. In fact, if your policies are well defined then shortcuts are acceptable but I expect changes in SharePoint, SQL Server, CMIS and elsewhere to make this draconian approach redundant. (I’ll have to chat to Dr. Pie about the CMIS one at EMC World this year.)
The Archive is a ‘black box’.
The archive is pretty much treated as a black box. In Archive 1.0 land this was seen as an advantage – you neither knew nor cared where the content was – it just wasn’t in your production system. This seemed like a good thing but you are leaving critical business assets out of reach to anything other than maybe your eDiscovery process.
Access to the archived content is expected to be done from SharePoint not directly from the archive so you are not really able to re-use the archived content from elsewhere; to be fair, if your policies are implemented correctly then this is exactly what you want because the archived content is not going to be re-used anyway but policies are never 100% water tight. Typically you can search the archive and might be able to apply localized policies to objects to support activities like eDiscovery. You’ll see that this area forms one of the biggest differentiators between storing content in an archive and storing it in your ECM system, (next posting)
How might Archive 2.0 Improve things?
It is not that Archive 2.0 is going to change the basic premise of what an archive does, it is just that it will be more sophisticated, less invasive and more integrated in to other parts of your business. This makes it more cost effective because you’ll be able to leverage what it does across more systems.
No shortcuts.
With Archive 1.0 your SharePoint content is either in SharePoint or in the Archive – in the latter case it is replaced by a shortcut. In Archive 2.0 I expect to see a more sophisticated and granular approach to how and where content is managed. For example it might start in SharePoint as a native object for a while then get virtualized using RBS. Later in its lifecycle it might get replaced with a shortcut then the entire SharePoint site might be deleted leaving the objects available only to the archive application. Finally after the retention policies have expired the content might be deleted. During these phase changes the objects might be on local storage, move off premise, on to tape (real and virtual), etc. This more granular approach will provide a better end user experience, more cost effective storage utilization and a greater ability to leverage new storage paradigms as they appear.
Aggregation
2.0 Archives will aggregate data not just from multiple SharePoint sites but from SharePoint, email, file shares…you name it. The ability to aggregate across multiple systems does give you some real benefits, you can implement a single set of policies across the different systems’ data, (when that makes sense - the rules that govern SharePoint content don’t always make sense for email content). You can get object-level de-duplication across all systems which can save storage, also you can manage your centralized storage utilization better because you don’t end up with multiple back-end storage solutions each with its own quota overhead. One area where aggregation does have a real cost savings is in eDiscovery – if all of your SharePoint sites and all of your email content are discoverable from within a single archive then you can realize fairly impressive gains – the single instance storage helps with this one too.
Conclusion
The likelihood is that at some time in the near future you will have to implement an archive for some part of your business. It might be Exchange that cracks first, you might decide to start harvesting that fine collection of crap on your network file servers or you might want to try to get control of your SharePoint content before you drown in it. You might be able to get away with using your ECM system as a back end – in some cases it will be the better option but IMHO the 80/20 rules kicks in here. It is likely that in time 80% of your content would be better off in a nice old archive and 20% should be in your ECM system (the actual number is probably 94/6 but it does not sound as good). An archive will typically be less expensive to administer and is more likely to scale over time as you grow your deployments so think hard about using an archive solution regardless of what else you are looking at.
What’s next?
In the next post I’ll look at pairing SharePoint to a traditional ECM solution. It brings a different approach to storing your content – it looks similar to archiving - the differences however are subtle but very important.
Since you speak of Archive 2.0 capabilities in the future tense, when do you expect to see actual working incarnations of same?
Posted by: John Heckendorn | 05/15/2009 at 04:55 PM