I'm down in sunny Orlando for the Microsoft TechEd Conference this week which is odd because I am nether technical nor particularly educated; maybe I'll return a different person. One of the things that I'm going to be validating with customers, partners and the Microsoft team is the general concept of "SharePoint Archiving."
It would seem that one man's archive is another man's backup/journal/file system. This has not always been the case but these days we seem to overload the term 'archiving' way too much IMHO. In fact, the definition of an archive seems to be multidimensional; based on why you archive, how you archive or even where you archive from - I'd proffer that there is no right or wrong definition, just different ones. At the end of this piece I'll suggest specific terms to describe each 'flavor' of archive that might help clarify the different approaches.
There are three main areas that I am considering right now:
- What do we mean by an archive?
- Why bother archiving at all?
- At what level of the SharePoint architecture would you perform the archiving?
Let's consider each of these in turn - if you are at TechEd and you have an opinion that you'd like to share then head over to the EMC booth and they should be able to get hold of me.
What?
This seems like a redundant question - "What is an archive?" seems like a question that contains the answer... let me give you a few different definitions that I've heard bandied around and you'll see how broad a range of definitions there are:
- Some vendors have solutions that copy an entire document library or site to an 'archive' .
- The ability to declare a piece of content in SharePoint as a record can be termed 'archiving'.
- Allegedly, Microsoft call their remote blob storage capabilities 'archiving', (RBS and associated new functionality support the ability to store content on a file system instead of the SQL Server database.).
- If you are the DB admin then backing up the underlying SQL Server database is archiving.
I think that the confusion and hence the clarification actually comes from the next question - "Why?"
Why?
Your personal definition of archiving is going to be driven by the rationale for archiving in the first place. Is it driven by compliance, is it a way to increase performance, does it provide operational backups, does it reduce scalability constraints or is it a means to re-purpose the content? Also consider whether the archive is active or inactive - do you need to archive the content and then continue to access it from SharePoint or will the archive only be used offline?
I think that it is fair to say that until recently an archive was something that took your content offline. Consider the original archives that we had before electronic document management systems appeared - they were physical vaults where your 'archived content' was shipped off to in a box. We continue to use the term archive when the reality of what an archive looks like have changed dramatically.
Where?
This is an interesting question that forms another dimension of the definition of archiving although I recognize that this is the technology solution wagging the business problem, (there's a metaphorical mix that you don't hear too often!)
Will you archive from a very low level by grabbing content from the file system or directly from SQL Server? Will you archive transparently from SharePoint so that the end user does not even know what the archive happened? Will the archive process be something that the end user invokes intentionally (e.g. by declaring an object as being a record).
The 'where' question will determine the number of objects that you will archive and what you can do to the archived content. Archiving content at a low level without SharePoint knowing that you've done it makes it harder to apply extra constraints to the content; for example, if you archive something out of SQL Server and make it immutable and then try to delete it from SharePoint what will happen? If you archive content as a manual process at a higher level in the stack then the chance is greater that you'll be able to interact with the archived content intelligently.
Who Cares?
I think that we all should; we cannot just keep throwing content into SharePoint without an 'exit strategy' for the content. If you don't know what is going to happen to the content in the future then you'd better start saving up for all of the nice EMC hardware that you'll need!
Chapman's Definitions:
As promised, here are my definitions for the different types of archiving...
- Archive: A copy of content and/or metadata that is stored offline for compliance purposes. This includes formal compliance requirements like records management and also informal business-driven/best practice compliance such as keeping copies of documents for historical reasons.
- Backup: Differs from an archive because the sole purpose of the copy is to allow the content to be recovered for operational reasons.
- Journal: Differs from an archive because the content is moved from SharePoint in to the journal provider and remains online. Not all content needs to be moved to the journal - maybe just content that is over a specific size or of a certain type.
- Centralized repository: Similar to a journal but the content is consolidated in to one location, (a journal solution could legitimately have one journal location per SharePoint site.) This approach allows you to have an aggregated view of content across sites and to then manage the content en masse.
I'm sure that this list is not comprehensive; I'd love to add more definitions to it and also tighten up the scope of each one. If you have feedback I'll incorporate it.