If you only read one of my series of posts this lifetime you might want to make it this one. It is certainly not my most coherent or interesting series but it is the one that is likely to save you the most grief over the next few years, (assuming you have something to do with SharePoint deployments).
This was going to be a single post but I got a bit carried away so I've split it up in to the following posts:
- This one - An Overview of the Issues
- The next one - An Overview of the Potential Solutions.
- The one after that - EBS vs. RBS...the ultimate grudge match (because I begrudge having to deal with it.)
- Maybe a bonus final posting but probably not.
An Overview of the Issues
I have dedicated a lot of posts in the past talking about how and why it makes sense to create a nice symbiotic relationship between your ECM system of choice and SharePoint. I won't ramble on about silos, compliance, long term archiving, over duplication of content, scalability etc. ad nauseam (again) instead I will take a different tack.
Imagine a world where content created in SharePoint was automatically routed to the most appropriate location depending on factors such as values in the object's attributes, where the object is in its lifecycle and/or who created it. Imagine that this was done without in any way affecting the SharePoint end user experience or any applications built on top of SharePoint. Imagine if doing this didn't just reduce risk and costs but it also made your SharePoint deployments more scalable and robust.
Before we continue to dream of such a thing let's talk about one of the fundamental issues with the SharePoint architecture. SharePoint stores everything in SQL Server, not just the structured content but also the unstructured content, (the actual binaries - Word documents, PDFs, JPGs, etc.). If we could get these binaries out of SQL Server and manage them in a more appropriate way then many of the limitations and concerns around SharePoint are lessened.
Let's start by reviewing why getting content out of SharePoint’s back end is a big deal? You might be asking what's wrong with how SharePoint manages content today; to save you going back over my Blog to read the answers let me give you a high-level overview.
How SharePoint Manages Content Today.
- If you import a 10GB object in to SharePoint and fill out the properties screen the 10GB object and its associated metadata cascade down the SharePoint stack and both end up in SQL Server.
- The attributes get written to a database table and the 10GB file is also stored in the database as a Binary Large OBject (BLOB ).
- In my ever so humble opinion the only thing good about BLOBs is the cool acronym. When I was a DB admin we were told never to store binaries in the database unless there were lots of tiny winy wittle files - XML for example.
- In fact, many of the limitations that you hear about SharePoint's scalability actually come from SQL Server trying to store the BLOBs not specifically from SharePoint or IIS.
In summary, storing the BLOBs in SQL Server creates issues related to scalability and related to the creation of silos.
How bad is using BLOBs really?
That really depends on who you are and what SharePoint is being used for. For small, non-regulated deployments you may not care, however if you want to use SharePoint as part of your enterprise infrastructure then it potentially really sucks. For example, there's a good chance that you will be forced to structure your site and farm topology based on SQL Server capacity rather than based on the business need. Technology driving deployments is never a good idea...in fact it always bites you in the end (the rear end usually).
What would avoiding Database BLOBs give me really?
Oh, I am so glad I asked, (I am a Gemini so this split personality thing is perfectly normal according to my team of psychiatrists). I use the term “Data Aggregation” in my architectural postings to describe the concept of storing all of the binary objects in a single location. Here's the picture to hold in your mind: binary objects that are being managed by SharePoint would be stored in a single centralized system; not all of them all of the time but whenever it makes sense. Along with the binary object we also take a "convenience copy" of some of the object's metadata and any other contextual information of interest, (the folder it came from for example). This metadata and context data is captured simply to allow us to be able to work intelligently with the object.
So what would this aggregated view give me? Let's break it down in to three key areas:
- Operational Efficiencies
- Use the metadata values to intelligently to perform HSM on the objects – i.e. move content to different storage devices based on a set of rules. This reduces hardware costs considerably.
- While you are at it why not de-duplicate the content, (think of the storage and backup savings there). This de-duplication is not just within a single site, you could de-dupe across all SharePoint sites in the entire company. Bear in mind that if you have versioning switched on in SharePoint it tends to create a huge number of identical copies of documents…more than you might expect.
- Don't stop there; you can also more efficiently deploy your SQL Servers because they will scale up the wazoo if they only contain the structured data. ("up the wazoo" is a DBA technical term.)
- Backup and recovery…well the jury is out on this one. Depending on where you aggregate your content to and how you do your backups (hot vs. offline) this could either be hugely efficient or a bit of a nightmare. We are working on that one - if you have any ideas contact me directly for a chat.
- You will certainly increase the scalability of your deployments by removing the payload from SQL server; this alone might justify the effort to aggregate your data,
- Many companies do not manage all of their SharePoint deployments from within the data center simply because they do not have the capacity to service that many separate systems. This aggregation approach means that IT could allow departmental rollouts of SharePoint but they would mandate that the deployments utilize the aggregated store so IT has responsibility for the actual content.
- Governance, Risk and Compliance (GRC)
- Once you have all, or a subset, of your content in a central repository you can start to apply good governance controls to it. If it is an ECM system you can start applying retention controls, you can apply digital rights management to objects and have more robust data protection/audit controls.
- Leveraging the HSM feature you can apply different levels of control to different categories of content. For example, some content might just be stored on a secure file system, other content might be dumped in to your ECM system and some might go to a CAS device.
- Probably the biggest compliance advantage is the ability to centralize the application and management of your controls. Given that all of your corporate assets are now in one place you can apply a common set of controls to them – one file plan, one set of retention schedules and one disposition process.
- Business Gains
- What about your end users? To be fair, they’d probably love this system simply because it is non-invasive but we can throw them another bone or two just to be kind.
- eDiscovery is a case in point – obviously having everything in one place, de-duplicated and indexed is a boon for the legal-eagles but there is another bonus feature…as I mention in the first list, right now there is a plethora of content lurking in the hidden department-deployed SharePoint systems; aggregation means that you can ensure that you have visibility in to all of this content.
- Long term archiving is another interesting use-case; this could be compliance driven or just good business practice. If your content resides in SharePoint then in order to have access to the content in 20 years time you will have to still have SharePoint running. This is neither attractive nor realistic! However, if the key business documents have been archived out with their metadata then you have many more options such as leaving them in your existing long term repository, exporting them to a file system with an XML tag file, storing them on a CAS device, etc.
So, hopefully you now have a feel for why you might want to take a look at aggregating your content in to one location. In the next post I’ll discuss the different approaches that are available and then in the third post I’ll focus on two new capabilities that Microsoft have made available – EBS and RBS.
Sheetal,I'll discuss this at length in the 3rd post...EBS vs. RBS. Bottom line is that getting access to the actual BLOB object is only 5% of the problem…being able to get the object’s context from SharePoint, being able to intelligently manage the content, apply protection/compliance controls is the other 95% of the issue. RBS gives you the file stream, (the BLOB object), but try finding out which site/library/folder it came from, try being selective about what gets externalized and to where, try to associate that BLOB with a previous version of that same BLOB for intelligent management…you get the idea. With EBS you have a fighting chance to do this (EBS is implemented in SharePoint whereas RBS is a SQL Server technology), but it is still exceedingly difficult to do it properly.To be fair, a lot of these issues are only significant if you think past simply getting the BLOBs out of SQL and think about the intelligent management of those BLOBs.I’ll try to get the next two entries finished up because I think that they will make everything much clearer. Andrew
Posted by: Andrew Chapman | 08/26/2008 at 03:00 AM
Thanks for the great post.Wouldn't SQL Server 2008 (FILESTREAM) would mute the BLOB issue? Last week I noticed an entry on MSFT blog about SQL server 2008 supporthttp://blogs.msdn.com/sharepoint/archive/2008/08/15/sql-server-2008-support-for-sharepoint-products-and-technologies.aspxI am currently testing the SQL server 2008 and MOSS for one of the compliance solutions we are builing and so far the results looks promising. Any comments would be appreciated
Posted by: Sheetal Jain | 08/26/2008 at 03:00 AM
Look forward to reading about the solutions Andrew! Thanks!
Posted by: Rich Klahne | 08/29/2008 at 03:00 AM
If you'd like a less biased overview of SharePoint archiving in relation to Documentum then "Pie" has posted an article "Forecasting the Future of Documentum and SharePoint " over at his Blog. It also helps that he also said nice things about me…
Posted by: Andrew Chapman | 09/03/2008 at 03:00 AM
Just to back up Andrew on the architectural side of things with an anecdote, 'no names, no pack drill'; I worked on a Documentum 5.3 implementation where the performance was so dire using MS SQL Server that the organisation took the hit and paid to go over to Oracle to get better performance from its ECMS, and this is with only the metadata and some other properties being held in database tables, the binary objects are in a flat file store, not BLOBS in the database. Just extrapolate those performance issues out from there.......
Posted by: Jed Cawthorne | 09/03/2008 at 03:00 AM
EMC's views on scalability issues with SharePoint
A nice post
Posted by: Confluence | 09/04/2008 at 03:00 AM
EMC on scalability issues with SharePoint
!sharepoinit.png align=right,width="220px"! A nice post
Posted by: Confluence: MarketSpace | 09/04/2008 at 03:00 AM
MyNetFaves : Public Faves Tagged Acronym
Marked your site as acronym at MyNetFaves!
Posted by: MyNetFaves : Web 2.0 Social Bookmarking | 10/15/2008 at 03:00 AM
Great Post. looking forward to seeing that 3rd post Three questions1. Does EBS provider give you control over what file data should be externalized (certain size for e.g) or is all or nothing.2. in relation to your comment, 'Along with the binary object we also take a "convenience copy" of some of the object's metadata and any other contextual information of interest'Any guidelines/template that one should be aware of when extracting this associated metadata to make it relevant to ECM vendors
Posted by: Paul Morrissey | 10/27/2008 at 03:00 AM
This is just a nit, but your example of a 10 Gb file wouldn't actually "fly," since there's a 2 Gb limit on BLOBs in SQL and, consequently, in SharePoint as well (there's hard-coded maximum size limit built into SharePoint, with the default size of 50 Mb). That said, your point is well taken. In Microsoft's own Interactive Media Manager application (built "on top" of SharePoint), the product team stores the very large media files on a secondary storage mechanism and only keeps track of the assets in SharePoint.
Posted by: Shawn Shell | 11/24/2008 at 03:00 AM
storage area network
If your IT department is being squeezed by the downturn in the economy but still needs to add let’s say a storage area network (SAN) then why not do it yourself (DIY).
Posted by: storage area network | 12/25/2008 at 03:00 AM