Document DBs store data as “documents” which can be any data you like.
However relationships can be created between common elements (eg: author may exist on a document of type Book as well as type article.
Ideally suited to scenarios where the structure of the data is highly volatile or is simply unstructured or partly structured.
Example: SurveyXML contains a Binary Large OBject (BLOB) of Survey answers. There is some common ground but between different survey types but the structure has evolved. There are also totally different structures like the Green Deal Assessment (type 81).
Mining “Big Data”
In SQL Server this data is totally impenetrable. For example, (assuming the data wasn’t zipped) a query to find the answer to a question across a sample of 500,000 surveys of varying type would be very difficult and very slow in SQL. This sort of query would be up to 20,000x faster in MongoDB with the data in a comparable BSON format [i].
SQL Server Xml queries get exponentially slower as the volume of data increases. As a direct result we can draw few useful conclusions from the mountains of data surveyors have collected beyond the occasional recall of individual records. For the most part that data lies in wait and the majority will never see the light of day.
Document databases like Mongo are designed to perform searches over large volumes of unstructured data very quickly. That capability could potentially open up the data and allow us to get more out of it. For example, calculating averages and trends in sale and rental prices, regional uptake of energy-saving measures, etc. That sort of information could be very useful MI for eTech and could be a saleable service in itself.
Alignment with agile
Sticking with the same example, the structure of survey data has changed many times during the lifecycle of the SmartSurvey. That is not the exception – it is fairly typical of most strategic systems.
In SQL Server, if we wanted to maintain survey data in a relational way that would allow SQL Server to query efficiently, the data would have be split through many different tables, and each change in schema would require a schematic change to the relevant tables and a migration path for all existing documents which is difficult, risky (in a production setting) and time-consuming.
Document databases don’t constrain you to the structure you started with. You can at your whimsy start inserting documents with a different schema but (crucially) the integrity of data in the pre-existing format is not compromised so there is no need to migrate existing data to the new format. Furthermore, we can query across our entire collection where data is common to both schemas eg: QuestionId.
Finally, how many people would trust Entity Framework migrations with their life??
Alignment with web technologies
There are a wide range of document DBs supporting a variety of different document formats such as binary, XML and perhaps most interestingly JSON. Common with most of the IT sector, eTech is investing heavily in technologies like WebApi, MVC, WCF as well as various mobile platforms. JSON is generally the transport of choice that underpins much of this technology.
Importantly, JSON is NOT the data format that underpins SQL Server.
Consider a common scenario where we get data from the database to push it straight out over HTTP via WepApi/MVC. With SQL Server this process entails a series of conversions from SQL’s binary stream, mapping into a POCO, before being serialized to JSON. Some of these conversions are implicit and we have a lot of help from ORMs to facilitate the mappings but nonetheless there is an overhead to all this.
Consider the same, with a CouchDB database where data is stored as JSON. In it’s simplest form, CouchDB has a web Api built in…
Sticking with CouchDB, if you’re using the data in an application and you want to de-serialize your data into POCOs, there are a range of APIs you can choose from that will do this in familiar fluent syntax for you. You can use the de-serializer in the API. Alternatively you can get the raw JSON and choose to use any of a range of JSON de-serializers to choose from various venders. You are no longer locked into ADO.NET.
If you’re lucky, you’ll get a JSON data type in SQL 2016…
Alignment with OOP
Have you seen the mess Entity Framework makes if you attempt to store data from 2 derived classes? In a simple scenario of 2 classes derived from a base class you first have to decide whether you want table per hierarchy, table per class, table per concrete class, table per type. Some perform better than others. Some are simpler than others. There are performance, complexity considerations with each method.
Why is this even a thing?? Why do I have to tell SQL Server how to cope with a straightforward, mainstream data structure?
None of that bother in document dbs. If you can do it in JSON (and you can) you can do it in a document db.
Document Databases: The Catches
Not really a catch but unsurprisingly, document databases manage relationships differently to relational databases. You therefore need to use them differently. Documents can be embedded in other documents and you can reference other documents. Complex multi-table queries do not perform as well as queries on one or few tables, and as such embedding documents is generally a better choice for performance, and quite often a “copy” of the dependency as it was at the time is perfectly acceptable, even preferable over a consistent shared instance.
The exception to this is where there are shared dependencies that must remain consistent between documents. Eg: if you change a shared dependency in one document, the expectation is that change is reflected in the subsequent document.
A typical document may contain an object graph that might have some embedded documents which may be repeated in other documents. Obviously, in a relational database we would try and optimise this by normalizing the data. This means that the databases are typically larger than a SQL Server equivalent. To what extent that is an issue in an era of relatively cheap storage is subjective.
They are good at constraining and validating what you can store to preserve the integrity of data. You can’t rely on that in document databases.
That said, the elements of your POCO are strongly typed. Any dependencies (eg: static data) will generally have been selected from a list that come from the database any way. Do you need to validate all that again? I would suggest that if you are relying on the database to validate the types of data and integrity of your relationships there are other problems…
Furthermore, document databases are horizontally scalable which helps manage this problem (SQL server scales vertically).
Most document databases support some form of atomic (per-document) transaction support. However, multi-document transactions are not universally supported. This is will come as the technology matures. There are now some document database platforms coming out that do support multi-document transactions (FoundationDB, RavenDB, etc). In time this will undoubtedly propagate to the majority of platforms.
Relational databases are relevant BUT they are a solution to a traditional problem. They are very good at maintaining consistent data integrity and their performance is optimised around querying well-defined, strongly-typed datasets with a stable structure.
That feature-set is an excellent choice for a waterfall, up-front designed project where deviation from the original design is strictly controlled. It is excellent for a flat POCO structure where one class can be accurately represented as one table. It is space optimised and highly performant even on the most complicated queries spanning many joins.
Where managing change (Agile), web technologies, big data and OOP are involved there are some headaches with SQL Server. We’ve papered over the cracks to a certain extent with technology like Entity Framework but those headaches often show through.
This is where document databases come in. They are a solution to these modern problems and out perform SQL when it comes to mining data from large non/semi-structured datasets, evolving your data structures throughout the lifecycle of your application to keep up with new requirements and Integration with prevailing web technologies.
[i] http://lifeinvistaprint.com/techblog/mongodb-vs-sql-servers-xml-data-type/ – MongoDB vs Sql Server xml comparison