Typically all data for a given application is stored in a
single DBMS. Then the usual data maintenance tasks include data backups on a predefined
schedule, and testing disaster recovery procedures. The potential loss of data could be dealt with.
The data could be recovered to a point in time right before the failure. If not,
it would be recovered from the most recent successful backup copy. With the above measures it is always possible
to tell the time period the data was lost for, and ask users to re-do their
work.
The data administration gets more complicated for a heterogeneous
persistence layer. With multiple data
stores, there is an additional concern of data inconsistency across data
stores.
Our application utilized three different data stores:
- SQL Server tables
- MongoDB collections
- Uploaded files (pdf and such) on the file system
Data was interconnected pair-wise: SQL Server and files, SQL
Server and MongoDB.
The dependency between SQL Server and files was simple and it was easy to overcome. The uploaded files were referenced from the SQL
Server tables by path/name. I.e. the tables included pointers to the data
files. As files should be rarely
deleted, the hard deletes were eliminated and replaced with soft deletes.
That meant that as long as the file backup was taken any time later than the
SQL Server backup, the recovered data would be consistent.
The interdependency between SQL Server and MongoDB was more
complicated. For a given transaction data
might get stored in SQL Server and overlapping data might go to MongoDB. This poses problems for a possible recovery
- When recovering to a point in time right before
the failure, the most recent transactions might not have been written to the MongoDB
journal, but were written to the SQL Server log (and vice versa). So when
recovered they might be somewhat off.
- When recovering from a backup snapshot it would
actually require restoring from two backups: SQL Server backup and MongoDB
backup. The snapshots are run
independently and might be a couple of minutes off.
MongoDB has good durability in a recommended configuration, i.e.
when utilizing journaling and replication. In initial configuration the
journaling was enabled, but not the replication. Without replica, the backups
had a rolling problem effect: if data is changed over the course of a backup
then the resulting snapshot may be in an inconsistent state that doesn't
correspond to how the data looked in the MongoDB at any one moment. So the first priority was enabling the replication.
It added fail over capability, significantly reducing changes of a disaster. And the backups used –oplog option for precise
backup timing.
Yet another trick was used on the SQL Server side. Its recovery procedures are more sophisticated. And the database can be recovered to any
point in time. So whenever the database
backups had slightly different times, the SQL Server recovery was tuned to
match the MongoDB timing.