All About Data: 2012

Sunday, November 18, 2012

Definition of Big Data -- 1 TB

Amazon Relational Database Service allows databases up to 1 TB.
So there is a clear definition of Big Data according to Amazon -- any database bigger than 1 TB.

Thursday, August 9, 2012

Recovery Strategy for Polyglot Persistence

Typically all data for a given application is stored in a single DBMS. Then the usual data maintenance tasks include data backups on a predefined schedule, and testing disaster recovery procedures. The potential loss of data could be dealt with. The data could be recovered to a point in time right before the failure. If not, it would be recovered from the most recent successful backup copy. With the above measures it is always possible to tell the time period the data was lost for, and ask users to re-do their work.

The data administration gets more complicated for a heterogeneous persistence layer. With multiple data stores, there is an additional concern of data inconsistency across data stores.

Our application utilized three different data stores:

SQL Server tables
MongoDB collections
Uploaded files (pdf and such) on the file system

Data was interconnected pair-wise: SQL Server and files, SQL Server and MongoDB.

The dependency between SQL Server and files was simple and it was easy to overcome. The uploaded files were referenced from the SQL Server tables by path/name. I.e. the tables included pointers to the data files. As files should be rarely deleted, the hard deletes were eliminated and replaced with soft deletes. That meant that as long as the file backup was taken any time later than the SQL Server backup, the recovered data would be consistent.

The interdependency between SQL Server and MongoDB was more complicated. For a given transaction data might get stored in SQL Server and overlapping data might go to MongoDB. This poses problems for a possible recovery

When recovering to a point in time right before the failure, the most recent transactions might not have been written to the MongoDB journal, but were written to the SQL Server log (and vice versa). So when recovered they might be somewhat off.
When recovering from a backup snapshot it would actually require restoring from two backups: SQL Server backup and MongoDB backup. The snapshots are run independently and might be a couple of minutes off.

MongoDB has good durability in a recommended configuration, i.e. when utilizing journaling and replication. In initial configuration the journaling was enabled, but not the replication. Without replica, the backups had a rolling problem effect: if data is changed over the course of a backup then the resulting snapshot may be in an inconsistent state that doesn't correspond to how the data looked in the MongoDB at any one moment. So the first priority was enabling the replication. It added fail over capability, significantly reducing changes of a disaster. And the backups used –oplog option for precise backup timing.

Yet another trick was used on the SQL Server side. Its recovery procedures are more sophisticated. And the database can be recovered to any point in time. So whenever the database backups had slightly different times, the SQL Server recovery was tuned to match the MongoDB timing.

Wednesday, August 8, 2012

Two Roads to NoSQL: Why and Why Not

There are two common threads on the path to NoSQL database: the Why and the Why Not. The talks and presentations define objective drivers for the new technology. They painstakingly delineate all the reasons of Why it is preferred in particular situations. However the discussion boards are full of advice of where the new technology can be tried out, striking down the Why Not using it.

The Why reasons include:

Big Data (scalability)
Cloud grid (commodity hardware, high availability)
Concurrency
Connectedness (social media)
Diversity (semi-structure, Web 2.0 decentralization of content generation)
Low impedance mismatch between programming and data model

From my experience:

- A while ago I was involved in an application for analyzing network traffic. To keep the software price down open source stack with MySQL was used. The application had great-looking screens and thought-out navigation flow, but failed measurably. The database could not even keep up with writing transactions, to say nothing about retrieving information. That would have been a good candidate for the NoSQL.
- A company moving its software from a shrink-wrapped version to the cloud platform must look at the highly-scalable NoSQL solutions.
The Why Not is looking from a different perspective. A new type of application brings its wave of the database innovation, some of which would survive and prosper. That was true for the data warehousing with multidimensional cubes and data warehouse appliances. Now it is the Web 2.0 turn with NoSQL. You need to build up expertise in the new technology. You need to acquire confidence in the technology and your ability to execute. And it is easier to accomplish on a non-critical portion of applications: execution logs, configuration and metadata, redundant data, etc.

Tuesday, August 7, 2012

Multi-tenant database requirements

The following is a partial wish list for a multi-tenant DBMS:

Distribute data across multiple nodes for scalability
Restrict user access by account/tenant
Backup multiple accounts/tenants at once
Restore data for a single account/tenant
Allow multiple schema versions
Maintain multiple replicas for durability
Meter and limit activities by account/tenant. NOTE: They should be measured in CPU cycles not in clock time (the query clock time limit as in SalesForce depends on the overall database load and adds unnecessary limitations)

Monday, August 6, 2012

Schema-less or schema-LOSS

One of my applications utilized MongoDB. Right from the start, we gained a lot of benefits from the choice. Data got stored exactly the way it was needed by the application. It all fit into memory, and performance was amazing. The MongoDB replica sets in the recommended 3-node configuration provided redundancy, and assured high availability. MongoDB would get good marks from me; it is a solid, mature product.

And it’s schema-less. The database just stores objects from the application. There is no overhead of defining and maintaining a schema definition. The application goes to production. Whenever a problem arises, code is changed and quickly deployed without dealing with the hassle of database administrators. And then, after three emergency fixes, comes a realization. What was an initial advantage is now a big problem. After each fix, data was created in a new format, and old data was retained “as is”. There is data in four different formats. There coexist four different schemas, without a clear way to tell them apart. It’s no longer “schema-less”, now it’s SCHEMA-LOSS.

Before NoSQL fixes were accompanied by database management tasks. Database structures were altered and old data was migrated to the new schema. Now there is a functionality gap caused by the disappearance of a schema. Unless some other tool fills the gap, maintenance becomes a nightmare.

Sunday, August 5, 2012

Polyglot Persistence

Web posed new challenges: big data, cloud grid, multi-tenancy, high availability, social media connectivity, and others. In solving the challenges, new types of databases have been developed and become popular. Even though they all go under the umbrella of NoSQL, each one was developed for and is best at solving specific problems. Martin Fowler[1] argues that the future of data storage in the enterprise is in polyglot persistence. “Polyglot persistence will occur over the enterprise as different applications use different data storage technologies.”

My only point of dispute is that the same was true before NoSQL. The below diagram shows the two most recent waves of the database innovation: data warehouses, and Web x.0.

Waves of Database Innovation

Applications used before and continue to use the most suitable data storage mechanism. Data is flowing through an enterprise and takes different forms: an in-memory storage for a web application, multidimensional cubes for interactive reporting, bi-temporal for master data management, flattened result sets for statistical analysis, hierarchical XML messages for interfaces, and so on.
With the same data taking different forms, it needs to be transformed between those forms. Some representations miss constraints imposed by others and cannot be converted directly. A relational schema played the intermediary role for many years. In my opinion, it is still the best one for expressing business domains, and would continue to be used for systems of record.

[1] Martin Fowler and Pramod Sadalage, February 8, 2012. http://martinfowler.com/articles/nosql-intro.pdf

Saturday, August 4, 2012

Links to my prior articles

I have been publishing articles in “The Data Administration Newsletter” (www.tdan.com) for a number of years. The topics covered different aspects of data modeling.:

- A Project Model is a Constrained Subset of an Enterprise Model in March ’98

- Documenting Meta Data Transformations in March ’99,

- Graphical Patterns for Data Models in October ’02,

- Importance of Definitions in October ’03,

- Metadata Improvements – A Case Study in May ‘08,

- Real Men Don't Read Instructions in Apr ‘09

Tuesday, January 10, 2012

No Laughing Matter

After years of working in the same area we get entrenched in our ways: go to the same user groups, seek bulletin boards strictly based on our area of expertise, etc. Even jokes become interspersed with professional terms making them obscure to the non-initiated. Only a data modeler might appreciate a pun such as: “An entity with a dangling relationship walks into a bar....” There is a real danger in such entrenchment. However, it does sometimes help to extend familiar concepts to a new area.

I wrote the above lines back in 2008, then I realized, no one was laughing at data modeling jokes; they were busy learning new concepts. Enterprise architecture and relational databases were “so yesterday”, and NoSQL was the new king. The article was never published as I redirected my energy as well. Nowadays I sign for user groups but have to switch to yet another technology before being able to attend their next meeting.