“I hate big data. That is, I hate the term, mainly because it implies that size is the interesting aspect of big data. Data volume has been exploding for decades. Big data has many interesting aspects, but bigness isn’t one of them. I’d like to highlight two of the more interesting big data problems in this article.
Load Now, Transform Later
Presented with a data source, traditional data warehouse folks take their extract, transform, and load (ETL) tools, point them at the data and extract and transform it into something that can be mapped into relational tables. That works if you’re getting data from a core banking system or a billing system, but as you start getting data from a greater number of diverse sources—which is the goal of big data—those techniques break down.
If we point our traditional data warehouse team at web log data, they’ll define it as a fixed set of columns in a relational table, applying a schema. When marketing knocks the next day and says, ‘Guess what? We added a new tag on the webpage and want to track that too,’ the schema for that table has to be altered. And changing schemas is no fun.
The frontier for big data is therefore to create value from that huge data store without having to change the schema again and again. … We need to bring the world of ETL and modern BI query tools operationally closer so that data transformation happens when someone asks a question and not before. That’s one interesting challenge.
Some Like It Hot
… As one of the professors I had at MIT said, “Hardware is the easyware; software is the hardware.” It is easy to configure hardware to suit our needs. The tricky part is the software that helps us determine that today’s hot data is tomorrow’s cold. Having an army of DBAs evaluating petabytes of data is cost prohibitive. That’s why the next big challenge is figuring out what data goes where without human intervention. You need to know how data is really being used. You need to understand different types of data, different access patterns, and be able to predict access patterns. The question is can you predict what you should do for tomorrow’s workload or the next hour’s workload?”