I recently saw a TV show about people who can’t throw things away. It featured an old lady who had kept every copy of the local newspaper since 1953, as well as pretty much everything else that had ever entered her house. The problem came to a head when her family visited and couldn’t open the front door due to the piles of junk.
Coincidentally I have a new customer with a similar problem - not newspapers but data. Truckloads of the stuff. So much in fact that they have no idea what they have or what they lost. There’s too much to search, too much to index, too much to backup, and who knows what would happen if they had a hardware failure. What would be affected?
One of the problems organizations face is that when they use relational database technology they end up making whatever raw data they have much bigger. To put this in context, a single gigabyte of raw data would be expanded three to five times upon loading. Add some indexes and aggregate tables and its seven or eight times. Add some data marts and it’s 10 times. Add some cubes and the data gets bigger still, then factor in duplication and it’s easily possible to store 20 or 30 times your original data. So what? Storage is cheap, isn’t it?
My customer thinks they have about 200 terabytes of data, or thereabouts. They back this up every day and keep a copy of the weekly tapes and the monthly tape - that’s a petabyte of tape each month! They also have a disaster recover site - which may be needed at this rate. They also have to pay an electricity bill for power and cooling - and their data center is in a hot city. They are not exactly fulfilling their promise to be eco-friendly.
Regulation poses another problem - and it is likely that they will be out of compliance with one or more of the regulations they are governed by. Some regulators require that they keep seven years of data, and it’s in their interest to delete old data.
This customer is not sure what they’ve got but one thing is for sure: we need to help them make a plan for what to keep and what to throw away. We also need to help them find some technologies to de-duplicate their data, and to store what they do need in a way that compresses it - not makes it bigger.
I encounter so many companies jumping on the big data bandwagon, the “let’s store everything” approach - and they claim it’s strategic. Sure, there are some systems that have tables with a lot of rows - like a general ledger, or a transaction system in a bank, but not all systems do. It’s everyone’s responsibility to make sure that they only keep what they need, and actively delete what they don’t.
What these companies really need to do is define a data strategy. They should define policies for what to keep, for how long, what to keep in-memory for instant analysis, what to keep close by and what to file away in cheap storage for emergencies. You can only find something if you know where it is - which means that you can only delete something if you know where it is. And the more copies you store the harder it becomes.
I’m helping my customer to build a platform to audit their data and building Qlik apps to show them what they’ve got, what they need to keep and what they can throw away. It’s an innovative use of data discovery, but an essential one. Once this project is complete we will have just a few terabytes of organized data to deal with, and we’ll be able to build some apps to show them how their business is actually performing.
In almost all of the BI and Analytics projects I’ve seen over the last 20 years, the real insights have come from analyzing data from different departments and bringing it together. The value comes from searching for subtle associations between sales and marketing, or finance and risk, and using those insights to drive decision-making and change. It’s not how much you’ve got, it’s what you do with it that matters.
Being organized is the key to success and having a data strategy is a good place to start.
Next time a storage vendor tells you that big data is the answer to the world’s problems keep all this in mind.
Maybe small data is the new big data.