This specific post is to help promote the launch of the new IAPA website and increase focus on Analytics in Australia (and Sydney, where I am normally based). The topic of this post is something that has been at the forefornt of my mind and seems to be a central theme of many of the projects I have been working on recently. It is certaininly a current problem for many Marketing/Customer Analytics departments. So here are a few thoughts and comments on 'big data'. Apologies for typos, it is mostly written piecemeal on my iPhone during short 5 mins breaks...
So, below is a series of my most recent observations from Analytics projects I have been involved with that involved resolving, or encountered 'big data' problems:
After a week of design and preliminary work I began to conasider ways to optimise the performance of my queries and computations, and I asked about the server specifications. I assumed some big server with dozens of processors, but unfortunately what I was connecting to was a dual core 4GB desktop PC under an Analyst's desk...
So, what is the best way to account for outliers, skewed distributions, poor data sparsity, or highly likely erreonous data features? Well an approach (that i am not keen on) taken by some is to apply several variable transformations indiscriminatly to all 'raw' variables and subsequentially let a variable selection process pick the best input variables for propensity modeling etc. When combined with data which represents transposed time series (so a variable represents a value in 'month1' the next variable the same value dimension in 'month2' etc) then this can easily generate in excess of 20,000 variables (by say 10 million customers...). It is true there are variable selection methods that handle 20,000 quite well, but the metadata and processing to create those datasets is often significant and the whole process often incurs excessive costs in terms of time to delivery of results.
Additional problems that may arise when you start working with many thousands of variables is that variable naming needs to be easily understood and interpretable. The last thing a data miner wants to do is spend hours working out what those transformed and selected important variables in the propensity model actually mean and represent in the raw data.
Which leads me to my next point..
- Variable / Data Understanding
One of the core skills of a good data miner is the understanding and translate complex data in order to solve business problems.
As organisations obtain more data it is not just about more records, often the data reveals new subtle operational details and customer behaviors not previously known, or completely new sources of data (FaceBook, social chat, location based services etc). This in turn often requires extended knowledge of the business and operational systems to enable the correct data warehouse values or variable manipulations and selections to be made.
An analyst is expected to understand most parts of an organization's data at a level of detail most individuals in the organisation are not concerned with, and this is often a momental task.
As an example of 'big data' bad practice, I've encountered verbose variables names which immediately require truncation (due to IT / variable name limit reasons), others which make understand the value or meaning of the variable difficult, or naming conventions which are undocumented. For example: "number_of_broken_promises" is one of the funniest long max variable names I've seen, whilst others such as "ccxs_ytdspd_m1_pct" can be guessed when you have the business context but definitely require detailed documentation or a key.
- Diverse Skillsets