Wednesday, November 19, 2008

A simple Data Transformation example...

In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a 'single row per customer' or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although my work is usually as SQL within the data warehouse so it is just the ‘T’ bit.

Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple. I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.

Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations. Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs. Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).

There seems to be a glaring big gap! I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm. In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters. Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful. They don’t necessarily have to be statistical in nature.
So (to put my money where my mouth is) I've started by illustrating a very simple data transformation that I expect is common. On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine).
Clementine Stream and text data files
my post to a Clementine user forum

I’m hoping that my peers might exchange similar ideas (hint!). A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful. This is just the start of a series of ETL you could perform.

I’ve also added a poll for feedback whether this is helpful, too basic, etc

- Tim

Example data processing steps

a)Creation of additional dummy columns
Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category. We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised). A new column is created for each category field.

For example;

Can be changed to;


Or even;


b) Summarisation
Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns.
So we could change the previous step to something like this;


Adam Lange said...

My two cents... I find it's often useful to look at the data that's not given. So, in the case of your call log data, there are some days in which a customer does not make a phone call or send a text message.

Take customer "10", for example. There is a 4 day gap between his first call on October 10th and his second call on October 15th. Then there is a 5 day gap between his 2nd call and his 3rd call on October 21st.

On the other hand, customer "11" uses his phone multiple times per day and also on consecutive days.

By "filling in the gaps" in ordinal log data, you're able to tease out hidden information. You're also able to perform true time series analysis, which is extremely powerful in both segmentation and classification.

Pedro said...

I agree with both you! :-)
very good post and comment!

Shane said...

I think there are a lot of ways you can transform data, especially transaction data. There is a technique called "data spiders", I have not used it but I think the essence of which it is the following: generate all possible ways to chop your transaction data then use some sort of genetic algorithm feature selection on that.
eg. generate data for [time window] x [feature] x [level]
where: time could be 1 hour, 2hrs, 1week, 1year etc
feature could be number of calls/call volume/total spend/etc
level could be customer/hhold/etc

Will Dwinnell said...

My first thought regarding this post is: Bravo! I agree completely that data preparation and transformation is a very important step in data mining. Some problems can be made much easier through this process.

Optical character recognition provides a good example: a 10x10 bitmap of a single character to be classified, for instance, contains 100 distinct variables, most of which, individually, are not terribly informative. Simple transformations such as profile projections can reduce the number of candidate predictors (10 horizontal + 10 vertical = 20: an 80% reduction!) and enhance their information content at the same time.

I also agree that shockingly little seems to be written on this subject. For my part, I very much like reading about things like this, though I suspect that many data mining authors may fear that their own "bag of tricks" is too problem- or industry-specific to bother sharing, which is a shame.

I also agree that the number of possible transformation is almost limitless. Be creative!

In your example, I like your Summarization idea. I would go further, and consider counting the individual types of calls by person, and possibly the mean call score by person.

Tim Manns said...

Thanks for your feedback guys.

I've some responses;

- Adam;
I completely agree! I do try to do this with account and payment level data, but for call detail records there is just too much data (billions of rows). I was putting it out there for feedback, and your feedback was very insightful. An inspired idea I hadn't considered, but further ivestigation made me conclude it wasn't a non-viable option for technical warehouse reasons (I'd bring down the warehouse :).

I'm looking into using data summarised to a daily level though (so only 30 rows per customer per month). This should be possible and quick.

- Pedro;

- Shane;
I'll look into this. Reckon it'll take me a while to figure out :)
btw - Your previous pagerank suggestion for social networks within our cell-phone customer base is very powerful. I owe you a few beers for that idea!
I modified the concept by creating several variations based upon customer spend, churn, and call count.

- Will;
Thanks for the feedback. As long as I'm not wasting my time :)
I'm thinking of creating a fairly comprehensive example using a simple access database (that'll help avoid specific SQL differnces. I'll be adding more to this topic later, part 2 etc.

Thanks all!


Adam Lange said...

To follow up on my comment, I'll share my most common method for extracting full time series data from ordinal logs.

Let's say you have a table called Customers that contains all customer id's. And then let's say you have a table called Days which contains a seperate record for each day of the month.

"Customers" table:

"Days" table:

And then let's say that your blog6.txt file was a table called "October_2008_CDR". You can write a query like this to extract a full time series of daily call durations.



(SELECT SUM(duration) FROM October_2008_CDR WHERE October_2008_CDR.customer_id = AND = DAY( AND October_2008_CDR.type = 'voice') as Total_Calling_Time


Customers CROSS JOIN Days

ORDER BY Customers.customer_id,

This would give you a seperate month-long time series for each customer. It's likely that each series will display some level of periodicity (or at least that's usually the case in my industry). You can then use autocorrelation and partial autocorrelation functions to examine the frequency of those intervals.

Shane Butler said...

Hi Tim,

Basically what I was trying to say in the comment above was that there are a lot of possible ways to transform transaction data ... feature selection can be useful here!

Glad to hear your socal network ideas are panning out! I would be very interested to hear more about your experiences with this!


Tim Manns said...

Hi Adam,

Thanks for your help. Great info and thanks for sharing! Here's one back at ya;

If the data analysis simply needs to examine the previous row, for example the number of days or time between a row and the subsequent row for the same customer, then I instead create a table with a rank count index for each customer. I then read the same table 'twice' with an inner join where customer_id matches and rank index is one less (so I match the rank index to the rank index minus 1).

In Teradata this runs really fast and is an option if you only need a previous row comparision.

- Tim

Adam Lange said...


Good point. In cases where only the previous row matters, then your approach is probably ideal. My approach is most applicable in cases where you are interested in the full time series.


Brielle Franklin said...

I actually read the entire article and all the comments. Thanks so much for sharing this helpful information. I have been looking up information on data transformation and I have not found anything that explained it quite like this. Thanks so much for this post it was very helpful.

sqiar said...

Thank you very much for this article, it is so rare to see nowadays written as fervently article. I enjoyed reading it and I learned a lot of things. I will go and continue reading your blog =). Good luck for the future and another one for the quality of it.You can also check out this (