Sunday, November 30, 2008

Smart Data Collective

I was recently asked to join the Smart Data Collective, which is a social community of bloggers and professionals interested in data warehousing and enterprise analytics. It is sponsored by Teradata and is editorially independent. Some of my posts will appear there, and maybe even some specific articles from me.

They asked me to be a founding member and featured blogger which sounds like a lot of work, but I've been reassured isn't :)

On a related but separate note, I recently participated in a podcast for Teradata concerning the data analytics we do at Optus (legal disclaimer: I do not represent Optus in anyway in this personal blog :). I discussed previously presented material regarding our churn prevention analysis, and also my recent social network analysis. The podcast was only completed recently, its still early days and needs to pass approval from appropriate legal channels, but hopefully it will find its way onto the Teradata site in the New Year. I'll keep you posted.

- Tim

Monday, November 24, 2008

Movember Madness

Through executive meetings and BBQ's with friends I have worn a 'Mo' as a symbol of my support for Movember (

That's right we're talking about men's love bumps and feeling down in the dumps. To raise awareness of prostate cancer and depression I have grown a truly dodgy moustache.

I have found it a challenge and am looking forward to shaving the damn thing off...

With only a few days to go I will soon be posting 'before' and 'after' pictures. There is still time to donate your hard earned cash if you wish;



- Edit: ok, here's the dodgy picture of my moustache. I raised $125 for my hardship :)

Wednesday, November 19, 2008

A simple Data Transformation example...

In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a 'single row per customer' or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although my work is usually as SQL within the data warehouse so it is just the ‘T’ bit.

Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple. I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.

Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations. Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs. Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).

There seems to be a glaring big gap! I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm. In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters. Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful. They don’t necessarily have to be statistical in nature.
So (to put my money where my mouth is) I've started by illustrating a very simple data transformation that I expect is common. On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine).
Clementine Stream and text data files
my post to a Clementine user forum

I’m hoping that my peers might exchange similar ideas (hint!). A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful. This is just the start of a series of ETL you could perform.

I’ve also added a poll for feedback whether this is helpful, too basic, etc

- Tim

Example data processing steps

a)Creation of additional dummy columns
Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category. We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised). A new column is created for each category field.

For example;

Can be changed to;


Or even;


b) Summarisation
Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns.
So we could change the previous step to something like this;

Thursday, November 13, 2008

People are Age-ist !

- Just an interesting customer insight that made me laugh the other day....

As a small part of further social network analysis of a mobile (cell-phone) customer base I have examined age differences between customers and with whom they communicate most frequently.

I was also looking at how reliable it might be to guess someones age (a customer or non-customer) by extrapolating from known individuals. There is customer age approx 97% of the time, and its accurate approx 92% of the time (unusally large numbers of people claim to be born on 1st Jan 2000 :)

I was surprised to see (but then maybe I'm naive :) how so many people have 'mobile calling relationships' mainly with people the same age...

The chart below shows the percentage of customers and the age average between the people they communicate with most frequently. Age differences over 4 years are comparatively rare...

The small spike at 30 years difference is probably parent-to-child communication.

I will be using this to support an estimation of age for prospects and customers where age is unknown, but age of social group members is known.

- Tim