Monday, October 20, 2008

Distribution of a prediction score

Ok, what are people's views on this?

I've tried to refer to a few textbooks but haven't found anything to help me 'answer' this.

- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s).

We often use neural networks to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers. We rescore the customer base each month.

- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.

When I build my predictive model I try to ensure the model mimics this distribution. My view is that most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.

Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout the score range.

I often examine the distribution as a sanity check before deployment. If the distribution is as expected it is something like this;

If it looks screwy, maybe something like this;

- then there may be a problem with the data processing or the behaviour of customers has sufficently changed over time to require a model refresh. The subsequent actual model outcome preformance is often not as good in this case.

- question -
What are your views/preferences on this?
What steps, if any, do you take in an attempt to validate the model prior to deployment (lets assume testing, validation and prior months performance is great) ?

- Tim

Thursday, October 16, 2008

new book "Marketing Calculator"

I'm famous! :) I'm delighted to say that I’ve been referenced in a new marketing book by Guy Powell titled “Marketing Calculator”.

I met Guy in Oct 2007 whilst presenting a couple of data mining case studies at a Marketing Analytics conference in Singapore. My presentation title was ‘given’ to me by the conference organisers, but it allowed some freedom regarding the content. I discussed how we use a comprehensive data warehouse, and how having access to detailed customer data enriched with demographic data can enable you to get some impressive response rates from campaigns, and sales numbers by up-selling to existing customers, and most importantly retain and grow your customer base.

Guy and I spent some time discussing our work and he asked if could include it as a case study in his book. I received the book yesterday afternoon and am already through the first 4 chapters (60 pages). It reads easily and is proving to be a very worthwhile book. It is a Marketing book (it doesn’t contain statistical equations or examples of algorithms on data mining) and I would recommend it for anyone involved in marketing. So far it has provided some well structured and clear insights how you can improve your marketing practices.

- Tim