Blog by Tim Manns (data mining blog)

Tim Manns

Ok, what are people's views on this?

I've tried to refer to a few textbooks but haven't found anything to help me 'answer' this.

- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s).

We often use neural networks to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers. We rescore the customer base each month.

- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.

When I build my predictive model I try to ensure the model mimics this distribution. My view is that most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.

Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout the score range.

I often examine the distribution as a sanity check before deployment. If the distribution is as expected it is something like this;

If it looks screwy, maybe something like this;

- then there may be a problem with the data processing or the behaviour of customers has sufficently changed over time to require a model refresh. The subsequent actual model outcome preformance is often not as good in this case.

- question -
What are your views/preferences on this?
What steps, if any, do you take in an attempt to validate the model prior to deployment (lets assume testing, validation and prior months performance is great) ?

- Tim

Blog by Tim Manns (data mining blog)

Monday, October 20, 2008

Distribution of a prediction score

4 comments: