Monday, October 20, 2008

Distribution of a prediction score

Ok, what are people's views on this?

I've tried to refer to a few textbooks but haven't found anything to help me 'answer' this.

- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s).

We often use neural networks to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers. We rescore the customer base each month.

- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.

When I build my predictive model I try to ensure the model mimics this distribution. My view is that most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.

Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout the score range.

I often examine the distribution as a sanity check before deployment. If the distribution is as expected it is something like this;

If it looks screwy, maybe something like this;

- then there may be a problem with the data processing or the behaviour of customers has sufficently changed over time to require a model refresh. The subsequent actual model outcome preformance is often not as good in this case.

- question -
What are your views/preferences on this?
What steps, if any, do you take in an attempt to validate the model prior to deployment (lets assume testing, validation and prior months performance is great) ?

- Tim


zyxo said...

Obviously your first distribution is the one we allways want to see, and it is the only one I accept. Otherwise there is something wrong with the model, probably some kind of overfitting.
I avoid overfitting by two ways : I use training data sampled from a relatively long history (one or two years), in order to obtain time-robust models. Second I use a lot of training data (~2 millions of records) spread over a few handfulls of models, each model trained on a different sample (sort of bagging). The resulting score is the average of the different scores. This gives you very robust models with high quality.
Interesting blog, keep on going.

Jeff Zanooda said...

In my experience models often simply spit out a probability estimate, which is then converted to a score (an integer in some arbitrary range, e.g., 0-100 or 300-900).

If score distribution is extremely skewed (we are predicting a very rare event), this can lead to some loss of precision due to quantization errors.

Another approach is to transform model output in such a way that N point increase in score doubles the odds.

If you use multiple models, having non-overlapping score ranges might help from implementation standpoint. When a score cut is applied using the wrong score, there are either zero passes or zero fails, so it's is easier to spot.

Tim Manns said...

I also asked Michael Berry this question, which he posted on the 'Data Miners' Blog

here is the post;

Thanks everyone, much appreciated!

- Tim

Phil said...


First you check the score distributions, and if they are as you see, then quite obviously something is going on that needs investigating. As a data miner it should be simple to put your finger on the root cause.

The reason will be a change in the distributions of your predictor variables. To quickly get to the bottom of which variable is causing the problem, merge the model development data set (tag as 1) with the scoring dataset (tag as 0) and build a model (logistic regression should do) to see if there is any 'lift' between the two data sets (there will be).

Calculate the variable importance in this model and you will probably find that there is only 1 variable that is important, and that is the cause of your problems.

I have come across this issue many times, and there is always an underlying reason why it happens. Sometimes it is a data error. I worked with credit card data where every customers annual fee was in January. Fees cause churn and you always funny model results for certain months because of this.