Sunday, September 13, 2009

I'll show you mine if you show me yours...

Analysts don't usually quote predictive model performance. Data Mining within each industry is different, and even within the telecommunications industry definitions of churn are inconsistent. This often makes reported outcomes tricky to fully understand.

I decided to post some churn model outcomes after reading a post by the enigmatic Zyxo on his (or maybe her :)) blog ;

I'd like to know if the models rate well :)

I'd love to see reports of the performance of any predictive classification models (anything like churn models) you've been working on, but I realise that is unlikely... For like-minded data miners a simple lift chart might suffice.

The availability of data will greatly influence your ability to identify and predict churn (for the purpose of this post churn is defined as when good fare paying customers voluntarily leave). In this case churn outcome incidence is approx 0.5% per month, where the total population shown in each chart is a few million.

Below are two pictures of recent churn model Lift charts I built. Both models use the previous three months call summary data and the previous month's social group analysis data to predict a churn event occurring in the subsequent month. Models are validated against real unseen historical data.

I'm assuming you know what a lift chart is. Basically, it shows the magnitude increase in the proportions of your target outcome (in this case churn) within small sub-groups of your total population. Sub-groups are rank/sorted by propensity. For example, in the first chart we obtain 10 times more churn in the top 1% of our customers we suspected of churning using the predictive model.

The first model is built for a customer base of prepaid (purchase recharge credit prior to use) mobile customers, where the main sources of data are usage and social network analysis.

The second model is postpaid (usage is subsequently billed to customer) mobile customers, where contract information and billing are additionally available. Obviously contracts commit customers for specified periods of time, so act as very 'predictive' inputs for any model.

- first churn model lift

- second churn model lift

Both charts show our model lift in blue and the best possible result in dotted red. For the first model we are obtaining a lift of approximately 6 or 7 for the top 5% population (where the best possibly outcome would be 20 (eg. (100 / 5) = 20).

The second model is significantly better, with our model able to obtain a lift of approximately 10 for the top 5% of population (half way to perfection :)

I mention lift at 5% population because this gives us the reasonable mailing size and catches a large number of subsequent churners.

Obviously I can't discuss the analysis itself in any depth. I'm just curious what the first impressions are of the lift. I think its good, but I could be delusional! And just to confirm, it is real and validated against unseen data.

- enjoy!