Monday, March 30, 2009

Tips for the KDD challenge :)

I recently heard about the KDD challenge this year. Its a telco based challenge to build churn, cross-sell, and up-sell propensity models using the supplied train and test data.

For more info see;
http://www.kddcup-orange.com/index.php

I am not able to download the data at work (security / download limits), so I might have to try this at home. I haven't even seen the data yet. I'm hoping its transactional cdr's and not in some summarised form (which it sounds like it is).

I don't have a lot of free time so I might not get around to submitting an entry, but if I do these are some of the data preparation steps and issues I'd consider;

- handle outliers
If the data is real-world then you can guarantee that some values will be at least a thousand times bigger than anything else. Log might not work, so try trimmed mean or frequency binning as a method to remove outliers.

- missing values
The KDD guide suggests that missing or undetermined values were converted into zero. Consider changing this. Many algorithms will treat zero very differently from a null. You might get better results by treating these zero's as nulls.

- percentage comparisons
If a customer can make a voice or sms call, what's the percentage between them? (eg 30% voice vs 70% sms calls). If only voice calls, then consider splitting by time of day or peak vs offpeak as percentages. The use of percentages helps remove differences of scale between high and low quantity customers. If telephony usage covers a number of days or weeks, then consider a similar metric that shows increased or decreased usage over time.

- social networking analysis
If the data is raw transactional cdr's (call detail records) then give a lot of consideration do performing a basic social networking analysis. Even if all you can manage is to identify a circle of friends for each customer, then this may have a big impact upon identification of high churn individuals or up-sell opportunities.

- not all churn is equal
Rank customers by usage and scale the rank to a zero (low) to 1.0 score (high rank). No telco should still be treating every churn as a equal loss. Its not! The loss of a highly valuable customer (high rank) is worse than a low spend customer (low rank). Develop a model to handle this and argue your reasons for why treating all churn the same is a fool's folly. This is difficult if you have no spend information or history of usage over multiple billing cycles.

Hope this helps

Good luck everyone!

4 comments:

Datalligence said...

i downloaded the data sets yesterday and just confirmed with the organizers that all variables have been renamed as var1,var2...var15000. without the real variables names, this will be a pure mathematical/statistical problem.

for example, take your "percentage comparisons", you can't do that with this data. what about variable selection, variable correlations....all these will be solely decided by statistical results.

suddenly, i am no longer interested in KDD09 as the business meaning/relevance of the data is simply not there. there is a VERY STRONG possibility that lots of people will come up with a model that's very accurate but without or with very little business benefits.

Allan Engelhardt said...

I agree: very disappointing that this is a pure statistical exercise with no business content.

Tim Manns said...

bloody hell, when I read re: "var1,var2...var15000" my heart sank!

It's stuff like this that makes me wonder whether most pople at conferences like KDD realise data mining is more than applying math to data in a database (or big text file).

My other concern was whether the dta was transactional call detail records (cdr's) or in some summarised form (which would also ruin the fun of doing the KDD for me).

I'm disheartened :(

Tim Manns said...

I added a comment to the KDD cup forum;

http://213.56.130.10/board/viewtopic.php?id=11

it says;

"Hello KDD Cup,

I'd just like to give some feedback regarding the KDD challenge this year.

I think the challenge is a great way to motivate students and new starters to data mining, thus increasing the number of potential candidates for the many data mining roles out there. If this is your aim, I believe you have succeeded. Unfortunately I don't think the challenge will attract industry or experienced data miners.

I work for a Telco and perform data mining everyday to solve exactly the same problems described in the challenge. I nearly always use call detail records, and if the data set were composed of CDR's I could demonstrate many different methods in my submission for the KDD challenge. Data mining is about taking detailed unwieldy data, transforming, and processing it into a form that makes it easier to understand and perform better. That sometimes includes building predictive models. It is widely accepted that the majority of the effort and time is required for data preparation, and any data mining challenge must involve this. It concerns me that the KDD cup omitted this fact. In my submission I would have used social network analysis, non-standard methods of data summarisation, and the creation of predictively powerful input variables.

I have had problems downloading the data, so I asked about the details of the data. It is my understanding that the data is not cdr's and has already been summarised into a specific format. Therefore the KKD challenge does not require the full scope of a data mining analyst, but is instead a statistical and predictive modelling accuracy exercise. Because the summarised data restricts the challenge to simply building a predictive model I will not be submitting an application.

Please give my comments consideration in your next challenge if you are keen to involve industry and established data miners. As it stands KDD are further distancing themselves from industry (ironic considering it's from a telco...)."