Monday, March 30, 2009

Tips for the KDD challenge :)

I recently heard about the KDD challenge this year. Its a telco based challenge to build churn, cross-sell, and up-sell propensity models using the supplied train and test data.

For more info see;

I am not able to download the data at work (security / download limits), so I might have to try this at home. I haven't even seen the data yet. I'm hoping its transactional cdr's and not in some summarised form (which it sounds like it is).

I don't have a lot of free time so I might not get around to submitting an entry, but if I do these are some of the data preparation steps and issues I'd consider;

- handle outliers
If the data is real-world then you can guarantee that some values will be at least a thousand times bigger than anything else. Log might not work, so try trimmed mean or frequency binning as a method to remove outliers.

- missing values
The KDD guide suggests that missing or undetermined values were converted into zero. Consider changing this. Many algorithms will treat zero very differently from a null. You might get better results by treating these zero's as nulls.

- percentage comparisons
If a customer can make a voice or sms call, what's the percentage between them? (eg 30% voice vs 70% sms calls). If only voice calls, then consider splitting by time of day or peak vs offpeak as percentages. The use of percentages helps remove differences of scale between high and low quantity customers. If telephony usage covers a number of days or weeks, then consider a similar metric that shows increased or decreased usage over time.

- social networking analysis
If the data is raw transactional cdr's (call detail records) then give a lot of consideration do performing a basic social networking analysis. Even if all you can manage is to identify a circle of friends for each customer, then this may have a big impact upon identification of high churn individuals or up-sell opportunities.

- not all churn is equal
Rank customers by usage and scale the rank to a zero (low) to 1.0 score (high rank). No telco should still be treating every churn as a equal loss. Its not! The loss of a highly valuable customer (high rank) is worse than a low spend customer (low rank). Develop a model to handle this and argue your reasons for why treating all churn the same is a fool's folly. This is difficult if you have no spend information or history of usage over multiple billing cycles.

Hope this helps

Good luck everyone!

Friday, March 27, 2009

Presenting at conference Uniscon 2009

I've been asked to present at Uniscon 2009. One to the professors involved at the University of Western Sydney is a relative of an analyst I work with and requested I present. I usually find academic conferences are snooze city, but they promised me free beer and I live in Sydney anyway, so I can get home to see the baby before the night's end. I hope I'm just one of many industry persons there and it proves to be an insightful event.

I'm not presenting work. I will be presenting from a personal perspective as a industry data miner (I've not enough time to prepare my presentation and get legal approval from work) and I'll be discussing generic topics instead of describing recent data mining projects and quoting numbers or factual business outcomes.

I suspect a large part of my attendance is to drive some enthusiasm and make the students interested in data mining and aware of what challenges you face in data mining roles.

If you are attending then feel free to say 'hi'.
For info on the conference see;

wider website

Below was the presentation title and abstract I threw together (now just have to write it...). There is a social networking analysis (SNA) element to it (because that's what I'm focused on at the moment).

Presentation title:
Know your customers. Know who they know they know, and who they don't.

Presentation Abstract:
Tim's presentation will describe some of the types of marketing analysis a typical telecommunications company might do, including social network analysis (SNA, which is a hot topic right now). He also elaborates on the technical and practical side of data mining, and what business impacts data mining may have.

More importantly the presentation will help answer questions such as;
- What skills are required for Data Mining?
- What problems are commonly faced during Data Mining projects?
- And just what is this Data Mining stuff all about anyway?

- Tim

Thursday, March 26, 2009

Closing days of the Data Mining survey

I got a quick email yesterday from Karl Rexer. There are a few days remaining to participate in his yearly data mining survey.

Survey Link:
Access Code: TM42P

If you frequently conduct data analysis on large amounts fo data (ie data mining!) then I urge you to particpate.

- Tim

Wednesday, March 11, 2009

And then there were Three, not!

I haven't been contributing to forums or making posts with my usual vigour because of a few recent events;

1) becoming a daddy
-> lots of fun!

2) recent accouncement of a merger between the telco's Vodafone and Hutchinson.
-> pain in the arse!
For info see

Australia's population is approximately 20 million, which is pretty small, and there were four players in the mobile service provider market (in probable order of market share); Telstra, Optus, Vodafone, Three.

The annoncement that Vodafone and Three are merging reduces this to three players, which reshapes the landscape of Australia to closely match many other countries with mature telecommunications markets. Most countries with mature telecommunications markets have a few players and, in this current economic climate, its not surprising that there will be mergers and therefore consolidation of customers into larger groups.

As a result of the merger, the competitors (Telstra & Optus) will have to review their strategies and probably re-examine customer analysis. Lots of work for us Data Miners...

Tuesday, March 3, 2009

How many models is enough?

I recently missed a presentation by a data mining software vendor (due to my recent paternity break) but I've been reviewing my colleagues notes and vendor presentation slides. I won't name the vendor, you can probably work it out.

A significant part of the vendor solution is the ability to manage many, we're talking hundreds, of data mining models (predictive, clustering etc).

In my group we do not have many data mining models, maybe a dozen, that we run on a weekly or monthly basis. Each model is quite comprehensive and will score the entire customer base (or near to it) for a specific outcome (churn, up-sell, cross-sell, acquisition, inactivity, credit risk, etc). We can subsequently select sub-populations from the customer base for targetted communications based upon the score or outcome of any single or a combination of models, or any criteria take from customer information.

I'm not entirely sure why you would want hundreds of models in a Telco (or similar) space. Any selection criteria applied to specific customers (say, by age, or gender, or state, or spend) before modeling will ofcourse force a baised sample that feeds into the model and affects its inherant nature. Once this type of selective sampling is performed you can't easily track the corresponding model over time *if* the sampled sub-population ever changes (which is likely because people do get older, move house, or change spend etc). For this reason I can't understand why someone would want or have many models. It makes perfect sense in Retail (for example a model for each product or associations rules for product recommendations), but not many models that apply to sub-populations of your customer base.

Am I missing something here? If you are working with a few products or services and a large customer base why would you prefer many models over a few?

Comments please :)