Tuesday, November 24, 2009

When sharing isn't a good idea

Ensemble models seem to be all the buzz at the moment. The NetFlix prize was won by a conglomerate of various models and approaches that each excelled in subsets of the data.

A number of data miners have presented findings based upon using simple ensembles that use the mean prediction of a number of models. I was surprised that some form of weighting isn’t commonly used, and that a simple mean average of multiple models could yield such an improvement in the global predictive power. It kinda reminds me of Gestalt theory phrase "The whole is greater than the sum of the parts". It’s got me thinking, when it is best not to share predictive power. What if one model is the best? There is also a ton of considerations regarding scalability and trade-off between additional processing , added business value, and practicality (don’t mention random forests to me..), but we’re pretend those don’t exist for the purpose of this discussion :)

So this has got me thinking do ensembles work best in situations where there are clearly different sub-populations of customers. For example Netflix is in the retail space, with many customers that rent the same popular blockbuster movies, and a moderate number of customers that rent rarer (or far more diverse, ie long tail) movies. I haven’t looked at the Netflix data so I’m guessing that most customers don’t have hundreds of transactions, so generalising the correct behaviour of the masses to specific customers is important. Netflix data on any specific customer could be quite scant (in terms of rents/transactions). In other industries such as telecom, there are parallels; customers can also be differentiated by nature of communication (voice calls, sms calls, data consumption etc) just like types of movies. Telecom is mostly about quantity though (customer x used to make a lot of calls etc). More importantly there is a huge amount of data about each customer, often with many hundreds of transactions per customer. There is therefore relatively lesser reliance upon supporting behaviour of the masses (although it helps a lot) to understand any specific customer.

Following this logic, I’m thinking that ensembles are great at reducing the error of incorrectly applying insights derived from the generalised masses to those weirdos that rent obscure sci-fi movies!  Combining models that explain sub-populations very well makes sense, but what if you don’t have many sub-populations (or can identify and model their behaviour with one model).

But you may shout "hey what about the KDD Cup".  Yes, the recent KDD Cup challenge (anonymous featureless telecom data from Orange) was also a won by an ensemble of over thousand models created by IBM Research.  I'd like to have had some information about what the hundreds of columns respresented, and this might have helped better understand the Orange data and build more insightful and performing models.  Aren't ensemble models used in this way simply a brute force approach to over learn the data?  I'd also really like to know how the performance of the winning entry tracks over the suebsequent months for Orange.

Well, I haven’t had a lot of success in using ensemble models in the telecom data I work with, and I’m hoping it is more a reflection of the data than any ineptitude on my part. I’ve tried simply building multiple models on the entire dataset and averaging the scores, but this doesn’t generate much additional improvement (granted on already good models, and I already combine K-means and Neural Nets on the whole base).  During my free time I’m just starting to try splitting the entire customer base into dozens of small sub-populations and building a Neural Net model on each, then combining the results and seeing if that yields an improvement. It’ll take a while.


Tuesday, November 3, 2009

Predictive Analytics World (PAW) was a great event

I found this year’s PAW in Washington a great success. Although I was only able to attend for one day (the day I presented), the handful of varied presentations I did see were very informative and stimulated lots of ideas for my own data mining in the telecommunications industry. PAW is an event clearly run and aimed at industry practitioners. The emphasis of the presentations was lessons learnt, implementation and business outcomes. I strongly recommend attending PAW if you get the chance.

Other bloggers have reviewed PAW and encapsulate my views perfectly. For example see some of James Taylor’s blog entries http://jtonedm.com/tag/predictive-analytics-world

James also provides a short overview of my presentation at PAW http://jtonedm.com/2009/10/20/know-your-customers-by-knowing-who-they-know-paw

My presentation at PAW was 35 minutes followed by 10 minutes for questions. I think I over-ran a little because I was very stretched to fit all the content in. For me the problem of data mining is a data manipulation one. I usually spend all my time building a comprehensive customer focused dataset, and usually a simple back-propagation neural network gives great results. I tried to convey that in my presentation, and as James points out I am able to do all my data analysis within a Teradata data warehouse (all my data analysis and model scoring runs as SQL) which isn't common. I'm definitely a believer that more data conquers better algorithms, although that doesn't necessarily mean more rows (girth is important too :))

Sunday, November 1, 2009

Building Neural Networks on Unbalanced Data (using Clementine)

I got a ton of ideas whilst attending the Teradata Partners conference and also Predictive Analytics World.  I think my presentations went down well (well, I got good feedback).  There were also a few questions and issues that were posed to me.  One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.

Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data.  The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability) of that outcome.  By default Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false.   This is why in Clementine you need to balance categorical outcome to roughtly 50%/50% when you build the neural net model.  In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.

But there is a simple solution!

It is something I have always done out of habit because it has proved to generate better models, and I find a decimal score more useful. Being a cautous individual (and at the time a bit jet lagged) I wanted to double check first, but simply by converting a categorical outcome into a numeric range you will avoid this problem.

In situations where you have a binary categorical outcome (say, churn yes/no, or response yes/no etc) then in Clementine you can use a Derive (flag) node to create alternative outcome values.  In a Derive (flag) node simply change the true outcome to 1.0 and the false outcome to 0.0. 

By changing the categorical outcome values to a decimal range outcome between 0.0 and 1.0, the Neural Network model will instead expose the output neuron values and the Clementine output score will be a decimal range from 0.0 to 1.0.  The distribution of this score should also closely match the probability of the data input into the model during building.  In my analysis I cannot use all the data because I have too many records, but I often build models on fairly unbalanced data and simply use the score sorted / ranked to determine which customers to contact first.  I subsequently use the lift metric and the incidence of actual outcomes in sub-populations of predicted high scoring customers.  I rarely try to create a categorical 'true' or 'false' outcome, so didn't give it much thought until now.

If you want to create an incidence matrix that simply shows how many 'true' or false' outcomes the model achieves, then instead of using the Neural Net score of 0.5 to determine the true or false outcome, you simply use the probabilty of the outcome used to build the model.  For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04.  If my neural network score exceeds 0.04 then I predict true, else if my neural network score is below 0.04 then I predict false.  A simple derive node can be used to do this.

If you have a categorical output with multiple values (say, 5 products, or 7 spend bands etc) then you can use a Set-To-Flag node in a similar way to create many new fields, each with a value of either 0.0 or 1.0.  Make *all* new set-to-flag fields outputs and the Neural Network will create a decimal score for each output field.  This is essential exposing the raw output neuron values, which you can then use in many ways similar to above (or use all output scores in a rough 'fuzzy' logic way as I have in the past:).
I posted a small example stream on the kdkeys Clementine forum http://www.kdkeys.net/forums/70/ShowForum.aspx

Just change the file suffix from .zip to .str and open ther Clementine steeam file.  Created using version 12.0, but should work in some older versions.

I hope this makes sense.  Free feel to post a comment if elboration is needed!

 - enjoy!