Sunday, November 1, 2009

Building Neural Networks on Unbalanced Data (using Clementine)

I got a ton of ideas whilst attending the Teradata Partners conference and also Predictive Analytics World.  I think my presentations went down well (well, I got good feedback).  There were also a few questions and issues that were posed to me.  One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.

Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data.  The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability) of that outcome.  By default Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false.   This is why in Clementine you need to balance categorical outcome to roughtly 50%/50% when you build the neural net model.  In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.

But there is a simple solution!

It is something I have always done out of habit because it has proved to generate better models, and I find a decimal score more useful. Being a cautous individual (and at the time a bit jet lagged) I wanted to double check first, but simply by converting a categorical outcome into a numeric range you will avoid this problem.

In situations where you have a binary categorical outcome (say, churn yes/no, or response yes/no etc) then in Clementine you can use a Derive (flag) node to create alternative outcome values.  In a Derive (flag) node simply change the true outcome to 1.0 and the false outcome to 0.0. 

By changing the categorical outcome values to a decimal range outcome between 0.0 and 1.0, the Neural Network model will instead expose the output neuron values and the Clementine output score will be a decimal range from 0.0 to 1.0.  The distribution of this score should also closely match the probability of the data input into the model during building.  In my analysis I cannot use all the data because I have too many records, but I often build models on fairly unbalanced data and simply use the score sorted / ranked to determine which customers to contact first.  I subsequently use the lift metric and the incidence of actual outcomes in sub-populations of predicted high scoring customers.  I rarely try to create a categorical 'true' or 'false' outcome, so didn't give it much thought until now.

If you want to create an incidence matrix that simply shows how many 'true' or false' outcomes the model achieves, then instead of using the Neural Net score of 0.5 to determine the true or false outcome, you simply use the probabilty of the outcome used to build the model.  For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04.  If my neural network score exceeds 0.04 then I predict true, else if my neural network score is below 0.04 then I predict false.  A simple derive node can be used to do this.

If you have a categorical output with multiple values (say, 5 products, or 7 spend bands etc) then you can use a Set-To-Flag node in a similar way to create many new fields, each with a value of either 0.0 or 1.0.  Make *all* new set-to-flag fields outputs and the Neural Network will create a decimal score for each output field.  This is essential exposing the raw output neuron values, which you can then use in many ways similar to above (or use all output scores in a rough 'fuzzy' logic way as I have in the past:).
I posted a small example stream on the kdkeys Clementine forum http://www.kdkeys.net/forums/70/ShowForum.aspx

http://www.kdkeys.net/forums/thread/9347.aspx
Just change the file suffix from .zip to .str and open ther Clementine steeam file.  Created using version 12.0, but should work in some older versions.
http://www.kdkeys.net/forums/9347/PostAttachment.aspx


I hope this makes sense.  Free feel to post a comment if elboration is needed!

 - enjoy!

10 comments:

Anonymous said...

Tim,

Don't assume the output of a neural net for a binary classification problem can be interpreted as a probability - you should always calibrate.


I don't know about Clementine, but a smart way to deal with unbalanced data is for the nn learning algorithm to learn by consecutively looking at one true, then one false case, so the actual balance of true and false in the data is balanced by the algorithm itself.

For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04. If my neural network score exceeds 0.04 then I predict true, else if my neural network score is below 0.04 then I predict false.


Why 0.04? If you had 500,000 false and still only 10,000 true, then by your logic it should be 0.02, but the neural net model could be exactly the same, if it is being 'smart' as described above.


By default Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false. This is why in Clementine you need to balance categorical outcome to roughtly 50%/50% when you build the neural net model.

I think you are misguided with this assumption. What if there was perfect model and the the net would predict all the trues as 1 and all the falses as 0. If there was such an obvious pattern in the data then it would make no difference how unbalanced the data was, and 0.5 would be the best choice if you were a betting man.

Food for thought,

Phil

Tim Manns said...

Hi Phil,

I finally got around to testing this again, and I can confirm that (as we would expect) the model ranking doesn't change much even if the distribution of true and false reaches huge disparities.

When you build a model with a target output that is a decimal number the predicted score will be a decimal preeiction. In my case 0.0 is false and 1.0 is true, therefore my neural net output is a decimal number ranging between 0.0 to 1.0. The distribution of this score reflects the number of either 0.0 or 1.0 values fed into the neural net during building.

If I use 250,000 false (0.0 values) and 10,000 true (1.0 values) as inputs to the model then the score outputted will heavily skewed toward 0.0. If I rank these scores then the model is nearly the same as a similar model built on fewer false values and the same number of trues.

The problem is that Clementine always splits a decimal score at 0.5 rather than the much lower approx 0.04 we might expect in my case (approx 0.04 because 10,000 / 250,000 = 0.04). If I had 500,000 false cases then the model would still be practically the same, but the scores would be further skewed toward zero.

I know its weird, but that is how it ‘works’ in Clementine.

Tim

Anonymous said...

Hi Tim,

The kdkeys Clementine forum has disabled new account creation.

Could you please send the example stream to me as well at bigfootk@gmail.com

Thanks !

hunter said...

In pursing high response rate, it's quite likely you will get some low value or negative value customers.

Do you use some way to eliminate high response, negative value candidates?

Tim Manns said...

Yes, it is fairly straight forward to subsequently filter out any customers with bad payment history or low value.

- but that's assuming customer value exists :)

I believe that is a tricky question to answer, because it partly depends upon the customer life-cycle. In my opinion all customers have negative value when they first join; the cost of advertising on TV, web banners, billboards, it all adds up!

Only after a few Prepaid recharges or monthly Postpaid bills have been paid does a customer start to have positive value.

Then you need to consider a lot of factors to determine customer value. There are costs like;
- interaction to call centre or customer services
- early changes to price/rate plan
- lots of calls to competitor mobile networks.

Not many telcos consider all the factors when developing a customer value, and few of these factors have trival impact.

Anonymous said...

i want know how to interpret the neural network model in spss clementime?what i know,just look at their specificity,sensitivity and error rate value only,is it right?im bit confuse.

Anonymous said...

I wrote a similar piece @ HT blog on balancing datasets before modelling.



SUN Trainings said...

nice thank you for information
big data in hadoop aslo online trainings us uk cannada

Fuel Digital Marketing said...

thanks for sharing great article like this,Fuel Digital Marketing is a house of the most talented content writers in Tamil Nadu, editors and creative minds in Chennai.

Best SEO Services in Chennai | digital marketing agencies in chennai | Best seo company in chennai | digital marketing consultants in chennai | Website designers in chennai

Mila B said...

Good reading tthis post