Wednesday, December 16, 2009

Meetup for Sydney Data Miners (11th February 2010)

Last month I attended a gathering of Sydney based data miners.

There are lots of parallels to IAPA (Institute of Analytics Professionals of Australia, but the audience seemed to be more hands-on analysts. Being based at Google it had quite a few web based analysts too.

The next meet-up is 11th February 2010. I'll be there having a chat and a few beers.

Tuesday, November 24, 2009

When sharing isn't a good idea

Ensemble models seem to be all the buzz at the moment. The NetFlix prize was won by a conglomerate of various models and approaches that each excelled in subsets of the data.

A number of data miners have presented findings based upon using simple ensembles that use the mean prediction of a number of models. I was surprised that some form of weighting isn’t commonly used, and that a simple mean average of multiple models could yield such an improvement in the global predictive power. It kinda reminds me of Gestalt theory phrase "The whole is greater than the sum of the parts". It’s got me thinking, when it is best not to share predictive power. What if one model is the best? There is also a ton of considerations regarding scalability and trade-off between additional processing , added business value, and practicality (don’t mention random forests to me..), but we’re pretend those don’t exist for the purpose of this discussion :)

So this has got me thinking do ensembles work best in situations where there are clearly different sub-populations of customers. For example Netflix is in the retail space, with many customers that rent the same popular blockbuster movies, and a moderate number of customers that rent rarer (or far more diverse, ie long tail) movies. I haven’t looked at the Netflix data so I’m guessing that most customers don’t have hundreds of transactions, so generalising the correct behaviour of the masses to specific customers is important. Netflix data on any specific customer could be quite scant (in terms of rents/transactions). In other industries such as telecom, there are parallels; customers can also be differentiated by nature of communication (voice calls, sms calls, data consumption etc) just like types of movies. Telecom is mostly about quantity though (customer x used to make a lot of calls etc). More importantly there is a huge amount of data about each customer, often with many hundreds of transactions per customer. There is therefore relatively lesser reliance upon supporting behaviour of the masses (although it helps a lot) to understand any specific customer.

Following this logic, I’m thinking that ensembles are great at reducing the error of incorrectly applying insights derived from the generalised masses to those weirdos that rent obscure sci-fi movies!  Combining models that explain sub-populations very well makes sense, but what if you don’t have many sub-populations (or can identify and model their behaviour with one model).

But you may shout "hey what about the KDD Cup".  Yes, the recent KDD Cup challenge (anonymous featureless telecom data from Orange) was also a won by an ensemble of over thousand models created by IBM Research.  I'd like to have had some information about what the hundreds of columns respresented, and this might have helped better understand the Orange data and build more insightful and performing models.  Aren't ensemble models used in this way simply a brute force approach to over learn the data?  I'd also really like to know how the performance of the winning entry tracks over the suebsequent months for Orange.

Well, I haven’t had a lot of success in using ensemble models in the telecom data I work with, and I’m hoping it is more a reflection of the data than any ineptitude on my part. I’ve tried simply building multiple models on the entire dataset and averaging the scores, but this doesn’t generate much additional improvement (granted on already good models, and I already combine K-means and Neural Nets on the whole base).  During my free time I’m just starting to try splitting the entire customer base into dozens of small sub-populations and building a Neural Net model on each, then combining the results and seeing if that yields an improvement. It’ll take a while.


Tuesday, November 3, 2009

Predictive Analytics World (PAW) was a great event

I found this year’s PAW in Washington a great success. Although I was only able to attend for one day (the day I presented), the handful of varied presentations I did see were very informative and stimulated lots of ideas for my own data mining in the telecommunications industry. PAW is an event clearly run and aimed at industry practitioners. The emphasis of the presentations was lessons learnt, implementation and business outcomes. I strongly recommend attending PAW if you get the chance.

Other bloggers have reviewed PAW and encapsulate my views perfectly. For example see some of James Taylor’s blog entries

James also provides a short overview of my presentation at PAW

My presentation at PAW was 35 minutes followed by 10 minutes for questions. I think I over-ran a little because I was very stretched to fit all the content in. For me the problem of data mining is a data manipulation one. I usually spend all my time building a comprehensive customer focused dataset, and usually a simple back-propagation neural network gives great results. I tried to convey that in my presentation, and as James points out I am able to do all my data analysis within a Teradata data warehouse (all my data analysis and model scoring runs as SQL) which isn't common. I'm definitely a believer that more data conquers better algorithms, although that doesn't necessarily mean more rows (girth is important too :))

Sunday, November 1, 2009

Building Neural Networks on Unbalanced Data (using Clementine)

I got a ton of ideas whilst attending the Teradata Partners conference and also Predictive Analytics World.  I think my presentations went down well (well, I got good feedback).  There were also a few questions and issues that were posed to me.  One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.

Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data.  The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability) of that outcome.  By default Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false.   This is why in Clementine you need to balance categorical outcome to roughtly 50%/50% when you build the neural net model.  In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.

But there is a simple solution!

It is something I have always done out of habit because it has proved to generate better models, and I find a decimal score more useful. Being a cautous individual (and at the time a bit jet lagged) I wanted to double check first, but simply by converting a categorical outcome into a numeric range you will avoid this problem.

In situations where you have a binary categorical outcome (say, churn yes/no, or response yes/no etc) then in Clementine you can use a Derive (flag) node to create alternative outcome values.  In a Derive (flag) node simply change the true outcome to 1.0 and the false outcome to 0.0. 

By changing the categorical outcome values to a decimal range outcome between 0.0 and 1.0, the Neural Network model will instead expose the output neuron values and the Clementine output score will be a decimal range from 0.0 to 1.0.  The distribution of this score should also closely match the probability of the data input into the model during building.  In my analysis I cannot use all the data because I have too many records, but I often build models on fairly unbalanced data and simply use the score sorted / ranked to determine which customers to contact first.  I subsequently use the lift metric and the incidence of actual outcomes in sub-populations of predicted high scoring customers.  I rarely try to create a categorical 'true' or 'false' outcome, so didn't give it much thought until now.

If you want to create an incidence matrix that simply shows how many 'true' or false' outcomes the model achieves, then instead of using the Neural Net score of 0.5 to determine the true or false outcome, you simply use the probabilty of the outcome used to build the model.  For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04.  If my neural network score exceeds 0.04 then I predict true, else if my neural network score is below 0.04 then I predict false.  A simple derive node can be used to do this.

If you have a categorical output with multiple values (say, 5 products, or 7 spend bands etc) then you can use a Set-To-Flag node in a similar way to create many new fields, each with a value of either 0.0 or 1.0.  Make *all* new set-to-flag fields outputs and the Neural Network will create a decimal score for each output field.  This is essential exposing the raw output neuron values, which you can then use in many ways similar to above (or use all output scores in a rough 'fuzzy' logic way as I have in the past:).
I posted a small example stream on the kdkeys Clementine forum
Just change the file suffix from .zip to .str and open ther Clementine steeam file.  Created using version 12.0, but should work in some older versions.

I hope this makes sense.  Free feel to post a comment if elboration is needed!

 - enjoy!

Monday, October 12, 2009

See you at PAW (Predictive Analytics World) and Teradata Partners

Next week I'll be in Washington DC for Teradata Partners and also Predictive Analytics World.

I'm presenting how leveraging the social interactions of the Optus mobile/cellphone customer base has enabled unparalleled insights into customers and prospects.

In my opinion the presenters and topics being discussed are interesting and worth attending.  These conferences are the few events where industry analysts congregate and discuss their work.

I will probably have a few meetings and activities lined up, but I'm always happy to chat over a few beers. If you are there feel free to say 'hi'.  I'm in Washington for 4 days, then taking a few days holiday with family in New York.

Sunday, September 13, 2009

I'll show you mine if you show me yours...

Analysts don't usually quote predictive model performance. Data Mining within each industry is different, and even within the telecommunications industry definitions of churn are inconsistent. This often makes reported outcomes tricky to fully understand.

I decided to post some churn model outcomes after reading a post by the enigmatic Zyxo on his (or maybe her :)) blog ;

I'd like to know if the models rate well :)

I'd love to see reports of the performance of any predictive classification models (anything like churn models) you've been working on, but I realise that is unlikely... For like-minded data miners a simple lift chart might suffice.

The availability of data will greatly influence your ability to identify and predict churn (for the purpose of this post churn is defined as when good fare paying customers voluntarily leave). In this case churn outcome incidence is approx 0.5% per month, where the total population shown in each chart is a few million.

Below are two pictures of recent churn model Lift charts I built. Both models use the previous three months call summary data and the previous month's social group analysis data to predict a churn event occurring in the subsequent month. Models are validated against real unseen historical data.

I'm assuming you know what a lift chart is. Basically, it shows the magnitude increase in the proportions of your target outcome (in this case churn) within small sub-groups of your total population. Sub-groups are rank/sorted by propensity. For example, in the first chart we obtain 10 times more churn in the top 1% of our customers we suspected of churning using the predictive model.

The first model is built for a customer base of prepaid (purchase recharge credit prior to use) mobile customers, where the main sources of data are usage and social network analysis.

The second model is postpaid (usage is subsequently billed to customer) mobile customers, where contract information and billing are additionally available. Obviously contracts commit customers for specified periods of time, so act as very 'predictive' inputs for any model.

- first churn model lift

- second churn model lift

Both charts show our model lift in blue and the best possible result in dotted red. For the first model we are obtaining a lift of approximately 6 or 7 for the top 5% population (where the best possibly outcome would be 20 (eg. (100 / 5) = 20).

The second model is significantly better, with our model able to obtain a lift of approximately 10 for the top 5% of population (half way to perfection :)

I mention lift at 5% population because this gives us the reasonable mailing size and catches a large number of subsequent churners.

Obviously I can't discuss the analysis itself in any depth. I'm just curious what the first impressions are of the lift. I think its good, but I could be delusional! And just to confirm, it is real and validated against unseen data.

- enjoy!

Tuesday, July 21, 2009

Books on my desk...

Over the years I have purchased a few data mining, machine learning, and even statistics books. I'll confess that I haven't read every book page by page, in fact some I've speed-read hoping to catch some interesting highlights.

Below is a summary and short review of the books that are sitting on my desk at work...

- from left to right;

- Marketing Calculator (author Guy Powell)
I got a free copy because I contributed to some of the industry examples. I'm even quoted in it! I found the book very useful and would recommend it for any marketing analyst. It talks about ROI and measuring every type of marketing event and customer interaction. Lots of case studies, which I always like. No detail in terms of data analysis itself, but plenty of food for ideas.

- Advances In Knowledge Discovery And Data Mining (editors Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy)
I first bought this for Rakesh Agrawal's article on Association Rules (Apriori in Clementine), but also found John Elder's Statistical Perspective on Knowledge Discovery very informative. It provides a great concise history of data mining.

- Data Mining Using SAS Applications (author George Fernandez)
I bought this hoping to get a different opinion or learn something new (compared to the SPSS Clementine User Guide I have far too much experience of..). I thought; maybe SAS analysts had a better way to do a specific type of data handling or followed a alternative thought pattern to accomplish a goal. Sadly I was disappointed. Like many data mining books it spend hundreds of pages describing algorithms and expert options for refining your model building and less than 10 pages on data transformations and/or data cleaning. Those 10 pages are well written though. Not worth the purchase in my view. I only hope SAS analysts have better books out there.

- Data Mining Techniques (authors Micheal Berry and Gordon Linoff)
Written by practitioners means a lot. The one book I often re-read just in case I missed something the previous time :) Maybe because it is very applicable to my role as an analyst in a marketing dept in a telecommunications carrier, but I find this book invaluable. Lots of case studies. 100 pages of background and practical tips before it even reaches 'algorithms' is good in my view, and when do you reach the algorithms they are described in practical terms as techniques very well (rather than a laborious stats class, and I didn't do stats at University). I find the whole book a joy to read. A must for every data miner.

- Data Mining. A Tutorial Based Primer (authors Richard Roger and Micheal Geatz)
Whilst going through a phase of keen hobby programming in VB.NET I tried my hand at writing a neural net, decision tree etc from scratch. I found this book really helpful since it goes through every detail a programmer would need to implement their own data mining code in Excel. I work with huge amounts of data, so the thought of doing data mining in Excel makes my giggle (maybe that's a bad thing...) but the principals of data manipulation, cleaning and prediction etc can easily be applied in Excel. If you really want to understand how algorithms work and build your own, then this book is very useful for that purpose.

- Data Mining. Introductory and Advanced Topics
If you did spend several years studying mathematics or statistics then this book would probably act as a great reference and reminder of how algorithms work.
Its very academic and sometimes that's useful. I think there's one line in there somewhere that mentions data cleaning or data transformations as being an industry thing... It is also quite a hard heavy book, so could be useful to rest stuff on.

- Data Mining. Practical Machine Learning Tools and Techniques (Ian Witten and Eibe Frank)
This is a classic example of bait advertising that some authors should be jailed for. On page 52 of this book the authors write;
"Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process. Although this book is not really about the problems of data preparation, we want to give you a feeling for the issues involved...." Fuck me, its not a data mining book then is it? Not only that, they actually use the term "Practical" in the title. Clearly it is not practical at all if it involves absolutely zero data manipulation. If I ever meet one of these authors I will slap them in the face and demand my money back... Oh and over half the book is a damn Weka user guide.

- The Elements Of Statistical Learning. Data Mining, Inference and Prediction (authors Trevor Hastie, Robert Tibshirani, Jerome Friedman)
Very heavy on the stats and squiggly equations (which take me ages to make sense of) but quite well written because I usually manage to understand it. Explains the algorithms stuff very well. I don't refer to it much and only read a few chapters in depth, but it was worth the purchase.

- The Science Of Superheroes (authors Lois Gresh and Robert Weinberg).
Not everything is about data mining. There's a whole world out there, and just maybe it includes super heroes with laser beams shooting out of their eyes. Its a soft-core science book discussing concepts such as; faster than light speed, cosmic rays, genetically engineered hulks, flying without wings, and black holes and how it all relates to real-life superheroes (if they existed). Really good geeky material.

- Data Preparation For Data Mining (author Dorian Pyle)
A good book, and like "Data Mining Techniques" it clearly covers topics with a practical understanding (no 'real-world' case studies though). Where it differs is that this book has a stronger academic or statistics focus. I didn't get a sense that the examples would always relate to large real-world data sets, and many methods I use were not mentioned at all (for example frequency binning) because they have no statistical basis. Here's the problem; this is a great data mining book, but only for the statistics in practical data mining. It is a book I frequently refer to and would recommend, although I'd like to see stuff added that *isn't* based on statistics.

- The Essence Of Databases (author F. Rolland)
101 database for dummies. It describes database schemas, relational concepts, tons of SQL examples for queries and data transformations, describes object oriented databases etc.
Essential stuff for anyone querying a corporate data warehouse. It reads easily and is recommended.

- Data Mining. Concepts, Models, Methods, and Algorithms (author Mehmed Kantardzic)
Another 'list all the algorithms I know' book. I'll be honest; I only quickly flicked through it hoping to see some case studies or something new. It seemed good, but didn't seem to have anything to set it apart from any other algorithms book.

- Statistics Explained. Basic concepts and methods. ( authors R. Fapadia and G. Andersson)
Just in case I forget what a t-test is. Has lots of pictures :)

- Clementine User Guides (author: many at SPSS, well if memory serves me Clay Helberg did a fair chunk of it) . When I was at SPSS I had a small part to play in these. I provided some examples and proof read where possible. I've been using Clementine daily for over a decade, but still refer to the user guide occasionally. I find them useful, but they could benefit from some new examples to take advantage of the many new features that have been added in recent years.

- Enjoy!

Monday, June 15, 2009

See you at the US Teradata User Conference 2009

Quick post because I'm swamped with work...

Last year, at the Asia Teradata User Group in Beijing, I presented some generic data mining that was being performed at Optus (mostly simple churn analysis and behavioural segmentation). I also had a few meetings with the analysts from some China telco's about how relatively simple data analysis can scale up to many millions of customers and billions of rows of data.

This year I'll be presenting at the US Teradata User Conference some of the more advanced analysis that I've recently done, notably surrounding social network analysis in the mobile customer base on large amounts of data (several billions of rows). I'm hoping to be able to quote some actual business outcomes and put up some $ numbers.

The US 2009 Teradata User Group Conference & Expo, October 18–22, 2009, at the Gaylord National Resort.

I'll be presenting on Wednesday 21st October 2009 at Maryland D on the Business Track. Judging from the large number of presentations I guessing it will a much smaller and personal room than the 1000+ conference hall I was in last year in Beijing :)

Feel free to say hi and ask lots of questions if you see me there. I might have one free evening for a few beers if anyone wants.

Wednesday, May 20, 2009

Teradata Podcasts on Data Mining And SNA

I sometimes get asked by vendors to present case studies or examples of work so that they can attract new customers or demonstrate how existing customers can use the software/solution. Below are details of a podcast I did over the phone with Teradata (I was in Sydney, they were in US). There wasn't any preparation, I just kinda 'winged-it'. Any numbers I quoted were rough estimates from memory (not official numbers!). And yes, my voice is a bit high pitched and I do sometimes sound like a 50 year old lady....apologies :)

The podcasts discuss customer insights and data mining analyses that are performed. We later then discussed social networking analysis and how linking customers by social calling groups helps predict customer action (such as churn or acquisition of an iPhone handset). TCRM is a Teradata tool I am not familiar with, but my colleagues do use it for campaign delivery, and it has the capability to perform trigger based campaigns (such a send a retention offer to other members of a social group when one member of that group churns).

I'm very fortunate that I am occasionally permitted to present my work. One of my main arguments for doing this is that I get peer review and feedback from other data miners, and an idea whether the analytics we do is 'better than most'.

So, I beg you! Please let me know either way; If this stuff is good or bad I need to know (especially if you work in Telco).



- - - - - - - - -

Enhancing Customer Knowledge and Retention at Optus

In This Podcast
Optus is an Australian telecommunications carrier that uses analytics to increase customer retention. The data being analyzed comes from call centers, mobile phone call details, census geo-demographic data, and a history customer behavior. Teradata CRM and the data warehouse environment from Teradata is key to Optus’ success with reliably identifying customers that might churn and offering marketing campaigns that are relevant and timely. Optus saw a 20% reduction in churn.

Social Networking Analysis at Optus

In This Podcast
Tim Manns from Optus discusses how the company uses detailed network data from its Teradata system to look at calling behavior. With 40% of the Australian telecommunications market, the company cross-references each customer with every other customer, groups them together based on who they communicate with, looks at the behavior of the group, and can then predict next steps and target those groups with appropriate products and services.

Monday, May 4, 2009

Telstra found guilty of abuses of telecommunications network data

- Disclaimer. I do not represent any organisation. This is a personal blog and I talk freely about data mining from a personal perspective only. --

See a recent news article;

And also;,25197,25414690-15306,00.html

As a Data Miner for a telecommunications provider I frequently use network data in my analysis. How many calls the customer makes, at what time of day, do they communicate using voice or sms etc. I examine data pertaining to *customers* only.

Telecommunications companies often provision services wholesale for another company. This 'wholesale recipient' company will pay for the use of the network, but manage all other activities such as marketing, customer account and billing. In these cases, although the telecommunications company is responsible for supplying the network service and ensuring calls are successfully established (and likely stores data about these calls), it doesn't own the call data for that customer (who belongs to the 'wholesale recipient' company). Make sense? Use of the data that pertains to the actions of someone that is not a customer of that telecommunications company must be treated with the utmost caution.

Every data miner must be aware of data privacy laws, and in many countries failure to adhere to these laws attract heavy financial penalties for the organisation and individuals involved. In Australia some invasion of privacy laws could even potentially involve 2 years jail time.

Recently Telstra, an Australia telecommunication company (and the previous incumbent) was found guilty of serious breaches of data privacy. For the 130 page publicly accessible transcript see;

I guessing that the significant legal costs and years it has taken to get this result is obviously prohibitive for many telcos, so they let it slide. Optus didn't.

Basically, the bit that caught my eye was on item 108 (yes, I speed read the whole thing...). It is legal jargon and reads;

"Telstra asserted that total traffic travelling across its network belonged to Telstra. Optus submitted that whether it belonged to Telstra is not the question posed by cl 15.1 of the Access Agreement. The question under cl 15.1 is whether Telstra owed an obligation under that clause with respect to traffic information recorded by Telstra of communications by Optus customers on the Telstra network because that information was Confidential Information of Optus. The definition of Confidential Information identifies what is the Confidential Information of Optus. Once a CCR records information in relation to a call made by an Optus customer, that information becomes the Confidential Information of Optus because it falls within the definition of ‘Confidential Information’. "

The first sentence is shocking. In English it basically suggests that Telstra treat all network calling data as its own, and freely uses call information made by anyone on that network as it sees fit. That includes calls made by customers of wholesale or competitors companies on their network. In the case of wholesale for fixed line (land) networks Telstra will know the address and likely also the name of the customer. In the early days Optus had little choice but to use some of Telstra's fixed line infrastructure, often the last bits of copper wire that reach a household. The information of this usage was passed to Executive and board members so that they knew customer size and market share by age, geography etc. It is also highly likely (although difficult to prove) that the Telstra retail arm used the data for marketing activities and actioned direct communications to that customer. Anti-competitive to say the least...

One of the short conclusions of the legal findings were;

"For the foregoing reasons, I find that Telstra has used traffic information of Optus, or Communication Information of Optus for the purposes of the Access Agreement, both in the preparation of market share reports and in distributing those reports among Telstra personnel. I also find that such information is Confidential Information of Optus for the purposes of the Access Agreement, or is otherwise subject to the requirements of confidentiality in cl 15 of the Access Agreement, by force of cl 10 of that agreement. I also find that neither such use of such information nor its disclosure for such purposes is permitted by the Access Agreement."

I guess the information here is probably too much in the 'telco land', but hopefully its clear enough to understand the gravity of this. I've known this type of stuff was being conducted by some telco's for a long time, but I'm shocked it was so brazen.

Knowing the big differences between what we (as Data Miners) are 'able to do' regarding insights and personal information (particularly in mobile telecommunications) and what we 'should do' is very important. Years ago the industry passed the early developmental stage of storing data, in recent years we have learned how to understand the data and convert it into useful insights. I still think that many data miners don't realise how important (now more than ever before) it is that we act responsibility in the use of the personal information we obtain from 'our' data.

Wednesday, April 22, 2009

When graphs, piecharts and all else fails... Dilbert to the rescue!

If you work in a Marketing or Sales department then you probably have the challenging task of convincing your less technical colleagues of the benefit of using your awesome customer insights.

I'm quite proud of the social network analysis (SNA) that I'd first completed months ago. It is refreshed each month (the data warehouse load is too high to run it daily or weekly as I would like). I've been tracking its performance, and am continually surprised.

The trouble is that my colleagues are having trouble understanding how they can use it to formalise customer communications, so I decided to try a different approach than graphs and piecharts etc.

Instead I thought I might try something humorous, hence Dilbert to the rescue! I have created a dozen or so custom Dilbert slides that provides some info about a customer insight made available by the SNA and also has a humorous conclusion to those insights. I'll pass this around the department in a series of daily emails.

Here is one example (I had to change the project nickname to "SNA" for this blog);

Monday, April 6, 2009

Clementine is dead, long live PASW Modeller

For the new Clementine homepage see

SPSS have gone for new product names, including changing Clementine to PASW. I'm more interested in the new features and bug fixes than buzz words. I'll hopefully be getting the new version shortly and will let you know if Clementine 13 (aka Predictive Analytics Soft Ware Modeller) adds value.

Monday, March 30, 2009

Tips for the KDD challenge :)

I recently heard about the KDD challenge this year. Its a telco based challenge to build churn, cross-sell, and up-sell propensity models using the supplied train and test data.

For more info see;

I am not able to download the data at work (security / download limits), so I might have to try this at home. I haven't even seen the data yet. I'm hoping its transactional cdr's and not in some summarised form (which it sounds like it is).

I don't have a lot of free time so I might not get around to submitting an entry, but if I do these are some of the data preparation steps and issues I'd consider;

- handle outliers
If the data is real-world then you can guarantee that some values will be at least a thousand times bigger than anything else. Log might not work, so try trimmed mean or frequency binning as a method to remove outliers.

- missing values
The KDD guide suggests that missing or undetermined values were converted into zero. Consider changing this. Many algorithms will treat zero very differently from a null. You might get better results by treating these zero's as nulls.

- percentage comparisons
If a customer can make a voice or sms call, what's the percentage between them? (eg 30% voice vs 70% sms calls). If only voice calls, then consider splitting by time of day or peak vs offpeak as percentages. The use of percentages helps remove differences of scale between high and low quantity customers. If telephony usage covers a number of days or weeks, then consider a similar metric that shows increased or decreased usage over time.

- social networking analysis
If the data is raw transactional cdr's (call detail records) then give a lot of consideration do performing a basic social networking analysis. Even if all you can manage is to identify a circle of friends for each customer, then this may have a big impact upon identification of high churn individuals or up-sell opportunities.

- not all churn is equal
Rank customers by usage and scale the rank to a zero (low) to 1.0 score (high rank). No telco should still be treating every churn as a equal loss. Its not! The loss of a highly valuable customer (high rank) is worse than a low spend customer (low rank). Develop a model to handle this and argue your reasons for why treating all churn the same is a fool's folly. This is difficult if you have no spend information or history of usage over multiple billing cycles.

Hope this helps

Good luck everyone!

Friday, March 27, 2009

Presenting at conference Uniscon 2009

I've been asked to present at Uniscon 2009. One to the professors involved at the University of Western Sydney is a relative of an analyst I work with and requested I present. I usually find academic conferences are snooze city, but they promised me free beer and I live in Sydney anyway, so I can get home to see the baby before the night's end. I hope I'm just one of many industry persons there and it proves to be an insightful event.

I'm not presenting work. I will be presenting from a personal perspective as a industry data miner (I've not enough time to prepare my presentation and get legal approval from work) and I'll be discussing generic topics instead of describing recent data mining projects and quoting numbers or factual business outcomes.

I suspect a large part of my attendance is to drive some enthusiasm and make the students interested in data mining and aware of what challenges you face in data mining roles.

If you are attending then feel free to say 'hi'.
For info on the conference see;

wider website

Below was the presentation title and abstract I threw together (now just have to write it...). There is a social networking analysis (SNA) element to it (because that's what I'm focused on at the moment).

Presentation title:
Know your customers. Know who they know they know, and who they don't.

Presentation Abstract:
Tim's presentation will describe some of the types of marketing analysis a typical telecommunications company might do, including social network analysis (SNA, which is a hot topic right now). He also elaborates on the technical and practical side of data mining, and what business impacts data mining may have.

More importantly the presentation will help answer questions such as;
- What skills are required for Data Mining?
- What problems are commonly faced during Data Mining projects?
- And just what is this Data Mining stuff all about anyway?

- Tim

Thursday, March 26, 2009

Closing days of the Data Mining survey

I got a quick email yesterday from Karl Rexer. There are a few days remaining to participate in his yearly data mining survey.

Survey Link:
Access Code: TM42P

If you frequently conduct data analysis on large amounts fo data (ie data mining!) then I urge you to particpate.

- Tim

Wednesday, March 11, 2009

And then there were Three, not!

I haven't been contributing to forums or making posts with my usual vigour because of a few recent events;

1) becoming a daddy
-> lots of fun!

2) recent accouncement of a merger between the telco's Vodafone and Hutchinson.
-> pain in the arse!
For info see

Australia's population is approximately 20 million, which is pretty small, and there were four players in the mobile service provider market (in probable order of market share); Telstra, Optus, Vodafone, Three.

The annoncement that Vodafone and Three are merging reduces this to three players, which reshapes the landscape of Australia to closely match many other countries with mature telecommunications markets. Most countries with mature telecommunications markets have a few players and, in this current economic climate, its not surprising that there will be mergers and therefore consolidation of customers into larger groups.

As a result of the merger, the competitors (Telstra & Optus) will have to review their strategies and probably re-examine customer analysis. Lots of work for us Data Miners...

Tuesday, March 3, 2009

How many models is enough?

I recently missed a presentation by a data mining software vendor (due to my recent paternity break) but I've been reviewing my colleagues notes and vendor presentation slides. I won't name the vendor, you can probably work it out.

A significant part of the vendor solution is the ability to manage many, we're talking hundreds, of data mining models (predictive, clustering etc).

In my group we do not have many data mining models, maybe a dozen, that we run on a weekly or monthly basis. Each model is quite comprehensive and will score the entire customer base (or near to it) for a specific outcome (churn, up-sell, cross-sell, acquisition, inactivity, credit risk, etc). We can subsequently select sub-populations from the customer base for targetted communications based upon the score or outcome of any single or a combination of models, or any criteria take from customer information.

I'm not entirely sure why you would want hundreds of models in a Telco (or similar) space. Any selection criteria applied to specific customers (say, by age, or gender, or state, or spend) before modeling will ofcourse force a baised sample that feeds into the model and affects its inherant nature. Once this type of selective sampling is performed you can't easily track the corresponding model over time *if* the sampled sub-population ever changes (which is likely because people do get older, move house, or change spend etc). For this reason I can't understand why someone would want or have many models. It makes perfect sense in Retail (for example a model for each product or associations rules for product recommendations), but not many models that apply to sub-populations of your customer base.

Am I missing something here? If you are working with a few products or services and a large customer base why would you prefer many models over a few?

Comments please :)

Monday, January 19, 2009

re: "Thoughts on Understanding Neural Networks"

Great post by Gordon Linoff the the Data Blog about visualising Neural Networks

I usually get better predictive success using neural nets, but the lack of explain-ability is always a downside. I'm always keen to see ways that might help explain or interpret a Neural Network. A few years ago I tried a simple graphical way to show a Neural Net, but I think Gordon's recent post highlights better options.

My example is written in VB.NET and parses a single hidden layer Neural Network saved as PMML into data grids. Once the neural net neurons and weights are loaded into the data grids I then read from the data grids and create a graphic of the Neural Net. Transparency is used to show the strength of the weight, whilst colour (blue or red) is used to show the negative or positive effect) . You can view my example at;
and download the source code, executable and example PMML directly from here;

I can't post images as comments on Gordon's blog so below are two snapshots of the simple UI application that displays the Neural Net PMML graphic.

a) graphic

b) PMML loaded into data grids


- Tim

Thursday, January 8, 2009

Isn't In-database processing old news yet?

A bit of a Clementine plug, but hear me through...

I'm puzzled by a few recent articles I've read describing in-database processing, the practice of doing very sophisticated data warehouse analysis (lets call it data mining :) on large amounts of data without having to extract the data into an external analytics tool (for example, a tool like SPSS or SAS).

As an example see the current Teradata magazine article; was fortunate enough to spend a few evenings chatting with Stephen Brobst (chief technology officer of Teradata) on these topics during a Teradata conference in Beijing last June 2008 *. I think he's right on the money concerning his top 4 predictions for data warehousing. As a Data Miner I am concerned with how I might be expected to analyse the data, so in-database processing is the biggest topic for me. I'm not so sure it is a 'future' thing though. In my view its here now, just maybe not so widesread. My only guess is that it's another plug for the SAS partnership. Although I don't use SAS I do like the thinking and development plan going forward. I simply don't think its the new concept it being touted as. I'm sure its not necessary for a data miner or data analyst to need custom plug-ins (and corresponding expense) to reach in-database data mining nevada.

In-database processing is nothing new. I've doing it using SPSS Clementine and Teradata for a few years now. SPSS Clementine has supported this functionality for quite a few years. In real-time Clementine will convert the stream (a graphical icon-based proprietary query file) into SQL (structured query language) and submits the SQL query(s) to the data warehouse. Any computation that cannot be represented as SQL will cause a data extraction and further processing by the Clementine analysis engine itself (commonly the Clementine Server on a dedicated analytical server box will do this, and keep the data and temp files on the server file system. Not the desktop). In practice I usually avoid heavy statistical functions and all my data processing is usually performed in the Teradata warehouse and only the data sample required to build a predictive or clustering model is extracted. When it comes to scoring the created data mining model (such as a neural network or decision tree) Clementine also converts the data mining model into SQL transparently for truly high scale processing on the data warehouse.

The real advantage comes from not just being able to score existing data mining models, but also being able to build predictive models entirely in the data warehouse, and this is a comparatively new development (a couple of years). Not something I do much of (I've done it for fun on my home SQL Server, but not in a corporate production environment). If the data warehouse provides the capability to create neural networks or clustering models (which some now do) then there is no need to ever extract data from the data warehouse into an external analysis application such as SPSS or SAS. More data can therefore be used to build models, and this usually beats tweaking algorithm options.

The data warehouses have actually supported embedded code and adding custom functions that might include a data mining algorithm for quite a while. For info see a recent post by Seth Grimes titled "In-Database Analytics: A Passing Lane for Complex Analysis";

Only in the past year or so have full blown embedded data mining algorithms taken off. The question is though will these algorithms always run fast(er)? Custom code can be good or bad! One advantage of the 'algorithms converted into SQL' route is that the data warehouse can quite easily determine and control how to process and prioritise the SQL query and be optimised specifically for it. Custom code and embedded data mining algorithms can also be optimised, but I'm guessing that requires far more effort (and expense!). One worry is also that custom code brings dangers and risks (not to mention the testing and issues for IT and the DB admin). Still, its necessary for in-database data mining model building capability.

Ok, I'm guessing some of my peers might know this stuff anyway, but one thing has recently occurred to me;
- considering that once we have data processing, model building and model scoring all occurring within the data warehouse, what need have we for the data mining tool?

My preference is for a easy tool that makes querying the data warehouse and constructing highly complex analysis easy. My queries could not possibly be prepared by hand since they are often transformed into many thousands of lines of SQL code. I use a clever user interface to make understanding the logic of the analysis possible. The data mining tool I use primarily is a tool that optimises my interaction with the data warehouse.

So considering these things, my current view is that data mining applications/tools such as SPSS Clementine (or even SAS :) will stick around for quite a bit longer because the user interface optimises a data miner's ability to query the data warehouse and perform data mining efficiently, but maybe in a few years we start to see what we commonly refer to as data mining 'algorithms' being developed for data warehouses and no longer in data mining tools (or simply as plug-ins for data warehouses). An interesting thought indeed!

- Tim

* Whilst at the Teradata User Conference in Beijing I presented some data mining analysis work, mostly my data analysis work on churn prediction and product upsell in telcommunications, and chatted to the China mobile analysts afterward. On a more personal note, that is also when my soon-to-be-born son was conceived. Don't worry, my wife was with me at the time! In true Hollywood fashion I thought it appropriate we therefore name him 'Beijing' or 'Teradata' but my wife doesn't share my enthusiasm :)

Tuesday, January 6, 2009

book review "Data Preparation For Data Mining"

Just before Christmas I bought myself yet another data mining book (i have a few dozen). This one somehow slipped by me for 10 years but I'm glad I finally stumbled upon it. Originally published in 1999, Dorian Pyle wrote "Data Preparation For Data Mining" before Data Mining was less wide spread and 'Predictive Analytics' wasn't the buzz word it is today.

The only few criticisms I could possibly raise are;
1) that everything has a statistical basis.
- For example one technique I use to redistribute heavily skewed data is simple binning by count. I work in telecommunications and the behavioural data is always extremely skewed. Log functions don’t work so I often use SQL to convert variables into 100 percentile bins (where each bin has the same number of rows (customers) in it). That type of insight isn't in the book, but several statistically based alternatives are. I'm not convinced they would work with extremely skewed data, but they are well explained and useful insights.
2) no mention of SQL or step-by-step examples of data manipulation (nothing like 'before and 'after' pictures). Ideas or examples for derived variables are lacking too.

So far I've read through the first 275 pages and the odd additional chapter. Its surprisingly easy to read and explains the statistics well. Its definitely a book I will refer to, and well worth buying.

In February 2004 Dorian Pyle made an interesting post about things to avoid when data mining;
"This Way Failure Lies "

- Tim