Monday, December 15, 2008

No Long Tail from iTunes

There have been quite a few posts in recent months about analytics involving the Long Tail.
For an overview see;

Personally, I definitely fall into a Long Tail demographic regarding my music habits. I buy many relatively uncommon (and some very obscure) funk and jazz albums or compilations. A significant number of these were introduced to me by Amazon's recommendation engine or similar.

I thought I might be missing out on some good music by not being part of the 'iTunes generation'. It came as a huge shock to me when I joined iTunes this weekend and found that *none* of my dozen most recent purchases are even listed in iTunes (all my recent purchases have been from Granted some of the artists have died and their albums are a few decades old, but other titles were released last year.

Not much chance of the Long Tail when 'stock' is limited. In terms of Amazon I have actually ordered and received something physical; a compact disc that was sitting on a shelf somewhere in the US and travelled transatlantic to me in Sydney.

What excuse do iTunes have? Downloading .mp3's hardly has the same requirements for stock management, inventory and distribution. A long tail in the iTunes business model should be easier to support and yield greater benefits (than Amazon for example) because the lesser requirement for physical stock management (I'm guessing it would be a simple case of more disk space). A long tail is not likely to exist where there is less choice for consumers!

In my view the whole idea of the 'Long Tail' applies to circumstances where the constraints of physical stock management are removed (as with iTunes). The fact that Amazon excel at this with physical stock is a credit to them. iTunes is simply a mass-market disgrace :)

Guess I'll continue to buy CD's and burn them into my own mp3's and stream music from my home network...

- Tim

btw. See my wish list for examples :)

Sunday, December 7, 2008

Wikipedia entry for SPSS Clementine

I was doing a search on Wikipedia today, and out of curiousity I wondered what the Clementine entry said.

Although there is some information on data mining (, I was surprised that there was nothing about the SPSS Clementine data mining tool, so I added a brief entry;

Feel free to contribute.

I might add more when I get a block of free time.

- Tim

Sunday, November 30, 2008

Smart Data Collective

I was recently asked to join the Smart Data Collective, which is a social community of bloggers and professionals interested in data warehousing and enterprise analytics. It is sponsored by Teradata and is editorially independent. Some of my posts will appear there, and maybe even some specific articles from me.

They asked me to be a founding member and featured blogger which sounds like a lot of work, but I've been reassured isn't :)

On a related but separate note, I recently participated in a podcast for Teradata concerning the data analytics we do at Optus (legal disclaimer: I do not represent Optus in anyway in this personal blog :). I discussed previously presented material regarding our churn prevention analysis, and also my recent social network analysis. The podcast was only completed recently, its still early days and needs to pass approval from appropriate legal channels, but hopefully it will find its way onto the Teradata site in the New Year. I'll keep you posted.

- Tim

Monday, November 24, 2008

Movember Madness

Through executive meetings and BBQ's with friends I have worn a 'Mo' as a symbol of my support for Movember (

That's right we're talking about men's love bumps and feeling down in the dumps. To raise awareness of prostate cancer and depression I have grown a truly dodgy moustache.

I have found it a challenge and am looking forward to shaving the damn thing off...

With only a few days to go I will soon be posting 'before' and 'after' pictures. There is still time to donate your hard earned cash if you wish;



- Edit: ok, here's the dodgy picture of my moustache. I raised $125 for my hardship :)

Wednesday, November 19, 2008

A simple Data Transformation example...

In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a 'single row per customer' or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although my work is usually as SQL within the data warehouse so it is just the ‘T’ bit.

Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple. I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.

Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations. Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs. Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).

There seems to be a glaring big gap! I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm. In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters. Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful. They don’t necessarily have to be statistical in nature.
So (to put my money where my mouth is) I've started by illustrating a very simple data transformation that I expect is common. On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine).
Clementine Stream and text data files
my post to a Clementine user forum

I’m hoping that my peers might exchange similar ideas (hint!). A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful. This is just the start of a series of ETL you could perform.

I’ve also added a poll for feedback whether this is helpful, too basic, etc

- Tim

Example data processing steps

a)Creation of additional dummy columns
Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category. We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised). A new column is created for each category field.

For example;

Can be changed to;


Or even;


b) Summarisation
Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns.
So we could change the previous step to something like this;

Thursday, November 13, 2008

People are Age-ist !

- Just an interesting customer insight that made me laugh the other day....

As a small part of further social network analysis of a mobile (cell-phone) customer base I have examined age differences between customers and with whom they communicate most frequently.

I was also looking at how reliable it might be to guess someones age (a customer or non-customer) by extrapolating from known individuals. There is customer age approx 97% of the time, and its accurate approx 92% of the time (unusally large numbers of people claim to be born on 1st Jan 2000 :)

I was surprised to see (but then maybe I'm naive :) how so many people have 'mobile calling relationships' mainly with people the same age...

The chart below shows the percentage of customers and the age average between the people they communicate with most frequently. Age differences over 4 years are comparatively rare...

The small spike at 30 years difference is probably parent-to-child communication.

I will be using this to support an estimation of age for prospects and customers where age is unknown, but age of social group members is known.

- Tim

Monday, October 20, 2008

Distribution of a prediction score

Ok, what are people's views on this?

I've tried to refer to a few textbooks but haven't found anything to help me 'answer' this.

- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s).

We often use neural networks to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers. We rescore the customer base each month.

- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.

When I build my predictive model I try to ensure the model mimics this distribution. My view is that most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.

Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout the score range.

I often examine the distribution as a sanity check before deployment. If the distribution is as expected it is something like this;

If it looks screwy, maybe something like this;

- then there may be a problem with the data processing or the behaviour of customers has sufficently changed over time to require a model refresh. The subsequent actual model outcome preformance is often not as good in this case.

- question -
What are your views/preferences on this?
What steps, if any, do you take in an attempt to validate the model prior to deployment (lets assume testing, validation and prior months performance is great) ?

- Tim

Thursday, October 16, 2008

new book "Marketing Calculator"

I'm famous! :) I'm delighted to say that I’ve been referenced in a new marketing book by Guy Powell titled “Marketing Calculator”.

I met Guy in Oct 2007 whilst presenting a couple of data mining case studies at a Marketing Analytics conference in Singapore. My presentation title was ‘given’ to me by the conference organisers, but it allowed some freedom regarding the content. I discussed how we use a comprehensive data warehouse, and how having access to detailed customer data enriched with demographic data can enable you to get some impressive response rates from campaigns, and sales numbers by up-selling to existing customers, and most importantly retain and grow your customer base.

Guy and I spent some time discussing our work and he asked if could include it as a case study in his book. I received the book yesterday afternoon and am already through the first 4 chapters (60 pages). It reads easily and is proving to be a very worthwhile book. It is a Marketing book (it doesn’t contain statistical equations or examples of algorithms on data mining) and I would recommend it for anyone involved in marketing. So far it has provided some well structured and clear insights how you can improve your marketing practices.

- Tim

Sunday, September 21, 2008

Social Network Analysis in mobile networks

Ok this is a big project that has consumed a lot of my time. It was actually completed a few months ago, but I’ve only recently had the time to present it or mention it in a public blog. I’m writing this free-form whilst some large queries are running in the background. I’ll add more to the thread when I get some more free time. Hopefully it will make interesting reading. I do tend to get excited with my projects like this, so please forgive me if my propaganda rambles on a bit...

The aim of these posts is to reveal some typical data mining problems I encounter. It will superficially describe a social networks project I have recently completed. Hopefully enough to give insight, but not enough to replicate the whole thing exactly :)

I would like to extend my sincere thanks to Jure Leskovec (
and Eric Horvitz at Microsoft for their work on social networks within the IM user base, and also Amit Nanavati ( et al at IBM India Research Labs for their work on social networks in a US telco regarding churn prediction. Both were kind enough to send me their published papers detailing their work in large scale social network analysis. I’d already completed most of my work, but both of their papers gave me some very informative insights and ideas.

I'd like to emphasise that my work is significantly simpler in terms of the social analysis computation itself. As much as I would like, I can't afford to investigate whether we have 6.6 degrees of separation or not. Much of the ground breaking work from these researchers involves continuous processing of data for days. Processing is often performed against binary files or flat files using dedicated systems. My data is stored within a terabyte scale data warehouse with hundreds of concurrent demands. Constraints in terms of data warehouse load and computing restrictions mean that my analysis must complete within a practical timeframe. In order to 'production-ise' the analysis it must be a series of SQL queries that someone can simply start with the click of a button. I perform data cleaning, summarisation and transformations on several hundred million CDR's (call detail rows) and calculate social groups for our customer base in less than 3 hours, entirely in SQL on a Teradata data warehouse. I think that in itself is pretty cool, but consequently I must acknowledge that my social networks are comparatively basic and my analysis does not investigate the attributes of the social networks as in-depth as others have.

Why do this?
Working for an Australian telco, in a market with 100% mobile (cell-phone) saturation, the primary focus is customer retention. From a data mining perspective this usually means we examine call usage and, based upon recent customer behaviour, we identify which customers might have problems or want to leave (telco's call this churn, banks often call it attrition). It costs a lot of win a new customer, far less to do something nice to keep an existing customer. The core to my data mining is to use the customer data within an integrated data warehouse to better understand the customer and deliver a service that appears specific to them as an individual. More recently I've tried to focus on communities and using the social fabric surrounding a customer to ensure we better adapt and anticipate customer actions. Hence the need for a social network analysis, a method to identify and investigate the communities that naturally exists within our mobile customer base. This is quite different from the standard analysis that focuses on customers as individuals.

What is it all about?
In a customer focused point of view the theory is that the influences of work colleagues, friends, and family are far stronger and influential than any messages a company can convey through TV or the media. By identifying influencers and social relationships within our customer base we can more effectively anticipate customer actions such as churn. For example, targeting the leaders of social groups and ensuring they are happy will spread with viral positive to word-of-mouth affects throughout social groups (which may even include competitor's customers). Being able to even monitor and measure the viral nature of communications with customers is valuable enough.

How do you do it?
So, recently I have been working on a project to develop analysis that identifies social groups, leaders, followers, churn risks and similar attributes within our customer base. It’s difficult to give too many details without risk of divulging intellectual property, so please assume any details or numbers I provide are rough estimates only...

Some Numbers...
- Lets suppose we have 4 million mobile customers.
- Suppose average outbound is approx 10 calls per day.
- Suppose average inbound is approx 10 calls per day.
- So, we have approx 80 million rows of data every day.
- The terminating number dialled can vary to include country codes, area codes etc.
- People communicate using voice, sms, picture messaging, and video.

Early Data Manipulation Issues
Already you can see a few problems to deal with;
A) A lot of data! One week of data alone is over 500 million rows.
B) The same terminating number can be dialled multiple ways (with or without country codes). In order to identify 'who' a customer communicates with we need to 'clean' the number dialled by resolving country codes, area codes etc so that the same number is resolved irrespective of whether country prefixes are used or not. Yes, we have to perform SQL string cleaning functions on all the data in order to resolve all dialled phone numbers. I did this using a conceptually simple but long winded SQL case statement. It doesn’t actually take long in our data warehouse, we’re talking several minutes, not hours.

C) Different forms of communication (voice, sms, picture messaging, video).
Once the dialled numbers have been cleaned, summarisation by customer number and dialled recipient can be performed. In our case this summarisation involves calculating totals for calls of different forms of communication. The summarised data is one row per customer vs recipient combination. Numerous columns contain sums regarding different calls.

D) Calls can be outbound or inbound. Each is distinguished and processed separately at first. String cleaning is also performed to resolve the originating telephone numbers. Outbound calls started by our customers are summarised as above, so too are inbound calls received by our customers from any source.

Simple Calling Relationships
Once we have both separate (outbound and inbound calls) summarisations complete, then we can join them together (matching recipient telephone number for outbound calls with originating number for inbound calls) to understand if the calling behaviour is reciprocal.

We could use some business logic to limit the definition of a calling relationship, for example if a customer makes over 5 and receives over 5 calls from the same recipient/originating telephone number. From this point you have a simple framework from which you can rank, transform and manipulate the relationships a specific customer has with recipients. The limiting of call counts can help reduce data, and also ensure one-off calls or uni-directional communication to the local pizza shop doesn’t count…

Important Legal Stuff…
Okay, a quick little important tangent. At this point I’d like to touch on an important topic which is far too often taboo in data mining, especially in the telecommunications industry. When you’ve got the capability to do some analysis you often need to stop and think what you should do (ethically and legally), as opposed to what you can do (technically). As a telco it is possible to get and use customer data for lots of things, but taking action based upon a specific number dialled is illegal in some countries. For example, suppose a customer calls a competitor’s sales number or speaks to a competitor’s tech support line. It may be illegal to track these events and perform some kind of retention activity. It could be an invasion of privacy. It also crosses into anti-competitive issues because other companies don’t have access to the same data. I've not done this type of activity. Still, I know for a fact that some industry analysts do this.

What I am doing is analysis at this sensitive level, but not reacting to specific telephone numbers. I don’t know (or care) anything about the recipient’s telephone number. I am only interested in how many times it is called, at what time or the day, using voice or SMS calls etc. It’s the nature of the relationship a customer has with a recipient (and their behaviour) that interests me, not necessarily who the recipient is. Understanding and generalising the calling relationships, for example allows us to build very accurate predictive models that can quantify how likely a customer is to churn based upon recent behaviour of them *and* their closest friends (still sounds ‘Big Brother’ though doesn’t it :)

Formation of Simple Social Networks
So, in my analysis I have summarised outbound calls and inbound calls. Next step is to cross-join both summarisations together so that we list all the customers that also called the same recipient and all the recipients that also called the same customers (and yes, recipients can be customers!). That’s one big query, so you might want to reduce the number of recipients or customers by using some business logic of your choice. This is where restrictions to make the processing complete in a practical timeframe really come into play. A true social network wouldn’t reduce the relationship criteria. Maybe you’d put some logic in place whereby you take the top 10 ranked recipients (who each customer calls the most) for each customer. This would drastically reduce the complexity of the cross-join, but obviously limit the potential social networks you will discover (at this point Jure maybe screaming in agony, and if you are I’m really sorry :)

The result of such a join would enable you to know which customers (and how many) communicate with any given recipient (who could be your customer or a customer of a competitor). Likewise, we can identify customers that have larger numbers of other customers or 'competitor customers' that call and rate them highly in their social groups. Such individuals can be given classifications as 'leaders', 'bridges' etc.

It is difficult to avoid going into too much detail, but simply by examining customer churn and attributes such as number of 'competitor customer' friends and any friends that recently churned, we can very accurately predict churn behaviour with a month lead time (even better if we predict just 1 week in advance). In terms of churn, we're talking an increased churn propensity by a factor of five times at least simply by having social group affinity with a another customer that has already recently churned.

Going forward I will be further analysing these social factors and, time permitting, examine some of the finer customer insights that this type of analysis can highlight.

If anyone is doing similar stuff, I'd love to chat. If you anywhere near Sydney I'll happily buy the beers!

- Tim

My best learning model yet...

My wife and I have successfully created a new learning model that should perform really well. At first I don't expect to see any good results and there will be plenty of errors, but eventually it'll learn to solve simple problems intuitively.

The downside is that it'll be by far the most time demanding and expensive model ever developed...

Here's a picture of our progress after 12 weeks of development time :)

Monday, September 1, 2008

Run multiple instances of Clementine 12

Something I find very useful...

One good change to Clementine in recent versions is to allow you to double click on a Clementine stream file and have that stream load in an already open Clementine client application. In the past double clicking a stream file always resulted in a new Clementine application being started.

A minor downside to this new feature is that by default you can't open two instances of the Clementine client even if you wanted to. I sometimes need to do this in order to run some analysis on the server box and other smaller analysis locally on my laptop.

By adding an additional command flag to the clementine start command you can force it to open multiple application windows, each olne could be configured to connect to a different server or run locally;

"C:\Program Files\SPSS\Clementine12.0\bin\clementine.exe" -noshare

-> add the "-noshare" option to the Clementine application start command.

Then simply clicking on any stream file will still open in the first Clementine client application window that started, but you have the option of opening additional Clementine client applications directly if you want.



Tuesday, August 26, 2008

101 reasons not to upgrade to Clementine 12

Why 101 reasons?
-> because that's close to how many bugs have been added :(
Ok, I'm exaggerating…a little.

We recently 'upgraded' (and I user that term similarly to how a Windows XP user 'upgrades' to Vista) from Clementine 10.1 to version 12.0.2.

For what's new in Clementine 12 see;

I agree the new version has some really cool features, but after using it for a few weeks now I have also found that it has obviously been released far too early in the development and testing cycle. UI design is not up to the usual high polished standard, and there are notably more bugs (granted, usually Clementine has *very few* bugs or problem areas). Clementine is still incredibly stable, the problems I have reported relate to UI design and interface problems. I've yet to find a problem or fault on the data processing side. Some of the problems are minor but have been around for 4+ years, and its frustrating that they are still not fixed. Kinda leaves customers thinking there's little point in providing product feedback at all...

Its been a few years since I left SPSS, so I'm not privy to what's going on in Development. In my view Clementine is still easily the best data mining software out there, but SPSS have clearly rushed this release out the door.

SPSS have replied to me and appear to be taking my criticism onboard. Fingers crossed that an update in the coming months will resolve the issues I have raised.

A few of the main points I raised;
- can no longer save the stream whilst data execution is occurring. Nasty loss of feature. I consider this one quite serious. Users should *always* be able to save the stream at any time.
- Partial outer joins no longer auto-tick the first table connected to a merge node if the order of the connected tables is different. This is a change in default behaviour (always dangerous), and will affect old streams opened in the new version 12 (so your join condition could be different – beware!)
- new pop-up ‘info’ windows that have no purpose and cannot be turned off. Really bad UI design, and not akin to Clementine’s usual interface.
- Charts always prompt to be deleted. Like the pop-up windows, this a new behaviour and quite annoying. There is no way to prevent “Are you sure you want to delete this chart” pop-up messages. Didn't they learn from the old version 7.0? (oldies will remember the "Are you sure you want to exit Clementine message"...)
- Quality Node has gone and there is no replacement functionality. Sure, just delete something from the software for no good reason…

Granted, I use Clementine 6 hours a day and am probably going to encounter problems other usedrs wouldn't, but some of these issues are glaringly obvious.

- Tim

Monday, August 18, 2008

Stratified Sampling in SQL

If you use SPSS Clementine as I do, then you are probably familiar with the Balance node. It performs the function of selectively and randomly sampling your data based upon the values of a field or number of fields. Also known as stratified sampling!

If your data is managed by a data warehouse, then Clementine has this cool behaviour of automatically converting functions into SQL, so the data processing can be performed by the database and less data needs to be extracted and duplicated on another file system.

Unfortunately the Balance node isn't one of the functions automatically converted into SQL. In order to perform stratified sampling you have to take a different approach and selectively pick the values of your target column/field and sample them individually.

On KDKEYS.NET I attached one Clementine version 12.0.2 stream (balance node.str) as one example of how to do this. By using a select condition, followed by a random sample, followed by a union (append) it is possible to easily obtain a stratified sample from a huge dataset efficiently.

I have also pasted below an example of the type of simple SQL that gets processed;

WHERE (Drug = 'drugA')
) AS TimTemp1
WHERE (Drug = 'drugX')
) AS TimTemp2
) AS TimTable

- sorry, I couldn't work out how to format the SQL properly in this blog :(



Wednesday, August 13, 2008

iPhone update (what bill shock?)

Update to my previous post, subsequent monitoring of iPhone users is showing that most are within their data download limits. Although the new 3G iPhones are showing slight more data download than their 2G counterparts, it is doubtful that mobile customers are going to be recieving unexpected bills with excess data charges.

I've resisted getting an iPhone so far, but my colleagues keep tempting me...

The pros;
- it has a great user interface. The scrolling nature and design of the UI is amazing. The concept of momentum that exists when you scroll through menus and music library is very cool.
- Optimum size. Its not that small, but yes it has a screen you can actually see. It fits in your back pocket.
- some versions have decent sized storage (16GB etc) for music and video.
- most importantly it has little apps such as the StarWars LightSaber. This app uses the momentum / gyro within the iPhone to react as you wave the iPhone around. It sound just like you have a LightSaber. Being able to turn on a Light Saber during a meeting when a colleague makes a dumb witted comment and chop them into pieces is priceless...
They just took this off of the apps library, but they will be replacing it with an offical one (hopefully still free)

The cons;
- battery life isn't so good. The screen uses a lot of power.
- battery cannot easily be replaced, can't carry a replacement for emergencies.
- no support for video calls
- no support for picture messaging

For me long battery life is quite important, and although I could send videos via email etc using the iPhone, I'm surprised it only supports the basic forms of mobile communication (especially considering its a 3G phone).

But the LightSaber app is really cool... :)

Monday, July 21, 2008

re iPhone "excessive data charges"

Working as a data miner for an Australian telco (I'll try not to pick sides :) I know that Optus have designed plans that will easily cover the iPhone data usage requirements for the majority of customers.

I agree with the ACCC that there is definately a possibility this could happen, but at least one telco (the one I work for) is behaving itself and offering generous plans with free data for the first few months to avoid any possibility of bill shock. I don't know where the ACCC get their info, but I doubt many of our customers will have 'bill shock'.

"ACCC warns about iPhone bill costs, additional charges"
and also;
"ACCC warns 'iPhoners' on bill shock"

Before we launched the iPhone (last weekend) I did some analysis examining early adopters of the 'old' 2G iPhone. A simple graph showing data usage of existing iPhone users gave us a rough idea of what the new 3G iPhone might require (assuming the new 3G iPhone and customers behave the same way...). The data was roughly something like this;

A proprtionally large number of early iPhone adopters were using 100mb or less a month, but small percentages of customers would use far more (actually going up to a few gb's).

Based off of this early analysis I'm guessing that the data included in the Optus iPhone plans should be sufficent for most new users of the 3G iPhone. We'll see...

Thursday, July 17, 2008

I finally started a blog...

I try to contribute to data mining related forums and blogs, but never got around to writing my own blog...until now!

Since most of my work revolves around data mining (and using Clementine) that will be the focus of my posts, but other topics might creep in.

It'll be difficult to discuss my work freely because of intellectual property concerns, but I'll try to discuss the data mining problems we face and how we solve them. My intention will be to foster peer review and feedback from other data miners, especially anyone also tackling analysis within terabyte sized data warehouses.

Stay tuned, more to follow.