Ok this is a big project that has consumed a lot of my time. It was actually completed a few months ago, but I’ve only recently had the time to present it or mention it in a public blog. I’m writing this free-form whilst some large queries are running in the background. I’ll add more to the thread when I get some more free time. Hopefully it will make interesting reading. I do tend to get excited with my projects like this, so please forgive me if my propaganda rambles on a bit...
The aim of these posts is to reveal some typical data mining problems I encounter. It will superficially describe a social networks project I have recently completed. Hopefully enough to give insight, but not enough to replicate the whole thing exactly :)
I would like to extend my sincere thanks to Jure Leskovec (http://www.cs.cmu.edu/~jure/)
and Eric Horvitz at Microsoft for their work on social networks within the IM user base, and also Amit Nanavati (http://domino.research.ibm.com/comm/research_people.nsf/pages/nanavati.index.html) et al at IBM India Research Labs for their work on social networks in a US telco regarding churn prediction. Both were kind enough to send me their published papers detailing their work in large scale social network analysis. I’d already completed most of my work, but both of their papers gave me some very informative insights and ideas.
I'd like to emphasise that my work is significantly simpler in terms of the social analysis computation itself. As much as I would like, I can't afford to investigate whether we have 6.6 degrees of separation or not. Much of the ground breaking work from these researchers involves continuous processing of data for days. Processing is often performed against binary files or flat files using dedicated systems. My data is stored within a terabyte scale data warehouse with hundreds of concurrent demands. Constraints in terms of data warehouse load and computing restrictions mean that my analysis must complete within a practical timeframe. In order to 'production-ise' the analysis it must be a series of SQL queries that someone can simply start with the click of a button. I perform data cleaning, summarisation and transformations on several hundred million CDR's (call detail rows) and calculate social groups for our customer base in less than 3 hours, entirely in SQL on a Teradata data warehouse. I think that in itself is pretty cool, but consequently I must acknowledge that my social networks are comparatively basic and my analysis does not investigate the attributes of the social networks as in-depth as others have.
Why do this?
Working for an Australian telco, in a market with 100% mobile (cell-phone) saturation, the primary focus is customer retention. From a data mining perspective this usually means we examine call usage and, based upon recent customer behaviour, we identify which customers might have problems or want to leave (telco's call this churn, banks often call it attrition). It costs a lot of win a new customer, far less to do something nice to keep an existing customer. The core to my data mining is to use the customer data within an integrated data warehouse to better understand the customer and deliver a service that appears specific to them as an individual. More recently I've tried to focus on communities and using the social fabric surrounding a customer to ensure we better adapt and anticipate customer actions. Hence the need for a social network analysis, a method to identify and investigate the communities that naturally exists within our mobile customer base. This is quite different from the standard analysis that focuses on customers as individuals.
What is it all about?
In a customer focused point of view the theory is that the influences of work colleagues, friends, and family are far stronger and influential than any messages a company can convey through TV or the media. By identifying influencers and social relationships within our customer base we can more effectively anticipate customer actions such as churn. For example, targeting the leaders of social groups and ensuring they are happy will spread with viral positive to word-of-mouth affects throughout social groups (which may even include competitor's customers). Being able to even monitor and measure the viral nature of communications with customers is valuable enough.
How do you do it?
So, recently I have been working on a project to develop analysis that identifies social groups, leaders, followers, churn risks and similar attributes within our customer base. It’s difficult to give too many details without risk of divulging intellectual property, so please assume any details or numbers I provide are rough estimates only...
Some Numbers...
- Lets suppose we have 4 million mobile customers.
- Suppose average outbound is approx 10 calls per day.
- Suppose average inbound is approx 10 calls per day.
- So, we have approx 80 million rows of data every day.
- The terminating number dialled can vary to include country codes, area codes etc.
- People communicate using voice, sms, picture messaging, and video.
Early Data Manipulation Issues
Already you can see a few problems to deal with;
A) A lot of data! One week of data alone is over 500 million rows.
B) The same terminating number can be dialled multiple ways (with or without country codes). In order to identify 'who' a customer communicates with we need to 'clean' the number dialled by resolving country codes, area codes etc so that the same number is resolved irrespective of whether country prefixes are used or not. Yes, we have to perform SQL string cleaning functions on all the data in order to resolve all dialled phone numbers. I did this using a conceptually simple but long winded SQL case statement. It doesn’t actually take long in our data warehouse, we’re talking several minutes, not hours.
C) Different forms of communication (voice, sms, picture messaging, video).
Once the dialled numbers have been cleaned, summarisation by customer number and dialled recipient can be performed. In our case this summarisation involves calculating totals for calls of different forms of communication. The summarised data is one row per customer vs recipient combination. Numerous columns contain sums regarding different calls.
D) Calls can be outbound or inbound. Each is distinguished and processed separately at first. String cleaning is also performed to resolve the originating telephone numbers. Outbound calls started by our customers are summarised as above, so too are inbound calls received by our customers from any source.
Simple Calling Relationships
Once we have both separate (outbound and inbound calls) summarisations complete, then we can join them together (matching recipient telephone number for outbound calls with originating number for inbound calls) to understand if the calling behaviour is reciprocal.
We could use some business logic to limit the definition of a calling relationship, for example if a customer makes over 5 and receives over 5 calls from the same recipient/originating telephone number. From this point you have a simple framework from which you can rank, transform and manipulate the relationships a specific customer has with recipients. The limiting of call counts can help reduce data, and also ensure one-off calls or uni-directional communication to the local pizza shop doesn’t count…
Important Legal Stuff…
Okay, a quick little important tangent. At this point I’d like to touch on an important topic which is far too often taboo in data mining, especially in the telecommunications industry. When you’ve got the capability to do some analysis you often need to stop and think what you should do (ethically and legally), as opposed to what you can do (technically). As a telco it is possible to get and use customer data for lots of things, but taking action based upon a specific number dialled is illegal in some countries. For example, suppose a customer calls a competitor’s sales number or speaks to a competitor’s tech support line. It may be illegal to track these events and perform some kind of retention activity. It could be an invasion of privacy. It also crosses into anti-competitive issues because other companies don’t have access to the same data. I've not done this type of activity. Still, I know for a fact that some industry analysts do this.
What I am doing is analysis at this sensitive level, but not reacting to specific telephone numbers. I don’t know (or care) anything about the recipient’s telephone number. I am only interested in how many times it is called, at what time or the day, using voice or SMS calls etc. It’s the nature of the relationship a customer has with a recipient (and their behaviour) that interests me, not necessarily who the recipient is. Understanding and generalising the calling relationships, for example allows us to build very accurate predictive models that can quantify how likely a customer is to churn based upon recent behaviour of them *and* their closest friends (still sounds ‘Big Brother’ though doesn’t it :)
Formation of Simple Social Networks
So, in my analysis I have summarised outbound calls and inbound calls. Next step is to cross-join both summarisations together so that we list all the customers that also called the same recipient and all the recipients that also called the same customers (and yes, recipients can be customers!). That’s one big query, so you might want to reduce the number of recipients or customers by using some business logic of your choice. This is where restrictions to make the processing complete in a practical timeframe really come into play. A true social network wouldn’t reduce the relationship criteria. Maybe you’d put some logic in place whereby you take the top 10 ranked recipients (who each customer calls the most) for each customer. This would drastically reduce the complexity of the cross-join, but obviously limit the potential social networks you will discover (at this point Jure maybe screaming in agony, and if you are I’m really sorry :)
The result of such a join would enable you to know which customers (and how many) communicate with any given recipient (who could be your customer or a customer of a competitor). Likewise, we can identify customers that have larger numbers of other customers or 'competitor customers' that call and rate them highly in their social groups. Such individuals can be given classifications as 'leaders', 'bridges' etc.
It is difficult to avoid going into too much detail, but simply by examining customer churn and attributes such as number of 'competitor customer' friends and any friends that recently churned, we can very accurately predict churn behaviour with a month lead time (even better if we predict just 1 week in advance). In terms of churn, we're talking an increased churn propensity by a factor of five times at least simply by having social group affinity with a another customer that has already recently churned.
Going forward I will be further analysing these social factors and, time permitting, examine some of the finer customer insights that this type of analysis can highlight.
If anyone is doing similar stuff, I'd love to chat. If you anywhere near Sydney I'll happily buy the beers!
- Tim