Sunday, September 21, 2008

Social Network Analysis in mobile networks

Ok this is a big project that has consumed a lot of my time. It was actually completed a few months ago, but I’ve only recently had the time to present it or mention it in a public blog. I’m writing this free-form whilst some large queries are running in the background. I’ll add more to the thread when I get some more free time. Hopefully it will make interesting reading. I do tend to get excited with my projects like this, so please forgive me if my propaganda rambles on a bit...

The aim of these posts is to reveal some typical data mining problems I encounter. It will superficially describe a social networks project I have recently completed. Hopefully enough to give insight, but not enough to replicate the whole thing exactly :)

I would like to extend my sincere thanks to Jure Leskovec (http://www.cs.cmu.edu/~jure/)
and Eric Horvitz at Microsoft for their work on social networks within the IM user base, and also Amit Nanavati (http://domino.research.ibm.com/comm/research_people.nsf/pages/nanavati.index.html) et al at IBM India Research Labs for their work on social networks in a US telco regarding churn prediction. Both were kind enough to send me their published papers detailing their work in large scale social network analysis. I’d already completed most of my work, but both of their papers gave me some very informative insights and ideas.

I'd like to emphasise that my work is significantly simpler in terms of the social analysis computation itself. As much as I would like, I can't afford to investigate whether we have 6.6 degrees of separation or not. Much of the ground breaking work from these researchers involves continuous processing of data for days. Processing is often performed against binary files or flat files using dedicated systems. My data is stored within a terabyte scale data warehouse with hundreds of concurrent demands. Constraints in terms of data warehouse load and computing restrictions mean that my analysis must complete within a practical timeframe. In order to 'production-ise' the analysis it must be a series of SQL queries that someone can simply start with the click of a button. I perform data cleaning, summarisation and transformations on several hundred million CDR's (call detail rows) and calculate social groups for our customer base in less than 3 hours, entirely in SQL on a Teradata data warehouse. I think that in itself is pretty cool, but consequently I must acknowledge that my social networks are comparatively basic and my analysis does not investigate the attributes of the social networks as in-depth as others have.

Why do this?
Working for an Australian telco, in a market with 100% mobile (cell-phone) saturation, the primary focus is customer retention. From a data mining perspective this usually means we examine call usage and, based upon recent customer behaviour, we identify which customers might have problems or want to leave (telco's call this churn, banks often call it attrition). It costs a lot of win a new customer, far less to do something nice to keep an existing customer. The core to my data mining is to use the customer data within an integrated data warehouse to better understand the customer and deliver a service that appears specific to them as an individual. More recently I've tried to focus on communities and using the social fabric surrounding a customer to ensure we better adapt and anticipate customer actions. Hence the need for a social network analysis, a method to identify and investigate the communities that naturally exists within our mobile customer base. This is quite different from the standard analysis that focuses on customers as individuals.

What is it all about?
In a customer focused point of view the theory is that the influences of work colleagues, friends, and family are far stronger and influential than any messages a company can convey through TV or the media. By identifying influencers and social relationships within our customer base we can more effectively anticipate customer actions such as churn. For example, targeting the leaders of social groups and ensuring they are happy will spread with viral positive to word-of-mouth affects throughout social groups (which may even include competitor's customers). Being able to even monitor and measure the viral nature of communications with customers is valuable enough.

How do you do it?
So, recently I have been working on a project to develop analysis that identifies social groups, leaders, followers, churn risks and similar attributes within our customer base. It’s difficult to give too many details without risk of divulging intellectual property, so please assume any details or numbers I provide are rough estimates only...

Some Numbers...
- Lets suppose we have 4 million mobile customers.
- Suppose average outbound is approx 10 calls per day.
- Suppose average inbound is approx 10 calls per day.
- So, we have approx 80 million rows of data every day.
- The terminating number dialled can vary to include country codes, area codes etc.
- People communicate using voice, sms, picture messaging, and video.

Early Data Manipulation Issues
Already you can see a few problems to deal with;
A) A lot of data! One week of data alone is over 500 million rows.
B) The same terminating number can be dialled multiple ways (with or without country codes). In order to identify 'who' a customer communicates with we need to 'clean' the number dialled by resolving country codes, area codes etc so that the same number is resolved irrespective of whether country prefixes are used or not. Yes, we have to perform SQL string cleaning functions on all the data in order to resolve all dialled phone numbers. I did this using a conceptually simple but long winded SQL case statement. It doesn’t actually take long in our data warehouse, we’re talking several minutes, not hours.

C) Different forms of communication (voice, sms, picture messaging, video).
Once the dialled numbers have been cleaned, summarisation by customer number and dialled recipient can be performed. In our case this summarisation involves calculating totals for calls of different forms of communication. The summarised data is one row per customer vs recipient combination. Numerous columns contain sums regarding different calls.

D) Calls can be outbound or inbound. Each is distinguished and processed separately at first. String cleaning is also performed to resolve the originating telephone numbers. Outbound calls started by our customers are summarised as above, so too are inbound calls received by our customers from any source.

Simple Calling Relationships
Once we have both separate (outbound and inbound calls) summarisations complete, then we can join them together (matching recipient telephone number for outbound calls with originating number for inbound calls) to understand if the calling behaviour is reciprocal.

We could use some business logic to limit the definition of a calling relationship, for example if a customer makes over 5 and receives over 5 calls from the same recipient/originating telephone number. From this point you have a simple framework from which you can rank, transform and manipulate the relationships a specific customer has with recipients. The limiting of call counts can help reduce data, and also ensure one-off calls or uni-directional communication to the local pizza shop doesn’t count…

Important Legal Stuff…
Okay, a quick little important tangent. At this point I’d like to touch on an important topic which is far too often taboo in data mining, especially in the telecommunications industry. When you’ve got the capability to do some analysis you often need to stop and think what you should do (ethically and legally), as opposed to what you can do (technically). As a telco it is possible to get and use customer data for lots of things, but taking action based upon a specific number dialled is illegal in some countries. For example, suppose a customer calls a competitor’s sales number or speaks to a competitor’s tech support line. It may be illegal to track these events and perform some kind of retention activity. It could be an invasion of privacy. It also crosses into anti-competitive issues because other companies don’t have access to the same data. I've not done this type of activity. Still, I know for a fact that some industry analysts do this.

What I am doing is analysis at this sensitive level, but not reacting to specific telephone numbers. I don’t know (or care) anything about the recipient’s telephone number. I am only interested in how many times it is called, at what time or the day, using voice or SMS calls etc. It’s the nature of the relationship a customer has with a recipient (and their behaviour) that interests me, not necessarily who the recipient is. Understanding and generalising the calling relationships, for example allows us to build very accurate predictive models that can quantify how likely a customer is to churn based upon recent behaviour of them *and* their closest friends (still sounds ‘Big Brother’ though doesn’t it :)

Formation of Simple Social Networks
So, in my analysis I have summarised outbound calls and inbound calls. Next step is to cross-join both summarisations together so that we list all the customers that also called the same recipient and all the recipients that also called the same customers (and yes, recipients can be customers!). That’s one big query, so you might want to reduce the number of recipients or customers by using some business logic of your choice. This is where restrictions to make the processing complete in a practical timeframe really come into play. A true social network wouldn’t reduce the relationship criteria. Maybe you’d put some logic in place whereby you take the top 10 ranked recipients (who each customer calls the most) for each customer. This would drastically reduce the complexity of the cross-join, but obviously limit the potential social networks you will discover (at this point Jure maybe screaming in agony, and if you are I’m really sorry :)

The result of such a join would enable you to know which customers (and how many) communicate with any given recipient (who could be your customer or a customer of a competitor). Likewise, we can identify customers that have larger numbers of other customers or 'competitor customers' that call and rate them highly in their social groups. Such individuals can be given classifications as 'leaders', 'bridges' etc.

It is difficult to avoid going into too much detail, but simply by examining customer churn and attributes such as number of 'competitor customer' friends and any friends that recently churned, we can very accurately predict churn behaviour with a month lead time (even better if we predict just 1 week in advance). In terms of churn, we're talking an increased churn propensity by a factor of five times at least simply by having social group affinity with a another customer that has already recently churned.

Going forward I will be further analysing these social factors and, time permitting, examine some of the finer customer insights that this type of analysis can highlight.

If anyone is doing similar stuff, I'd love to chat. If you anywhere near Sydney I'll happily buy the beers!

- Tim

12 comments:

Anonymous said...

Tim,

This is really interesting stuff you're doing here. It is very impressive how strong the social relationship is in predicting churn, etc. I wonder if these customers are any harder to convince to stay, given their churn is being influenced by their friends and family!

Do you have a feel for how often the social network changes? How often do you envisage you will need to rerun the analysis?

Regarding the computational constraints, the problem of propagating churn or uptake of products through a graph of customers sounds similar to the problem of propagating reputation through a web graph (ie. Google PageRank). I doubt this would be easy with SQL - the difficulty being that it requires several iterations of the algorithm until a particular page's score converges. Google solved this using their MapReduce framework -- I wonder if this paradigm would be of any use for you? Greenplum and AsterData have announced they are extended their databases with MapReduce (not sure about Teradata's plans). You could look into Hadoop though.

Regards,
Shane Butler

Tim Manns said...

Thanks for your feedback Shane.

I'm planning on running the analysis weekly and tracking a rolling history of three months. I'm still finalising how to track changes and sequential effects whereby I can attribute a score based on whether a specific customer leads or follows the actions of other members of the same social group or friends.

As much as I do like the idea of the MapReduce framework, I don't have the influence or budget to do anything about it :) I'm restricted to using the data warehouse (which is very good) and I'm happy performance wise.

Over time we'll see how well we are able to

Do you have a blog or similar?

Cheers

Tim

Tim Manns said...

I've been thinking further about your suggestion to use PageRank.

To be honest I wasn't aware of how it works, but I have been reading up on the topic and think I have got a sound grasp of it. I think I might be able work out a way to do something like this in SQL. I'll try a few things over the next week or so and let you know how it goes.

Thanks!

Anonymous said...

Tim,

Let me know how you go, I will be keen to hear about more of your experiences. I think you will be able to make a suitable approximation using SQL, its just question of how many passes you make over the data.

BTW, have you considered spatial elements to the social network relationships?

Cheers,
Shane

PS. My site is http://sbutler.com/blog but I will be moving to http://www.dataminingdownunder.com within the week.

Tim Manns said...

Shane,

a) by "spatial elements" are you referring to physical location?
-> in answer to that, no. I am not identifing the location of the cell-sites (nearest mobile tower used to ferry the call) or customer address.
My analysis is summarised to a relationship level (and people move).
I *do* record whether the relationship involves domestic or international calls, but not where.

b) if by "spatial elements" you are referring to some kind of weighting based upon the quantity of calls and nature (voice, sms, picture, video etc) then yes, I have all that info.

I was considering using some metric such as customer usage in a damping factor or factor in the PageRank summation, but I'll see how a basic PageRank goes first :)

I'm working on this stuff concurrently with higher priority requests over the next week or so which is slowing my progress :(

Tim

Anonymous said...

Hi Tim,

Yes, I was referring to (a), in that spatial relationships between customers may help to supplement a call-based relationship metric. However, from what you've said it sounds like call patterns are pretty strong anyway.

Cheers,
Shane

Anonymous said...

Tim

I was really interested to see your thoughts on this area.

I am an analyst and am about to write a note on SNA in mobile.

I wondered whether you would have time for a longer chat? (Not specifics about the project - but generally about the area).

I am hoping my email address is displayed below!

Thanks
Charlotte

Anonymous said...

Hi Tim,

This is a very interesting blog and valuable blog I have ever read. I have been involved in data mining and modeling for a year plus, consider new to this area. I have never come across that social relationship can play an important role in the churn model. I believe it's really work based on apriori, because as you said worth of mouth is certainly more important than general advertisement on the TV, magazines and etc. I am looking forward to read more about the data mining.

Regards,
Kelvid Pang (Malaysia)
kelvidpang@gmail.com

John Held said...

Tim
I was fascinated by your original post, and had a few questions.

It would seem that using SNA to create additional indicators for churn would have been very innovative back in Sept 2008; what’s your impression on the state of the art today if you were to look at telco best practices generally?

The article contrasted the typical way of measuring churn—using individual customer attributes—and you discussed some representative results of using SNA analysis in identifying higher likelihood of churn. I wasn’t sure how you approached both together-would you advocate simply adding SNA leader/group/follower/bridge attributes into the mix and performing a single churn model? Or do you replace all your churn modeling with SNA only, or would you separately predict churn based on individual and SNA and somehow merge the two scores? I guess what I am getting at is how would you combine the two approaches?

Last question: since the original post, have you continued to use SNA as part of measurement of likely churn?

A final comment: I appreciated your addition of considerations of ethics. I believe most practitioners behave ethically, but taking the time to point out ethical considerations in a public forum reinforces that. Thanks!

John
P.S. Anything mentioned here represents my own opinions and don’t represent those of my employer.

Tim Manns said...

Hi John,

Its not innovative anymore?

Some replies;
I presented my SNA work (which was wholly in SQL on Teradata) at the Teradata Partners conference, and also at PAW Oct 09 in Washington. Judging from the feedback I got from my peers that also work in Telcom other's aren't doing this in-house. A few do have vendor provided solutions which have nice interfaces and metrics, but many are 'black box' and not integrated with a data warehouse. I guess your queries about applying multiple scores apply more in these cases.

For me, the SNA simply generates more inputs for a Neural Net.

Your query about using SNA inputs as just one piece of the customer view matches my thinking, and is what I have implemented in predictive churn models. For example our models use billing history, age, monthly usage profile, other product holdings, etc and social relationship data derived from SNA as inputs into a simple backprop nueral network.

The addition of the SNA inputs was significant! -> increased the lift metric at the top 1% of postpaid mobile customer base from approx 8 to 14 (churn at approx 0.5% per month), so just under doubling our predictive accuracy for highest churn likiehood. We rerun the SNA analysis and refresh all inputs (ie refresh the monthly customer view) each month. Yes, that is processing several billions rows of data over the weekend...

One of the earlier tips/suggestions I got from Shane Butler was to use Pagerank. I did implement a simple metric (just a single pass with weighted means) for churn, spend, etc so that one customer is also judged on the merits of their friends. This can also help a lot for up-sell (eg. customer has a cool handset, we see same uptake by friends).

Regarding ethics, don't be fooled! I've had vendors unknowingly describe illegal activities as part of their solution which they have sold and implemented elsewhere. In some cases analysing phone usage of anyone that is not a customer is illegal because you have no consent. Exactly how you use SNA that involves analysing everyone your customers talk to is tricky.

Arnd Winter said...

Hi Tim,

very interesting blog.

Nice to see that you are still around doing Data Mining with Clementine!

Greetings from Baden-Baden, Germany
Arnd

Sanket said...

Hi Tim,

I work for IBM in AAO practice (Advanced Analytics and Optimization). I got your reference from John Held.

I am trying to build an in-house SPSS model for churn prediction catering to prepaid mobile phone users. In that regard, I am writing a paper for IBM.

Your blog is really well written and throws lot of light on SNA part (Social Network Analysis). I am exploring some variables that could be of interest for the purpose of churn prediction/prevention.

Thanks and regards,
Sanket