Monday, June 15, 2009
See you at the US Teradata User Conference 2009
Last year, at the Asia Teradata User Group in Beijing, I presented some generic data mining that was being performed at Optus (mostly simple churn analysis and behavioural segmentation). I also had a few meetings with the analysts from some China telco's about how relatively simple data analysis can scale up to many millions of customers and billions of rows of data.
This year I'll be presenting at the US Teradata User Conference some of the more advanced analysis that I've recently done, notably surrounding social network analysis in the mobile customer base on large amounts of data (several billions of rows). I'm hoping to be able to quote some actual business outcomes and put up some $ numbers.
The US 2009 Teradata User Group Conference & Expo, October 18–22, 2009, at the Gaylord National Resort.
I'll be presenting on Wednesday 21st October 2009 at Maryland D on the Business Track. Judging from the large number of presentations I guessing it will a much smaller and personal room than the 1000+ conference hall I was in last year in Beijing :)
http://www.teradata.com/teradata-partners/conference/session_abstracts.cfm?cdate=10%2F21%2F09#7743173
Feel free to say hi and ask lots of questions if you see me there. I might have one free evening for a few beers if anyone wants.
Wednesday, May 20, 2009
Teradata Podcasts on Data Mining And SNA
The podcasts discuss customer insights and data mining analyses that are performed. We later then discussed social networking analysis and how linking customers by social calling groups helps predict customer action (such as churn or acquisition of an iPhone handset). TCRM is a Teradata tool I am not familiar with, but my colleagues do use it for campaign delivery, and it has the capability to perform trigger based campaigns (such a send a retention offer to other members of a social group when one member of that group churns).
I'm very fortunate that I am occasionally permitted to present my work. One of my main arguments for doing this is that I get peer review and feedback from other data miners, and an idea whether the analytics we do is 'better than most'.
So, I beg you! Please let me know either way; If this stuff is good or bad I need to know (especially if you work in Telco).
Cheers!
Tim
- - - - - - - - -
Enhancing Customer Knowledge and Retention at Optus
http://www.teradata.com/t/podcast.aspx?id=10736
In This Podcast
Optus is an Australian telecommunications carrier that uses analytics to increase customer retention. The data being analyzed comes from call centers, mobile phone call details, census geo-demographic data, and a history customer behavior. Teradata CRM and the data warehouse environment from Teradata is key to Optus’ success with reliably identifying customers that might churn and offering marketing campaigns that are relevant and timely. Optus saw a 20% reduction in churn.
Social Networking Analysis at Optus
http://www.teradata.com/t/podcasts/social-networking-analysis-optus/
In This Podcast
Tim Manns from Optus discusses how the company uses detailed network data from its Teradata system to look at calling behavior. With 40% of the Australian telecommunications market, the company cross-references each customer with every other customer, groups them together based on who they communicate with, looks at the behavior of the group, and can then predict next steps and target those groups with appropriate products and services.
Monday, May 4, 2009
Telstra found guilty of abuses of telecommunications network data
See a recent news article;
http://www.itwire.com/content/view/24786/1095/
And also;
http://www.australianit.news.com.au/story/0,25197,25414690-15306,00.html
As a Data Miner for a telecommunications provider I frequently use network data in my analysis. How many calls the customer makes, at what time of day, do they communicate using voice or sms etc. I examine data pertaining to *customers* only.
Telecommunications companies often provision services wholesale for another company. This 'wholesale recipient' company will pay for the use of the network, but manage all other activities such as marketing, customer account and billing. In these cases, although the telecommunications company is responsible for supplying the network service and ensuring calls are successfully established (and likely stores data about these calls), it doesn't own the call data for that customer (who belongs to the 'wholesale recipient' company). Make sense? Use of the data that pertains to the actions of someone that is not a customer of that telecommunications company must be treated with the utmost caution.
Every data miner must be aware of data privacy laws, and in many countries failure to adhere to these laws attract heavy financial penalties for the organisation and individuals involved. In Australia some invasion of privacy laws could even potentially involve 2 years jail time.
Recently Telstra, an Australia telecommunication company (and the previous incumbent) was found guilty of serious breaches of data privacy. For the 130 page publicly accessible transcript see;
http://www.austlii.edu.au/cgi-bin/sinodisp/au/cases/cth/FCA/2009/422.html
I guessing that the significant legal costs and years it has taken to get this result is obviously prohibitive for many telcos, so they let it slide. Optus didn't.
Basically, the bit that caught my eye was on item 108 (yes, I speed read the whole thing...). It is legal jargon and reads;
"Telstra asserted that total traffic travelling across its network belonged to Telstra. Optus submitted that whether it belonged to Telstra is not the question posed by cl 15.1 of the Access Agreement. The question under cl 15.1 is whether Telstra owed an obligation under that clause with respect to traffic information recorded by Telstra of communications by Optus customers on the Telstra network because that information was Confidential Information of Optus. The definition of Confidential Information identifies what is the Confidential Information of Optus. Once a CCR records information in relation to a call made by an Optus customer, that information becomes the Confidential Information of Optus because it falls within the definition of ‘Confidential Information’. "
The first sentence is shocking. In English it basically suggests that Telstra treat all network calling data as its own, and freely uses call information made by anyone on that network as it sees fit. That includes calls made by customers of wholesale or competitors companies on their network. In the case of wholesale for fixed line (land) networks Telstra will know the address and likely also the name of the customer. In the early days Optus had little choice but to use some of Telstra's fixed line infrastructure, often the last bits of copper wire that reach a household. The information of this usage was passed to Executive and board members so that they knew customer size and market share by age, geography etc. It is also highly likely (although difficult to prove) that the Telstra retail arm used the data for marketing activities and actioned direct communications to that customer. Anti-competitive to say the least...
One of the short conclusions of the legal findings were;
"For the foregoing reasons, I find that Telstra has used traffic information of Optus, or Communication Information of Optus for the purposes of the Access Agreement, both in the preparation of market share reports and in distributing those reports among Telstra personnel. I also find that such information is Confidential Information of Optus for the purposes of the Access Agreement, or is otherwise subject to the requirements of confidentiality in cl 15 of the Access Agreement, by force of cl 10 of that agreement. I also find that neither such use of such information nor its disclosure for such purposes is permitted by the Access Agreement."
I guess the information here is probably too much in the 'telco land', but hopefully its clear enough to understand the gravity of this. I've known this type of stuff was being conducted by some telco's for a long time, but I'm shocked it was so brazen.
Knowing the big differences between what we (as Data Miners) are 'able to do' regarding insights and personal information (particularly in mobile telecommunications) and what we 'should do' is very important. Years ago the industry passed the early developmental stage of storing data, in recent years we have learned how to understand the data and convert it into useful insights. I still think that many data miners don't realise how important (now more than ever before) it is that we act responsibility in the use of the personal information we obtain from 'our' data.
Wednesday, April 22, 2009
When graphs, piecharts and all else fails... Dilbert to the rescue!
I'm quite proud of the social network analysis (SNA) that I'd first completed months ago. It is refreshed each month (the data warehouse load is too high to run it daily or weekly as I would like). I've been tracking its performance, and am continually surprised.
The trouble is that my colleagues are having trouble understanding how they can use it to formalise customer communications, so I decided to try a different approach than graphs and piecharts etc.
Instead I thought I might try something humorous, hence Dilbert to the rescue! I have created a dozen or so custom Dilbert slides that provides some info about a customer insight made available by the SNA and also has a humorous conclusion to those insights. I'll pass this around the department in a series of daily emails.
Here is one example (I had to change the project nickname to "SNA" for this blog);
Monday, April 6, 2009
Clementine is dead, long live PASW Modeller
http://www.spss.com/software/modeling/modeler/
SPSS have gone for new product names, including changing Clementine to PASW. I'm more interested in the new features and bug fixes than buzz words. I'll hopefully be getting the new version shortly and will let you know if Clementine 13 (aka Predictive Analytics Soft Ware Modeller) adds value.
Monday, March 30, 2009
Tips for the KDD challenge :)
For more info see;
http://www.kddcup-orange.com/index.php
I am not able to download the data at work (security / download limits), so I might have to try this at home. I haven't even seen the data yet. I'm hoping its transactional cdr's and not in some summarised form (which it sounds like it is).
I don't have a lot of free time so I might not get around to submitting an entry, but if I do these are some of the data preparation steps and issues I'd consider;
- handle outliers
If the data is real-world then you can guarantee that some values will be at least a thousand times bigger than anything else. Log might not work, so try trimmed mean or frequency binning as a method to remove outliers.
- missing values
The KDD guide suggests that missing or undetermined values were converted into zero. Consider changing this. Many algorithms will treat zero very differently from a null. You might get better results by treating these zero's as nulls.
- percentage comparisons
If a customer can make a voice or sms call, what's the percentage between them? (eg 30% voice vs 70% sms calls). If only voice calls, then consider splitting by time of day or peak vs offpeak as percentages. The use of percentages helps remove differences of scale between high and low quantity customers. If telephony usage covers a number of days or weeks, then consider a similar metric that shows increased or decreased usage over time.
- social networking analysis
If the data is raw transactional cdr's (call detail records) then give a lot of consideration do performing a basic social networking analysis. Even if all you can manage is to identify a circle of friends for each customer, then this may have a big impact upon identification of high churn individuals or up-sell opportunities.
- not all churn is equal
Rank customers by usage and scale the rank to a zero (low) to 1.0 score (high rank). No telco should still be treating every churn as a equal loss. Its not! The loss of a highly valuable customer (high rank) is worse than a low spend customer (low rank). Develop a model to handle this and argue your reasons for why treating all churn the same is a fool's folly. This is difficult if you have no spend information or history of usage over multiple billing cycles.
Hope this helps
Good luck everyone!
Friday, March 27, 2009
Presenting at conference Uniscon 2009
I've been asked to present at Uniscon 2009. One to the professors involved at the University of Western Sydney is a relative of an analyst I work with and requested I present. I usually find academic conferences are snooze city, but they promised me free beer and I live in Sydney anyway, so I can get home to see the baby before the night's end. I hope I'm just one of many industry persons there and it proves to be an insightful event.
I'm not presenting work. I will be presenting from a personal perspective as a industry data miner (I've not enough time to prepare my presentation and get legal approval from work) and I'll be discussing generic topics instead of describing recent data mining projects and quoting numbers or factual business outcomes.
I suspect a large part of my attendance is to drive some enthusiasm and make the students interested in data mining and aware of what challenges you face in data mining roles.
If you are attending then feel free to say 'hi'.
For info on the conference see;
http://openresearch.org/wiki/UNISCON_2009
wider website http://www.uniscon2009.org/
Below was the presentation title and abstract I threw together (now just have to write it...). There is a social networking analysis (SNA) element to it (because that's what I'm focused on at the moment).
Presentation title:
Know your customers. Know who they know they know, and who they don't.
Presentation Abstract:
Tim's presentation will describe some of the types of marketing analysis a typical telecommunications company might do, including social network analysis (SNA, which is a hot topic right now). He also elaborates on the technical and practical side of data mining, and what business impacts data mining may have.
More importantly the presentation will help answer questions such as;
- What skills are required for Data Mining?
- What problems are commonly faced during Data Mining projects?
- And just what is this Data Mining stuff all about anyway?
- Tim
Thursday, March 26, 2009
Closing days of the Data Mining survey
Survey Link: www.RexerAnalytics.com/Data-Miner-Survey-Intro2.html
Access Code: TM42P
If you frequently conduct data analysis on large amounts fo data (ie data mining!) then I urge you to particpate.
- Tim
Wednesday, March 11, 2009
And then there were Three, not!
1) becoming a daddy
-> lots of fun!
2) recent accouncement of a merger between the telco's Vodafone and Hutchinson.
-> pain in the arse!
For info see
http://www.ft.com/cms/s/0/1e1af810-f68e-11dd-8a1f-0000779fd2ac.html
http://www.vodafone.com/start/media_relations/news/group_press_releases/2009/hutchison_and_vodafone.html
Australia's population is approximately 20 million, which is pretty small, and there were four players in the mobile service provider market (in probable order of market share); Telstra, Optus, Vodafone, Three.
The annoncement that Vodafone and Three are merging reduces this to three players, which reshapes the landscape of Australia to closely match many other countries with mature telecommunications markets. Most countries with mature telecommunications markets have a few players and, in this current economic climate, its not surprising that there will be mergers and therefore consolidation of customers into larger groups.
As a result of the merger, the competitors (Telstra & Optus) will have to review their strategies and probably re-examine customer analysis. Lots of work for us Data Miners...
Tuesday, March 3, 2009
How many models is enough?
A significant part of the vendor solution is the ability to manage many, we're talking hundreds, of data mining models (predictive, clustering etc).
In my group we do not have many data mining models, maybe a dozen, that we run on a weekly or monthly basis. Each model is quite comprehensive and will score the entire customer base (or near to it) for a specific outcome (churn, up-sell, cross-sell, acquisition, inactivity, credit risk, etc). We can subsequently select sub-populations from the customer base for targetted communications based upon the score or outcome of any single or a combination of models, or any criteria take from customer information.
I'm not entirely sure why you would want hundreds of models in a Telco (or similar) space. Any selection criteria applied to specific customers (say, by age, or gender, or state, or spend) before modeling will ofcourse force a baised sample that feeds into the model and affects its inherant nature. Once this type of selective sampling is performed you can't easily track the corresponding model over time *if* the sampled sub-population ever changes (which is likely because people do get older, move house, or change spend etc). For this reason I can't understand why someone would want or have many models. It makes perfect sense in Retail (for example a model for each product or associations rules for product recommendations), but not many models that apply to sub-populations of your customer base.
Am I missing something here? If you are working with a few products or services and a large customer base why would you prefer many models over a few?
Comments please :)
Monday, January 19, 2009
re: "Thoughts on Understanding Neural Networks"
http://www.data-miners.com/blog/2009/01/thoughts-on-understanding-neural.html
I usually get better predictive success using neural nets, but the lack of explain-ability is always a downside. I'm always keen to see ways that might help explain or interpret a Neural Network. A few years ago I tried a simple graphical way to show a Neural Net, but I think Gordon's recent post highlights better options.
My example is written in VB.NET and parses a single hidden layer Neural Network saved as PMML into data grids. Once the neural net neurons and weights are loaded into the data grids I then read from the data grids and create a graphic of the Neural Net. Transparency is used to show the strength of the weight, whilst colour (blue or red) is used to show the negative or positive effect) . You can view my example at;
http://www.kdkeys.net/forums/thread/6495.aspx
and download the source code, executable and example PMML directly from here;
http://www.kdkeys.net/forums/6495/PostAttachment.aspx
I can't post images as comments on Gordon's blog so below are two snapshots of the simple UI application that displays the Neural Net PMML graphic.
a) graphic
b) PMML loaded into data grids
Cheers
- Tim
Thursday, January 8, 2009
Isn't In-database processing old news yet?
A bit of a Clementine plug, but hear me through...
I'm puzzled by a few recent articles I've read describing in-database processing, the practice of doing very sophisticated data warehouse analysis (lets call it data mining :) on large amounts of data without having to extract the data into an external analytics tool (for example, a tool like SPSS or SAS).
As an example see the current Teradata magazine article;http://www.teradata.com/tdmo/v08n04/Features/OnTheHorizon.aspxI was fortunate enough to spend a few evenings chatting with Stephen Brobst (chief technology officer of Teradata) on these topics during a Teradata conference in Beijing last June 2008 *. I think he's right on the money concerning his top 4 predictions for data warehousing. As a Data Miner I am concerned with how I might be expected to analyse the data, so in-database processing is the biggest topic for me. I'm not so sure it is a 'future' thing though. In my view its here now, just maybe not so widesread. My only guess is that it's another plug for the SAS partnership. Although I don't use SAS I do like the thinking and development plan going forward. I simply don't think its the new concept it being touted as. I'm sure its not necessary for a data miner or data analyst to need custom plug-ins (and corresponding expense) to reach in-database data mining nevada.
In-database processing is nothing new. I've doing it using SPSS Clementine and Teradata for a few years now. SPSS Clementine has supported this functionality for quite a few years. In real-time Clementine will convert the stream (a graphical icon-based proprietary query file) into SQL (structured query language) and submits the SQL query(s) to the data warehouse. Any computation that cannot be represented as SQL will cause a data extraction and further processing by the Clementine analysis engine itself (commonly the Clementine Server on a dedicated analytical server box will do this, and keep the data and temp files on the server file system. Not the desktop). In practice I usually avoid heavy statistical functions and all my data processing is usually performed in the Teradata warehouse and only the data sample required to build a predictive or clustering model is extracted. When it comes to scoring the created data mining model (such as a neural network or decision tree) Clementine also converts the data mining model into SQL transparently for truly high scale processing on the data warehouse.
The real advantage comes from not just being able to score existing data mining models, but also being able to build predictive models entirely in the data warehouse, and this is a comparatively new development (a couple of years). Not something I do much of (I've done it for fun on my home SQL Server, but not in a corporate production environment). If the data warehouse provides the capability to create neural networks or clustering models (which some now do) then there is no need to ever extract data from the data warehouse into an external analysis application such as SPSS or SAS. More data can therefore be used to build models, and this usually beats tweaking algorithm options.
The data warehouses have actually supported embedded code and adding custom functions that might include a data mining algorithm for quite a while. For info see a recent post by Seth Grimes titled "In-Database Analytics: A Passing Lane for Complex Analysis";http://www.intelligententerprise.com/info_centers/data_int/showArticle.jhtml?articleID=212500351&cid=RSSfeed_IE_News
Only in the past year or so have full blown embedded data mining algorithms taken off. The question is though will these algorithms always run fast(er)? Custom code can be good or bad! One advantage of the 'algorithms converted into SQL' route is that the data warehouse can quite easily determine and control how to process and prioritise the SQL query and be optimised specifically for it. Custom code and embedded data mining algorithms can also be optimised, but I'm guessing that requires far more effort (and expense!). One worry is also that custom code brings dangers and risks (not to mention the testing and issues for IT and the DB admin). Still, its necessary for in-database data mining model building capability.
Ok, I'm guessing some of my peers might know this stuff anyway, but one thing has recently occurred to me;
- considering that once we have data processing, model building and model scoring all occurring within the data warehouse, what need have we for the data mining tool?
My preference is for a easy tool that makes querying the data warehouse and constructing highly complex analysis easy. My queries could not possibly be prepared by hand since they are often transformed into many thousands of lines of SQL code. I use a clever user interface to make understanding the logic of the analysis possible. The data mining tool I use primarily is a tool that optimises my interaction with the data warehouse.
So considering these things, my current view is that data mining applications/tools such as SPSS Clementine (or even SAS :) will stick around for quite a bit longer because the user interface optimises a data miner's ability to query the data warehouse and perform data mining efficiently, but maybe in a few years we start to see what we commonly refer to as data mining 'algorithms' being developed for data warehouses and no longer in data mining tools (or simply as plug-ins for data warehouses). An interesting thought indeed!
- Tim
* Whilst at the Teradata User Conference in Beijing I presented some data mining analysis work, mostly my data analysis work on churn prediction and product upsell in telcommunications, and chatted to the China mobile analysts afterward. On a more personal note, that is also when my soon-to-be-born son was conceived. Don't worry, my wife was with me at the time! In true Hollywood fashion I thought it appropriate we therefore name him 'Beijing' or 'Teradata' but my wife doesn't share my enthusiasm :)
Tuesday, January 6, 2009
book review "Data Preparation For Data Mining"
The only few criticisms I could possibly raise are;
1) that everything has a statistical basis.
- For example one technique I use to redistribute heavily skewed data is simple binning by count. I work in telecommunications and the behavioural data is always extremely skewed. Log functions don’t work so I often use SQL to convert variables into 100 percentile bins (where each bin has the same number of rows (customers) in it). That type of insight isn't in the book, but several statistically based alternatives are. I'm not convinced they would work with extremely skewed data, but they are well explained and useful insights.
2) no mention of SQL or step-by-step examples of data manipulation (nothing like 'before and 'after' pictures). Ideas or examples for derived variables are lacking too.
So far I've read through the first 275 pages and the odd additional chapter. Its surprisingly easy to read and explains the statistics well. Its definitely a book I will refer to, and well worth buying.
In February 2004 Dorian Pyle made an interesting post about things to avoid when data mining;
"This Way Failure Lies " http://www.ibmdatabasemag.com/story/showArticle.jhtml?articleID=17602328
- Tim
Monday, December 15, 2008
No Long Tail from iTunes
There have been quite a few posts in recent months about analytics involving the Long Tail.
For an overview see;
http://en.wikipedia.org/wiki/The_Long_Tail
Personally, I definitely fall into a Long Tail demographic regarding my music habits. I buy many relatively uncommon (and some very obscure) funk and jazz albums or compilations. A significant number of these were introduced to me by Amazon's recommendation engine or similar.
I thought I might be missing out on some good music by not being part of the 'iTunes generation'. It came as a huge shock to me when I joined iTunes this weekend and found that *none* of my dozen most recent purchases are even listed in iTunes (all my recent purchases have been from Amazon.com). Granted some of the artists have died and their albums are a few decades old, but other titles were released last year.
Not much chance of the Long Tail when 'stock' is limited. In terms of Amazon I have actually ordered and received something physical; a compact disc that was sitting on a shelf somewhere in the US and travelled transatlantic to me in Sydney.
What excuse do iTunes have? Downloading .mp3's hardly has the same requirements for stock management, inventory and distribution. A long tail in the iTunes business model should be easier to support and yield greater benefits (than Amazon for example) because the lesser requirement for physical stock management (I'm guessing it would be a simple case of more disk space). A long tail is not likely to exist where there is less choice for consumers!
In my view the whole idea of the 'Long Tail' applies to circumstances where the constraints of physical stock management are removed (as with iTunes). The fact that Amazon excel at this with physical stock is a credit to them. iTunes is simply a mass-market disgrace :)
Guess I'll continue to buy CD's and burn them into my own mp3's and stream music from my home network...
- Tim
btw. See my wish list for examples :)
Sunday, December 7, 2008
Wikipedia entry for SPSS Clementine
Although there is some information on data mining (http://en.wikipedia.org/wiki/Data_mining), I was surprised that there was nothing about the SPSS Clementine data mining tool, so I added a brief entry;
http://en.wikipedia.org/wiki/SPSS_Clementine
Feel free to contribute.
I might add more when I get a block of free time.
- Tim
Sunday, November 30, 2008
Smart Data Collective
They asked me to be a founding member and featured blogger which sounds like a lot of work, but I've been reassured isn't :)
On a related but separate note, I recently participated in a podcast for Teradata concerning the data analytics we do at Optus (legal disclaimer: I do not represent Optus in anyway in this personal blog :). I discussed previously presented material regarding our churn prevention analysis, and also my recent social network analysis. The podcast was only completed recently, its still early days and needs to pass approval from appropriate legal channels, but hopefully it will find its way onto the Teradata site in the New Year. I'll keep you posted.
- Tim
Monday, November 24, 2008
Movember Madness
That's right we're talking about men's love bumps and feeling down in the dumps. To raise awareness of prostate cancer and depression I have grown a truly dodgy moustache.
I have found it a challenge and am looking forward to shaving the damn thing off...
With only a few days to go I will soon be posting 'before' and 'after' pictures. There is still time to donate your hard earned cash if you wish;
https://www.movember.com/au/donate/donate-details.php?action=sponsorlink®o=1944466&country=au
Cheers
Tim
- Edit: ok, here's the dodgy picture of my moustache. I raised $125 for my hardship :)
Wednesday, November 19, 2008
A simple Data Transformation example...
Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple. I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.
Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations. Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs. Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).
There seems to be a glaring big gap! I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm. In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters. Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful. They don’t necessarily have to be statistical in nature.
So (to put my money where my mouth is) I've started by illustrating a very simple data transformation that I expect is common. On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine).
Clementine Stream and text data files
my post to a Clementine user forum
I’m hoping that my peers might exchange similar ideas (hint!). A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful. This is just the start of a series of ETL you could perform.
I’ve also added a poll for feedback whether this is helpful, too basic, etc
- Tim
Example data processing steps
a)Creation of additional dummy columns
Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category. We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised). A new column is created for each category field.
For example;
| customer | category | score |
| bill | food | 10 |
| bill | drink | 20 |
| ben | food | 15 |
| bill | drink | 25 |
| ben | drink | 20 |
Can be changed to;
| customer | category | score | food_ind | drink_ind |
| bill | food | 10 | 1 | 0 |
| bill | drink | 20 | 0 | 1 |
| ben | food | 15 | 1 | 0 |
| bill | drink | 25 | 0 | 1 |
| ben | drink | 20 | 0 | 1 |
Or even;
| customer | category | score | food_score | drink_score |
| bill | food | 10 | 10 | 0 |
| bill | drink | 20 | 0 | 20 |
| ben | food | 15 | 15 | 0 |
| bill | drink | 25 | 0 | 25 |
| ben | drink | 20 | 0 | 20 |
b) Summarisation
Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns.
So we could change the previous step to something like this;
| customer | food_score | drink_score |
| bill | 10 | 45 |
| ben | 15 | 20 |
Thursday, November 13, 2008
People are Age-ist !
As a small part of further social network analysis of a mobile (cell-phone) customer base I have examined age differences between customers and with whom they communicate most frequently.
I was also looking at how reliable it might be to guess someones age (a customer or non-customer) by extrapolating from known individuals. There is customer age approx 97% of the time, and its accurate approx 92% of the time (unusally large numbers of people claim to be born on 1st Jan 2000 :)
I was surprised to see (but then maybe I'm naive :) how so many people have 'mobile calling relationships' mainly with people the same age...
The chart below shows the percentage of customers and the age average between the people they communicate with most frequently. Age differences over 4 years are comparatively rare...
The small spike at 30 years difference is probably parent-to-child communication.
I will be using this to support an estimation of age for prospects and customers where age is unknown, but age of social group members is known.
- Tim
Monday, October 20, 2008
Distribution of a prediction score
I've tried to refer to a few textbooks but haven't found anything to help me 'answer' this.
- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s).
We often use neural networks to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers. We rescore the customer base each month.
- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.
When I build my predictive model I try to ensure the model mimics this distribution. My view is that most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.
Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout the score range.
I often examine the distribution as a sanity check before deployment. If the distribution is as expected it is something like this;
If it looks screwy, maybe something like this;
- then there may be a problem with the data processing or the behaviour of customers has sufficently changed over time to require a model refresh. The subsequent actual model outcome preformance is often not as good in this case.
- question -
What are your views/preferences on this?
What steps, if any, do you take in an attempt to validate the model prior to deployment (lets assume testing, validation and prior months performance is great) ?
- Tim
Thursday, October 16, 2008
new book "Marketing Calculator"
http://www.marketing-calculator.com/
I met Guy in Oct 2007 whilst presenting a couple of data mining case studies at a Marketing Analytics conference in Singapore. My presentation title was ‘given’ to me by the conference organisers, but it allowed some freedom regarding the content. I discussed how we use a comprehensive data warehouse, and how having access to detailed customer data enriched with demographic data can enable you to get some impressive response rates from campaigns, and sales numbers by up-selling to existing customers, and most importantly retain and grow your customer base.
Guy and I spent some time discussing our work and he asked if could include it as a case study in his book. I received the book yesterday afternoon and am already through the first 4 chapters (60 pages). It reads easily and is proving to be a very worthwhile book. It is a Marketing book (it doesn’t contain statistical equations or examples of algorithms on data mining) and I would recommend it for anyone involved in marketing. So far it has provided some well structured and clear insights how you can improve your marketing practices.
- Tim
Sunday, September 21, 2008
Social Network Analysis in mobile networks
Ok this is a big project that has consumed a lot of my time. It was actually completed a few months ago, but I’ve only recently had the time to present it or mention it in a public blog. I’m writing this free-form whilst some large queries are running in the background. I’ll add more to the thread when I get some more free time. Hopefully it will make interesting reading. I do tend to get excited with my projects like this, so please forgive me if my propaganda rambles on a bit...
The aim of these posts is to reveal some typical data mining problems I encounter. It will superficially describe a social networks project I have recently completed. Hopefully enough to give insight, but not enough to replicate the whole thing exactly :)
I would like to extend my sincere thanks to Jure Leskovec (http://www.cs.cmu.edu/~jure/)
and Eric Horvitz at Microsoft for their work on social networks within the IM user base, and also Amit Nanavati (http://domino.research.ibm.com/comm/research_people.nsf/pages/nanavati.index.html) et al at IBM India Research Labs for their work on social networks in a US telco regarding churn prediction. Both were kind enough to send me their published papers detailing their work in large scale social network analysis. I’d already completed most of my work, but both of their papers gave me some very informative insights and ideas.
I'd like to emphasise that my work is significantly simpler in terms of the social analysis computation itself. As much as I would like, I can't afford to investigate whether we have 6.6 degrees of separation or not. Much of the ground breaking work from these researchers involves continuous processing of data for days. Processing is often performed against binary files or flat files using dedicated systems. My data is stored within a terabyte scale data warehouse with hundreds of concurrent demands. Constraints in terms of data warehouse load and computing restrictions mean that my analysis must complete within a practical timeframe. In order to 'production-ise' the analysis it must be a series of SQL queries that someone can simply start with the click of a button. I perform data cleaning, summarisation and transformations on several hundred million CDR's (call detail rows) and calculate social groups for our customer base in less than 3 hours, entirely in SQL on a Teradata data warehouse. I think that in itself is pretty cool, but consequently I must acknowledge that my social networks are comparatively basic and my analysis does not investigate the attributes of the social networks as in-depth as others have.
Why do this?
Working for an Australian telco, in a market with 100% mobile (cell-phone) saturation, the primary focus is customer retention. From a data mining perspective this usually means we examine call usage and, based upon recent customer behaviour, we identify which customers might have problems or want to leave (telco's call this churn, banks often call it attrition). It costs a lot of win a new customer, far less to do something nice to keep an existing customer. The core to my data mining is to use the customer data within an integrated data warehouse to better understand the customer and deliver a service that appears specific to them as an individual. More recently I've tried to focus on communities and using the social fabric surrounding a customer to ensure we better adapt and anticipate customer actions. Hence the need for a social network analysis, a method to identify and investigate the communities that naturally exists within our mobile customer base. This is quite different from the standard analysis that focuses on customers as individuals.
What is it all about?
In a customer focused point of view the theory is that the influences of work colleagues, friends, and family are far stronger and influential than any messages a company can convey through TV or the media. By identifying influencers and social relationships within our customer base we can more effectively anticipate customer actions such as churn. For example, targeting the leaders of social groups and ensuring they are happy will spread with viral positive to word-of-mouth affects throughout social groups (which may even include competitor's customers). Being able to even monitor and measure the viral nature of communications with customers is valuable enough.
How do you do it?
So, recently I have been working on a project to develop analysis that identifies social groups, leaders, followers, churn risks and similar attributes within our customer base. It’s difficult to give too many details without risk of divulging intellectual property, so please assume any details or numbers I provide are rough estimates only...
Some Numbers...
- Lets suppose we have 4 million mobile customers.
- Suppose average outbound is approx 10 calls per day.
- Suppose average inbound is approx 10 calls per day.
- So, we have approx 80 million rows of data every day.
- The terminating number dialled can vary to include country codes, area codes etc.
- People communicate using voice, sms, picture messaging, and video.
Early Data Manipulation Issues
Already you can see a few problems to deal with;
A) A lot of data! One week of data alone is over 500 million rows.
B) The same terminating number can be dialled multiple ways (with or without country codes). In order to identify 'who' a customer communicates with we need to 'clean' the number dialled by resolving country codes, area codes etc so that the same number is resolved irrespective of whether country prefixes are used or not. Yes, we have to perform SQL string cleaning functions on all the data in order to resolve all dialled phone numbers. I did this using a conceptually simple but long winded SQL case statement. It doesn’t actually take long in our data warehouse, we’re talking several minutes, not hours.
C) Different forms of communication (voice, sms, picture messaging, video).
Once the dialled numbers have been cleaned, summarisation by customer number and dialled recipient can be performed. In our case this summarisation involves calculating totals for calls of different forms of communication. The summarised data is one row per customer vs recipient combination. Numerous columns contain sums regarding different calls.
D) Calls can be outbound or inbound. Each is distinguished and processed separately at first. String cleaning is also performed to resolve the originating telephone numbers. Outbound calls started by our customers are summarised as above, so too are inbound calls received by our customers from any source.
Simple Calling Relationships
Once we have both separate (outbound and inbound calls) summarisations complete, then we can join them together (matching recipient telephone number for outbound calls with originating number for inbound calls) to understand if the calling behaviour is reciprocal.
We could use some business logic to limit the definition of a calling relationship, for example if a customer makes over 5 and receives over 5 calls from the same recipient/originating telephone number. From this point you have a simple framework from which you can rank, transform and manipulate the relationships a specific customer has with recipients. The limiting of call counts can help reduce data, and also ensure one-off calls or uni-directional communication to the local pizza shop doesn’t count…
Important Legal Stuff…
Okay, a quick little important tangent. At this point I’d like to touch on an important topic which is far too often taboo in data mining, especially in the telecommunications industry. When you’ve got the capability to do some analysis you often need to stop and think what you should do (ethically and legally), as opposed to what you can do (technically). As a telco it is possible to get and use customer data for lots of things, but taking action based upon a specific number dialled is illegal in some countries. For example, suppose a customer calls a competitor’s sales number or speaks to a competitor’s tech support line. It may be illegal to track these events and perform some kind of retention activity. It could be an invasion of privacy. It also crosses into anti-competitive issues because other companies don’t have access to the same data. I've not done this type of activity. Still, I know for a fact that some industry analysts do this.
What I am doing is analysis at this sensitive level, but not reacting to specific telephone numbers. I don’t know (or care) anything about the recipient’s telephone number. I am only interested in how many times it is called, at what time or the day, using voice or SMS calls etc. It’s the nature of the relationship a customer has with a recipient (and their behaviour) that interests me, not necessarily who the recipient is. Understanding and generalising the calling relationships, for example allows us to build very accurate predictive models that can quantify how likely a customer is to churn based upon recent behaviour of them *and* their closest friends (still sounds ‘Big Brother’ though doesn’t it :)
Formation of Simple Social Networks
So, in my analysis I have summarised outbound calls and inbound calls. Next step is to cross-join both summarisations together so that we list all the customers that also called the same recipient and all the recipients that also called the same customers (and yes, recipients can be customers!). That’s one big query, so you might want to reduce the number of recipients or customers by using some business logic of your choice. This is where restrictions to make the processing complete in a practical timeframe really come into play. A true social network wouldn’t reduce the relationship criteria. Maybe you’d put some logic in place whereby you take the top 10 ranked recipients (who each customer calls the most) for each customer. This would drastically reduce the complexity of the cross-join, but obviously limit the potential social networks you will discover (at this point Jure maybe screaming in agony, and if you are I’m really sorry :)
The result of such a join would enable you to know which customers (and how many) communicate with any given recipient (who could be your customer or a customer of a competitor). Likewise, we can identify customers that have larger numbers of other customers or 'competitor customers' that call and rate them highly in their social groups. Such individuals can be given classifications as 'leaders', 'bridges' etc.
It is difficult to avoid going into too much detail, but simply by examining customer churn and attributes such as number of 'competitor customer' friends and any friends that recently churned, we can very accurately predict churn behaviour with a month lead time (even better if we predict just 1 week in advance). In terms of churn, we're talking an increased churn propensity by a factor of five times at least simply by having social group affinity with a another customer that has already recently churned.
Going forward I will be further analysing these social factors and, time permitting, examine some of the finer customer insights that this type of analysis can highlight.
If anyone is doing similar stuff, I'd love to chat. If you anywhere near Sydney I'll happily buy the beers!
- Tim
My best learning model yet...
The downside is that it'll be by far the most time demanding and expensive model ever developed...
Here's a picture of our progress after 12 weeks of development time :)
Monday, September 1, 2008
Run multiple instances of Clementine 12
One good change to Clementine in recent versions is to allow you to double click on a Clementine stream file and have that stream load in an already open Clementine client application. In the past double clicking a stream file always resulted in a new Clementine application being started.
A minor downside to this new feature is that by default you can't open two instances of the Clementine client even if you wanted to. I sometimes need to do this in order to run some analysis on the server box and other smaller analysis locally on my laptop.
By adding an additional command flag to the clementine start command you can force it to open multiple application windows, each olne could be configured to connect to a different server or run locally;
"C:\Program Files\SPSS\Clementine12.0\bin\clementine.exe" -noshare
-> add the "-noshare" option to the Clementine application start command.
Then simply clicking on any stream file will still open in the first Clementine client application window that started, but you have the option of opening additional Clementine client applications directly if you want.
Cheers
Tim
Tuesday, August 26, 2008
101 reasons not to upgrade to Clementine 12
-> because that's close to how many bugs have been added :(
Ok, I'm exaggerating…a little.
We recently 'upgraded' (and I user that term similarly to how a Windows XP user 'upgrades' to Vista) from Clementine 10.1 to version 12.0.2.
For what's new in Clementine 12 see;
http://www.spss.com/clementine/whats_new.htm
I agree the new version has some really cool features, but after using it for a few weeks now I have also found that it has obviously been released far too early in the development and testing cycle. UI design is not up to the usual high polished standard, and there are notably more bugs (granted, usually Clementine has *very few* bugs or problem areas). Clementine is still incredibly stable, the problems I have reported relate to UI design and interface problems. I've yet to find a problem or fault on the data processing side. Some of the problems are minor but have been around for 4+ years, and its frustrating that they are still not fixed. Kinda leaves customers thinking there's little point in providing product feedback at all...
Its been a few years since I left SPSS, so I'm not privy to what's going on in Development. In my view Clementine is still easily the best data mining software out there, but SPSS have clearly rushed this release out the door.
SPSS have replied to me and appear to be taking my criticism onboard. Fingers crossed that an update in the coming months will resolve the issues I have raised.
A few of the main points I raised;
- can no longer save the stream whilst data execution is occurring. Nasty loss of feature. I consider this one quite serious. Users should *always* be able to save the stream at any time.
- Partial outer joins no longer auto-tick the first table connected to a merge node if the order of the connected tables is different. This is a change in default behaviour (always dangerous), and will affect old streams opened in the new version 12 (so your join condition could be different – beware!)
- new pop-up ‘info’ windows that have no purpose and cannot be turned off. Really bad UI design, and not akin to Clementine’s usual interface.
- Charts always prompt to be deleted. Like the pop-up windows, this a new behaviour and quite annoying. There is no way to prevent “Are you sure you want to delete this chart” pop-up messages. Didn't they learn from the old version 7.0? (oldies will remember the "Are you sure you want to exit Clementine message"...)
- Quality Node has gone and there is no replacement functionality. Sure, just delete something from the software for no good reason…
Granted, I use Clementine 6 hours a day and am probably going to encounter problems other usedrs wouldn't, but some of these issues are glaringly obvious.
- Tim
Monday, August 18, 2008
Stratified Sampling in SQL
If your data is managed by a data warehouse, then Clementine has this cool behaviour of automatically converting functions into SQL, so the data processing can be performed by the database and less data needs to be extracted and duplicated on another file system.
Unfortunately the Balance node isn't one of the functions automatically converted into SQL. In order to perform stratified sampling you have to take a different approach and selectively pick the values of your target column/field and sample them individually.
On KDKEYS.NET I attached one Clementine version 12.0.2 stream (balance node.str) as one example of how to do this. By using a select condition, followed by a random sample, followed by a union (append) it is possible to easily obtain a stratified sample from a huge dataset efficiently.
I have also pasted below an example of the type of simple SQL that gets processed;
SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM IPSHARE.TMANNS_DRUG4n
WHERE (Drug = 'drugA')
SAMPLE 0.5
) AS TimTemp1
UNION ALL
SELECT *
FROM (
SELECT *
FROM IPSHARE.TMANNS_DRUG4n
WHERE (Drug = 'drugX')
SAMPLE 0.2
) AS TimTemp2
) AS TimTable
;
- sorry, I couldn't work out how to format the SQL properly in this blog :(
Cheers
Tim
Wednesday, August 13, 2008
iPhone update (what bill shock?)
I've resisted getting an iPhone so far, but my colleagues keep tempting me...
The pros;
- it has a great user interface. The scrolling nature and design of the UI is amazing. The concept of momentum that exists when you scroll through menus and music library is very cool.
- Optimum size. Its not that small, but yes it has a screen you can actually see. It fits in your back pocket.
- some versions have decent sized storage (16GB etc) for music and video.
- most importantly it has little apps such as the StarWars LightSaber. This app uses the momentum / gyro within the iPhone to react as you wave the iPhone around. It sound just like you have a LightSaber. Being able to turn on a Light Saber during a meeting when a colleague makes a dumb witted comment and chop them into pieces is priceless...
They just took this off of the apps library, but they will be replacing it with an offical one (hopefully still free)
see: http://blog.laptopmag.com/best-most-useless-iphone-application-phonesaber
and: http://macenstein.com/default/archives/1559
The cons;
- battery life isn't so good. The screen uses a lot of power.
- battery cannot easily be replaced, can't carry a replacement for emergencies.
- no support for video calls
- no support for picture messaging
For me long battery life is quite important, and although I could send videos via email etc using the iPhone, I'm surprised it only supports the basic forms of mobile communication (especially considering its a 3G phone).
But the LightSaber app is really cool... :)
Monday, July 21, 2008
re iPhone "excessive data charges"
I agree with the ACCC that there is definately a possibility this could happen, but at least one telco (the one I work for) is behaving itself and offering generous plans with free data for the first few months to avoid any possibility of bill shock. I don't know where the ACCC get their info, but I doubt many of our customers will have 'bill shock'.
See;
"ACCC warns about iPhone bill costs, additional charges"
and also;
"ACCC warns 'iPhoners' on bill shock"
Before we launched the iPhone (last weekend) I did some analysis examining early adopters of the 'old' 2G iPhone. A simple graph showing data usage of existing iPhone users gave us a rough idea of what the new 3G iPhone might require (assuming the new 3G iPhone and customers behave the same way...). The data was roughly something like this;
A proprtionally large number of early iPhone adopters were using 100mb or less a month, but small percentages of customers would use far more (actually going up to a few gb's).
Based off of this early analysis I'm guessing that the data included in the Optus iPhone plans should be sufficent for most new users of the 3G iPhone. We'll see...
Thursday, July 17, 2008
I finally started a blog...
Since most of my work revolves around data mining (and using Clementine) that will be the focus of my posts, but other topics might creep in.
It'll be difficult to discuss my work freely because of intellectual property concerns, but I'll try to discuss the data mining problems we face and how we solve them. My intention will be to foster peer review and feedback from other data miners, especially anyone also tackling analysis within terabyte sized data warehouses.
Stay tuned, more to follow.
Tim