<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6028114151548461320</id><updated>2012-01-28T10:38:05.197-08:00</updated><title type='text'>Blog by Tim Manns (data mining blog)</title><subtitle type='html'>Data Mining Blog, Data Mining, Analysing terabyte data warehouses, SPSS Clementine, SAS Enterprise Miner, Telecommunications</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>42</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6374061418711769600</id><published>2011-11-27T14:18:00.000-08:00</published><updated>2011-11-27T14:18:49.449-08:00</updated><title type='text'>When is 'Big Data' too big for Analytics?</title><content type='html'>&lt;span style="font-family: Verdana, sans-serif;"&gt;&lt;strong&gt;- 'Foreword'&lt;/strong&gt;Apologies for the lack of recent posts.&amp;nbsp; I've been *very* busy on many Data Mining Analytics projects in my role as a Data Mining Consultant for SAS.&amp;nbsp; The content of my work is usually sensititive and therefore discussing it in any level of detail in public blog posts is difficult.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;This specific post is to help promote&amp;nbsp;the launch of the new IAPA website and increase focus on Analytics in Australia (and Sydney, where I am normally&amp;nbsp;based).&amp;nbsp; The topic of this post is something that has been at the forefornt of my mind and seems to be a central theme of many of the projects I have been working on recently.&amp;nbsp; It is certaininly a current problem for many Marketing/Customer Analytics departments.&amp;nbsp; So here are a few thoughts and comments on 'big data'. Apologies for typos, it is mostly written piecemeal on my iPhone during short 5 mins breaks...&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;strong&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;How big is too big (for Analytics)?&lt;/span&gt;&lt;/strong&gt;&lt;/div&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;I frequently read Analytics blogs and e-magazines that talk about the 'new' explosion of big data. Although&amp;nbsp;I am unconvinced it is new, or will improve anytime soon, I do agree that despite technology advances in analytics the growth of data generation and storage seems to be outpacing most Analyst's ability to transform data into information and utilize it to greater benefit (both operationally and analytically). The term 'Analysis Paralysis' has never been so relevant!&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;  &lt;/span&gt;&lt;br /&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;But from a practical perspective what conditions&amp;nbsp;cause data to&amp;nbsp;become unwieldy? For&amp;nbsp;example, take a typical customer services based organisation such as a bank, telcom, or public dept: how can the data (de)-evolve&amp;nbsp;to a state that&amp;nbsp;makes it 'un-analysable' (what a horrible thought..). Even given mild (by today's standards) numbers of variables and records, certain practices and conditions can lead to bottlenecks, widespread performance problems, and delays that make any delivery of Analytics very challenging.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;   &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;So, below is a series of my most recent observations from Analytics projects&amp;nbsp;I have been involved with that involved resolving, or encountered 'big data' problems:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;  &lt;/span&gt;&lt;br /&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;strong&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;- Scaleable Infrastructure.&lt;/span&gt;&lt;/strong&gt;&lt;/div&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;Data will grow. Fast. In fact it will probably more than double in the next few years. CPU capacity of data warehousing and analytics servers need to improve to match.&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;As an example, I was working on a telcom Social Network Analysis project recently where we were processing weekly summaries of mobile telephone calls for approx 18million individuals. My role was to analysis the social interactions between all customers and build dozens of propensity scores, using the social influence of others to predict behaviour. In total I was probably processing hundreds of millions of records of data (by a dozen or so variables). This was more than the client typically analysed.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;After a week&amp;nbsp; of design and preliminary work I began to conasider ways to optimise the performance of my queries and computations, and I asked about the server specifications. I assumed some big server with dozens of processors, but unfortunately what I was connecting to was a dual core 4GB desktop PC under an Analyst's desk...&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;  &lt;/span&gt;&lt;br /&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;strong&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;- Variable Transformations&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/strong&gt;&lt;/div&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;  A common mistake by inexperienced data miners is to ignore or short-cut comprehensive data preparation steps. All data that involves analysis of people is certain to include unusual characteristics. One person's outlier is another's screw-up :) &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;So, what is the best way to account for outliers, skewed distributions, poor data sparsity, or highly likely erreonous data features? Well an approach (that i am not keen on) taken by some is to apply several variable transformations indiscriminatly to all 'raw' variables and subsequentially let a variable selection process pick the best input variables for propensity modeling etc. When combined with data which represents transposed time series (so a variable represents a value in 'month1' the next variable the same value dimension in 'month2' etc) then this can easily generate in excess of 20,000 variables (by say 10 million customers...). It is true there are variable selection methods that handle 20,000 quite well, but the metadata and processing to create those datasets is often significant and the whole process often incurs excessive costs in terms of time to delivery of results. &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;Additional problems that may arise when you start working with many thousands of variables is that variable naming needs to be easily understood and interpretable. The last thing a data miner wants to do is spend hours working out what those transformed and selected important variables in the propensity model actually mean and represent in the raw data.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;Which leads me to my next point..&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;- Variable / Data Understanding&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;One of the core skills of a good data miner is the understanding and translate&amp;nbsp;complex data in order to solve business problems.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;As organisations obtain more data it is not just about more records, often the data reveals new subtle operational details and customer behaviors not previously known, or completely new sources of data (FaceBook, social chat, location based services etc). This in turn often requires extended knowledge of the business and operational systems to enable the correct data warehouse values or variable manipulations and selections to be made.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;An analyst is expected to understand most parts of an organization's data at a level of detail most individuals in the organisation are not concerned with, and this is often&amp;nbsp;a momental task.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;As an example of 'big data' bad practice, I've encountered verbose variables names which immediately require truncation (due to&amp;nbsp;IT / variable name limit reasons), others which make understand the value or meaning of the variable difficult, or naming conventions which are undocumented. For example: "number_of_broken_promises" is one of the funniest long max variable names I've seen, whilst others such as "ccxs_ytdspd_m1_pct" can be guessed when you have the business context but definitely require detailed documentation or a key.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;- Diverse Skillsets&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;'big data' often requires big warehouse and analytics systems (see point 1) and so an analyst must have  understanding of how these systems work properly. &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;Through personal experience I'm always aware of table indexes on a Teradata system for example. By default the first column in a warehouse table will be the index, so if you incorrectly use a poorly managed or repetitive variable such as 'gender' or 'end_date' then the technology of a big data system works against you. I've seen this type of user error on temp tables or analytics output tables far too many times. &amp;nbsp;Big Data often involves bringing information from a greater number of sources, so understanding the source systems and data warehouse involved is an important challenge.&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;I hope this helps.&amp;nbsp; I strongly recommend getting involved with the IAPA and Sydney Data Miner's Meetup if you are based in&amp;nbsp; Australia or Sydney.&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoPlainText" style="margin: 0cm 0cm 0pt;"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;&amp;nbsp;- Tim&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6374061418711769600?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6374061418711769600/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6374061418711769600' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6374061418711769600'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6374061418711769600'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2011/11/when-is-big-data-too-big-for-analytics.html' title='When is &apos;Big Data&apos; too big for Analytics?'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-1974185098738063860</id><published>2011-02-13T18:51:00.000-08:00</published><updated>2011-02-13T18:51:44.497-08:00</updated><title type='text'>Just how much do you trust your telco?</title><content type='html'>Although not one of my favourites, I confess to owning the film ‘Minority Report’. Set a few decades in the future, there is a section of the film in which Tom Cruise walks through a shopping centre and is inundated with targeted offers to buy products and services sold in adjacent retail stores. Is this futuristic scenario really that distant? Of course in the film there are sinister eye scanners and creepy robot spiders to identify the individual instead of a near field device such as a smart phone (iPhone or Android), but the principal is the same (and lets ignore the issue of stolen identity for another discussion :)&lt;br /&gt;&lt;br /&gt;Many Android handsets now have near field communications (NFC) technology. According to some reputable sources (&lt;a href="http://www.bloomberg.com/news/2011-01-25/apple-plans-service-that-lets-iphone-users-pay-with-handsets.html"&gt;http://www.bloomberg.com/news/2011-01-25/apple-plans-service-that-lets-iphone-users-pay-with-handsets.html&lt;/a&gt;) the 5th generation of iPhone will also include near field communications (NFC) technology, which amongst other things can allow users to pay for goods and services just like they currently do with their credit card.&lt;br /&gt;&lt;br /&gt;Many iPhone users already buy songs and applications from iTunes, which has made it become a significant global billing platform, and provides a notable proportion of revenue for Apple (4.1% of its total quarterly earnings for Q1, see &lt;a href="http://www.fiercemobilecontent.com/story/apples-itunes-revenues-top-11-billion-q1/2011-01-19)"&gt;http://www.fiercemobilecontent.com/story/apples-itunes-revenues-top-11-billion-q1/2011-01-19)&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;If iPhone, iPad and iPod users adopt widespread use of NFC for purchase of everyday groceries and general retail goods, then iTunes could quickly curve a huge slice out of the VISA and Mastercard revenue stream.&lt;br /&gt;&lt;br /&gt;The use of smart phone applications for communication (such as Facebook and Twitter) have already taken significant chunks out of telco’s revenues from traditional voice communication. As smart devices and apps further empower users, telcos face the greater danger of becoming a dumb pipe. In my opinion there is the opportunity for NFC to enable telco’s to develop a closer relationship with customers and act as the information conduit (rather than Google or Apple). &lt;br /&gt;&lt;br /&gt;With varying degrees of success, telco’s currently perform a lot of data mining to understand usage patterns, household demographics, forecasting of network demand etc. Much of this analysis is marketing focused, with an objective to gain new customers, retain a customer, and/or spend more. Most importantly for data miners these marketing activities usually involve intelligently processing very large amounts of data. There are a lot of parallels with data mining performed by VISA and Mastercard, so you would think that telco’s might have the infrastructure and experience to play in the area of credit cards.&lt;br /&gt;&lt;br /&gt;Some telcos are able to provide single billing, whereby the entire household has a single bill for multiple mobile services, wireless broadband, fixed/land telephony, cable TV etc. If a telco already has the rating system to charge for usage of high transaction telephony services, and also provide a single unified household billing platform, then incorporating the purchase of retail goods and a NFC system should not be a challenge for a telco. From my experience I’ve not seen VISA or many retail banks offer a single bill for your household purchases, across multiple individuals and products. This capability places telco’s head and shoulders above banks and credit card companies in the customer experience stakes.&lt;br /&gt;&lt;br /&gt;Most developed countries have 3G or better mobile networks, and when combined with smart phones can easily pin-point the location of a customer. If telco’s used NFC to process and learn each customers (or household’s) purchase habits and preferences, then there is no reason why they couldn’t recommend products and offers for shopping centres or stores in your immediate vicinity in real-time. The additional revenue opportunities might even be able to cover the cost of moderate telephony usage, so customers could get a mobile plan subsidised by advertising and purchase revenue. For example, the telco would develop the trusted relationship with the customer, and many retailers could pay a commission to target specific customer segments, or individuals in the vicinity that buy similar products. Retailers wouldn’t need to implement their own loyalty cards to identify customers, they could simply get summarised information about who shops at their stores, how often, share of wallet etc from the telco company that manages the relationship with the customer. I would relish the opportunity to analysis *that* kind of data! &lt;br /&gt;&lt;br /&gt;Granted there are a lot of challenges, but the fantasy of Minority Report might not be that unrealistic…&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-1974185098738063860?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/1974185098738063860/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=1974185098738063860' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1974185098738063860'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1974185098738063860'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2011/02/just-how-much-do-you-trust-your-telco.html' title='Just how much do you trust your telco?'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-8261899303843831988</id><published>2010-10-28T20:04:00.000-07:00</published><updated>2010-10-28T20:04:55.358-07:00</updated><title type='text'>Not your typical financial risk model…</title><content type='html'>I’ve not done a lot of analysis in the finance industry, and my Google searches didn’t yield helpful insights for similar data mining. I just finished a project and would like some feedback. I’m trying to explain this as a data preparation and analysis approach to solve a specific problem. I’ve described as best I could without names or actual data. I also did a lot of presentation and extra info for the segments not described here.&amp;nbsp;If anyone has relevant words of wisdom, or suggestions for a different approach they would have taken, then please describe it! Otherwise, perhaps this will be helpful to others…&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The business problem to solve was generating customer insight (Businesses with loans), with considerations for each client business' financial health and business loan repayment risk.&lt;br /&gt;&lt;br /&gt;The first thing we concentrated on was tax payments. The data I had access to contained typical finance account monthly summaries (eg. balance at close of month, total $ of transactions etc) but also two years of detailed transactional history of all outgoing and inbound money transfers/payments (eg. including tax payments made by many thousands of businesses). We examined two years of summary data and also all transactions for only those money transfers/payments that involved the account number belonging to the tax man.&lt;br /&gt;&lt;br /&gt;The core idea was to understand each businesses tax payments over time in order to get an accurate view of their financial health. Obviously this would have great importance in predicting future loan repayments or likelihood of future financial problems. One main objective was to understand if tax payment behavior differed significantly between customers, and a secondary consideration was the risk profiles of any subgroups or segments that could be identified.&lt;br /&gt;&lt;br /&gt;It was a quick preliminary investigation (less than two weeks work) so I tackled the problem very simplistically to meet deadlines.&lt;br /&gt;&lt;br /&gt;For the majority of client businesses tax payments occur quarterly or monthly, so I first summarized the data to a quarterly aggregation, for example;&lt;br /&gt;&lt;br /&gt;&lt;div align="left" class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://1.bp.blogspot.com/_octsRin8yY0/TMo1MBxzTGI/AAAAAAAAADU/99_njLFfKQs/s1600/blog1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="307" nx="true" src="http://1.bp.blogspot.com/_octsRin8yY0/TMo1MBxzTGI/AAAAAAAAADU/99_njLFfKQs/s400/blog1.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;As you can see above, each customer could have many records (actually it was a maximum of 8, one for each quarter over a two year period), each record showing the account balance at the end of the quarter and the net sum of payments made to (or from!) the tax man.&lt;br /&gt;&lt;br /&gt;Then I created two offset copies of Tax Payments, one being the previous record (Lag) and the other being the subsequent record (Lead) like so;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://1.bp.blogspot.com/_octsRin8yY0/TMo1XqTX5tI/AAAAAAAAADY/6MtqWo38N_Y/s1600/blog2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="380" nx="true" src="http://1.bp.blogspot.com/_octsRin8yY0/TMo1XqTX5tI/AAAAAAAAADY/6MtqWo38N_Y/s400/blog2.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;I then simply scaled the data so that everything was between 0-1 by using;&lt;br /&gt;&lt;br /&gt;(X – (minimum of X)) / ((maximum of X) - (minimum of X)) &lt;br /&gt;&lt;br /&gt;Obviously, where X is one of the variables representing quarterly account balance or tax payments, and the maximum is within Customer ID.&lt;br /&gt;&lt;br /&gt;For example the raw data here;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://4.bp.blogspot.com/_octsRin8yY0/TMo1hgktxKI/AAAAAAAAADc/XDkNPagwxMQ/s1600/blog3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="218" nx="true" src="http://4.bp.blogspot.com/_octsRin8yY0/TMo1hgktxKI/AAAAAAAAADc/XDkNPagwxMQ/s400/blog3.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Got rescaled to;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://4.bp.blogspot.com/_octsRin8yY0/TMo1oc1DwLI/AAAAAAAAADg/uiGvhlwmSTw/s1600/blog4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="218" nx="true" src="http://4.bp.blogspot.com/_octsRin8yY0/TMo1oc1DwLI/AAAAAAAAADg/uiGvhlwmSTw/s400/blog4.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;I did the all raw balance and tax payment variable rescaling this way so that I could later run a Pearson’s correlation, and k-means clustering, and also graph data easily on the same axis (directly compare balance and tax payments). Some business customers had very large account balances, but small tax payments.&lt;br /&gt;&lt;br /&gt;For example I could eventually generate a line chart like this showing a specific business’ relationship between balance (dotted line) and tax payments (bold red line);&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/TMo2DY4290I/AAAAAAAAADk/sm-ZyhrK7e8/s1600/blog7.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="195" nx="true" src="http://3.bp.blogspot.com/_octsRin8yY0/TMo2DY4290I/AAAAAAAAADk/sm-ZyhrK7e8/s640/blog7.jpg" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;I then ran a simple Pearson’s correlation with the variable ‘Balance’ correlated against the 3 tax payment variables (original, lag , and lead) with a correlation Group By clause on the Customer ID. This would output three correlation scores, one for the original (account balance and tax payments in same month), second for the correlation between current account balance and previous month’s tax payments, and the third for the current account balance and future month tax payments.&lt;br /&gt;&lt;br /&gt;My thought process was to use the highest correlation score (along with balance and tax payment amounts as described below) to build k-means clusters to segment the customer base. Hopefully the segments would reflect, amongst other things, the strongest relationship between account balance and tax payments.&lt;br /&gt;&lt;br /&gt;I joined the correlation outputs to the data and then I flipped/transposed and summarized the data so that each quarter was a new column for balance and tax payments, creating a very wide and summarized data set. For example;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/TMo2SM_Ok6I/AAAAAAAAADo/-lSH5m35aow/s1600/blog5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="203" nx="true" src="http://3.bp.blogspot.com/_octsRin8yY0/TMo2SM_Ok6I/AAAAAAAAADo/-lSH5m35aow/s640/blog5.jpg" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;…also including the correlation, lag, lead and original value variables in the single record per customer…&lt;br /&gt;&lt;br /&gt;Now I have a dataset that is a nice single record per customer, and concentrated on representing the growth or decline in tax payments over the 2 year period. I did this quite simply by converting the raw payments into percentages (of the sum of each customer’s payments over the two years). In some cases a high proportion of the customer’s payments occurred many months ago, which represents a decline in recent quarters.&lt;br /&gt;&lt;br /&gt;I then built a K-means model using inputs such as;&lt;br /&gt;&lt;br /&gt;- the highest correlation score (of the three per customer) and categorical encoding of the correlations (eg. ‘negative correlation’ / ‘positive correlation’, ‘lag’ / ‘lead’ etc)&lt;br /&gt;- Data manipulated payment sums&lt;br /&gt;- Variables representing growth or decline in payments over time.&lt;br /&gt;&lt;br /&gt;The segments that were generated have proved to perform very well. Many features of the client business that were not used in the segmentation (eg number of accounts per client, and risk propensity) could be distinguished quite clearly by each segment.&lt;br /&gt;&lt;br /&gt;When I examined the incidence of risk (failure or problems repaying a business loan) for a three month period (also with a three month gap) I found some segments had almost double the risk propensity.&lt;br /&gt;&lt;br /&gt;Timeline described below;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://4.bp.blogspot.com/_octsRin8yY0/TMo2iqr8dNI/AAAAAAAAADs/Mio-CAHW-wY/s1600/blog8.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="235" nx="true" src="http://4.bp.blogspot.com/_octsRin8yY0/TMo2iqr8dNI/AAAAAAAAADs/Mio-CAHW-wY/s640/blog8.jpg" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;As you can see, there were a very small number of risk outcomes (just 204 in three months) but each of these is very high value, so any lift in risk prediction is beneficial. I hate working with such small samples, but sometimes you get given lemons….&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Suppose I built five clusters, here’s an example summary of the type of results I managed to get;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://2.bp.blogspot.com/_octsRin8yY0/TMo2wagoGmI/AAAAAAAAADw/7SvKBPXmOYY/s1600/blog6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="139" nx="true" src="http://2.bp.blogspot.com/_octsRin8yY0/TMo2wagoGmI/AAAAAAAAADw/7SvKBPXmOYY/s640/blog6.jpg" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Where ‘Risk Index’ is simply calculated as;&lt;br /&gt;&lt;br /&gt;(‘% Of Total Risk’ – ‘% Of Client Count’ ) / ‘% Of Client Count’&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So, this is showing that cluster 5 has 67.91% higher propensity to be a bad risk that the entire base (well, in the analysis…). Conversely cluster 2 is much less (-70%) likely to be a bad risk than the average customer.&lt;br /&gt;&lt;br /&gt;Maybe not your typical financial risk model….&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="left"&gt;﻿&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-8261899303843831988?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/8261899303843831988/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=8261899303843831988' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8261899303843831988'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8261899303843831988'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2010/10/not-your-typical-financial-risk-model.html' title='Not your typical financial risk model…'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_octsRin8yY0/TMo1MBxzTGI/AAAAAAAAADU/99_njLFfKQs/s72-c/blog1.jpg' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-4369302188522507027</id><published>2010-06-16T16:09:00.000-07:00</published><updated>2010-06-16T16:09:37.434-07:00</updated><title type='text'>SNA Presentation in Melbourne (IAPA event)</title><content type='html'>The Institute of Analytics Professionals of Australia (IAPA) requested I give a generic presentation next week on social network analysis and how it can be used for activities such as customer insights, marketing (acquisition, retention, up-sell), fraud detection etc.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;See their current newsletter;&lt;br /&gt;&lt;a href="http://alerts.iapa.org.au/"&gt;http://alerts.iapa.org.au/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;or website;&lt;br /&gt;&lt;a href="http://www.iapa.org.au/"&gt;http://www.iapa.org.au/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I'll make the presentation as vendor neutral and informative as possible (but obviously I can't discuss details of any&amp;nbsp;previous confidential work by myself or SAS).&lt;br /&gt;&lt;br /&gt;If you are in Melbourne on Wednesday 23rd June, then feel free to&amp;nbsp;book and attend the presentation.&amp;nbsp; As with all IAPA events it is free and a great opportunity to 'social network' :) with others interested in analysis and data mining.&lt;br /&gt;&lt;br /&gt;I hope to see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-4369302188522507027?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/4369302188522507027/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=4369302188522507027' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4369302188522507027'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4369302188522507027'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2010/06/sna-presentation-in-melbourne-iapa.html' title='SNA Presentation in Melbourne (IAPA event)'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-1020644672491099666</id><published>2010-04-14T16:40:00.000-07:00</published><updated>2010-04-14T16:41:43.187-07:00</updated><title type='text'>Personal Changes Are Afoot</title><content type='html'>My blog posts and contributions to forums may have to take a back seat for a while. There are at multitude of reasons for this, a few being;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;- baby No.2 due in 5 months&lt;br /&gt;&amp;nbsp;- starting a new job&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; - which means lots of work finalising and handing over data mining projects at my current employer (Optus)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; - lots of new stuff to read and learn at the new employer (SAS)&lt;br /&gt;&lt;br /&gt;&lt;div&gt;My new job is going to take me away from using Clementine and SPSS software (which will be weird after using it every day for over 10 years..), although I might be working on some data mining projects.&lt;/div&gt;&lt;div&gt;I’ll try to contribute a post if I'm doing really cool data analysis that I can talk about…&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-1020644672491099666?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/1020644672491099666/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=1020644672491099666' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1020644672491099666'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1020644672491099666'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2010/04/personal-changes-are-afoot.html' title='Personal Changes Are Afoot'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-3287920721538778718</id><published>2010-03-11T00:51:00.000-08:00</published><updated>2010-03-11T00:52:25.163-08:00</updated><title type='text'>Breaches of data confidentiality can be costly</title><content type='html'>&lt;span style="font-family: Verdana, sans-serif;"&gt;In a &lt;/span&gt;&lt;a href="http://timmanns.blogspot.com/2009/05/telstra-found-guilty-of-abuses-of.html"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;previous post&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt; last year I mentioned a particularly nasty and blatant breach of confidentiality regarding fixed line telephony data. The update is that Optus recently won a court case in Federal Court to seek damages against Telstra;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.itnews.com.au/News/168876,optus-wins-telstra-confidentiality-breach-ruling.aspx"&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;http://www.itnews.com.au/News/168876,optus-wins-telstra-confidentiality-breach-ruling.aspx&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;This news seemed to slip the major national newspapers, which is quite surprising as it is likely to involve significant amounts of money. To be honest I’m not concerned with the consequences, but as a data miner it does interest me how data &lt;strong&gt;*is*&lt;/strong&gt; used, and how it &lt;strong&gt;*could*&lt;/strong&gt; be used. &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Verdana, sans-serif;"&gt;As technology advances I’m certain&amp;nbsp;the general public&amp;nbsp;will see more examples of invasions of personal privacy and breaches of data confidentiality that enable organisations to gain the upper hand (unless or until they are caught).&amp;nbsp; Keep it honest people!&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-3287920721538778718?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/3287920721538778718/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=3287920721538778718' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/3287920721538778718'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/3287920721538778718'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2010/03/breaches-of-data-confidentiality-can-be.html' title='Breaches of data confidentiality can be costly'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-8078585618226120910</id><published>2009-12-16T20:53:00.000-08:00</published><updated>2009-12-16T20:55:16.838-08:00</updated><title type='text'>Meetup for Sydney Data Miners (11th February 2010)</title><content type='html'>Last month I attended a gathering of Sydney based data miners. &lt;br /&gt;&lt;a href="http://www.meetup.com/datarati/"&gt;http://www.meetup.com/datarati/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There are lots of parallels to IAPA (Institute of Analytics Professionals of Australia http://www.iapa.org.au/), but the audience seemed to be more hands-on analysts. Being based at Google it had quite a few web based analysts too.&lt;br /&gt;&lt;br /&gt;The next meet-up is 11th February 2010. I'll be there having a chat and a few beers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-8078585618226120910?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.meetup.com/datarati/calendar/12078335/' title='Meetup for Sydney Data Miners (11th February 2010)'/><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/8078585618226120910/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=8078585618226120910' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8078585618226120910'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8078585618226120910'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/12/meetup-for-sydney-data-miners-11th.html' title='Meetup for Sydney Data Miners (11th February 2010)'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-8193973998572022374</id><published>2009-11-24T00:11:00.000-08:00</published><updated>2009-11-24T00:11:38.482-08:00</updated><title type='text'>When sharing isn't a good idea</title><content type='html'>Ensemble models seem to be all the buzz at the moment. The NetFlix prize was won by a conglomerate of various models and approaches that each excelled in subsets of the data. &lt;br /&gt;&lt;br /&gt;A number of data miners have presented findings based upon using simple ensembles that use the mean prediction of a number of models. I was surprised that some form of weighting isn’t commonly used, and that a simple mean average of multiple models could yield such an improvement in the global predictive power. It kinda reminds me of Gestalt theory phrase "The whole is greater than the sum of the parts". It’s got me thinking, when it is best not to share predictive power. What if one model is the best? There is also a ton of considerations regarding scalability and trade-off between additional processing , added business value, and practicality (don’t mention random forests to me..), but we’re pretend those don’t exist for the purpose of this discussion :)&lt;br /&gt;&lt;br /&gt;So this has got me thinking do ensembles work best in situations where there are clearly different sub-populations of customers. For example Netflix is in the retail space, with many customers that rent the same popular blockbuster movies, and a moderate number of customers that rent rarer (or far more diverse, ie long tail) movies. I haven’t looked at the Netflix data so I’m guessing that most customers don’t have hundreds of transactions, so generalising the correct behaviour of the masses to specific customers is important. Netflix data on any specific customer could be quite scant (in terms of rents/transactions). In other industries such as telecom, there are parallels; customers can also be differentiated by nature of communication (voice calls, sms calls, data consumption etc) just like types of movies. Telecom is mostly about quantity though (customer x used to make a lot of calls etc). More importantly there is a huge amount of data about each customer, often with many hundreds of transactions per customer. There is therefore relatively lesser reliance upon supporting behaviour of the masses (although it helps a lot) to understand any specific customer.&lt;br /&gt;&lt;br /&gt;Following this logic, I’m thinking that ensembles are great at reducing the error of incorrectly applying insights derived from the generalised masses to those weirdos that rent obscure sci-fi movies! &amp;nbsp;Combining models that explain sub-populations very well makes sense, but what if you don’t have many sub-populations (or can identify and model their behaviour with one model).&lt;br /&gt;&lt;br /&gt;But you may shout "hey what about the KDD Cup".&amp;nbsp; Yes, the recent KDD Cup challenge (anonymous featureless telecom data from Orange) was also a won by an ensemble of over thousand models created by IBM Research.&amp;nbsp; I'd like to have had some information about what the hundreds of columns respresented, and this might have helped better understand the Orange data and build more insightful and performing models.&amp;nbsp; Aren't&amp;nbsp;ensemble models used in this way simply a brute force approach to over learn the data?&amp;nbsp; I'd also really like to know how the performance of the winning entry tracks over the suebsequent months for Orange.&lt;br /&gt;&lt;br /&gt;Well, I haven’t had a lot of success in using ensemble models in the telecom data I work with, and I’m hoping it is more a reflection of the data than any ineptitude on my part. I’ve tried simply building multiple models on the entire dataset and averaging the scores, but this doesn’t generate much additional improvement (granted on already good models, and I already combine K-means and Neural Nets on the whole base). &amp;nbsp;During my free time I’m just starting to try splitting the entire customer base into dozens of small sub-populations and building a Neural Net model on each, then combining the results and seeing if that yields an improvement. It’ll take a while.&lt;br /&gt;&lt;br /&gt;Thoughts?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-8193973998572022374?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/8193973998572022374/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=8193973998572022374' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8193973998572022374'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8193973998572022374'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/11/when-sharing-isnt-good-idea.html' title='When sharing isn&apos;t a good idea'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-1604523549452274931</id><published>2009-11-03T16:34:00.000-08:00</published><updated>2009-11-03T16:34:01.414-08:00</updated><title type='text'>Predictive Analytics World (PAW) was a great event</title><content type='html'>I found this year’s PAW in Washington a great success. Although I was only able to attend for one day (the day I presented), the handful of varied presentations I did see were very informative and stimulated lots of ideas for my own data mining in the telecommunications industry. PAW is an event clearly run and aimed at industry practitioners. The emphasis of the presentations was lessons learnt, implementation and business outcomes. I strongly recommend attending PAW if you get the chance.&lt;br /&gt;&lt;br /&gt;Other bloggers have reviewed PAW and encapsulate my views perfectly. For example see some of James Taylor’s blog entries &lt;a href="http://jtonedm.com/tag/predictive-analytics-world"&gt;http://jtonedm.com/tag/predictive-analytics-world&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;James also provides a short overview of my presentation at PAW &lt;a href="http://jtonedm.com/2009/10/20/know-your-customers-by-knowing-who-they-know-paw"&gt;http://jtonedm.com/2009/10/20/know-your-customers-by-knowing-who-they-know-paw&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;My presentation at PAW was 35 minutes followed by 10 minutes for questions. I think I over-ran a little because I was very stretched to fit all the content in. For me the problem of data mining is a data manipulation one. I usually spend all my time building a comprehensive customer focused dataset, and usually a simple back-propagation neural network gives great results. I tried to convey that in my presentation, and as James points out I am able to do all my data analysis within a Teradata data warehouse (all my data analysis and model scoring runs as SQL) which isn't common. I'm definitely a believer that more data conquers better algorithms, although that doesn't necessarily mean more rows (girth is important too :))&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-1604523549452274931?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/1604523549452274931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=1604523549452274931' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1604523549452274931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1604523549452274931'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/11/predictive-analytics-world-paw-was.html' title='Predictive Analytics World (PAW) was a great event'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-116346338278953789</id><published>2009-11-01T15:26:00.000-08:00</published><updated>2009-11-01T15:26:56.589-08:00</updated><title type='text'>Building Neural Networks on Unbalanced Data (using Clementine)</title><content type='html'>I&amp;nbsp;got a ton of ideas whilst attending&amp;nbsp;the Teradata Partners conference and also Predictive Analytics World.&amp;nbsp; I think my presentations went down well (well, I got good feedback).&amp;nbsp; There were also a few questions and issues that were posed to me.&amp;nbsp; One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.&lt;br /&gt;&lt;br /&gt;Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data.&amp;nbsp; The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability)&amp;nbsp;of that outcome.&amp;nbsp; By default Clementine will simply take the output neuron&amp;nbsp;values, and if the value is&amp;nbsp;above 0.5 the prediction will be true, else if the output neuron value is below 0.5&amp;nbsp;the category&amp;nbsp;outcome will be false.&amp;nbsp;&amp;nbsp; This is why in Clementine you need to balance categorical outcome to roughtly 50%/50% when you build the neural net&amp;nbsp;model.&amp;nbsp; In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.&lt;br /&gt;&lt;br /&gt;But there is a simple solution!&lt;br /&gt;&lt;br /&gt;It is&amp;nbsp;something I&amp;nbsp;have always done out of habit because it has&amp;nbsp;proved to generate better models, and I find a decimal score more useful. Being a cautous individual (and at the time a bit jet lagged) I wanted to double check first, but simply by converting a categorical outcome into a numeric range you will avoid this problem.&lt;br /&gt;&lt;br /&gt;In situations where you have a binary categorical outcome (say, churn yes/no, or response yes/no etc) then in&amp;nbsp;Clementine you can use a Derive (flag) node to create alternative outcome values.&amp;nbsp; In a Derive (flag) node simply change the true outcome to 1.0 and the false outcome to 0.0.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;By changing the categorical outcome values to&amp;nbsp;a decimal range outcome between 0.0 and 1.0, the Neural Network model will instead&amp;nbsp;expose the output neuron&amp;nbsp;values and the Clementine output score will be a decimal&amp;nbsp;range from 0.0 to 1.0.&amp;nbsp; The distribution of this score should also closely match the probability of the data input into the model during building.&amp;nbsp; In my analysis I cannot use all the data because I have too many records, but I often build models on fairly unbalanced data and simply use the score sorted / ranked&amp;nbsp;to determine which customers to contact first.&amp;nbsp; I subsequently use the lift metric and the incidence of actual outcomes in sub-populations&amp;nbsp;of predicted&amp;nbsp;high scoring customers.&amp;nbsp; I rarely try to create a categorical 'true' or 'false' outcome, so didn't give it much thought until now.&lt;br /&gt;&lt;br /&gt;If you want to create an incidence matrix that simply shows how many 'true' or false' outcomes the model achieves, then instead of using the&amp;nbsp;Neural Net score of&amp;nbsp;0.5 to determine the true or false outcome, you simply use the probabilty of the outcome used to build the model.&amp;nbsp; For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04.&amp;nbsp; If my neural network score exceeds 0.04 then I predict true, else if my neural network score is below 0.04 then I predict false.&amp;nbsp; A simple derive node can be used to do this.&lt;br /&gt;&lt;br /&gt;If you have a&amp;nbsp;categorical output with multiple values (say, 5 products, or 7 spend bands etc) then you can use a Set-To-Flag node&amp;nbsp;in a similar way to create many new fields, each with a value of either 0.0 or 1.0.&amp;nbsp; Make *all* new set-to-flag fields outputs and the Neural Network will create a decimal score for each output field.&amp;nbsp; This is essential exposing the raw output neuron values, which you can then use in many ways similar to above (or use all output scores in a rough 'fuzzy' logic way as I have in the past:).&lt;br /&gt;I posted a small example stream on the kdkeys Clementine forum &lt;a href="http://www.kdkeys.net/forums/70/ShowForum.aspx"&gt;http://www.kdkeys.net/forums/70/ShowForum.aspx&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.kdkeys.net/forums/thread/9347.aspx"&gt;http://www.kdkeys.net/forums/thread/9347.aspx&lt;/a&gt;&lt;br /&gt;Just change the file suffix from .zip to .str and open ther Clementine steeam file.&amp;nbsp; Created using version 12.0, but should work in some older versions.&lt;br /&gt;&lt;a href="http://www.kdkeys.net/forums/9347/PostAttachment.aspx"&gt;http://www.kdkeys.net/forums/9347/PostAttachment.aspx&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I hope this makes sense.&amp;nbsp; Free feel to post a comment if elboration is needed!&lt;br /&gt;&lt;br /&gt;&amp;nbsp;- enjoy!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-116346338278953789?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/116346338278953789/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=116346338278953789' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/116346338278953789'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/116346338278953789'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/11/building-neural-networks-on-unbalanced.html' title='Building Neural Networks on Unbalanced Data (using Clementine)'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-3627403642249021338</id><published>2009-10-12T19:10:00.000-07:00</published><updated>2009-10-12T19:10:06.695-07:00</updated><title type='text'>See you at PAW (Predictive Analytics World) and Teradata Partners</title><content type='html'>Next week I'll be in Washington DC for Teradata Partners and also Predictive Analytics World. &lt;br /&gt;&lt;br /&gt;I'm presenting how leveraging the social interactions of the Optus mobile/cellphone&amp;nbsp;customer base has enabled unparalleled insights into customers and prospects.&lt;br /&gt;&lt;br /&gt;In my opinion the presenters and topics being discussed are interesting and worth attending.&amp;nbsp; These conferences&amp;nbsp;are the few events where industry analysts congregate and discuss their work.&lt;br /&gt;&lt;br /&gt;I will probably have a few meetings and activities lined up, but I'm always happy to chat over a few beers. If you are there feel free to say 'hi'.&amp;nbsp; I'm in Washington for 4 days, then taking a few days holiday with family in New York.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-3627403642249021338?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/3627403642249021338/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=3627403642249021338' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/3627403642249021338'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/3627403642249021338'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/10/see-you-at-paw-predictive-analytics.html' title='See you at PAW (Predictive Analytics World) and Teradata Partners'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-8047148861904054696</id><published>2009-09-13T20:40:00.000-07:00</published><updated>2009-09-13T21:49:22.493-07:00</updated><title type='text'>I'll show you mine if you show me yours...</title><content type='html'>Analysts don't usually quote predictive model performance. Data Mining within each industry is different, and even within the telecommunications industry definitions of churn are inconsistent. This often makes reported outcomes tricky to fully understand.&lt;br /&gt;&lt;br /&gt;I decided to post some churn model outcomes after reading a post by the enigmatic Zyxo on his (or maybe her :)) blog ;&lt;a href="http://zyxo.wordpress.com/2009/08/29/data-mining-for-marketing-campaigns-interpretation-of-lift/"&gt;http://zyxo.wordpress.com/2009/08/29/data-mining-for-marketing-campaigns-interpretation-of-lift/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I'd like to know if the models rate well :)&lt;br /&gt;&lt;br /&gt;I'd love to see reports of the performance of any predictive classification models (anything like churn models) you've been working on, but I realise that is unlikely... For like-minded data miners a simple lift chart might suffice.&lt;br /&gt;&lt;br /&gt;The availability of data will greatly influence your ability to identify and predict churn (for the purpose of this post churn is defined as when good fare paying customers voluntarily leave). In this case churn outcome incidence is approx 0.5% per month, where the total population shown in each chart is a few million.&lt;br /&gt;&lt;br /&gt;Below are two pictures of recent churn model Lift charts I built. Both models use the previous three months call summary data and the previous month's social group analysis data to predict a churn event occurring in the subsequent month. Models are validated against real unseen historical data.&lt;br /&gt;&lt;br /&gt;I'm assuming you know what a lift chart is. Basically, it shows the magnitude increase in the proportions of your target outcome (in this case churn) within small sub-groups of your total population. Sub-groups are rank/sorted by propensity. For example, in the first chart we obtain 10 times more churn in the top 1% of our customers we suspected of churning using the predictive model.&lt;br /&gt;&lt;br /&gt;The first model is built for a customer base of prepaid (purchase recharge credit prior to use) mobile customers, where the main sources of data are usage and social network analysis.&lt;br /&gt;&lt;br /&gt;The second model is postpaid (usage is subsequently billed to customer) mobile customers, where contract information and billing are additionally available. Obviously contracts commit customers for specified periods of time, so act as very 'predictive' inputs for any model.&lt;br /&gt;&lt;br /&gt;- first churn model lift&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_octsRin8yY0/Sq3GAomkVSI/AAAAAAAAACY/pc2F1ywYLNI/s1600-h/pp+lift.bmp"&gt;&lt;img id="BLOGGER_PHOTO_ID_5381174843979093282" style="WIDTH: 100%; CURSOR: hand; HEIGHT: 100%" alt="" src="http://2.bp.blogspot.com/_octsRin8yY0/Sq3GAomkVSI/AAAAAAAAACY/pc2F1ywYLNI/s400/pp+lift.bmp" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;- second churn model lift&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_octsRin8yY0/Sq3Gqmo8nYI/AAAAAAAAACg/KFhbheZmcfc/s1600-h/pt+lift.bmp"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 100%; height: 100%;" src="http://2.bp.blogspot.com/_octsRin8yY0/Sq3Gqmo8nYI/AAAAAAAAACg/KFhbheZmcfc/s400/pt+lift.bmp" border="0" alt=""id="BLOGGER_PHOTO_ID_5381175565006708098" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Both charts show our model lift in blue and the best possible result in dotted red. For the first model we are obtaining a lift of approximately 6 or 7 for the top 5% population (where the best possibly outcome would be 20 (eg. (100 / 5) = 20).&lt;br /&gt;&lt;br /&gt;The second model is significantly better, with our model able to obtain a lift of approximately 10 for the top 5% of population (half way to perfection :)&lt;br /&gt;&lt;br /&gt;I mention lift at 5% population because this gives us the reasonable mailing size and catches a large number of subsequent churners.&lt;br /&gt;&lt;br /&gt;Obviously I can't discuss the analysis itself in any depth. I'm just curious what the first impressions are of the lift. I think its good, but I could be delusional! And just to confirm, it is real and validated against unseen data.&lt;br /&gt;&lt;br /&gt;- enjoy!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-8047148861904054696?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/8047148861904054696/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=8047148861904054696' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8047148861904054696'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8047148861904054696'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/09/ill-show-you-mine-if-you-show-me-yours.html' title='I&apos;ll show you mine if you show me yours...'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_octsRin8yY0/Sq3GAomkVSI/AAAAAAAAACY/pc2F1ywYLNI/s72-c/pp+lift.bmp' height='72' width='72'/><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-3174987207595140821</id><published>2009-07-21T19:55:00.000-07:00</published><updated>2009-07-27T01:20:28.861-07:00</updated><title type='text'>Books on my desk...</title><content type='html'>Over the years I have purchased a few data mining, machine learning, and even statistics books. I'll confess that I haven't read every book page by page, in fact some I've speed-read hoping to catch some interesting highlights.&lt;br /&gt;&lt;br /&gt;Below is a summary and short review of the books that are sitting on my desk at work...&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/Sm1J0oGJInI/AAAAAAAAACQ/nLf0ldrsI-4/s1600-h/books.jpg"&gt;&lt;img id="BLOGGER_PHOTO_ID_5363023899733336690" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 300px" alt="" src="http://3.bp.blogspot.com/_octsRin8yY0/Sm1J0oGJInI/AAAAAAAAACQ/nLf0ldrsI-4/s400/books.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt; - from left to right;&lt;br /&gt;&lt;/p&gt;- Marketing Calculator (author Guy Powell)&lt;br /&gt;I got a free copy because I contributed to some of the industry examples. I'm even quoted in it!  I found the book very useful and would recommend it for any marketing analyst. It talks about ROI and measuring every type of marketing event and customer interaction. Lots of case studies, which I always like. No detail in terms of data analysis itself, but plenty of food for ideas.&lt;br /&gt;&lt;br /&gt;- Advances In Knowledge Discovery And Data Mining (editors Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy)&lt;br /&gt;I first bought this for Rakesh Agrawal's article on Association Rules (Apriori in Clementine), but also found John Elder's Statistical Perspective on Knowledge Discovery very informative. It provides a great concise history of data mining.&lt;br /&gt;&lt;br /&gt; - Data Mining Using SAS Applications (author George Fernandez)&lt;br /&gt;I bought this hoping to get a different opinion or learn something new (compared to the SPSS Clementine User Guide I have far too much experience of..).  I thought; maybe SAS analysts had a better way to do a specific type of data handling or followed a alternative thought pattern to accomplish a goal.   Sadly I was disappointed.  Like many data mining books it spend hundreds of pages describing algorithms and expert options for refining your model building and less than 10 pages on data transformations and/or data cleaning.  Those 10 pages are well written though.  Not worth the purchase in my view. I only hope SAS analysts have better books out there.&lt;br /&gt;&lt;br /&gt; - Data Mining Techniques (authors Micheal Berry and Gordon Linoff)&lt;br /&gt;Written by practitioners means a lot. The one book I often re-read just in case I missed something the previous time :)  Maybe because it is very applicable to my role as an analyst in a marketing dept in a telecommunications carrier, but I find this book invaluable.  Lots of case studies.  100 pages of background and practical tips before it even reaches 'algorithms' is good in my view, and when do you reach the algorithms they are described in practical terms as techniques very well (rather than a laborious stats class, and I didn't do stats at University).  I find the whole book a joy to read. A must for every data miner.&lt;br /&gt;&lt;br /&gt; - Data Mining. A Tutorial Based Primer (authors Richard Roger and Micheal Geatz)&lt;br /&gt;Whilst going through a phase of keen hobby programming in VB.NET I tried my hand at writing a neural net, decision tree etc from scratch.  I found this book really helpful since it goes through every detail a programmer would need to implement their own data mining code in Excel.  I work with huge amounts of data, so the thought of doing data mining in Excel makes my giggle (maybe that's a bad thing...) but the principals of data manipulation, cleaning and prediction etc can easily be applied in Excel.  If you really want to understand how algorithms work and build your own, then this book is very useful for that purpose.&lt;br /&gt;&lt;br /&gt; - Data Mining.  Introductory and Advanced Topics&lt;br /&gt;If you did spend several years studying mathematics or statistics then this book would probably act as a great reference and reminder of how algorithms work. &lt;br /&gt;Its very academic and sometimes that's useful.  I think there's one line in there somewhere that mentions data cleaning or data transformations as being an industry thing...  It is also quite a hard heavy book, so could be useful to rest stuff on.&lt;br /&gt;&lt;br /&gt; - Data Mining. Practical Machine Learning Tools and Techniques (Ian Witten and Eibe Frank)&lt;br /&gt;This is a classic example of bait advertising that some authors should be jailed for.  On page 52 of this book the authors write;&lt;br /&gt;&lt;em&gt;"Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process. Although this book is not really about the problems of data preparation, we want to give you a feeling for the issues involved...."&lt;/em&gt;  Fuck me, its not a data mining book then is it?  Not only that, they actually use the term "Practical" in the title.  Clearly it is not practical at all if it involves absolutely zero data manipulation.  If I ever meet one of these authors I will slap them in the face and demand my money back...   Oh and over half the book is a damn Weka user guide.&lt;br /&gt;&lt;br /&gt; - The Elements Of Statistical Learning. Data Mining, Inference and Prediction (authors Trevor Hastie, Robert Tibshirani, Jerome Friedman)&lt;br /&gt;Very heavy on the stats and squiggly equations (which take me ages to make sense of) but quite well written because I usually manage to understand it.  Explains the algorithms stuff very well. I don't refer to it much and only read a few chapters in depth, but it was worth the purchase.&lt;br /&gt;&lt;br /&gt; - The Science Of Superheroes (authors Lois Gresh and Robert Weinberg).&lt;br /&gt;Not everything is about data mining.  There's a whole world out there, and just maybe it includes super heroes with laser beams shooting out of their eyes.  Its a soft-core science book discussing concepts such as; faster than light speed, cosmic rays, genetically engineered hulks, flying without wings, and black holes and how it all relates to real-life superheroes (if they existed).  Really good geeky material.&lt;br /&gt;&lt;br /&gt; - Data Preparation For Data Mining (author Dorian Pyle)&lt;br /&gt;A good book, and like "Data Mining Techniques" it clearly covers topics with a practical understanding (no 'real-world' case studies though).  Where it differs is that this book has a stronger academic or statistics focus.  I didn't get a sense that the examples would always relate to large real-world data sets, and many methods I use were not mentioned at all (for example frequency binning) because they have no statistical basis.  Here's the problem; this is a great data mining book, but only for the statistics in practical data mining.  It is a book I frequently refer to and would recommend, although I'd like to see stuff added that *isn't* based on statistics.&lt;br /&gt;&lt;br /&gt;- The Essence Of Databases (author F. Rolland)&lt;br /&gt;101 database for dummies.  It describes database schemas, relational concepts, tons of SQL examples for queries and data transformations, describes object oriented databases etc.&lt;br /&gt;Essential stuff for anyone querying a corporate data warehouse.  It reads easily and is recommended.&lt;br /&gt;&lt;br /&gt;- Data Mining. Concepts, Models, Methods, and Algorithms (author Mehmed Kantardzic)&lt;br /&gt;Another 'list all the algorithms I know' book.  I'll be honest; I only quickly flicked through it hoping to see some case studies or something new.  It seemed good, but didn't seem to have anything to set it apart from any other algorithms book.&lt;br /&gt;&lt;br /&gt; - Statistics Explained. Basic concepts and methods.  ( authors R. Fapadia and G. Andersson)&lt;br /&gt;Just in case I forget what a t-test is.  Has lots of pictures :)&lt;br /&gt;&lt;br /&gt; - Clementine User Guides (author: many at SPSS, well if memory serves me Clay Helberg did a fair chunk of it) .  When I was at SPSS I had a small part to play in these.  I provided some examples and proof read where possible.  I've been using Clementine daily for over a decade, but still refer to the user guide occasionally.  I find them useful, but they could benefit from some new examples to take advantage of the many new features that have been added in recent years.&lt;br /&gt;&lt;br /&gt; - Enjoy!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-3174987207595140821?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/3174987207595140821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=3174987207595140821' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/3174987207595140821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/3174987207595140821'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/07/books-on-my-desk.html' title='Books on my desk...'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_octsRin8yY0/Sm1J0oGJInI/AAAAAAAAACQ/nLf0ldrsI-4/s72-c/books.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-9071746120931075754</id><published>2009-06-15T23:54:00.000-07:00</published><updated>2009-06-16T00:19:09.087-07:00</updated><title type='text'>See you at the US Teradata User Conference 2009</title><content type='html'>Quick post because I'm swamped with work...&lt;br /&gt;&lt;br /&gt;Last year, at the Asia &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Teradata&lt;/span&gt; User Group in Beijing, I presented some generic data mining that was being performed at &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Optus (mostly simple churn analysis and behavioural segmentation)&lt;/span&gt;.  I also had a few meetings with the analysts from some China &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;telco's&lt;/span&gt; about how relatively simple data analysis can scale up to many millions of customers and billions of rows of data.&lt;br /&gt;&lt;br /&gt;This year I'll be presenting at the US &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Teradata&lt;/span&gt; User Conference some of the more advanced analysis that I've recently done, notably surrounding social network analysis in the mobile customer base on large amounts of data (several billions of rows).  I'm hoping to be able to quote some actual business outcomes and put up some $ numbers.&lt;br /&gt;&lt;br /&gt;The US 2009 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Teradata&lt;/span&gt; User Group Conference &amp;amp; Expo, October 18–22, 2009, at the Gaylord National Resort.&lt;br /&gt;&lt;br /&gt;I'll be presenting on Wednesday 21st October 2009 at Maryland D on the Business Track.  Judging from the large number of &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_5"&gt;presentations&lt;/span&gt; I guessing it will a much smaller and personal room than the 1000+ conference hall I was in last year in Beijing :)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.teradata.com/teradata-partners/conference/session_abstracts.cfm?cdate=10%2F21%2F09#7743173"&gt;http://www.teradata.com/teradata-partners/conference/session_abstracts.cfm?cdate=10%2F21%2F09#7743173&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Feel free to say hi and ask lots of questions if you see me there.  I might have one free evening for a few beers if anyone wants.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-9071746120931075754?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/9071746120931075754/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=9071746120931075754' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/9071746120931075754'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/9071746120931075754'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/06/see-you-at-us-teradata-user-conference.html' title='See you at the US Teradata User Conference 2009'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-34769033043158672</id><published>2009-05-20T19:52:00.000-07:00</published><updated>2009-05-20T21:37:57.570-07:00</updated><title type='text'>Teradata Podcasts on Data Mining And SNA</title><content type='html'>I sometimes get asked by vendors to present case studies or examples of work so that they can attract new customers or demonstrate how existing customers can use the software/solution. Below are details of a podcast I did over the phone with Teradata (I was in Sydney, they were in US). There wasn't any preparation, I just kinda 'winged-it'. Any numbers I quoted were rough estimates from memory (not official numbers!). And yes, my voice is a bit high pitched and I do sometimes sound like a 50 year old lady....apologies :)&lt;br /&gt;&lt;br /&gt;The podcasts discuss customer insights and data mining analyses that are performed.  We later then discussed social networking analysis and how linking customers by social calling groups helps predict customer action (such as churn or acquisition of an iPhone handset).   TCRM is a Teradata tool I am not familiar with, but my colleagues do use it for campaign delivery, and it has the capability to perform trigger based campaigns (such a send a retention offer to other members of a social group when one member of that group churns). &lt;br /&gt;&lt;br /&gt;I'm very fortunate that I am occasionally permitted to present my work. One of my main arguments for doing this is that I get peer review and feedback from other data miners, and an idea whether the analytics we do is 'better than most'.&lt;br /&gt;&lt;br /&gt;&lt;span style="color:#ff0000;"&gt;&lt;strong&gt;So, I beg you! Please let me know either way; If this stuff is good or bad I need to know (especially if you work in Telco).&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Cheers!&lt;br /&gt;&lt;br /&gt;Tim&lt;br /&gt;&lt;br /&gt;- - - - - - - - -&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Enhancing Customer Knowledge and Retention at Optus&lt;br /&gt;&lt;/strong&gt;&lt;a href="http://www.teradata.com/t/podcast.aspx?id=10736"&gt;http://www.teradata.com/t/podcast.aspx?id=10736&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;In This Podcast&lt;br /&gt;&lt;/strong&gt;Optus is an Australian telecommunications carrier that uses analytics to increase customer retention. The data being analyzed comes from call centers, mobile phone call details, census geo-demographic data, and a history customer behavior. Teradata CRM and the data warehouse environment from Teradata is key to Optus’ success with reliably identifying customers that might churn and offering marketing campaigns that are relevant and timely. Optus saw a 20% reduction in churn.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Social Networking Analysis at Optus&lt;/strong&gt;&lt;br /&gt;&lt;a href="http://www.teradata.com/t/podcasts/social-networking-analysis-optus/"&gt;http://www.teradata.com/t/podcasts/social-networking-analysis-optus/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;In This Podcast&lt;br /&gt;&lt;/strong&gt;Tim Manns from Optus discusses how the company uses detailed network data from its Teradata system to look at calling behavior. With 40% of the Australian telecommunications market, the company cross-references each customer with every other customer, groups them together based on who they communicate with, looks at the behavior of the group, and can then predict next steps and target those groups with appropriate products and services.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-34769033043158672?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/34769033043158672/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=34769033043158672' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/34769033043158672'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/34769033043158672'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/05/teradata-podcasts-on-data-mining-and.html' title='Teradata Podcasts on Data Mining And SNA'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-437952389531007532</id><published>2009-05-04T15:56:00.000-07:00</published><updated>2009-05-04T19:14:20.338-07:00</updated><title type='text'>Telstra found guilty of abuses of telecommunications network data</title><content type='html'>- Disclaimer. I do not represent any organisation. This is a personal blog and I talk freely about data mining from a personal perspective only. --&lt;br /&gt;&lt;br /&gt;See a recent news article;&lt;br /&gt;&lt;a href="http://www.itwire.com/content/view/24786/1095/"&gt;http://www.itwire.com/content/view/24786/1095/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;And also;&lt;br /&gt;&lt;a href="http://www.australianit.news.com.au/story/0,25197,25414690-15306,00.html"&gt;http://www.australianit.news.com.au/story/0,25197,25414690-15306,00.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As a Data Miner for a telecommunications provider I frequently use network data in my analysis. How many calls the customer makes, at what time of day, do they communicate using voice or sms etc. I examine data pertaining to *customers* only. &lt;br /&gt;&lt;br /&gt;Telecommunications companies often provision services wholesale for another company. This 'wholesale recipient' company will pay for the use of the network, but manage all other activities such as marketing, customer account and billing. In these cases, although the telecommunications company is responsible for supplying the network service and ensuring calls are successfully established (and likely stores data about these calls), it doesn't own the call data for that customer (who belongs to the 'wholesale recipient' company). Make sense? Use of the data that pertains to the actions of someone that is not a customer of that telecommunications company must be treated with the utmost caution.&lt;br /&gt;&lt;br /&gt;Every data miner must be aware of data privacy laws, and in many countries failure to adhere to these laws attract heavy financial penalties for the organisation and individuals involved. In Australia some invasion of privacy laws could even potentially involve 2 years jail time.&lt;br /&gt;&lt;br /&gt;Recently Telstra, an Australia telecommunication company (and the previous incumbent) was found guilty of serious breaches of data privacy. For the 130 page publicly accessible transcript see;&lt;br /&gt;&lt;a href="http://www.austlii.edu.au/cgi-bin/sinodisp/au/cases/cth/FCA/2009/422.html"&gt;http://www.austlii.edu.au/cgi-bin/sinodisp/au/cases/cth/FCA/2009/422.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I guessing that the significant legal costs and years it has taken to get this result is obviously prohibitive for many telcos, so they let it slide. Optus didn't.&lt;br /&gt;&lt;br /&gt;Basically, the bit that caught my eye was on item 108 (yes, I speed read the whole thing...). It is legal jargon and reads;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;"Telstra asserted that total traffic travelling across its network belonged to Telstra. Optus submitted that whether it belonged to Telstra is not the question posed by cl 15.1 of the Access Agreement. The question under cl 15.1 is whether Telstra owed an obligation under that clause with respect to traffic information recorded by Telstra of communications by Optus customers on the Telstra network because that information was Confidential Information of Optus. The definition of Confidential Information identifies what is the Confidential Information of Optus. Once a CCR records information in relation to a call made by an Optus customer, that information becomes the Confidential Information of Optus because it falls within the definition of ‘Confidential Information’. "&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;The first sentence is shocking. In English it basically suggests that Telstra treat all network calling data as its own, and freely uses call information made by anyone on that network as it sees fit. That includes calls made by customers of wholesale or competitors companies on their network. In the case of wholesale for fixed line (land) networks Telstra will know the address and likely also the name of the customer. In the early days Optus had little choice but to use some of Telstra's fixed line infrastructure, often the last bits of copper wire that reach a household. The information of this usage was passed to Executive and board members so that they knew customer size and market share by age, geography etc. It is also highly likely (although difficult to prove) that the Telstra retail arm used the data for marketing activities and actioned direct communications to that customer. Anti-competitive to say the least...&lt;br /&gt;&lt;br /&gt;One of the short conclusions of the legal findings were;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;"For the foregoing reasons, I find that Telstra has used traffic information of Optus, or Communication Information of Optus for the purposes of the Access Agreement, both in the preparation of market share reports and in distributing those reports among Telstra personnel. I also find that such information is Confidential Information of Optus for the purposes of the Access Agreement, or is otherwise subject to the requirements of confidentiality in cl 15 of the Access Agreement, by force of cl 10 of that agreement. I also find that neither such use of such information nor its disclosure for such purposes is permitted by the Access Agreement."&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;I guess the information here is probably too much in the 'telco land', but hopefully its clear enough to understand the gravity of this. I've known this type of stuff was being conducted by some telco's for a long time, but I'm shocked it was so brazen. &lt;br /&gt;&lt;br /&gt;Knowing the big differences between what we (as Data Miners) are 'able to do' regarding insights and personal information (particularly in mobile telecommunications) and what we 'should do' is very important. Years ago the industry passed the early developmental stage of storing data, in recent years we have learned how to understand the data and convert it into useful insights.  I still think that many data miners don't realise how important (now more than ever before) it is that we act responsibility in the use of the personal information we obtain from 'our' data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-437952389531007532?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/437952389531007532/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=437952389531007532' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/437952389531007532'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/437952389531007532'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/05/telstra-found-guilty-of-abuses-of.html' title='Telstra found guilty of abuses of telecommunications network data'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-1054930634406420017</id><published>2009-04-22T15:10:00.000-07:00</published><updated>2009-04-22T15:39:54.236-07:00</updated><title type='text'>When graphs, piecharts and all else fails... Dilbert to the rescue!</title><content type='html'>If you work in a Marketing or Sales department then you probably have the challenging task of convincing your less technical colleagues of the benefit of using your awesome customer insights.&lt;br /&gt;&lt;br /&gt;I'm quite proud of the social network analysis (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;SNA&lt;/span&gt;) that I'd first completed months ago. It is refreshed each month (the data warehouse load is too high to run it daily or weekly as I would like). I've been tracking its performance, and am continually surprised.&lt;br /&gt;&lt;br /&gt;The trouble is that my colleagues are having trouble understanding how they can use it to formalise customer communications, so I decided to try a different approach than graphs and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;piecharts&lt;/span&gt; etc.&lt;br /&gt;&lt;br /&gt;Instead I thought I might try something &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;humorous&lt;/span&gt;, hence Dilbert to the rescue! I have created a dozen or so custom Dilbert slides that provides some info about a customer insight made &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;available&lt;/span&gt; by the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;SNA&lt;/span&gt; and also has a &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_5"&gt;humorous&lt;/span&gt; &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_6"&gt;conclusion&lt;/span&gt; to those insights. I'll pass this around the department in a series of daily emails.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Here is one example (I had to change the project nickname to "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;SNA&lt;/span&gt;" for this blog);&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/Se-bfY_kPkI/AAAAAAAAACI/W_isASeeCqI/s1600-h/blog+version.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5327647847789903426" style="WIDTH: 100%; CURSOR: hand; HEIGHT: 100%" alt="" src="http://3.bp.blogspot.com/_octsRin8yY0/Se-bfY_kPkI/AAAAAAAAACI/W_isASeeCqI/s400/blog+version.JPG" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-1054930634406420017?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/1054930634406420017/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=1054930634406420017' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1054930634406420017'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1054930634406420017'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/04/when-graphs-piecharts-and-all-else.html' title='When graphs, piecharts and all else fails... Dilbert to the rescue!'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_octsRin8yY0/Se-bfY_kPkI/AAAAAAAAACI/W_isASeeCqI/s72-c/blog+version.JPG' height='72' width='72'/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-7217638966265223244</id><published>2009-04-06T20:02:00.000-07:00</published><updated>2009-04-06T20:18:33.915-07:00</updated><title type='text'>Clementine is dead, long live PASW Modeller</title><content type='html'>For the new Clementine homepage see&lt;br /&gt;&lt;a href="http://www.spss.com/software/modeling/modeler/"&gt;http://www.spss.com/software/modeling/modeler/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;SPSS have gone for new product names, including changing Clementine to PASW.  I'm more interested in the new features and bug fixes than buzz words.  I'll hopefully be getting the new version shortly and will let you know if Clementine 13 (aka Predictive Analytics Soft Ware Modeller) adds value.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-7217638966265223244?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/7217638966265223244/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=7217638966265223244' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/7217638966265223244'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/7217638966265223244'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/04/clementine-is-dead-long-live-pasw.html' title='Clementine is dead, long live PASW Modeller'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-9027052160123189187</id><published>2009-03-30T14:12:00.000-07:00</published><updated>2009-03-30T15:04:16.894-07:00</updated><title type='text'>Tips for the KDD challenge :)</title><content type='html'>I recently heard about the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;KDD&lt;/span&gt; challenge this year.  Its a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;telco&lt;/span&gt; based challenge to build churn, cross-sell, and up-sell propensity models using the supplied train and test data.&lt;br /&gt;&lt;br /&gt;For more info see;&lt;br /&gt;&lt;a href="http://www.kddcup-orange.com/index.php"&gt;http://www.kddcup-orange.com/index.php&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I am not able to download the data at work (security / download limits), so I might have to try this at home.  I haven't even seen the data yet.  I'm hoping its transactional &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;cdr's&lt;/span&gt; and not in some summarised form (which it sounds like it is).&lt;br /&gt;&lt;br /&gt;I don't have a lot of free time so I might not get around to submitting an entry, but if I do these are some of the data preparation steps and issues I'd consider;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt; - handle outliers &lt;br /&gt;&lt;/strong&gt;If the data is real-world then you can guarantee that some values will be at least a thousand times bigger than anything else.  Log might not work, so try trimmed mean or frequency binning as a method to remove outliers.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt; - missing values&lt;/strong&gt;&lt;br /&gt;The &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;KDD&lt;/span&gt; guide suggests that missing or undetermined values were converted into zero.  Consider changing this.  Many algorithms will treat zero very differently from a null.  You might get better results by treating these zero's as nulls.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt; - percentage &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_4"&gt;comparisons&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;If a customer can make a voice or &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;sms&lt;/span&gt; call, what's the percentage between them?  (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;eg&lt;/span&gt; 30% voice vs 70% &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;sms&lt;/span&gt; calls).  If only voice calls, then consider &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_8"&gt;splitting&lt;/span&gt; by time of day or peak vs &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;offpeak&lt;/span&gt; as percentages.  The use of percentages helps remove differences of scale between high and low quantity customers.  If telephony usage covers a number of days or weeks, then consider a similar metric that shows increased or &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_10"&gt;decreased&lt;/span&gt; usage over time.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt; - social networking analysis&lt;/strong&gt;&lt;br /&gt;If the data is raw transactional &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;cdr's&lt;/span&gt; (call detail records) then give a lot of consideration do performing a basic social networking analysis.  Even if all you can manage is to identify a circle of friends for each customer, then this may have a big impact upon identification of high churn individuals or up-sell opportunities.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt; - not all churn is equal&lt;/strong&gt;&lt;br /&gt;Rank customers by usage and scale the rank to a zero (low) to 1.0 score (high rank).  No &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;telco&lt;/span&gt; should still be treating every churn as a equal loss.  Its not!   The loss of a highly valuable customer (high rank) is worse than a low spend customer (low rank).  Develop a model to handle this and argue your reasons for why treating all churn the same is a fool's folly.  This is difficult if you have no spend information or history of usage over multiple billing cycles.&lt;br /&gt;&lt;br /&gt;Hope this helps&lt;br /&gt;&lt;br /&gt;Good luck everyone!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-9027052160123189187?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/9027052160123189187/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=9027052160123189187' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/9027052160123189187'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/9027052160123189187'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/03/tips-for-kdd-challenge.html' title='Tips for the KDD challenge :)'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6191681601770481279</id><published>2009-03-27T14:34:00.000-07:00</published><updated>2009-03-26T17:44:20.527-07:00</updated><title type='text'>Presenting at conference Uniscon 2009</title><content type='html'>&lt;p&gt;I've been asked to present at &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Uniscon&lt;/span&gt; 2009. One to the professors involved at the University of Western Sydney is a relative of an analyst I work with and requested I present. I usually find academic conferences are snooze city, but they promised me free beer and I live in Sydney anyway, so I can get home to see the baby before the night's end. I hope I'm just one of many industry persons there and it proves to be an insightful event.&lt;br /&gt;&lt;br /&gt;I'm not presenting work. I will be presenting from a personal perspective as a &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;industry&lt;/span&gt; data miner (I've not enough time to prepare my presentation and get legal approval from work) and I'll be discussing generic topics instead of describing recent data mining projects and quoting numbers or factual business outcomes.&lt;br /&gt;&lt;br /&gt;I suspect a large part of my &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;attendance&lt;/span&gt; is to drive some &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;enthusiasm&lt;/span&gt; and make the students interested in data mining and aware of what challenges you face in data mining roles.&lt;br /&gt;&lt;br /&gt;If you are attending then feel free to say 'hi'.&lt;br /&gt;For info on the conference see;&lt;br /&gt;&lt;a href="http://openresearch.org/wiki/UNISCON_2009"&gt;http://openresearch.org/wiki/UNISCON_2009&lt;/a&gt;&lt;span style="color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;wider website &lt;a href="http://www.uniscon2009.org/"&gt;http://www.uniscon2009.org/&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Below was the presentation title and abstract I threw together (now just have to write it...). There is a social networking analysis (SNA) element to it (because that's what I'm focused on at the moment).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Presentation title:&lt;br /&gt;&lt;/strong&gt;Know your customers. Know who they know they know, and who they don't.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Presentation Abstract:&lt;br /&gt;&lt;/strong&gt;Tim's presentation will describe some of the types of marketing analysis a typical telecommunications company might do, including social network analysis (SNA, which is a hot topic right now). He also elaborates on the technical and practical side of data mining, and what business impacts data mining may have.&lt;br /&gt;&lt;br /&gt;More importantly the presentation will help answer questions such as;&lt;br /&gt;- What skills are required for Data Mining?&lt;br /&gt;- What problems are commonly faced during Data Mining projects?&lt;br /&gt;- And just what is this Data Mining stuff all about anyway?&lt;br /&gt;&lt;br /&gt;- Tim&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6191681601770481279?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6191681601770481279/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6191681601770481279' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6191681601770481279'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6191681601770481279'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/03/presenting-at-conference-uniscon-2009.html' title='Presenting at conference Uniscon 2009'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-4713613448961013080</id><published>2009-03-26T17:13:00.000-07:00</published><updated>2009-03-26T17:20:31.818-07:00</updated><title type='text'>Closing days of the Data Mining survey</title><content type='html'>I got a quick email yesterday from Karl Rexer. There are a few days remaining to participate in his yearly data mining survey.&lt;br /&gt;&lt;br /&gt;Survey Link: &lt;a href="http://www.rexeranalytics.com/Data-Miner-Survey-Intro2.html"&gt;www.RexerAnalytics.com/Data-Miner-Survey-Intro2.html&lt;/a&gt;&lt;br /&gt;Access Code: TM42P&lt;br /&gt;&lt;br /&gt;If you frequently conduct data analysis on large amounts fo data (ie data mining!) then I urge you to particpate.&lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-4713613448961013080?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/4713613448961013080/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=4713613448961013080' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4713613448961013080'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4713613448961013080'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/03/i-got-quick-email-yesterday-from-karl.html' title='Closing days of the Data Mining survey'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-8351393575204557000</id><published>2009-03-11T13:52:00.000-07:00</published><updated>2009-03-17T14:33:07.998-07:00</updated><title type='text'>And then there were Three, not!</title><content type='html'>I haven't been contributing to forums or making posts with my usual vigour because of a few recent events;&lt;br /&gt;&lt;br /&gt;1) becoming a daddy&lt;br /&gt;-&gt; lots of fun!&lt;br /&gt;&lt;br /&gt;2) recent accouncement of a merger between the telco's Vodafone and Hutchinson.&lt;br /&gt;-&gt; pain in the arse!&lt;br /&gt;For info see&lt;br /&gt;&lt;a href="http://www.ft.com/cms/s/0/1e1af810-f68e-11dd-8a1f-0000779fd2ac.html"&gt;http://www.ft.com/cms/s/0/1e1af810-f68e-11dd-8a1f-0000779fd2ac.html&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.vodafone.com/start/media_relations/news/group_press_releases/2009/hutchison_and_vodafone.html"&gt;http://www.vodafone.com/start/media_relations/news/group_press_releases/2009/hutchison_and_vodafone.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Australia's population is approximately 20 million, which is pretty small, and there were four players in the mobile service provider market (in probable order of market share); Telstra, Optus, Vodafone, Three.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The annoncement that Vodafone and Three are merging reduces this to three players, which reshapes the landscape of Australia to closely match many other countries with mature telecommunications markets. Most countries with mature telecommunications markets have a few players and, in this current economic climate, its not surprising that there will be mergers and therefore consolidation of customers into larger groups.&lt;br /&gt;&lt;br /&gt;As a result of the merger, the competitors (Telstra &amp;amp; Optus) will have to review their strategies and probably re-examine customer analysis. Lots of work for us Data Miners...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-8351393575204557000?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/8351393575204557000/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=8351393575204557000' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8351393575204557000'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8351393575204557000'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/03/and-then-there-were-three-not.html' title='And then there were Three, not!'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-4150838419336114281</id><published>2009-03-03T12:53:00.000-08:00</published><updated>2009-03-03T13:24:04.769-08:00</updated><title type='text'>How many models is enough?</title><content type='html'>I recently missed a presentation by a data mining software vendor (due to my recent paternity break) but I've been &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;reviewing&lt;/span&gt; my colleagues notes and vendor presentation slides.  I won't name the vendor, you can probably work it out.&lt;br /&gt;&lt;br /&gt;A &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;significant&lt;/span&gt; part of the vendor solution is the ability to manage many, we're talking hundreds, of data mining models (predictive, clustering etc).&lt;br /&gt;&lt;br /&gt;In my group we do not have many data mining models, maybe a dozen, that we run on a weekly or monthly basis.  Each model is quite comprehensive and will score the entire customer base (or near to it) for a specific outcome (churn, up-sell, cross-sell, &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;acquisition&lt;/span&gt;, inactivity, credit risk, etc).  We can &lt;strong&gt;subsequently&lt;/strong&gt; select sub-populations from the customer base for &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;targetted&lt;/span&gt; communications based upon the score or outcome of any single or a combination of models, or any criteria take from customer information.&lt;br /&gt;&lt;br /&gt;I'm not entirely sure why you would want hundreds of models in a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Telco&lt;/span&gt; (or similar) space.  Any selection criteria applied to specific customers (say, by age, or gender, or state, or spend) before modeling will ofcourse force a baised sample that feeds into the model and affects its inherant nature.  Once this type of selective sampling is performed you can't easily track the corresponding model over time *if* the sampled sub-population ever changes (which is likely because people do get older, move house, or change spend etc).  For this reason I can't understand why someone would want or have many models.  It makes perfect sense in Retail (for example a model for each product or associations rules for product recommendations), but not many models that apply to sub-populations of your customer base.&lt;br /&gt;&lt;br /&gt;Am I missing something here? If you are working with a few products or services and a large customer base why would you prefer many models over a few?   &lt;br /&gt;&lt;br /&gt;Comments please :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-4150838419336114281?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/4150838419336114281/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=4150838419336114281' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4150838419336114281'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4150838419336114281'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/03/how-many-models-is-enough.html' title='How many models is enough?'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-8154767923488147662</id><published>2009-01-19T15:20:00.000-08:00</published><updated>2009-01-19T15:40:41.264-08:00</updated><title type='text'>re: "Thoughts on Understanding Neural Networks"</title><content type='html'>Great post by Gordon &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Linoff&lt;/span&gt; the the Data Miners.com Blog about &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;visualising&lt;/span&gt; Neural Networks&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.data-miners.com/blog/2009/01/thoughts-on-understanding-neural.html"&gt;http://www.data-miners.com/blog/2009/01/thoughts-on-understanding-neural.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I usually get better predictive success using neural nets, but the lack of explain-ability is always a downside. I'm always keen to see ways that might help explain or interpret a Neural Network. A few years ago I tried a simple graphical way to show a Neural Net, but I think Gordon's recent post highlights better options.&lt;br /&gt;&lt;br /&gt;My example is written in VB.NET and parses a single hidden layer Neural Network saved as &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;PMML&lt;/span&gt; into data grids. Once the neural net neurons and weights are loaded into the data grids I then read from the data grids and create a graphic of the Neural Net. Transparency is used to show the strength of the weight, whilst colour (blue or red) is used to show the negative or positive effect) . You can view my example at;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.kdkeys.net/forums/thread/6495.aspx"&gt;http://www.kdkeys.net/forums/thread/6495.aspx&lt;/a&gt;&lt;br /&gt;and download the source code, executable and example &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;PMML&lt;/span&gt; directly from here;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.kdkeys.net/forums/6495/PostAttachment.aspx"&gt;http://www.kdkeys.net/forums/6495/PostAttachment.aspx&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I can't post images as comments on Gordon's blog so below are two snapshots of the simple &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;UI&lt;/span&gt; application that displays the Neural Net &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;PMML&lt;/span&gt; graphic.&lt;br /&gt;&lt;br /&gt;a) graphic&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_octsRin8yY0/SXUOGW8VZGI/AAAAAAAAABo/GpjaX2GJBb8/s1600-h/nn+visual+1.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5293152439444530274" style="WIDTH: 320px; CURSOR: hand; HEIGHT: 198px" alt="" src="http://2.bp.blogspot.com/_octsRin8yY0/SXUOGW8VZGI/AAAAAAAAABo/GpjaX2GJBb8/s320/nn+visual+1.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;b) &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;PMML&lt;/span&gt; loaded into data grids&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/SXUOLTN2TsI/AAAAAAAAABw/OxhgNHlmWSo/s1600-h/nn+visual+2.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5293152524343594690" style="WIDTH: 320px; CURSOR: hand; HEIGHT: 176px" alt="" src="http://3.bp.blogspot.com/_octsRin8yY0/SXUOLTN2TsI/AAAAAAAAABw/OxhgNHlmWSo/s320/nn+visual+2.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Cheers&lt;br /&gt;&lt;br /&gt; - Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-8154767923488147662?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/8154767923488147662/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=8154767923488147662' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8154767923488147662'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/8154767923488147662'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/01/re-thoughts-on-understanding-neural.html' title='re: &quot;Thoughts on Understanding Neural Networks&quot;'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_octsRin8yY0/SXUOGW8VZGI/AAAAAAAAABo/GpjaX2GJBb8/s72-c/nn+visual+1.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-666481801628178250</id><published>2009-01-08T02:04:00.000-08:00</published><updated>2009-01-13T23:50:26.003-08:00</updated><title type='text'>Isn't In-database processing old news yet?</title><content type='html'>&lt;p&gt;A bit of a Clementine plug, but hear me through...&lt;br /&gt;&lt;br /&gt;I'm puzzled by a few recent articles I've read describing in-database processing, the practice of doing very sophisticated data warehouse analysis (lets call it data mining :) on large amounts of data without having to extract the data into an external analytics tool (for example, a tool like SPSS or SAS).&lt;br /&gt;&lt;br /&gt;As an example see the current Teradata magazine article;&lt;a href="http://www.teradata.com/tdmo/v08n04/Features/OnTheHorizon.aspx"&gt;http://www.teradata.com/tdmo/v08n04/Features/OnTheHorizon.aspx&lt;/a&gt;I was fortunate enough to spend a few evenings chatting with Stephen Brobst (chief technology officer of Teradata) on these topics during a Teradata conference in Beijing last June 2008 *. I think he's right on the money concerning his top 4 predictions for data warehousing. As a Data Miner I am concerned with how I might be expected to analyse the data, so in-database processing is the biggest topic for me. I'm not so sure it is a 'future' thing though. In my view its here now, just maybe not so widesread. My only guess is that it's another plug for the SAS partnership. Although I don't use SAS I do like the thinking and development plan going forward. I simply don't think its the new concept it being touted as. I'm sure its not necessary for a data miner or data analyst to need custom plug-ins (and corresponding expense) to reach in-database data mining nevada.&lt;br /&gt;&lt;br /&gt;In-database processing is nothing new. I've doing it using SPSS Clementine and Teradata for a few years now. SPSS Clementine has supported this functionality for quite a few years. In real-time Clementine will convert the stream (a graphical icon-based proprietary query file) into SQL (structured query language) and submits the SQL query(s) to the data warehouse. Any computation that cannot be represented as SQL will cause a data extraction and further processing by the Clementine analysis engine itself (commonly the Clementine Server on a dedicated analytical server box will do this, and keep the data and temp files on the server file system. Not the desktop). In practice I usually avoid heavy statistical functions and all my data processing is usually performed in the Teradata warehouse and only the data sample required to build a predictive or clustering model is extracted. When it comes to scoring the created data mining model (such as a neural network or decision tree) Clementine also converts the data mining model into SQL transparently for truly high scale processing on the data warehouse.&lt;br /&gt;&lt;br /&gt;The real advantage comes from not just being able to score existing data mining models, but also being able to build predictive models entirely in the data warehouse, and this is a comparatively new development (a couple of years). Not something I do much of (I've done it for fun on my home SQL Server, but not in a corporate production environment). If the data warehouse provides the capability to create neural networks or clustering models (which some now do) then there is no need to ever extract data from the data warehouse into an external analysis application such as SPSS or SAS. More data can therefore be used to build models, and this usually beats tweaking algorithm options.&lt;br /&gt;&lt;br /&gt;The data warehouses have actually supported embedded code and adding custom functions that might include a data mining algorithm for quite a while. For info see a recent post by Seth Grimes titled "In-Database Analytics: A Passing Lane for Complex Analysis";&lt;a href="http://www.intelligententerprise.com/info_centers/data_int/showArticle.jhtml?articleID=212500351&amp;amp;cid=RSSfeed_IE_News"&gt;http://www.intelligententerprise.com/info_centers/data_int/showArticle.jhtml?articleID=212500351&amp;amp;cid=RSSfeed_IE_News&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Only in the past year or so have full blown embedded data mining algorithms taken off. The question is though will these algorithms always run fast(er)? Custom code can be good or bad! One advantage of the 'algorithms converted into SQL' route is that the data warehouse can quite easily determine and control how to process and prioritise the SQL query and be optimised specifically for it. Custom code and embedded data mining algorithms can also be optimised, but I'm guessing that requires far more effort (and expense!). One worry is also that custom code brings dangers and risks (not to mention the testing and issues for IT and the DB admin). Still, its necessary for in-database data mining model building capability.&lt;br /&gt;&lt;br /&gt;Ok, I'm guessing some of my peers might know this stuff anyway, but one thing has recently occurred to me;&lt;br /&gt;- considering that once we have data processing, model building and model scoring all occurring within the data warehouse, what need have we for the data mining tool?&lt;br /&gt;&lt;br /&gt;My preference is for a easy tool that makes querying the data warehouse and constructing highly complex analysis easy. My queries could not possibly be prepared by hand since they are often transformed into many thousands of lines of SQL code. I use a clever user interface to make understanding the logic of the analysis possible. The data mining tool I use primarily is a tool that optimises my interaction with the data warehouse.&lt;br /&gt;&lt;br /&gt;So considering these things, my current view is that data mining applications/tools such as SPSS Clementine (or even SAS :) will stick around for quite a bit longer because the user interface optimises a data miner's ability to query the data warehouse and perform data mining efficiently, but maybe in a few years we start to see what we commonly refer to as data mining 'algorithms' being developed for data warehouses and no longer in data mining tools (or simply as plug-ins for data warehouses). An interesting thought indeed!&lt;br /&gt;&lt;br /&gt;- Tim&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;* Whilst at the Teradata User Conference in Beijing I presented some data mining analysis work, mostly my data analysis work on churn prediction and product upsell in telcommunications, and chatted to the China mobile analysts afterward. On a more personal note, that is also when my soon-to-be-born son was conceived. Don't worry, my wife was with me at the time! In true Hollywood fashion I thought it appropriate we therefore name him 'Beijing' or 'Teradata' but my wife doesn't share my enthusiasm :)&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-666481801628178250?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/666481801628178250/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=666481801628178250' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/666481801628178250'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/666481801628178250'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/01/isnt-in-database-processing-old-news.html' title='Isn&apos;t In-database processing old news yet?'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-4936320770290944420</id><published>2009-01-06T15:10:00.001-08:00</published><updated>2009-01-07T12:15:40.232-08:00</updated><title type='text'>book review "Data Preparation For Data Mining"</title><content type='html'>Just before Christmas I bought myself yet another data mining book (i have a few dozen). This one somehow slipped by me for 10 years but I'm glad I finally stumbled upon it. Originally published in 1999, Dorian Pyle wrote "Data Preparation For Data Mining" before Data Mining was less wide spread and 'Predictive Analytics' wasn't the buzz word it is today.&lt;br /&gt;&lt;br /&gt;The only few criticisms I could possibly raise are;&lt;br /&gt;1) that everything has a statistical basis.&lt;br /&gt;- For example one technique I use to redistribute heavily skewed data is simple binning by count. I work in telecommunications and the behavioural data is always extremely skewed. Log functions don’t work so I often use &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;SQL&lt;/span&gt; to convert variables into 100 percentile bins (where each bin has the same number of rows (customers) in it). That type of insight isn't in the book, but several statistically based alternatives are. I'm not convinced they would work with extremely skewed data, but they are well explained and useful insights.&lt;br /&gt;2) no mention of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;SQL&lt;/span&gt; or step-by-step examples of data manipulation (nothing like 'before and 'after' pictures). Ideas or examples for derived variables are lacking too.&lt;br /&gt;&lt;br /&gt;So far I've read through the first 275 pages and the odd additional chapter. Its surprisingly easy to read and explains the statistics well. Its &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;definitely&lt;/span&gt; a book I will refer to, and well worth buying.&lt;br /&gt;&lt;br /&gt;In &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;February&lt;/span&gt; 2004 Dorian Pyle made an interesting post about things to avoid when data mining;&lt;br /&gt;"This Way Failure Lies " &lt;a href="http://www.ibmdatabasemag.com/story/showArticle.jhtml?articleID=17602328"&gt;http://www.ibmdatabasemag.com/story/showArticle.jhtml?articleID=17602328&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-4936320770290944420?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/4936320770290944420/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=4936320770290944420' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4936320770290944420'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4936320770290944420'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2009/01/book-review-data-preparation-for-data.html' title='book review &quot;Data Preparation For Data Mining&quot;'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6199018178049578461</id><published>2008-12-15T18:03:00.000-08:00</published><updated>2008-12-15T18:46:40.925-08:00</updated><title type='text'>No Long Tail from iTunes</title><content type='html'>&lt;p&gt;There have been quite a few posts in recent months about analytics involving the Long Tail. &lt;br /&gt;For an overview see;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/The_Long_Tail"&gt;http://en.wikipedia.org/wiki/The_Long_Tail&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Personally, I definitely fall into a Long Tail demographic regarding my music habits.  I buy many relatively uncommon (and some very obscure) funk and jazz albums or compilations.  A significant number of these were introduced to me by Amazon's recommendation engine or similar.&lt;br /&gt;&lt;br /&gt;I thought I might be missing out on some good music by not being part of the 'iTunes generation'.  It came as a huge shock to me when I joined iTunes this weekend and found that *none* of my dozen most recent purchases are even listed in iTunes (all my recent purchases have been from Amazon.com). Granted some of the artists have died and their albums are a few decades old, but other titles were released last year.&lt;br /&gt;&lt;br /&gt;Not much chance of the Long Tail when 'stock' is limited.  In terms of Amazon I have actually ordered and received something physical; a compact disc that was sitting on a shelf somewhere in the US and travelled transatlantic to me in Sydney.&lt;br /&gt;&lt;br /&gt;What excuse do iTunes have?  Downloading .mp3's hardly has the same requirements for stock management, inventory and distribution.  A long tail in the iTunes business model should be easier to support and yield greater benefits (than Amazon for example) because the lesser requirement for physical stock management (I'm guessing it would be a simple case of more disk space).  A long tail is not likely to exist where there is less choice for consumers! &lt;br /&gt;&lt;br /&gt;In my view the whole idea of the 'Long Tail' applies to circumstances where the constraints of physical stock management are removed (as with iTunes).  The fact that Amazon excel at this with physical stock is a credit to them.   iTunes is simply a mass-market disgrace :)&lt;br /&gt;&lt;br /&gt;Guess I'll continue to buy CD's and burn them into my own mp3's and stream music from my home network... &lt;/p&gt;&lt;p&gt;- Tim&lt;/p&gt;&lt;p&gt;btw.  See my wish list for examples :)&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6199018178049578461?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6199018178049578461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6199018178049578461' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6199018178049578461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6199018178049578461'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/12/no-long-tail-from-itunes.html' title='No Long Tail from iTunes'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-2977108852487312924</id><published>2008-12-07T18:00:00.000-08:00</published><updated>2008-12-07T18:09:20.537-08:00</updated><title type='text'>Wikipedia entry for SPSS Clementine</title><content type='html'>I was doing a search on Wikipedia today, and out of curiousity I wondered what the Clementine entry said. &lt;br /&gt;&lt;br /&gt;Although there is some information on data mining (&lt;a href="http://en.wikipedia.org/wiki/Data_mining"&gt;http://en.wikipedia.org/wiki/Data_mining&lt;/a&gt;), I was surprised that there was nothing about the SPSS Clementine data mining tool, so I added a brief entry;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/SPSS_Clementine"&gt;http://en.wikipedia.org/wiki/SPSS_Clementine&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Feel free to contribute. &lt;br /&gt;&lt;br /&gt;I might add more when I get a block of free time.&lt;br /&gt;&lt;br /&gt; - Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-2977108852487312924?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://en.wikipedia.org/wiki/SPSS_Clementine' title='Wikipedia entry for SPSS Clementine'/><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/2977108852487312924/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=2977108852487312924' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/2977108852487312924'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/2977108852487312924'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/12/wikipedia-entry-for-spss-clementine.html' title='Wikipedia entry for SPSS Clementine'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6659858001951897086</id><published>2008-11-30T21:12:00.000-08:00</published><updated>2008-12-02T15:25:16.519-08:00</updated><title type='text'>Smart Data Collective</title><content type='html'>I was recently asked to join the Smart Data Collective, which is a social community of bloggers and professionals interested in data warehousing &lt;strong&gt;and&lt;/strong&gt; enterprise analytics. It is sponsored by Teradata and is editorially independent. Some of my posts will appear there, and maybe even some specific articles from me.&lt;br /&gt;&lt;br /&gt;They asked me to be a founding member and featured blogger which sounds like a lot of work, but I've been reassured isn't :)&lt;br /&gt;&lt;br /&gt;On a related but separate note, I recently participated in a podcast for Teradata concerning the data analytics we do at Optus &lt;em&gt;(legal disclaimer:  I do not represent Optus in anyway in this personal blog :)&lt;/em&gt;.  I discussed previously presented material regarding our churn prevention analysis, and also my recent social network analysis. The podcast was only completed recently, its still early days and needs to pass approval from appropriate legal channels, but hopefully it will find its way onto the Teradata site in the New Year. I'll keep you posted.&lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6659858001951897086?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://smartdatacollective.com' title='Smart Data Collective'/><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6659858001951897086/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6659858001951897086' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6659858001951897086'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6659858001951897086'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/11/smart-data-collective.html' title='Smart Data Collective'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-107262877171119500</id><published>2008-11-24T14:07:00.000-08:00</published><updated>2009-01-08T01:33:35.332-08:00</updated><title type='text'>Movember Madness</title><content type='html'>Through executive meetings and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;BBQ's&lt;/span&gt;&lt;/span&gt; with friends I have worn a 'Mo' as a symbol of my support for &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Movember&lt;/span&gt;&lt;/span&gt; (&lt;a href="http://www.movember.com/"&gt;http://www.movember.com/&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;That's right we're talking about &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;men's&lt;/span&gt;&lt;/span&gt; love bumps and feeling down in the dumps. To raise awareness of prostate cancer and depression I have grown a truly dodgy moustache.&lt;br /&gt;&lt;br /&gt;I have found it a challenge and am looking forward to shaving the damn thing off...&lt;br /&gt;&lt;br /&gt;With only a few days to go I will soon be posting 'before' and 'after' pictures. There is still time to donate your hard earned cash if you wish;&lt;br /&gt;&lt;br /&gt;&lt;a href="https://www.movember.com/au/donate/donate-details.php?action=sponsorlink&amp;amp;rego=1944466&amp;amp;country=au"&gt;https://www.movember.com/au/donate/donate-details.php?action=sponsorlink&amp;amp;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;rego&lt;/span&gt;&lt;/span&gt;=1944466&amp;amp;country=&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;au&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Cheers&lt;br /&gt;&lt;br /&gt;Tim&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;- Edit: &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;ok&lt;/span&gt;, here's the dodgy picture of my &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_6"&gt;moustache&lt;/span&gt;. I raised $125 for my hardship :)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_octsRin8yY0/SWXIDGDIUBI/AAAAAAAAABc/5qAR508WSBQ/s1600-h/dodgy+mo+tim.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5288853292905418770" style="WIDTH: 230px; CURSOR: hand; HEIGHT: 255px" alt="" src="http://4.bp.blogspot.com/_octsRin8yY0/SWXIDGDIUBI/AAAAAAAAABc/5qAR508WSBQ/s320/dodgy+mo+tim.JPG" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-107262877171119500?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='https://www.movember.com/au/donate/donate-details.php?action=sponsorlink&amp;rego=1944466&amp;country=au' title='Movember Madness'/><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/107262877171119500/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=107262877171119500' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/107262877171119500'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/107262877171119500'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/11/movember-madness.html' title='Movember Madness'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_octsRin8yY0/SWXIDGDIUBI/AAAAAAAAABc/5qAR508WSBQ/s72-c/dodgy+mo+tim.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-1191087892560390761</id><published>2008-11-19T20:14:00.000-08:00</published><updated>2008-11-18T22:05:55.316-08:00</updated><title type='text'>A simple Data Transformation example...</title><content type='html'>In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format.  Often the data is transformed to a 'single row per customer' or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models.  Such data transformation can also be referred to as ETL (extract transform load), although my work is usually as SQL within the data warehouse so it is just the ‘T’ bit.&lt;br /&gt;&lt;br /&gt;Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple.  I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.&lt;br /&gt;&lt;br /&gt;Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations.  Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs.  Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).&lt;br /&gt;&lt;br /&gt;There seems to be a glaring big gap!  I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm.  In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters.  Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful.  They don’t necessarily have to be statistical in nature.&lt;br /&gt;So (to put my money where my mouth is) I've started by illustrating a very simple data transformation that I expect is common.  On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine). &lt;br /&gt;&lt;a href="http://www.kdkeys.net/forums/8440/PostAttachment.aspx"&gt;Clementine Stream and text data files&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.kdkeys.net/forums/8440/ShowThread.aspx"&gt;my post to a Clementine user forum&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I’m hoping that my peers might exchange similar ideas (hint!).  A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful.  This is just the start of a series of ETL you could perform.&lt;br /&gt;&lt;br /&gt;I’ve also added a poll for feedback whether this is helpful, too basic, etc&lt;br /&gt;&lt;br /&gt; - Tim&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Example data processing steps&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt; a)Creation of additional dummy columns &lt;/b&gt;&lt;br /&gt;Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category.  We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised).  A new column is created for each category field. &lt;br /&gt;&lt;br /&gt;For example;&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;td&gt;customer&lt;/td&gt;&lt;td&gt;category&lt;/td&gt;&lt;td&gt;score&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;food&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;food&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;25&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;Can be changed to;&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;td&gt;customer&lt;/td&gt;&lt;td&gt;category&lt;/td&gt;&lt;td&gt;score&lt;/td&gt;&lt;td&gt;food_ind&lt;/td&gt;&lt;td&gt;drink_ind&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;food&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;food&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;25&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;Or even;&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;td&gt;customer&lt;/td&gt;&lt;td&gt;category&lt;/td&gt;&lt;td&gt;score&lt;/td&gt;&lt;td&gt;food_score&lt;/td&gt;&lt;td&gt;drink_score&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;food&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;food&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;25&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;25&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;drink&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt; b) Summarisation &lt;/b&gt;&lt;br /&gt;Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns. &lt;br /&gt;So we could change the previous step to something like this;&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;td&gt;customer&lt;/td&gt;&lt;td&gt;food_score&lt;/td&gt;&lt;td&gt;drink_score&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;bill&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;45&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ben&lt;/td&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-1191087892560390761?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/1191087892560390761/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=1191087892560390761' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1191087892560390761'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1191087892560390761'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/11/simple-data-transformation-example.html' title='A simple Data Transformation example...'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-5264139723774666400</id><published>2008-11-13T16:55:00.000-08:00</published><updated>2009-01-05T15:08:49.303-08:00</updated><title type='text'>People are Age-ist !</title><content type='html'>- Just an interesting customer insight that made me laugh the other day....&lt;br /&gt;&lt;br /&gt;As a small part of further social network analysis of a mobile (cell-phone) customer base I have examined age differences between customers and with whom they communicate most frequently.&lt;br /&gt;&lt;br /&gt;I was also looking at how reliable it might be to guess someones age (a customer or non-customer) by extrapolating from known individuals. There is customer age approx 97% of the time, and its accurate approx 92% of the time (unusally large numbers of people claim to be born on 1st Jan 2000 :)&lt;br /&gt;&lt;br /&gt;I was surprised to see (but then maybe I'm naive :) how so many people have 'mobile calling relationships' mainly with people the same age...&lt;br /&gt;&lt;br /&gt;The chart below shows the percentage of customers and the age average between the people they communicate with most frequently. Age differences over 4 years are comparatively rare...&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/SSINhHNq9oI/AAAAAAAAABM/9pp5dS_uVR0/s1600-h/herds+age+diff+3.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5269789376500135554" style="WIDTH: 320px; CURSOR: hand; HEIGHT: 238px" alt="" src="http://3.bp.blogspot.com/_octsRin8yY0/SSINhHNq9oI/AAAAAAAAABM/9pp5dS_uVR0/s320/herds+age+diff+3.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The small spike at 30 years difference is probably parent-to-child communication.&lt;br /&gt;&lt;br /&gt;I will be using this to support an estimation of age for prospects and customers where age is unknown, but age of social group members is known.&lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-5264139723774666400?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/5264139723774666400/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=5264139723774666400' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/5264139723774666400'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/5264139723774666400'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/11/people-are-age-ist.html' title='People are Age-ist !'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_octsRin8yY0/SSINhHNq9oI/AAAAAAAAABM/9pp5dS_uVR0/s72-c/herds+age+diff+3.JPG' height='72' width='72'/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-2812128055295484958</id><published>2008-10-20T22:19:00.000-07:00</published><updated>2008-10-20T22:49:47.268-07:00</updated><title type='text'>Distribution of a prediction score</title><content type='html'>Ok, what are people's views on this?&lt;br /&gt;&lt;br /&gt;I've tried to refer to a few textbooks but haven't found anything to help me 'answer' this.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;- background -&lt;br /&gt;&lt;/strong&gt;I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s).&lt;br /&gt;&lt;br /&gt;We often use neural networks to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers. We rescore the customer base each month.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;- problem -&lt;/strong&gt;&lt;br /&gt;We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.&lt;br /&gt;&lt;br /&gt;When I build my predictive model I try to ensure the model mimics this distribution. My view is that most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.&lt;br /&gt;&lt;br /&gt;Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout the score range.&lt;br /&gt;&lt;br /&gt;I often examine the distribution as a sanity check before deployment. If the distribution is as expected it is something like this;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_octsRin8yY0/SP1pbFqXWrI/AAAAAAAAAAk/nfIPFAMXm30/s1600-h/churn+score+distribution.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5259475853935860402" style="CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_octsRin8yY0/SP1pbFqXWrI/AAAAAAAAAAk/nfIPFAMXm30/s320/churn+score+distribution.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If it looks screwy, maybe something like this;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_octsRin8yY0/SP1qW15viKI/AAAAAAAAAAs/2VbveXwPZpY/s1600-h/churn+score+distribution2.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5259476880497543330" style="CURSOR: hand" alt="" src="http://2.bp.blogspot.com/_octsRin8yY0/SP1qW15viKI/AAAAAAAAAAs/2VbveXwPZpY/s320/churn+score+distribution2.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;- then there may be a problem with the data processing or the behaviour of customers has sufficently changed over time to require a model refresh. The subsequent actual model outcome preformance is often not as good in this case.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;- question -&lt;/strong&gt;&lt;br /&gt;What are your views/preferences on this?&lt;br /&gt;What steps, if any, do you take in an attempt to validate the model prior to deployment (lets assume testing, validation and prior months performance is great) ?&lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-2812128055295484958?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/2812128055295484958/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=2812128055295484958' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/2812128055295484958'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/2812128055295484958'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/10/distribution-of-prediction-score.html' title='Distribution of a prediction score'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_octsRin8yY0/SP1pbFqXWrI/AAAAAAAAAAk/nfIPFAMXm30/s72-c/churn+score+distribution.JPG' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6479196469729577573</id><published>2008-10-16T15:18:00.000-07:00</published><updated>2008-10-16T15:50:56.654-07:00</updated><title type='text'>new book "Marketing Calculator"</title><content type='html'>I'm famous! :)  I'm delighted to say that I’ve been referenced in a new marketing book by Guy Powell titled “Marketing Calculator”.&lt;br /&gt;&lt;a href="http://www.marketing-calculator.com/"&gt;http://www.marketing-calculator.com/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I met Guy in Oct 2007 whilst presenting a couple of data mining case studies at a Marketing Analytics conference in Singapore.  My presentation title was ‘given’ to me by the conference organisers, but it allowed some freedom regarding the content.  I discussed how we use a comprehensive data warehouse, and how having access to detailed customer data enriched with demographic data can enable you to get some impressive response rates from campaigns, and sales numbers by up-selling to existing customers, and most importantly retain and grow your customer base.&lt;br /&gt;&lt;br /&gt;Guy and I spent some time discussing our work and he asked if could include it as a case study in his book.   I received the book yesterday afternoon and am already through the first 4 chapters (60 pages).   It reads easily and is proving to be a very worthwhile book.  It is a Marketing book (it doesn’t contain statistical equations or examples of algorithms on data mining) and I would recommend it for anyone involved in marketing.  So far it has provided some well structured and clear insights how you can improve your marketing practices. &lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6479196469729577573?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6479196469729577573/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6479196469729577573' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6479196469729577573'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6479196469729577573'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/10/new-book-marketing-calculator.html' title='new book &quot;Marketing Calculator&quot;'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6426864074732996917</id><published>2008-09-21T20:19:00.000-07:00</published><updated>2009-01-05T15:18:40.732-08:00</updated><title type='text'>Social Network Analysis in mobile networks</title><content type='html'>&lt;p&gt;Ok this is a big project that has consumed a lot of my time. It was actually completed a few months ago, but I’ve only recently had the time to present it or mention it in a public blog. I’m writing this free-form whilst some large queries are running in the background. I’ll add more to the thread when I get some more free time. Hopefully it will make interesting reading. I do tend to get excited with my projects like this, so please forgive me if my propaganda rambles on a bit...&lt;br /&gt;&lt;br /&gt;The aim of these posts is to reveal some typical data mining problems I encounter. It will superficially describe a social networks project I have recently completed. Hopefully enough to give insight, but not enough to replicate the whole thing exactly :)&lt;br /&gt;&lt;br /&gt;I would like to extend my sincere thanks to Jure Leskovec (&lt;a href="http://www.cs.cmu.edu/~jure/"&gt;http://www.cs.cmu.edu/~jure/&lt;/a&gt;)&lt;br /&gt;and Eric Horvitz at Microsoft for their work on social networks within the IM user base, and also Amit Nanavati (&lt;a href="http://domino.research.ibm.com/comm/research_people.nsf/pages/nanavati.index.html"&gt;http://domino.research.ibm.com/comm/research_people.nsf/pages/nanavati.index.html&lt;/a&gt;) et al at IBM India Research Labs for their work on social networks in a US telco regarding churn prediction. Both were kind enough to send me their published papers detailing their work in large scale social network analysis. I’d already completed most of my work, but both of their papers gave me some very informative insights and ideas.&lt;br /&gt;&lt;br /&gt;I'd like to emphasise that my work is significantly simpler in terms of the social analysis computation itself. As much as I would like, I can't afford to investigate whether we have 6.6 degrees of separation or not. Much of the ground breaking work from these researchers involves continuous processing of data for days. Processing is often performed against binary files or flat files using dedicated systems. My data is stored within a terabyte scale data warehouse with hundreds of concurrent demands. Constraints in terms of data warehouse load and computing restrictions mean that my analysis must complete within a practical timeframe. In order to 'production-ise' the analysis it must be a series of SQL queries that someone can simply start with the click of a button. I perform data cleaning, summarisation and transformations on several hundred million CDR's (call detail rows) and calculate social groups for our customer base in less than 3 hours, entirely in SQL on a Teradata data warehouse. I think that in itself is pretty cool, but consequently I must acknowledge that my social networks are comparatively basic and my analysis does not investigate the attributes of the social networks as in-depth as others have.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Why do this?&lt;br /&gt;&lt;/strong&gt;Working for an Australian telco, in a market with 100% mobile (cell-phone) saturation, the primary focus is customer retention. From a data mining perspective this usually means we examine call usage and, based upon recent customer behaviour, we identify which customers might have problems or want to leave (telco's call this churn, banks often call it attrition). It costs a lot of win a new customer, far less to do something nice to keep an existing customer. The core to my data mining is to use the customer data within an integrated data warehouse to better understand the customer and deliver a service that appears specific to them as an individual. More recently I've tried to focus on communities and using the social fabric surrounding a customer to ensure we better adapt and anticipate customer actions. Hence the need for a social network analysis, a method to identify and investigate the communities that naturally exists within our mobile customer base. This is quite different from the standard analysis that focuses on customers as individuals.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;What is it all about?&lt;br /&gt;&lt;/strong&gt;In a customer focused point of view the theory is that the influences of work colleagues, friends, and family are far stronger and influential than any messages a company can convey through TV or the media. By identifying influencers and social relationships within our customer base we can more effectively anticipate customer actions such as churn. For example, targeting the leaders of social groups and ensuring they are happy will spread with viral positive to word-of-mouth affects throughout social groups (which may even include competitor's customers). Being able to even monitor and measure the viral nature of communications with customers is valuable enough.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;How do you do it?&lt;br /&gt;&lt;/strong&gt;So, recently I have been working on a project to develop analysis that identifies social groups, leaders, followers, churn risks and similar attributes within our customer base. It’s difficult to give too many details without risk of divulging intellectual property, so please assume any details or numbers I provide are rough estimates only...&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Some Numbers...&lt;/strong&gt;&lt;br /&gt;- Lets suppose we have 4 million mobile customers.&lt;br /&gt;- Suppose average outbound is approx 10 calls per day.&lt;br /&gt;- Suppose average inbound is approx 10 calls per day.&lt;br /&gt;- So, we have approx 80 million rows of data every day.&lt;br /&gt;- The terminating number dialled can vary to include country codes, area codes etc.&lt;br /&gt;- People communicate using voice, sms, picture messaging, and video.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Early Data Manipulation Issues&lt;br /&gt;&lt;/strong&gt;Already you can see a few problems to deal with;&lt;br /&gt;A) A lot of data! One week of data alone is over 500 million rows.&lt;br /&gt;B) The same terminating number can be dialled multiple ways (with or without country codes). In order to identify 'who' a customer communicates with we need to 'clean' the number dialled by resolving country codes, area codes etc so that the same number is resolved irrespective of whether country prefixes are used or not. Yes, we have to perform SQL string cleaning functions on all the data in order to resolve all dialled phone numbers. I did this using a conceptually simple but long winded SQL case statement. It doesn’t actually take long in our data warehouse, we’re talking several minutes, not hours.&lt;br /&gt;&lt;br /&gt;C) Different forms of communication (voice, sms, picture messaging, video).&lt;br /&gt;Once the dialled numbers have been cleaned, summarisation by customer number and dialled recipient can be performed. In our case this summarisation involves calculating totals for calls of different forms of communication. The summarised data is one row per customer vs recipient combination. Numerous columns contain sums regarding different calls.&lt;br /&gt;&lt;br /&gt;D) Calls can be outbound or inbound. Each is distinguished and processed separately at first. String cleaning is also performed to resolve the originating telephone numbers. Outbound calls started by our customers are summarised as above, so too are inbound calls received by our customers from any source.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Simple Calling Relationships&lt;br /&gt;&lt;/strong&gt;Once we have both separate (outbound and inbound calls) summarisations complete, then we can join them together (matching recipient telephone number for outbound calls with originating number for inbound calls) to understand if the calling behaviour is reciprocal.&lt;br /&gt;&lt;br /&gt;We could use some business logic to limit the definition of a calling relationship, for example if a customer makes over 5 and receives over 5 calls from the same recipient/originating telephone number. From this point you have a simple framework from which you can rank, transform and manipulate the relationships a specific customer has with recipients. The limiting of call counts can help reduce data, and also ensure one-off calls or uni-directional communication to the local pizza shop doesn’t count…&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Important Legal Stuff…&lt;br /&gt;&lt;/strong&gt;Okay, a quick little important tangent. At this point I’d like to touch on an important topic which is far too often taboo in data mining, especially in the telecommunications industry. When you’ve got the capability to do some analysis you often need to stop and think what you should do (ethically and legally), as opposed to what you can do (technically). As a telco it is possible to get and use customer data for lots of things, but taking action based upon a specific number dialled is illegal in some countries. For example, suppose a customer calls a competitor’s sales number or speaks to a competitor’s tech support line. It may be illegal to track these events and perform some kind of retention activity. It could be an invasion of privacy. It also crosses into anti-competitive issues because other companies don’t have access to the same data. I've not done this type of activity.  Still, I know for a fact that some industry analysts do this.&lt;br /&gt;&lt;br /&gt;What I am doing is analysis at this sensitive level, but not reacting to specific telephone numbers. I don’t know (or care) anything about the recipient’s telephone number. I am only interested in how many times it is called, at what time or the day, using voice or SMS calls etc. It’s the nature of the relationship a customer has with a recipient (and their behaviour) that interests me, not necessarily who the recipient is. Understanding and generalising the calling relationships, for example allows us to build very accurate predictive models that can quantify how likely a customer is to churn based upon recent behaviour of them *and* their closest friends (still sounds ‘Big Brother’ though doesn’t it :)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Formation of Simple Social Networks&lt;br /&gt;&lt;/strong&gt;So, in my analysis I have summarised outbound calls and inbound calls. Next step is to cross-join both summarisations together so that we list all the customers that also called the same recipient and all the recipients that also called the same customers (and yes, recipients can be customers!). That’s one big query, so you might want to reduce the number of recipients or customers by using some business logic of your choice. This is where restrictions to make the processing complete in a practical timeframe really come into play. A true social network wouldn’t reduce the relationship criteria. Maybe you’d put some logic in place whereby you take the top 10 ranked recipients (who each customer calls the most) for each customer. This would drastically reduce the complexity of the cross-join, but obviously limit the potential social networks you will discover (at this point Jure maybe screaming in agony, and if you are I’m really sorry :)&lt;br /&gt;&lt;br /&gt;The result of such a join would enable you to know which customers (and how many) communicate with any given recipient (who could be your customer or a customer of a competitor). Likewise, we can identify customers that have larger numbers of other customers or 'competitor customers' that call and rate them highly in their social groups. Such individuals can be given classifications as 'leaders', 'bridges' etc.&lt;br /&gt;&lt;br /&gt;It is difficult to avoid going into too much detail, but simply by examining customer churn and attributes such as number of 'competitor customer' friends and any friends that recently churned, we can very accurately predict churn behaviour with a month lead time (even better if we predict just 1 week in advance). In terms of churn, we're talking an increased churn propensity by a factor of five times at least simply by having social group affinity with a another customer that has already recently churned.&lt;br /&gt;&lt;br /&gt;Going forward I will be further analysing these social factors and, time permitting, examine some of the finer customer insights that this type of analysis can highlight.&lt;br /&gt;&lt;br /&gt;If anyone is doing similar stuff, I'd love to chat. If you anywhere near Sydney I'll happily buy the beers!&lt;br /&gt;&lt;br /&gt;- Tim&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6426864074732996917?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6426864074732996917/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6426864074732996917' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6426864074732996917'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6426864074732996917'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/09/social-network-analysis-in-mobile.html' title='Social Network Analysis in mobile networks'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-6835643277845016788</id><published>2008-09-21T18:19:00.000-07:00</published><updated>2008-09-21T18:29:40.189-07:00</updated><title type='text'>My best learning model yet...</title><content type='html'>My wife and I have successfully created a new learning model that should perform really well. At first I don't expect to see any good results and there will be plenty of errors, but eventually it'll learn to solve simple problems intuitively. &lt;br /&gt;&lt;br /&gt;The downside is that it'll be by far the most time demanding and expensive model ever developed...&lt;br /&gt;&lt;br /&gt;Here's a picture of our progress after 12 weeks of development time :)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_octsRin8yY0/SNbzlUsCE2I/AAAAAAAAAAc/RPNMI4HPgws/s1600-h/12+week+scan.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5248650238280995682" style="CURSOR: hand" alt="" src="http://4.bp.blogspot.com/_octsRin8yY0/SNbzlUsCE2I/AAAAAAAAAAc/RPNMI4HPgws/s320/12+week+scan.JPG" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-6835643277845016788?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/6835643277845016788/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=6835643277845016788' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6835643277845016788'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/6835643277845016788'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/09/my-best-learning-model-yet.html' title='My best learning model yet...'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_octsRin8yY0/SNbzlUsCE2I/AAAAAAAAAAc/RPNMI4HPgws/s72-c/12+week+scan.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-824790620554507631</id><published>2008-09-01T17:39:00.000-07:00</published><updated>2008-09-01T18:04:01.735-07:00</updated><title type='text'>Run multiple instances of Clementine 12</title><content type='html'>Something I find very useful...&lt;br /&gt;&lt;br /&gt;One good change to Clementine in recent versions is to allow you to double click on a Clementine stream file and have that stream load in an already open Clementine client application.  In the past double clicking a stream file always resulted in a new Clementine application being started.&lt;br /&gt;&lt;br /&gt;A minor downside to this new feature is that by default you can't open two instances of the  Clementine client even if you wanted to.  I sometimes need to do this in order to run some analysis on the server box and other smaller analysis locally on my laptop.&lt;br /&gt;&lt;br /&gt;By adding an additional command flag to the clementine start command you can force it to open multiple application windows, each olne could be configured to connect to a different server or run locally;&lt;br /&gt;&lt;br /&gt;"C:\Program Files\SPSS\Clementine12.0\bin\clementine.exe" -noshare&lt;br /&gt;&lt;br /&gt;-&gt; add the "-noshare" option to the Clementine application start command.  &lt;br /&gt;&lt;br /&gt;Then simply clicking on any stream file will still open in the first Clementine client application window that started, but you have the option of opening additional Clementine client applications directly if you want.&lt;br /&gt;&lt;br /&gt;Cheers&lt;br /&gt;&lt;br /&gt;Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-824790620554507631?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/824790620554507631/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=824790620554507631' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/824790620554507631'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/824790620554507631'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/09/run-multiple-instances-of-clementine-12.html' title='Run multiple instances of Clementine 12'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-1248873175734362498</id><published>2008-08-26T23:21:00.000-07:00</published><updated>2008-08-27T00:04:16.929-07:00</updated><title type='text'>101 reasons not to upgrade to Clementine 12</title><content type='html'>Why 101 reasons? &lt;br /&gt;-&gt; because that's close to how many bugs have been added :(&lt;br /&gt;Ok, I'm exaggerating…a little. &lt;br /&gt;&lt;br /&gt;We recently 'upgraded' (and I user that term similarly to how a Windows XP user 'upgrades' to Vista) from Clementine 10.1 to version 12.0.2.  &lt;br /&gt;&lt;br /&gt;For what's new in Clementine 12 see;&lt;br /&gt;&lt;a href="http://www.spss.com/clementine/whats_new.htm"&gt;http://www.spss.com/clementine/whats_new.htm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I agree the new version has some really cool features, but after using it for a few weeks now I have also found that it has obviously been released far too early in the development and testing cycle. UI design is not up to the usual high polished standard, and there are notably more bugs (granted, usually Clementine has *very few* bugs or problem areas).  Clementine is still incredibly stable, the problems I have reported relate to UI design and interface problems. I've yet to find a problem or fault on the data processing side.  Some of the problems are minor but have been around for 4+ years, and its frustrating that they are still not fixed.  Kinda leaves customers thinking there's little point in providing product feedback at all...   &lt;br /&gt;&lt;br /&gt;Its been a few years since I left SPSS, so I'm not privy to what's going on in Development. In my view Clementine is still easily the best data mining software out there, but SPSS have clearly rushed this release out the door.  &lt;br /&gt;&lt;br /&gt;SPSS have replied to me and appear to be taking my criticism onboard.  Fingers crossed that an update in the coming months will resolve the issues I have raised. &lt;br /&gt;&lt;br /&gt;A few of the main points I raised;&lt;br /&gt; - can no longer save the stream whilst data execution is occurring. Nasty loss of feature. I consider this one quite serious. Users should *always* be able to save the stream at any time. &lt;br /&gt; - Partial outer joins no longer auto-tick the first table connected to a merge node if the order of the connected tables is different.  This is a change in default behaviour (always dangerous), and will affect old streams opened in the new version 12 (so your join condition could be different – beware!)&lt;br /&gt; - new pop-up ‘info’ windows that have no purpose and cannot be turned off.  Really bad UI design, and not akin to Clementine’s usual interface.&lt;br /&gt; - Charts always prompt to be deleted.  Like the pop-up windows, this a new behaviour and quite annoying.  There is no way to prevent “Are you sure you want to delete this chart” pop-up messages. Didn't they learn from the old version 7.0?  (oldies will remember the "Are you sure you want to exit Clementine message"...)&lt;br /&gt; - Quality Node has gone and there is no replacement functionality.  Sure, just delete something from the software for no good reason…&lt;br /&gt;&lt;br /&gt;Granted, I use Clementine 6 hours a day and am probably going to encounter problems other usedrs wouldn't, but some of these issues are glaringly obvious. &lt;br /&gt;&lt;br /&gt;- Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-1248873175734362498?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/1248873175734362498/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=1248873175734362498' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1248873175734362498'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/1248873175734362498'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/08/101-reasons-not-to-upgrade-to.html' title='101 reasons not to upgrade to Clementine 12'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-4980137879361221088</id><published>2008-08-18T18:15:00.000-07:00</published><updated>2008-08-18T18:53:43.616-07:00</updated><title type='text'>Stratified Sampling in SQL</title><content type='html'>If you use SPSS Clementine as I do, then you are probably familiar with the Balance node. It performs the function of selectively and randomly sampling your data based upon the values of a field or number of fields. Also known as stratified sampling!&lt;br /&gt;&lt;br /&gt;If your data is managed by a data warehouse, then Clementine has this cool behaviour of automatically converting functions into SQL, so the data processing can be performed by the database and less data needs to be extracted and duplicated on another file system.&lt;br /&gt;&lt;br /&gt;Unfortunately the Balance node isn't one of the functions automatically converted into SQL. In order to perform stratified sampling you have to take a different approach and selectively pick the values of your target column/field and sample them individually.&lt;br /&gt;&lt;br /&gt;On KDKEYS.NET I attached one Clementine version 12.0.2 stream (&lt;a href="http://www.kdkeys.net/forums/8229/PostAttachment.aspx"&gt;balance node.str&lt;/a&gt;) as one example of how to do this. By using a select condition, followed by a random sample, followed by a union (append) it is possible to easily obtain a stratified sample from a huge dataset efficiently.&lt;br /&gt;&lt;br /&gt;I have also pasted below an example of the type of simple SQL that gets processed;&lt;br /&gt;&lt;br /&gt;SELECT *&lt;br /&gt;FROM (&lt;br /&gt;SELECT *&lt;br /&gt;FROM (&lt;br /&gt;SELECT *&lt;br /&gt;FROM IPSHARE.TMANNS_DRUG4n&lt;br /&gt;WHERE (Drug = 'drugA')&lt;br /&gt;SAMPLE 0.5&lt;br /&gt;) AS TimTemp1&lt;br /&gt;UNION ALL&lt;br /&gt;SELECT *&lt;br /&gt;FROM (&lt;br /&gt;SELECT *&lt;br /&gt;FROM IPSHARE.TMANNS_DRUG4n&lt;br /&gt;WHERE (Drug = 'drugX')&lt;br /&gt;SAMPLE 0.2&lt;br /&gt;) AS TimTemp2&lt;br /&gt;) AS TimTable&lt;br /&gt;;&lt;br /&gt;&lt;br /&gt; - sorry, I couldn't work out how to format the SQL properly in this blog :(&lt;br /&gt;&lt;br /&gt;Cheers&lt;br /&gt;&lt;br /&gt;Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-4980137879361221088?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/4980137879361221088/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=4980137879361221088' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4980137879361221088'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/4980137879361221088'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/08/stratified-sampling-in-sql.html' title='Stratified Sampling in SQL'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-5584397491023993935</id><published>2008-08-13T15:24:00.000-07:00</published><updated>2008-08-13T16:01:06.957-07:00</updated><title type='text'>iPhone update (what bill shock?)</title><content type='html'>Update to my previous post, subsequent monitoring of iPhone users is showing that most are within their data download limits.  Although the new 3G iPhones are showing slight more data download than their 2G counterparts, it is doubtful that mobile customers are going to be recieving unexpected bills with excess data charges.&lt;br /&gt;&lt;br /&gt;I've resisted getting an iPhone so far, but my colleagues keep tempting me...  &lt;br /&gt;&lt;br /&gt;The &lt;strong&gt;pros&lt;/strong&gt;;&lt;br /&gt; - it has a great user interface.  The scrolling nature and design of the UI is amazing.  The concept of momentum that exists when you scroll through menus and music library is very cool.&lt;br /&gt; - Optimum size.  Its not that small, but yes it has a screen you can actually see.  It fits in your back pocket.&lt;br /&gt; - some versions have decent sized storage (16GB etc) for music and video.&lt;br /&gt; - most importantly it has little apps such as the StarWars LightSaber.  This app uses the momentum / gyro within the iPhone to react as you wave the iPhone around. It sound just like you have a LightSaber. Being able to turn on a Light Saber during a meeting when a colleague makes a dumb witted comment and chop them into pieces is priceless...&lt;br /&gt;They just took this off of the apps library, but they will be replacing it with an offical one (hopefully still free)&lt;br /&gt;see: http://blog.laptopmag.com/best-most-useless-iphone-application-phonesaber&lt;br /&gt;and: http://macenstein.com/default/archives/1559&lt;br /&gt;&lt;br /&gt;The &lt;strong&gt;cons&lt;/strong&gt;;&lt;br /&gt; - battery life isn't so good.  The screen uses  a lot of power.&lt;br /&gt; - battery cannot easily be replaced, can't carry a replacement for emergencies.&lt;br /&gt; - no support for video calls&lt;br /&gt; - no support for picture messaging &lt;br /&gt;&lt;br /&gt;For me long battery life is quite important, and although I could send videos via email etc using the iPhone, I'm surprised it only supports the basic forms of mobile communication (especially considering its a 3G phone).&lt;br /&gt;&lt;br /&gt;But the LightSaber app is really cool...  :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-5584397491023993935?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/5584397491023993935/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=5584397491023993935' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/5584397491023993935'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/5584397491023993935'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/08/iphone-update-what-bill-shock.html' title='iPhone update (what bill shock?)'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-5728978304470589516</id><published>2008-07-21T15:28:00.000-07:00</published><updated>2008-07-21T17:41:47.711-07:00</updated><title type='text'>re iPhone "excessive data charges"</title><content type='html'>Working as a data miner for an Australian telco (I'll try not to pick sides :) I know that Optus have designed plans that will easily cover the iPhone data usage requirements for the majority of customers.&lt;br /&gt;&lt;br /&gt;I agree with the ACCC that there is definately a possibility this could happen, but at least one telco (the one I work for) is behaving itself and offering generous plans with free data for the first few months to avoid any possibility of bill shock. I don't know where the ACCC get their info, but I doubt many of our customers will have 'bill shock'.&lt;br /&gt;&lt;br /&gt;See;&lt;br /&gt;&lt;a href="http://www.news.com.au/couriermail/story/0,23739,24055615-953,00.html"&gt;"ACCC warns about iPhone bill costs, additional charges"&lt;/a&gt;&lt;br /&gt;and also;&lt;br /&gt;&lt;a href="http://www.cnet.com.au/mobilephones/0,239025893,339290741,00.htm"&gt;"ACCC warns 'iPhoners' on bill shock"&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Before we launched the iPhone (last weekend) I did some analysis examining early adopters of the 'old' 2G iPhone.  A simple graph showing data usage of existing iPhone users  gave us a rough idea of what the new 3G iPhone might require (assuming the new 3G iPhone and customers behave the same way...).  The data was roughly something like this;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://bp1.blogger.com/_octsRin8yY0/SIUo4qZwpaI/AAAAAAAAAAM/fLQwfRZ3oRQ/s1600-h/data+usage.JPG"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_octsRin8yY0/SIUo4qZwpaI/AAAAAAAAAAM/fLQwfRZ3oRQ/s320/data+usage.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5225627896553448866" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;A proprtionally large number of early iPhone adopters were using 100mb or less a month, but small percentages of customers would use far more (actually going up to a few gb's).&lt;br /&gt;&lt;br /&gt;Based off of this early analysis I'm guessing that the data included in the Optus iPhone plans should be sufficent for most new users of the 3G iPhone.  We'll see...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-5728978304470589516?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://timmanns.blogspot.com/feeds/5728978304470589516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6028114151548461320&amp;postID=5728978304470589516' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/5728978304470589516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/5728978304470589516'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/07/re-iphone-excessive-data-charges.html' title='re iPhone &quot;excessive data charges&quot;'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp1.blogger.com/_octsRin8yY0/SIUo4qZwpaI/AAAAAAAAAAM/fLQwfRZ3oRQ/s72-c/data+usage.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6028114151548461320.post-7847986849306415107</id><published>2008-07-17T21:36:00.000-07:00</published><updated>2008-07-17T22:02:38.163-07:00</updated><title type='text'>I finally started a blog...</title><content type='html'>I try to contribute to data mining related forums and blogs, but never got around to writing my own blog...until now! &lt;br /&gt;&lt;br /&gt;Since most of my work revolves around data mining (and using Clementine) that will be the focus of my posts, but other topics might creep in. &lt;br /&gt;&lt;br /&gt;It'll be difficult to discuss my work freely because of intellectual property concerns, but I'll try to discuss the data mining problems we face and how we solve them.  My intention will be to foster peer review and feedback from other data miners, especially anyone also tackling analysis within terabyte sized data warehouses.&lt;br /&gt;&lt;br /&gt;Stay tuned, more to follow.&lt;br /&gt;&lt;br /&gt;Tim&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6028114151548461320-7847986849306415107?l=timmanns.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/7847986849306415107'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6028114151548461320/posts/default/7847986849306415107'/><link rel='alternate' type='text/html' href='http://timmanns.blogspot.com/2008/07/i-finally-started-blog.html' title='I finally started a blog...'/><author><name>Tim Manns</name><uri>http://www.blogger.com/profile/17405266346372888597</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry></feed>
