Monday, January 19, 2009

re: "Thoughts on Understanding Neural Networks"

Great post by Gordon Linoff the the Data Blog about visualising Neural Networks

I usually get better predictive success using neural nets, but the lack of explain-ability is always a downside. I'm always keen to see ways that might help explain or interpret a Neural Network. A few years ago I tried a simple graphical way to show a Neural Net, but I think Gordon's recent post highlights better options.

My example is written in VB.NET and parses a single hidden layer Neural Network saved as PMML into data grids. Once the neural net neurons and weights are loaded into the data grids I then read from the data grids and create a graphic of the Neural Net. Transparency is used to show the strength of the weight, whilst colour (blue or red) is used to show the negative or positive effect) . You can view my example at;
and download the source code, executable and example PMML directly from here;

I can't post images as comments on Gordon's blog so below are two snapshots of the simple UI application that displays the Neural Net PMML graphic.

a) graphic

b) PMML loaded into data grids


- Tim

Thursday, January 8, 2009

Isn't In-database processing old news yet?

A bit of a Clementine plug, but hear me through...

I'm puzzled by a few recent articles I've read describing in-database processing, the practice of doing very sophisticated data warehouse analysis (lets call it data mining :) on large amounts of data without having to extract the data into an external analytics tool (for example, a tool like SPSS or SAS).

As an example see the current Teradata magazine article; was fortunate enough to spend a few evenings chatting with Stephen Brobst (chief technology officer of Teradata) on these topics during a Teradata conference in Beijing last June 2008 *. I think he's right on the money concerning his top 4 predictions for data warehousing. As a Data Miner I am concerned with how I might be expected to analyse the data, so in-database processing is the biggest topic for me. I'm not so sure it is a 'future' thing though. In my view its here now, just maybe not so widesread. My only guess is that it's another plug for the SAS partnership. Although I don't use SAS I do like the thinking and development plan going forward. I simply don't think its the new concept it being touted as. I'm sure its not necessary for a data miner or data analyst to need custom plug-ins (and corresponding expense) to reach in-database data mining nevada.

In-database processing is nothing new. I've doing it using SPSS Clementine and Teradata for a few years now. SPSS Clementine has supported this functionality for quite a few years. In real-time Clementine will convert the stream (a graphical icon-based proprietary query file) into SQL (structured query language) and submits the SQL query(s) to the data warehouse. Any computation that cannot be represented as SQL will cause a data extraction and further processing by the Clementine analysis engine itself (commonly the Clementine Server on a dedicated analytical server box will do this, and keep the data and temp files on the server file system. Not the desktop). In practice I usually avoid heavy statistical functions and all my data processing is usually performed in the Teradata warehouse and only the data sample required to build a predictive or clustering model is extracted. When it comes to scoring the created data mining model (such as a neural network or decision tree) Clementine also converts the data mining model into SQL transparently for truly high scale processing on the data warehouse.

The real advantage comes from not just being able to score existing data mining models, but also being able to build predictive models entirely in the data warehouse, and this is a comparatively new development (a couple of years). Not something I do much of (I've done it for fun on my home SQL Server, but not in a corporate production environment). If the data warehouse provides the capability to create neural networks or clustering models (which some now do) then there is no need to ever extract data from the data warehouse into an external analysis application such as SPSS or SAS. More data can therefore be used to build models, and this usually beats tweaking algorithm options.

The data warehouses have actually supported embedded code and adding custom functions that might include a data mining algorithm for quite a while. For info see a recent post by Seth Grimes titled "In-Database Analytics: A Passing Lane for Complex Analysis";

Only in the past year or so have full blown embedded data mining algorithms taken off. The question is though will these algorithms always run fast(er)? Custom code can be good or bad! One advantage of the 'algorithms converted into SQL' route is that the data warehouse can quite easily determine and control how to process and prioritise the SQL query and be optimised specifically for it. Custom code and embedded data mining algorithms can also be optimised, but I'm guessing that requires far more effort (and expense!). One worry is also that custom code brings dangers and risks (not to mention the testing and issues for IT and the DB admin). Still, its necessary for in-database data mining model building capability.

Ok, I'm guessing some of my peers might know this stuff anyway, but one thing has recently occurred to me;
- considering that once we have data processing, model building and model scoring all occurring within the data warehouse, what need have we for the data mining tool?

My preference is for a easy tool that makes querying the data warehouse and constructing highly complex analysis easy. My queries could not possibly be prepared by hand since they are often transformed into many thousands of lines of SQL code. I use a clever user interface to make understanding the logic of the analysis possible. The data mining tool I use primarily is a tool that optimises my interaction with the data warehouse.

So considering these things, my current view is that data mining applications/tools such as SPSS Clementine (or even SAS :) will stick around for quite a bit longer because the user interface optimises a data miner's ability to query the data warehouse and perform data mining efficiently, but maybe in a few years we start to see what we commonly refer to as data mining 'algorithms' being developed for data warehouses and no longer in data mining tools (or simply as plug-ins for data warehouses). An interesting thought indeed!

- Tim

* Whilst at the Teradata User Conference in Beijing I presented some data mining analysis work, mostly my data analysis work on churn prediction and product upsell in telcommunications, and chatted to the China mobile analysts afterward. On a more personal note, that is also when my soon-to-be-born son was conceived. Don't worry, my wife was with me at the time! In true Hollywood fashion I thought it appropriate we therefore name him 'Beijing' or 'Teradata' but my wife doesn't share my enthusiasm :)

Tuesday, January 6, 2009

book review "Data Preparation For Data Mining"

Just before Christmas I bought myself yet another data mining book (i have a few dozen). This one somehow slipped by me for 10 years but I'm glad I finally stumbled upon it. Originally published in 1999, Dorian Pyle wrote "Data Preparation For Data Mining" before Data Mining was less wide spread and 'Predictive Analytics' wasn't the buzz word it is today.

The only few criticisms I could possibly raise are;
1) that everything has a statistical basis.
- For example one technique I use to redistribute heavily skewed data is simple binning by count. I work in telecommunications and the behavioural data is always extremely skewed. Log functions don’t work so I often use SQL to convert variables into 100 percentile bins (where each bin has the same number of rows (customers) in it). That type of insight isn't in the book, but several statistically based alternatives are. I'm not convinced they would work with extremely skewed data, but they are well explained and useful insights.
2) no mention of SQL or step-by-step examples of data manipulation (nothing like 'before and 'after' pictures). Ideas or examples for derived variables are lacking too.

So far I've read through the first 275 pages and the odd additional chapter. Its surprisingly easy to read and explains the statistics well. Its definitely a book I will refer to, and well worth buying.

In February 2004 Dorian Pyle made an interesting post about things to avoid when data mining;
"This Way Failure Lies "

- Tim