Thursday, January 8, 2009

Isn't In-database processing old news yet?

A bit of a Clementine plug, but hear me through...

I'm puzzled by a few recent articles I've read describing in-database processing, the practice of doing very sophisticated data warehouse analysis (lets call it data mining :) on large amounts of data without having to extract the data into an external analytics tool (for example, a tool like SPSS or SAS).

As an example see the current Teradata magazine article;http://www.teradata.com/tdmo/v08n04/Features/OnTheHorizon.aspxI was fortunate enough to spend a few evenings chatting with Stephen Brobst (chief technology officer of Teradata) on these topics during a Teradata conference in Beijing last June 2008 *. I think he's right on the money concerning his top 4 predictions for data warehousing. As a Data Miner I am concerned with how I might be expected to analyse the data, so in-database processing is the biggest topic for me. I'm not so sure it is a 'future' thing though. In my view its here now, just maybe not so widesread. My only guess is that it's another plug for the SAS partnership. Although I don't use SAS I do like the thinking and development plan going forward. I simply don't think its the new concept it being touted as. I'm sure its not necessary for a data miner or data analyst to need custom plug-ins (and corresponding expense) to reach in-database data mining nevada.

In-database processing is nothing new. I've doing it using SPSS Clementine and Teradata for a few years now. SPSS Clementine has supported this functionality for quite a few years. In real-time Clementine will convert the stream (a graphical icon-based proprietary query file) into SQL (structured query language) and submits the SQL query(s) to the data warehouse. Any computation that cannot be represented as SQL will cause a data extraction and further processing by the Clementine analysis engine itself (commonly the Clementine Server on a dedicated analytical server box will do this, and keep the data and temp files on the server file system. Not the desktop). In practice I usually avoid heavy statistical functions and all my data processing is usually performed in the Teradata warehouse and only the data sample required to build a predictive or clustering model is extracted. When it comes to scoring the created data mining model (such as a neural network or decision tree) Clementine also converts the data mining model into SQL transparently for truly high scale processing on the data warehouse.

The real advantage comes from not just being able to score existing data mining models, but also being able to build predictive models entirely in the data warehouse, and this is a comparatively new development (a couple of years). Not something I do much of (I've done it for fun on my home SQL Server, but not in a corporate production environment). If the data warehouse provides the capability to create neural networks or clustering models (which some now do) then there is no need to ever extract data from the data warehouse into an external analysis application such as SPSS or SAS. More data can therefore be used to build models, and this usually beats tweaking algorithm options.

The data warehouses have actually supported embedded code and adding custom functions that might include a data mining algorithm for quite a while. For info see a recent post by Seth Grimes titled "In-Database Analytics: A Passing Lane for Complex Analysis";http://www.intelligententerprise.com/info_centers/data_int/showArticle.jhtml?articleID=212500351&cid=RSSfeed_IE_News

Only in the past year or so have full blown embedded data mining algorithms taken off. The question is though will these algorithms always run fast(er)? Custom code can be good or bad! One advantage of the 'algorithms converted into SQL' route is that the data warehouse can quite easily determine and control how to process and prioritise the SQL query and be optimised specifically for it. Custom code and embedded data mining algorithms can also be optimised, but I'm guessing that requires far more effort (and expense!). One worry is also that custom code brings dangers and risks (not to mention the testing and issues for IT and the DB admin). Still, its necessary for in-database data mining model building capability.

Ok, I'm guessing some of my peers might know this stuff anyway, but one thing has recently occurred to me;
- considering that once we have data processing, model building and model scoring all occurring within the data warehouse, what need have we for the data mining tool?

My preference is for a easy tool that makes querying the data warehouse and constructing highly complex analysis easy. My queries could not possibly be prepared by hand since they are often transformed into many thousands of lines of SQL code. I use a clever user interface to make understanding the logic of the analysis possible. The data mining tool I use primarily is a tool that optimises my interaction with the data warehouse.

So considering these things, my current view is that data mining applications/tools such as SPSS Clementine (or even SAS :) will stick around for quite a bit longer because the user interface optimises a data miner's ability to query the data warehouse and perform data mining efficiently, but maybe in a few years we start to see what we commonly refer to as data mining 'algorithms' being developed for data warehouses and no longer in data mining tools (or simply as plug-ins for data warehouses). An interesting thought indeed!

- Tim


* Whilst at the Teradata User Conference in Beijing I presented some data mining analysis work, mostly my data analysis work on churn prediction and product upsell in telcommunications, and chatted to the China mobile analysts afterward. On a more personal note, that is also when my soon-to-be-born son was conceived. Don't worry, my wife was with me at the time! In true Hollywood fashion I thought it appropriate we therefore name him 'Beijing' or 'Teradata' but my wife doesn't share my enthusiasm :)

6 comments:

Anonymous said...

Tim, I agree the capability has been around for a while but never caught on. I wrote in my article about mid-'90s object-relational DBMSes. I used in-DB programmability myself with Illustra and Informix ORDBMSes, but I was a DBA/database programmer, not a business analyst. What's new is 1) exploitation of DBMS parallelization, 2) easier programmability, 3) availability of in-database *applications* and analytical software modules (R, SAS, etc.) and libraries and not just coding hooks, and, of course, 4) visibility.

Tim Manns said...

Hello Seth,

I didn't know you were a viewer :)

I loved your post btw!

1) I agree, and am at that stage. There are great benefits to be had with advanced analtyics (including neural nets etc) processed on massively parrallel db's. I use Teradata and am always in awe of the performance.

2) SQL is easy, and tools that convert statistical functions into some SQL-ish format would be cool. I don't want to learn yet another code (SAS = argh!).

3) Thanks for the clarification. Maybe I'm still not on the same page or understanding the concept. What I am unconvinced of is the added benefit of "in-database *applications* and analytical software modules". I totally agree that they are exciting new features and although I like the idea, I'm thinking a lot of data mining can be successfully achieved through relatively simple SQL and libraries (or extensions to SQL). At the moment I'm at the 'data mining as SQL stage'. I think the extra effort and expense of analytical software modules is unlikely to provide much additional ROI. At the moment I simply can't see a business case justifying the effort or costs.
I think a big advantage would be in moving from back-office analytics performed and presented by people like me, to more front line and real-time analytics used in every interaction with the customer (or whatever). But I see this as a shift of focus in the use of existing technology, and not as a result of new technology.

Sandro Saitta said...

Hi Tim,

Very interesting blog post. This is a very nice view from the business side of data mining (rather than technical side, which I'm more used to).

I'm looking forward to reading your next posts.

Anonymous said...

While the in-database mining capabilities have been around for a while in Oracle, SQL Server, etc, perhaps the current trend is more about scaling to the next generation of MPP databases like those listed in Seth's article?

Anonymous said...

Hi Tim,

Enjoy your post. Here are some papers about SAS in-Teradata processing that maybe of interest to you:
http://support.sas.com/resources/papers/proceedings09/337-2009.pdf

http://support.sas.com/resources/papers/sgf09/339-2009.pdf

Coding is SAS is easy. We would love to you at Teradata partners this fall for a demonstation and discuss further your needs.

Wayne Thompson
SAS Analytics Product Manager.

Wayne Thompson said...

My aplogies, too many typos
Hi Tim,

Enjoyed your post. Here are some papers about SAS in-Teradata processing that maybe of interest to you:
http://support.sas.com/resources/papers/proceedings09/337-2009.pdf

http://support.sas.com/resources/papers/sgf09/339-2009.pdf

Coding in SAS is easy. Would be great to meet you at Teradata partners this fall for a demonstration and discuss further your in-database data mining needs.

Wayne Thompson
SAS Analytics Product Manager.