Tuesday, January 6, 2009

book review "Data Preparation For Data Mining"

Just before Christmas I bought myself yet another data mining book (i have a few dozen). This one somehow slipped by me for 10 years but I'm glad I finally stumbled upon it. Originally published in 1999, Dorian Pyle wrote "Data Preparation For Data Mining" before Data Mining was less wide spread and 'Predictive Analytics' wasn't the buzz word it is today.

The only few criticisms I could possibly raise are;
1) that everything has a statistical basis.
- For example one technique I use to redistribute heavily skewed data is simple binning by count. I work in telecommunications and the behavioural data is always extremely skewed. Log functions don’t work so I often use SQL to convert variables into 100 percentile bins (where each bin has the same number of rows (customers) in it). That type of insight isn't in the book, but several statistically based alternatives are. I'm not convinced they would work with extremely skewed data, but they are well explained and useful insights.
2) no mention of SQL or step-by-step examples of data manipulation (nothing like 'before and 'after' pictures). Ideas or examples for derived variables are lacking too.

So far I've read through the first 275 pages and the odd additional chapter. Its surprisingly easy to read and explains the statistics well. Its definitely a book I will refer to, and well worth buying.

In February 2004 Dorian Pyle made an interesting post about things to avoid when data mining;
"This Way Failure Lies " http://www.ibmdatabasemag.com/story/showArticle.jhtml?articleID=17602328

- Tim


Sandro Saitta said...

I also just bought the book. Seems really interesting. Thanks for the post and the link (by the way, the date is February 2004).

Tim Manns said...

oops! Thanks for the correction!. Yes, I mentioned the date because it was quite old and put the wrong year by mistake - doh!

I'm still looking for any data manipulation books, so if you see any (good ones) out there please let me know.

Themos Kalafatis said...


Try also "Exploratory Data Mining and Data Cleaning" ISBN-10: 0471268518 which is the next best book -after Dorian Pyle's- i have read regarding Data preparation and Data Cleaning.

Anonymous said...

Which are your favorite Data Mining books? Ken

Tim Manns said...

- Themos
Thanks for the recommendation. Amazon will be posting it to me soon!

- Ken
Good suggesion. I will list my favourite data mining books and who each would be most suitable for shortly.


Brian said...

Hey Tim,

When you say you create a percentile for skewed distributions, are you replacing the number of the variable with its percentile rank (a number 1 through 100)?

Tim Manns said...

Hello Brian,

Yes. I especially find using simple frequency bins very useful when building neural networks.

The outliers I encounter are often thousands of times higher than the mode.

Because much of my analysis is at a granularity of 'people' (ie customers), then converting the raw business numbers into top 10% of customers etc leans itself very well to charts and descriptive displays.