Tuesday, July 21, 2009

Books on my desk...

Over the years I have purchased a few data mining, machine learning, and even statistics books. I'll confess that I haven't read every book page by page, in fact some I've speed-read hoping to catch some interesting highlights.

Below is a summary and short review of the books that are sitting on my desk at work...

- from left to right;

- Marketing Calculator (author Guy Powell)
I got a free copy because I contributed to some of the industry examples. I'm even quoted in it! I found the book very useful and would recommend it for any marketing analyst. It talks about ROI and measuring every type of marketing event and customer interaction. Lots of case studies, which I always like. No detail in terms of data analysis itself, but plenty of food for ideas.

- Advances In Knowledge Discovery And Data Mining (editors Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy)
I first bought this for Rakesh Agrawal's article on Association Rules (Apriori in Clementine), but also found John Elder's Statistical Perspective on Knowledge Discovery very informative. It provides a great concise history of data mining.

- Data Mining Using SAS Applications (author George Fernandez)
I bought this hoping to get a different opinion or learn something new (compared to the SPSS Clementine User Guide I have far too much experience of..). I thought; maybe SAS analysts had a better way to do a specific type of data handling or followed a alternative thought pattern to accomplish a goal. Sadly I was disappointed. Like many data mining books it spend hundreds of pages describing algorithms and expert options for refining your model building and less than 10 pages on data transformations and/or data cleaning. Those 10 pages are well written though. Not worth the purchase in my view. I only hope SAS analysts have better books out there.

- Data Mining Techniques (authors Micheal Berry and Gordon Linoff)
Written by practitioners means a lot. The one book I often re-read just in case I missed something the previous time :) Maybe because it is very applicable to my role as an analyst in a marketing dept in a telecommunications carrier, but I find this book invaluable. Lots of case studies. 100 pages of background and practical tips before it even reaches 'algorithms' is good in my view, and when do you reach the algorithms they are described in practical terms as techniques very well (rather than a laborious stats class, and I didn't do stats at University). I find the whole book a joy to read. A must for every data miner.

- Data Mining. A Tutorial Based Primer (authors Richard Roger and Micheal Geatz)
Whilst going through a phase of keen hobby programming in VB.NET I tried my hand at writing a neural net, decision tree etc from scratch. I found this book really helpful since it goes through every detail a programmer would need to implement their own data mining code in Excel. I work with huge amounts of data, so the thought of doing data mining in Excel makes my giggle (maybe that's a bad thing...) but the principals of data manipulation, cleaning and prediction etc can easily be applied in Excel. If you really want to understand how algorithms work and build your own, then this book is very useful for that purpose.

- Data Mining. Introductory and Advanced Topics
If you did spend several years studying mathematics or statistics then this book would probably act as a great reference and reminder of how algorithms work.
Its very academic and sometimes that's useful. I think there's one line in there somewhere that mentions data cleaning or data transformations as being an industry thing... It is also quite a hard heavy book, so could be useful to rest stuff on.

- Data Mining. Practical Machine Learning Tools and Techniques (Ian Witten and Eibe Frank)
This is a classic example of bait advertising that some authors should be jailed for. On page 52 of this book the authors write;
"Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process. Although this book is not really about the problems of data preparation, we want to give you a feeling for the issues involved...." Fuck me, its not a data mining book then is it? Not only that, they actually use the term "Practical" in the title. Clearly it is not practical at all if it involves absolutely zero data manipulation. If I ever meet one of these authors I will slap them in the face and demand my money back... Oh and over half the book is a damn Weka user guide.

- The Elements Of Statistical Learning. Data Mining, Inference and Prediction (authors Trevor Hastie, Robert Tibshirani, Jerome Friedman)
Very heavy on the stats and squiggly equations (which take me ages to make sense of) but quite well written because I usually manage to understand it. Explains the algorithms stuff very well. I don't refer to it much and only read a few chapters in depth, but it was worth the purchase.

- The Science Of Superheroes (authors Lois Gresh and Robert Weinberg).
Not everything is about data mining. There's a whole world out there, and just maybe it includes super heroes with laser beams shooting out of their eyes. Its a soft-core science book discussing concepts such as; faster than light speed, cosmic rays, genetically engineered hulks, flying without wings, and black holes and how it all relates to real-life superheroes (if they existed). Really good geeky material.

- Data Preparation For Data Mining (author Dorian Pyle)
A good book, and like "Data Mining Techniques" it clearly covers topics with a practical understanding (no 'real-world' case studies though). Where it differs is that this book has a stronger academic or statistics focus. I didn't get a sense that the examples would always relate to large real-world data sets, and many methods I use were not mentioned at all (for example frequency binning) because they have no statistical basis. Here's the problem; this is a great data mining book, but only for the statistics in practical data mining. It is a book I frequently refer to and would recommend, although I'd like to see stuff added that *isn't* based on statistics.

- The Essence Of Databases (author F. Rolland)
101 database for dummies. It describes database schemas, relational concepts, tons of SQL examples for queries and data transformations, describes object oriented databases etc.
Essential stuff for anyone querying a corporate data warehouse. It reads easily and is recommended.

- Data Mining. Concepts, Models, Methods, and Algorithms (author Mehmed Kantardzic)
Another 'list all the algorithms I know' book. I'll be honest; I only quickly flicked through it hoping to see some case studies or something new. It seemed good, but didn't seem to have anything to set it apart from any other algorithms book.

- Statistics Explained. Basic concepts and methods. ( authors R. Fapadia and G. Andersson)
Just in case I forget what a t-test is. Has lots of pictures :)

- Clementine User Guides (author: many at SPSS, well if memory serves me Clay Helberg did a fair chunk of it) . When I was at SPSS I had a small part to play in these. I provided some examples and proof read where possible. I've been using Clementine daily for over a decade, but still refer to the user guide occasionally. I find them useful, but they could benefit from some new examples to take advantage of the many new features that have been added in recent years.

- Enjoy!


Themos Kalafatis said...


Thanks for the list. DM Techniques is a must, i guess it can be easily seen that the authors have actually done DM in real-world applications. One small disagreement though is for the book written by Witten et al. Data Quality and problem representation is very important for achieving good results - no doubt about that- however knowing how the algorithms work (and this is what this book focuses on in a very good way IMHO) is also very important to increase the usefulness of models.

Gerard said...

Thanks for the bibliography. As an intelligent layperson end user, I would really like to see a better Clem (now PASW) application manual, expecially instructions on "hooking" into external programs.


Adam said...

Thanks for the list. This field is relatively new to me (but not SQL, normal stats techniques etc.)

I've found the Data Mining Techniques book very useful!

Michael J. A. Berry said...

How cool to see our book on your shelf, but you should really get the second edition. It's not green, which I regret, but we got much more experience between editions and I hope it shows. -Michael