Sunday, November 27, 2011

When is 'Big Data' too big for Analytics?

- 'Foreword'Apologies for the lack of recent posts.  I've been *very* busy on many Data Mining Analytics projects in my role as a Data Mining Consultant for SAS.  The content of my work is usually sensititive and therefore discussing it in any level of detail in public blog posts is difficult.

This specific post is to help promote the launch of the new IAPA website and increase focus on Analytics in Australia (and Sydney, where I am normally based).  The topic of this post is something that has been at the forefornt of my mind and seems to be a central theme of many of the projects I have been working on recently.  It is certaininly a current problem for many Marketing/Customer Analytics departments.  So here are a few thoughts and comments on 'big data'. Apologies for typos, it is mostly written piecemeal on my iPhone during short 5 mins breaks...


How big is too big (for Analytics)?
I frequently read Analytics blogs and e-magazines that talk about the 'new' explosion of big data. Although I am unconvinced it is new, or will improve anytime soon, I do agree that despite technology advances in analytics the growth of data generation and storage seems to be outpacing most Analyst's ability to transform data into information and utilize it to greater benefit (both operationally and analytically). The term 'Analysis Paralysis' has never been so relevant!

But from a practical perspective what conditions cause data to become unwieldy? For example, take a typical customer services based organisation such as a bank, telcom, or public dept: how can the data (de)-evolve to a state that makes it 'un-analysable' (what a horrible thought..). Even given mild (by today's standards) numbers of variables and records, certain practices and conditions can lead to bottlenecks, widespread performance problems, and delays that make any delivery of Analytics very challenging.

So, below is a series of my most recent observations from Analytics projects I have been involved with that involved resolving, or encountered 'big data' problems:

- Scaleable Infrastructure.
Data will grow. Fast. In fact it will probably more than double in the next few years. CPU capacity of data warehousing and analytics servers need to improve to match.
As an example, I was working on a telcom Social Network Analysis project recently where we were processing weekly summaries of mobile telephone calls for approx 18million individuals. My role was to analysis the social interactions between all customers and build dozens of propensity scores, using the social influence of others to predict behaviour. In total I was probably processing hundreds of millions of records of data (by a dozen or so variables). This was more than the client typically analysed.
After a week  of design and preliminary work I began to conasider ways to optimise the performance of my queries and computations, and I asked about the server specifications. I assumed some big server with dozens of processors, but unfortunately what I was connecting to was a dual core 4GB desktop PC under an Analyst's desk...

- Variable Transformations
A common mistake by inexperienced data miners is to ignore or short-cut comprehensive data preparation steps. All data that involves analysis of people is certain to include unusual characteristics. One person's outlier is another's screw-up :)
So, what is the best way to account for outliers, skewed distributions, poor data sparsity, or highly likely erreonous data features? Well an approach (that i am not keen on) taken by some is to apply several variable transformations indiscriminatly to all 'raw' variables and subsequentially let a variable selection process pick the best input variables for propensity modeling etc. When combined with data which represents transposed time series (so a variable represents a value in 'month1' the next variable the same value dimension in 'month2' etc) then this can easily generate in excess of 20,000 variables (by say 10 million customers...). It is true there are variable selection methods that handle 20,000 quite well, but the metadata and processing to create those datasets is often significant and the whole process often incurs excessive costs in terms of time to delivery of results.
Additional problems that may arise when you start working with many thousands of variables is that variable naming needs to be easily understood and interpretable. The last thing a data miner wants to do is spend hours working out what those transformed and selected important variables in the propensity model actually mean and represent in the raw data.
Which leads me to my next point..

- Variable / Data Understanding
One of the core skills of a good data miner is the understanding and translate complex data in order to solve business problems.
As organisations obtain more data it is not just about more records, often the data reveals new subtle operational details and customer behaviors not previously known, or completely new sources of data (FaceBook, social chat, location based services etc). This in turn often requires extended knowledge of the business and operational systems to enable the correct data warehouse values or variable manipulations and selections to be made.
An analyst is expected to understand most parts of an organization's data at a level of detail most individuals in the organisation are not concerned with, and this is often a momental task.
As an example of 'big data' bad practice, I've encountered verbose variables names which immediately require truncation (due to IT / variable name limit reasons), others which make understand the value or meaning of the variable difficult, or naming conventions which are undocumented. For example: "number_of_broken_promises" is one of the funniest long max variable names I've seen, whilst others such as "ccxs_ytdspd_m1_pct" can be guessed when you have the business context but definitely require detailed documentation or a key.

- Diverse Skillsets
'big data' often requires big warehouse and analytics systems (see point 1) and so an analyst must have understanding of how these systems work properly.
Through personal experience I'm always aware of table indexes on a Teradata system for example. By default the first column in a warehouse table will be the index, so if you incorrectly use a poorly managed or repetitive variable such as 'gender' or 'end_date' then the technology of a big data system works against you. I've seen this type of user error on temp tables or analytics output tables far too many times.  Big Data often involves bringing information from a greater number of sources, so understanding the source systems and data warehouse involved is an important challenge.

I hope this helps.  I strongly recommend getting involved with the IAPA and Sydney Data Miner's Meetup if you are based in  Australia or Sydney.
 - Tim

62 comments:

Pritish said...

Good Information Tim. Understanding your business/organization data is really critical thing in today's world. Moreover, more you know about your data better your data models are!

Keep posting your experience.

Thanks,
Pritish

hunterdong said...

Can I still ask you questions about SPSS Clementine (although you use SAS now)?

When buidling predictive models from imbalanced data (say 1% response rate or 5% churn rate), I can use "Weight" function in CHAID/CR&T, or re-balance data using Balance node.

Is there a difference between weight function/Balance node?

Is 50%/50% a good re-balance for most models?

Also would you use the whole dataset (say 1M records of this 1% sparse data), or only use a sample? How big if you choose a sample? I met " all records in training data have the same value for target" error and I guess it's related with the size of data

Thanks in advance

hunterdong said...

" all records in training data have the same value for target" error I met was caused by manually specified range in Type node for weight field.

Tim Manns said...

Hi Hunterdong,

Most questions are the same for any data minng application :)

50/50% is mostly a threshold that is used in scoring to define the true/false binary outcome. The model itself usually 'works' with very imbalanced data. You just need to create your own threshold to decide true or false outcomes.

For decision trees I would strongly recommend using a weighting variable. This is because having imbalanced data will actually affect the tree splits (which are often also a source of data understanding) and performance of the generated model.

For Neural networks I prefer to use a decimal varianble as the target and instead use the natural balance as the threshold for the yes/no or true/false distinction. For more information see a previous discuss I had with Dean Abbott a few years ago;
http://timmanns.blogspot.com.au/2009/11/building-neural-networks-on-unbalanced.html

Dean discusses the topic here;
http://abbottanalytics.blogspot.com.au/2009/11/stratified-sampling-vs-posterior.html


I haven't tested this approach for CART, but maybe you can try instead using a decimal number as the target (C5 will not support a numeric target variable).

Hope that helps

Tim

hunterdong said...

Hi Tim

When the T/F result is 90/10 split, I found most trees tends to just predict one value (simple way to get 90% accuracy!).

Using cost of mis-classification didn't help much.

Under what circumstances would you predict numberic value? I just started playing with KDD98 data which may require numeric evaluation (value of responders).

One more question, for churn model or DM respond model, would you use CHAID or Logistic Regression?

Thanks

Tim Manns said...

Don't use T/F, instead use 0.0 or 1.0 as the target variable (and make sure the variable is a numeric type).

Then the predicted value generated by the model will be a raw number between 0.0 and 1.0.

Then simply use an expression to create "T" where predicted value > 0.9. else "F" to get your string binary outcome if it is needed.

Ann Hathaway said...

Scarcely a day passes by where you don't see a headline about "Big Data" and how analysis of this big data is going to lead to huge efficiencies, targeted marketing and large profits.

Anonymous said...

I use daily time series in most of the variables.
But, there many other times series are in different frequency (quartley, month...)

How I can convert, for example, quarterly time series, to daily?

Many thanks.
jma
my email:
jma at gmail dot com

Anonymous said...

Perdon, my email:
manlop at mail dot pt

not the gmail in last message above.

Unknown said...
This comment has been removed by a blog administrator.
Unknown said...


Hii you are providing good information.Thanks for sharing AND Data Scientist Course in Hyderabad, Data Analytics Courses,

Data Science Courses, Business Analytics Training ISB HYD Trained Faculty with 10 yrs of Exp See below link

Data science training in Ameerpet

Unknown said...

thank you for sharing this informative blog.. this blog really helpful for everyone.. explanation are clear so easy to understand... I got more useful information from this blog

hadoop training institute in adyar | big data training institute in adyar | hadoop training in chennai adyar | big data training in chennai adyar

Unknown said...

After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

hadoop training institute in adyar | big data training institute in adyar | hadoop training in chennai adyar | big data training in chennai adyar

Flora said...

HOW I GOT A LOAN @ 2% INTEREST RATE with Victoriafinancier@outlook.com

Am very happy as am writing this testimony of how i got my loan from this loan, if you want to get a loan from any company you have to contact VICTORIA FINANCIER LOAN FIRM ( victoriafinancier@outlook.com ). when i contact her, i thought she was like the rest, and to my Greatest surprise i got my loan amount in my account and that was the Exact amount i applied for in her company. If you are interested in getting any type of loan, you should contact her no0w via email: victoriafinancier@outlook.com . She helped many persons that recommended to her.

*Full Name:_________

*Address:_________

*Tell:_________

*loan amount:_________

*Loan duration:_________

*Country:_________

*Purpose of loan:_________

*Monthly Income:__________

*Occupation__________

*Next of kin:_________

*Email :_________

Unknown said...

Great and interesting article to read.. i Gathered more useful and new information from this article.thanks a lot for sharing this article to us..

best hadoop training in chennai | hadoop training institutes in chennai

fullstackanalytics said...

Hello Tim! Great article and very insightful as well. You’ve described everything clearly. Could you help me in finding the best big data online courses?

Unknown said...

Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.


Hadoop Training in Chennai

Aws Training in Chennai

Selenium Training in Chennai

Unknown said...

Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us
python training in chennai | python training in bangalore

python online training | python training in pune

python training in chennai

simbu said...

You got an extremely helpful website I actually have been here reading for regarding an hour. I’m an initiate and your success is incredibly a lot of a concept on behalf of me.

java training in chennai | java training in bangalore

java online training | java training in pune

java training in chennai | java training in bangalore

Unknown said...
This comment has been removed by the author.
shalinipriya said...

This is a nice post in an interesting line of content.Thanks for sharing this article, great way of bring this topic to discussion.
Data Science with Python training in chenni
Data Science training in chennai
Data science training in velachery
Data science training in tambaram
Data Science training in OMR
Data Science training in anna nagar
Data Science training in chennai
Data science training in Bangalore

sai said...

It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command
python training in tambaram
python training in annanagar

Unknown said...

Thanks you for sharing this unique useful information content with us. Really awesome work. keep on blogging

java training in tambaram | java training in velachery

java training in omr | oracle training in chennai

Mounika said...

Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
python training in chennai
python training in Bangalore

Unknown said...

This is ansuperior writing service point that doesn't always sink in within the context of the classroom. In the first superior writing service paragraph you either hook the reader's interest or lose it. Of course your teacher, who's getting paid to teach you how to write an good essay, 

Data science training in tambaram | Data Science training in anna nagar
Data Science training in chennai | Data science training in Bangalore
Data Science training in marathahalli | Data Science training in btm

dwarakesh said...

Does your blog have a contact page? I’m having problems locating it but, I’d like to shoot you an email. I’ve got some recommendations for your blog you might be interested in hearing.


AWS Training in Pune | Best Amazon Web Services Training in Pune

AWS Tutorial |Learn Amazon Web Services Tutorials |AWS Tutorial For Beginners

Amazon Web Services Training in OMR , Chennai | Best AWS Training in OMR,Chennai


AWS Training in Chennai |Best Amazon Web Services Training in Chennai


Amazon Web Services Training in Pune | Best AWS Training in Pune

Sherin Alfonsa said...

I am eagerly waiting for your next blog!!! keep updating more contents.

Selenium training in chennai
Selenium training institute in Chennai
iOS Course Chennai
French Classes in Chennai
Big Data Training in Chennai
Digital marketing Training institute in chennai
Digital marketing course chennai

somar said...

visit
visit

Unknown said...


KetoViante By far the oldest (and least expensive) method to improve your penis size is with the use of guide exercises. There are such a lot of differing types and variations of KetoViante products out there out there immediately. Take into accout, that the market is flooded with all forms of so-known as penis pumps, weights and saw dust tablets that promise you every thing but the kitchen sink but deliver disappointing results. In an historic ACLU victory, a judge dominated for the primary time that abstinence-solely training goes in opposition to a state legislation mandating comprehensive KetoViante training. When doing these penis enlargement strategies, it could even be extremely advisable to stay stationary the entire time. Sex is a pure part of life and it's fitting that you can find the answers to sexual shortcomings with pure substances. You must speak with the physician concerning the usage of KetoViante pill. At present at twenty five, as I have been exploring the depths of the sexual dimension I've been amazed at the wonders of what ecstatic zones Yonggang tablets natural enhancement can present. Dietary supplements are one other standard method which have been available on the market for variety of years now of the best way to increase penis dimension.
https://www.smore.com/k3t5u-ketoviante-trial

Anuj Zhik said...

I get my lots of solution with this blog. Thank you for sharing valuable information. Nice post. I enjoyed reading this post…
Love Problem Solution in Delhi

App Developers in Delhi said...

I want to say that this post is awesome, nice written.

sai ram said...

A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.
Microsoft Azure online training
Selenium online training
Java online training
uipath online training
Python online training

easylearn said...

Hi,
Good job & thank you very much for the new information, i learned something new. Very well written. It was sooo good to read and usefull to improve knowledge. Who want to learn this information most helpful. One who wanted to learn this technology IT employees will always suggest you take python training in pune. Because python course in pune is one of the best that one can do while choosing the course.

Prwatech said...

I learned World's Trending Technology from certified experts for free of cost. I Got a job in decent Top MNC Company with handsome 14 LPA salary, I have learned the World's Trending Technology from Python training in pune experts who know advanced concepts which can help to solve any type of Real-time issues in the field of Python. Really worth trying instant approval blog commenting sites

gautham said...

oracle pl sql training this is the nice blog

gautham said...

Technology has improved a lot in which all technology has equal importance azure training

Imran said...

Very good blog with lots of useful information about amazon web services concepts.
AWS Training in Chennai | AWS Training Institute in Chennai | AWS Training Center in Chennai | Best AWS Training in Chennai

Vijiaajith said...

Nice...
freeinplanttrainingcourseforECEstudents
internship-in-chennai-for-bsc
inplant-training-for-automobile-engineering-students
freeinplanttrainingfor-ECEstudents-in-chennai
internship-for-cse-students-in-bsnl
application-for-industrial-training

Vijiaajith said...

Good, keep posting
interview-questions/aptitude/permutation-and-combination/how-many-groups-of-6-persons-can-be-formed

tutorials/oracle/oracle-delete

technology/chrome-flags-complete-guide-enhance-browsing-experience/

interview-questions/aptitude/time-and-work/a-alone-can-do-1-4-of-the-work-in-2-days


interview-questions/programming/recursion-and-iteration/integer-a-40-b-35-c-20-d-10-comment-about-the-output-of-the-following-two-statements

shiv said...

nice.............
inplant training in chennai
inplant training in chennai
inplant training in chennai for it.php
algeeria hosting
angola hostig
shared hosting
bangladesh hosting
botswana hosting
central african republi hosting
shared hosting

shri said...

nice blogs....
internship in chennai for ece students
internships in chennai for cse students 2019
Inplant training in chennai
internship for eee students
free internship in chennai
eee internship in chennai
internship for ece students in chennai
inplant training in bangalore for cse
inplant training in bangalore
ccna training in chennai


TNK Design Desk said...


Nice blog! i'm also working with a Digital marketing company in gurgaon
website designing in gurgaon
best website design services in gurgaon
web company in delhi
web desiging company
web design & development banner
web design & development company
web design & development services
web design agency delhi
web design agency in delhi
web design and development services
web design companies in delhi
web design company delhi
web design company in delhi
web design company in gurgaon
web design company in noida
web design company list
web design company services
web design company website
web design delhi
web design development company
web design development services
web design in delhi
web design service
web design services company
web design services in delhi
web designer company
web designer delhi
web designer in delhi
web designers delhi
web designers in delhi
web designing & development
web designing advertisement
web designing and development
web designing and development company
web designing and development services
web designing companies in delhi
web designing company delhi
web designing company in delhi
web designing company in gurgaon
web designing company in new delhi

Sakshi said...

Content is very interesting and written in very easy go lucky language, easy to understand.. thankx for the blog..

Python Coaching Classes and Training Institute in Pune
SourceKode Training Institute
Android
Course in Pune

Graphics Design Classes in Pune

nivetha said...

hii good..
internships for cse students in bangalore
internship for cse students
industrial training for diploma eee students
internship in chennai for it students
kaashiv infotech in chennai
internship in trichy for ece
inplant training for ece
inplant training in coimbatore for ece
industrial training certificate format for electrical engineering students
internship certificate for mechanical engineering students

Trishana said...

thank you sharing this blog, this information is useful for understanding big data and data science.
Big data training bangalore
datascience training bangalore

IICT said...

Informative blog post. Thanks for this wonderful Post.
SAP Training in Chennai
AWS Training in Chennai
Hardware and Networking Training in Chennai
QTP Training in Chennai
CCNA Training in Chennai

aarthi said...

Thank you for excellent article.Great information
Java training in chennai | Java training in bangalore | Java training in hyderabad | Java training in coimbatore | Java training in online

ramesh said...

The blog very useful. every content is very uniquely represented.

Azure Training in Chennai | Certification | Azure Online Training Course | Azure Training in Bangalore | Certification | Azure Online Training Course | Azure Training in Hyderabad | Certification | Azure Online Training Course | Azure Training in Pune | Certification | Azure Online Training Course | Azure Training | microsoft azure certification | Azure Online Training Course

vijay said...

Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
Salesforce Training in Chennai

Salesforce Online Training in Chennai

Salesforce Training in Bangalore

Salesforce Training in Hyderabad

Salesforce training in ameerpet

Salesforce Training in Pune

Salesforce Online Training

Salesforce Training

Gaurav said...

Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
cyber security course training in indore

KITS Technologies said...

Oh man! This blog is sick! How did you make it look like this !
Django online training
Django training
Go Language online training
Go Language training
Hibernate online training
Hibernate training
Hyperion ESS Base online training
Hyperion ESS Base training
Hyperion Fdqm online training

Online Front said...

Mostly I use to read the blogs and informative article daily, but today i found your blog very unique, providing the information and helpful to others. Keep it up and waiting for your new updates thanks. We offer multiple services in digital marketing, some of our services are:

Digital marketing Company in Delhi
SMM Services
PPC Services in Delhi
Website Design & Development Packages
SEO Services Packages
Local SEO services
E-mail marketing services
YouTube plans
Digital marketing service in Delhi

gaga said...

Thanks for sharing

Hussey said...

Extraordinary Blog. Provides necessary information.
best python institute in chennai
​​python training centre in chennai

Mrbk30 said...

Very Informative blog thank you for sharing. Keep sharing.

Best software training institute in Chennai. Make your career development the best by learning software courses.

Docker Training institute in Chennai
devops training in chennai
cloud computing courses in chennai

Reshma said...


This post is so interactive and informative.keep update more information...
PHP Training in Bangalore
php classes in pune

Block said...

I believe there are many more pleasurable opportunities ahead for
individuals that looked at your site.
unix certification in Chennai
IT training institute in Chennai

tektutes said...

Very Nice Blog…Thanks for sharing this information with us. Here am sharing some information about training institute.
tableau online training in hyderabad

milka said...

Great post. keep sharing such a worthy information.
Data Science Training in Chennai

TRONIX said...

Great post. keep sharing such a piece of worthy information.

Java training institution in hyderabad

Saurabh said...

Thank you FOr this Blog!
Linux Classes in Pune

vcube said...

I appreciate you sharing this useful information. I hope a large number of people learn about this and find it useful. And kindly keep updating in this manner.