Monday, August 18, 2008

Stratified Sampling in SQL

If you use SPSS Clementine as I do, then you are probably familiar with the Balance node. It performs the function of selectively and randomly sampling your data based upon the values of a field or number of fields. Also known as stratified sampling!

If your data is managed by a data warehouse, then Clementine has this cool behaviour of automatically converting functions into SQL, so the data processing can be performed by the database and less data needs to be extracted and duplicated on another file system.

Unfortunately the Balance node isn't one of the functions automatically converted into SQL. In order to perform stratified sampling you have to take a different approach and selectively pick the values of your target column/field and sample them individually.

On KDKEYS.NET I attached one Clementine version 12.0.2 stream (balance node.str) as one example of how to do this. By using a select condition, followed by a random sample, followed by a union (append) it is possible to easily obtain a stratified sample from a huge dataset efficiently.

I have also pasted below an example of the type of simple SQL that gets processed;

SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM IPSHARE.TMANNS_DRUG4n
WHERE (Drug = 'drugA')
SAMPLE 0.5
) AS TimTemp1
UNION ALL
SELECT *
FROM (
SELECT *
FROM IPSHARE.TMANNS_DRUG4n
WHERE (Drug = 'drugX')
SAMPLE 0.2
) AS TimTemp2
) AS TimTable
;

- sorry, I couldn't work out how to format the SQL properly in this blog :(

Cheers

Tim

3 comments:

Anonymous said...

Good tip. I think the "sample" keyword is DB dependant but all systems have a random function of some sort. I usually use an "order by" random clause with a row limit.

John Aitchison said...

Hi Tim .. welcome to blogging

I looked at your SQL .. but I am a little confused. As far as I can see you are randomly selecting 50% of drugA users and 20% of drugX users, and appending the two datasets (so that the selected drugA users come before the selected drugX users on the final dataset, which may or may not be a good thing, if true).

The SQL seems a bit difficult to follow - all that nesting - but is it your understanding that the engine would optimize that query so that only a single pass would be required of the original and huge dataset?

regards

Tim Manns said...

Hello John,

Pop me a message next time you're near Sydney.

We use Teradata, so yes the query would be optimised and distributed across the amps (clusters) of the data warehouse. This way it should be faster than running or creating separate tables. Because of the distributed nature of Teradata, the SAMPLE function simply returns random rows (from whichever amp is fastest). By selecting different percents and appending them together I end up with equal counts of both. Obviously you can edit the percentages so that you get whichever distrbution your want.

To avoid any issues of sort order and can sort by a random variable.

I use this samplying method to sample down my active customers to a similar number to my churned customers when I'm building predictive models. I always reduce my data and never boost.

When I test/validate the predictive model I do not sample in any way and use the natural distribution of the data.

Cheers

Tim