Blog by Tim Manns (data mining blog)

Tim Manns

If you use SPSS Clementine as I do, then you are probably familiar with the Balance node. It performs the function of selectively and randomly sampling your data based upon the values of a field or number of fields. Also known as stratified sampling!

If your data is managed by a data warehouse, then Clementine has this cool behaviour of automatically converting functions into SQL, so the data processing can be performed by the database and less data needs to be extracted and duplicated on another file system.

Unfortunately the Balance node isn't one of the functions automatically converted into SQL. In order to perform stratified sampling you have to take a different approach and selectively pick the values of your target column/field and sample them individually.

On KDKEYS.NET I attached one Clementine version 12.0.2 stream (balance node.str) as one example of how to do this. By using a select condition, followed by a random sample, followed by a union (append) it is possible to easily obtain a stratified sample from a huge dataset efficiently.

I have also pasted below an example of the type of simple SQL that gets processed;

SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM IPSHARE.TMANNS_DRUG4n
WHERE (Drug = 'drugA')
SAMPLE 0.5
) AS TimTemp1
UNION ALL
SELECT *
FROM (
SELECT *
FROM IPSHARE.TMANNS_DRUG4n
WHERE (Drug = 'drugX')
SAMPLE 0.2
) AS TimTemp2
) AS TimTable
;

- sorry, I couldn't work out how to format the SQL properly in this blog :(

Cheers

Tim

Blog by Tim Manns (data mining blog)

Monday, August 18, 2008

Stratified Sampling in SQL

4 comments: