The following is a sample session of clustering using revenue and number of baskets in which the products appear. The session is captured as it progressed. It may include some of the commands that were wrong and were subsequently corrected. Extract the revenue and number of baskets from the database. create table productRevenueBasket as SELECT `ITEM_SK`,sum(`SELLING_RETAIL_AMT`) as revenue, count(distinct `TRANSACTION_RK`) as baskets from dataset01.sales219 group by `ITEM_SK` Choose top 2000 products. create table productcluster as SELECT * FROM `productRevenueBasket` ORDER BY `productRevenueBasket`.`revenue` DESC limit 0,2000 Download the productcluster.csv and cluster using the following R command: prod<-read.csv("productcluster.csv",header=F) prod summary(prod) km = kmeans(prod[,2:3],5,150) km$centers km$size km = kmeans(prod[2:nrow(prod),2:3],5,150) km$size summary(prod) summary(prod[2:2000,]) summary(prod[9:2000,]) summary(prod[10:2000,]) nprod=prod nprod[,2]=nprod[,2]/mean(nprod[,2]) nprod[,2]=nprod[,3]/mean(nprod[,3]) summary(nprod) nprod=prod nprod[,2]=nprod[,2]/mean(nprod[,2]) nprod[,3]=nprod[,3]/mean(nprod[,3]) summary(nprod) km = kmeans(nprod[,2:3],5,150) km$center km$size source("clusterAccessories.r") withinSSrange(nprod[,2:3], 3, 15, 150) Here, the clusterAccessories.r file is as follows: withinSSrange <- function(data,low,high,maxIter) { withinss = array(0, dim=c(high-low+1)); for(i in low:high) { withinss[i-low+1] <- kmeans(data, i, maxIter)$tot.withinss } withinss } # We have decided that the numbers of clusters is 6 nkm=kmeans(nprod[,2:3],6,150) nkm$centers nkm$size # Denormalize the centers realCenters = nkm$centers # Note that 1 in realCenters matches 2 in nprod realCenters[,1]=mean(prod[,2])*realCenters[,1] realCenters[,2]=mean(prod[,3])*realCenters[,2] realCenters #column bind the columns clusteredProd=cbind(prod,nkm$cluster) #Make sure that the binding is proper clusteredProd[1:20,] nkm$cluster[1:20] #write the clustered products out write(t(clusteredProd),file="clusteredProducts.csv",sep=',',ncolumns=4) Import clusteredProducts.csv into your database. Join it with dataset01.items. SELECT p . * , i.ITEM_ENG_DESC AS name FROM `clusteredProducts` AS p, dataset01.items AS i WHERE p.`ITEM_SK` = i.`ITEM_SK` Use the realCenters, clusteredProducts and the product names to write profiles for each cluster.