M.Sc. CDA - Data Mining Assignments


These assignments are for Dr. Lingras' part of the course. They must be submitted through brightspace. Any assignments that are emailed will be deleted without reading.

Other instructors: Sreejata Chatterjee (text mining), Chris Malloy (for deep learning) will give you separate assignments and submission instructions.

Your assignments will not be marked until you have submitted the
non-disclosure agreement
There is a submission link on brightspace for NDA. Please do not email your NDA.

Unless otherwise stated, please submit a single pdf or text file. Archives (zip, rar, tar.gz, etc.) or word processor files (doc, docx, odf, etc.) will not be accepted.


Assignment #1: Clustering of customers and products

Cluster the customers and products for the retail dataset.

We want to cluster customers based on:

We want to cluster products based on:

As part of the process, we may find outliers, which may lead to some data cleaning. You should submit a report of your findings that may be beneficial to the business owner in understanding the nature of customers. The report should also briefly outline the data cleaning and preparation.

Suggested detailed algorithm and report contents:

  1. Make sure that your data does not have values that are too high or too low. You can find these values with the help of max and min functions in a spreadsheet. You could also look at the distributions of each attributes through R or spreadsheet. If you choose to delete some records, please describe how you found the outliers and what were the values.
  2. Once all the values are in reasonable ranges, normalize them so that no values are larger than 100. You have to be careful with such a normalization.
  3. Experiment with different number of clusters using simple K-means in R. I would suggest using 5, 10, 15 as number of clusters. Clustering may help you further identify outlying objects. If you go further and find the optimal number of clusters, that will be even better, especially if you can justify your answer.
  4. When you are presenting the results use the centroid data produced by R. You may also plot the cluster distribution by varying X and Y values in R. Show only those graphs that are more meaningful.
Link to a sample session of clustering using revenue and number of baskets in which the products appear. The session is captured as it progressed. It may include some of the commands that were wrong and were subsequently corrected.
Please use this sample session as a guideline instead of following it literally. Using your judgement instead of blindly following instructions is an important trait. Creativity with justification is encouraged.

Note for submission: Submit a pdf (no word processor files) file describing your data processing efforts and analysis. Add the programs, R commands, and queries in appendix.


Assignment #2: Classification with car data

Use different classification algorithms and tune the parameters to predict the acceptibility of a car using the car data (this is a link to the car data)

Here is a sample classification session on sonar data

Please use this sample session as a guideline instead of following it literally. Using your judgement instead of blindly following instructions is an important trait. Creativity with justification is encouraged.

If you just follow minimal steps from the sample session and write a report that explains your understanding of each of the classifications using meaningful figures and tables:

you will be marked out of 80 (since all the computing steps are already written down for you). Please note that the sample session may not have all the evaluations for all the models. The sample session is for a dataset with two classes. There are some notes for working with multiclass problems. In order to be marked out of 100, you may want to play with the various parameters within rpart, randomForest, and caret packages with a thorough analysis.

Note for submission: Submit a pdf (no word processor files) file describing your data processing efforts and analysis.


Assignment #3: Association

Run the association mining on the SimplyCast.com data and write a report on meaningful associations at both user level and session level. Association can be a computationally intensive process.

  • Sample SQL and R code (follow the link).
  • Please use this sample session as a guideline instead of following it literally. Using your judgement instead of blindly following instructions is an important trait. Creativity with justification is encouraged.

    Note for submission: Submit a pdf (no word processor files) file describing your data processing efforts and analysis.


    Assignment #4: Prediction

    This assignment is replaced by Statistics Canada Competition