M.Sc. CDA - Data Mining Assignments
These assignments are for Dr. Lingras' part of the course. They must be submitted through brightspace. Any assignments that are emailed will be deleted without reading.
Other instructors: Sreejata Chatterjee (text mining), Chris Malloy (for deep learning) will give you separate assignments and submission instructions.
Your assignments will not be marked until you have submitted the
non-disclosure agreement
There is a submission link on brightspace for NDA. Please do not email your NDA.
Unless otherwise stated, please submit a single pdf or text file. Archives (zip, rar, tar.gz, etc.) or word processor files (doc, docx, odf, etc.) will not be accepted.
Assignment #1: Clustering of customers and products
Cluster the customers and products for the retail dataset.
We want to cluster customers based on:
- number of products bought
- number of distinct products bought
- revenues
- number of visits
- additional attributes that you think are appropriate
We want to cluster products based on:
- number of distinct customers who buy the product
- revenues
- number of visits in which the product is bought
- additional attributes that you think are appropriate
As part of the process, we may find outliers, which may lead to some data cleaning.
You should submit a report of your findings that
may be beneficial to the business owner in understanding
the nature of customers.
The report should also briefly outline the data
cleaning and preparation.
Suggested detailed algorithm and report contents:
-
Make sure that your data does not have values that are
too high or too low. You can find these values with the
help of max and min functions in a spreadsheet. You could
also look at the distributions of each attributes through R
or spreadsheet.
If you choose to delete some records, please describe
how you found the outliers and what were the values.
-
Once all the values are in reasonable ranges, normalize
them so that no values are larger than 100.
You have to be careful with such a normalization.
-
Experiment with different number of clusters using simple K-means
in R. I would suggest using 5, 10, 15 as number of clusters.
Clustering may help you further identify outlying objects.
If you go further and find the optimal number of clusters,
that will be even better, especially if you can justify your
answer.
-
When you are presenting the results use the centroid
data produced by R. You may also plot the cluster
distribution by varying X and Y values in R.
Show only those graphs that are more meaningful.
Link to a sample session of clustering using revenue and number of baskets in which the products appear. The session is captured as it progressed. It may include some of the commands that were wrong and were subsequently corrected.
Please use this sample session as a guideline instead of following it literally.
Using your judgement instead of blindly following instructions is
an important trait.
Creativity with justification is encouraged.
Note for submission: Submit a pdf (no word processor files)
file describing your data processing efforts and analysis.
Add the programs, R commands, and queries in appendix.
Assignment #2: Classification with car data
Use different classification algorithms and tune the parameters to predict the acceptibility of a car using the
car data (this is a link to the car data)
Please use this sample session as a guideline instead of following it literally.
Using your judgement instead of blindly following instructions is
an important trait.
Creativity with justification is encouraged.
If you just follow minimal steps from the sample session and write a report that explains your understanding
of each of the classifications using meaningful figures and tables:
- drawing a part of the tree
- listing some of the rules
- accuracy, recall, AUC
you will be marked out of 80 (since all the computing steps are already written down for you).
Please note that the sample session may not have all the evaluations for all the models. The sample session is for a dataset
with two classes. There are some notes for working with multiclass problems.
In order to be marked out of 100, you may want to play with the various parameters within rpart, randomForest, and caret packages
with a thorough analysis.
Note for submission: Submit a pdf (no word processor files)
file describing your data processing efforts and analysis.
Assignment #3: Association
Run the association mining on the SimplyCast.com data and
write a report on meaningful associations at both user level and session level.
Association can be a computationally intensive process.
Sample SQL and R code (follow the link).
Please use this sample session as a guideline instead of following it literally.
Using your judgement instead of blindly following instructions is
an important trait.
Creativity with justification is encouraged.
Note for submission: Submit a pdf (no word processor files)
file describing your data processing efforts and analysis.
Assignment #4: Prediction
This assignment is replaced by Statistics Canada Competition