CISC873 Data Mining

This course is offered in Fall 2017. Classes are Wednesdays from 2:00-6:00 in Goodwin 521.

Notes from class 1

Notes from class 2

Data mining builds inductive models from data. Almost all organisations, and many individuals, accumulate data from their interactions, and can use this data to improve service, and sometimes profit. Some examples:

The algorithms used for data mining must be efficient, because of the huge volumes of data that have to be examined, and sophisticated, because the benefit of an extracted concept depends heavily on how subtle it is.

This course is a project course We will examine a number of datasets, with each participant using a particular technique to investigate each dataset and see what structure the technique discovers. You will have a chance to try several different techniques during the course.

Good working knowledge of standard software environments is required, especially the ability to develop scripts and plot data (e.g. Excel, Matlab, Open GL + Perl, Python, Awk). Some elementary knowledge of statistics and probability is required.

The course is an applications token for the Ph.D. programme.

Each participant will use two techniques during the course, the first for datasets 1 and 2; and the second for dataset 3. You will choose your combination in the early weeks of the term. Each combination includes one supervised technique and one unsupervised technique. Here are some possible combinations:

Assessment will be based partly on performance in class (quality of results, and quality of presentation and discussion). Marks will be generated using input from all class participants. There may be a take-home exam at the end of term in which you will be given a dataset and asked to report on what you can find out about it. The exam would be worth 30% of your mark.

A possible reference for this course is: Hand, Mannila, Smyth, "Principles of Data Mining", MIT Press, 026208290X. You may also be interested in the text for the undergraduate data mining course: Tan, Steinbach, Kumar, Introduction to Data Mining, Addison-Wesley, 2006, ISBN 0-321-32136-7.

Datasets

I will make the datasets available as we go.

Here is an approximation to the weekly schedule that I expect to follow, at least for the first few weeks:

Weeks 1 and 2
I will present introductory material on data mining, particularly about ways in which data needs to be prepared before mining, and how your results should be presented.

Week 2
You will choose your first data mining technique. I will explain the protocol for doing this in the first class. As soon as you have been allocated a method, you should look for software that will help you. I will be able to give some advice about this.

Weeks 2 and 3
You should spend these weeks finding out about your chosen technique. Be prepared to make a brief explanation of what your technique does and how it works in class during Week 3.

Weeks 4 to 11
You should be prepared to make some kind of presentation every week, probably a brief powerpoint presentation. We will spend time on each dataset in turn. It's hard to estimate how long we will spend on each one, because it depends on how successful the modelling is.

Some questions about your technique

Week 12

What have you learned about each of the techniques that you've seen being used? Which technique would you use for problems of the following kind:

If you wanted to spend the next ten years using data mining as your source of livelihood, would you (a) develop a product and sell it, or (b) provide consulting to organisations wanting to use data mining themselves, or (c) something else? Why?

I will ask each person to evaluate the performance of each other member of the class based on their contribution to the course. This might be based on help with data manipulation and software, as well as the quality of their work and presentations. Note that it isn't fair to base performance on the quality of the results obtained, since some techniques are intrinsically more powerful than others.

I'll introduce the dataset to be used for the exam if we have one. Notice that the exam task is different from what you've been doing during the term -- you may choose any technique, but you must justify your choice.

2012 exam

2011 final exam.

2010 final exam.

2008 final exam.

2007 Final Exam.

2003 Final Exam

Back to David Skillicorn's home page.