CISC251/3.0 Data Analytics
Introduction to data analytics; data preparation; assessing performance; prediction methods such as decision trees, random forests, support vector machines, neural networks and rules; ensemble methods such as bagging and boosting; clustering techniques such as expectation-maximization, matrix decompositions, and biclustering; attribute selection.
Recommended: Prior exposure to problem solving in any discipline.
Learning Hours: 120 (36L; 24Lab;60P)
Exclusion: CISC/CMPE 333
Preliminaries (2 weeks)
- Data acquisition and preparation
- Inductive modelling as an epistemology
- Assessing model performance
- Simple predictors: decision trees, k-nearest neighbour, Naïve Bayes prediction
- Stronger predictors: random forests, support vector machines, neural networks
- Ensemble techniques (bagging, boosting)
- Similarity measures: distance-based, distribution-based, density-based
- Algorithms: k-means, expectation-maximization, DBScan
- Metrix decompositions (such as singular value decomposition) and projections
- Attribute selection techniques
- Applications selected from a variety of domains (natural language, bioinformatics, business)
Upon successful completion of this course, a student will be able to:
- Design inductive model building algorithms appropriate for datasets of moderate size and complexity
- Evaluate the modelling performance of such algorithms, and the implications for the real-world system that the data describes
- Zaki and Meira, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press.