Now onto class 2 with the Data Mining with Weka MOOC.
Be a classifier
Interesting, Weka has a feature to visually build a decision tree by selecting regions of the instance space. Much nicer than the decision tree process I taught a couple of years ago about in IPT.
The idea is that this tree still yet needs be tested – which is coming up.
Training and testing
Basic machine learning situation.
We’re developing a classifer on the basis of machine learning algorithm.
The test and training data needs to be different. But if there is one data set, then you need to divide – one suggestion 2/3 data for training, 1/3 for testing. Weka has the ability to do this randomly.
Basic assumption is that both sets are produced from independent sampling from an infinite population.
More training and testing
Using Weka’s percentage split to train/test the classifier multiple times.
After doing this calculate the sample mean, variance and standard deviation for each of the runs.
Interestingly the quiz questions assume some knowledge of statistics and actually require it to be used.
Ahh, is it good to get 76% correct. A way to answer this is to look at the class (outcome) data and identify what you would get if you simply guessed. In this case, 500 of the 768 outcomes are positive (yes). So if you simply guessed yes, then you would be correct 65% of the time.
There is a classifier that does this. zeroR classifier. Establishing the base line.
The standard way of testing the performance.
Data set is divided into x segments. Run x times the idea of holding one segment for testing and use the rest for training. Called x-fold (where x is number) cross validation.
Stratified cross-validation – tries to hold the proportion of each class value.
Lots of data – use percentage split. Otherwise stratified 10-fold cross-validation – rule of thumb.
How big is lots? Hard to say, depends on the number of classes (2 class data set with 100-1000 samples, is probably good enough). Meaning 1000 data points.
100 number of classes needs more – looking for fair representation of each class.
Not much to write about their.
9/10 on the mid-course assessment. 10 MCQs. That’ll do.