7.15 kmeans example wine train and predict
UNDER DEVELOPMENT 20211229 Now that we have a dataset in the right form we can train the K-Means model:
ml train kmeans 3 wine.csv
The model is:
alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
13.11,1.01,1.70,15.00,78.00,2.98,3.18,0.26,2.28,5.30,1.12,3.18,502.00,0
14.19,1.59,2.48,16.50,108.00,3.30,3.93,0.32,1.86,8.70,1.23,2.82,1680.00,1
13.88,1.89,2.59,15.00,101.00,3.25,3.56,0.17,1.70,5.43,0.88,3.56,1095.00,2
We can train and then use the model to predict to place each observation into a cluster:
ml train kmeans 3 wine.csv |
ml predict kmeans wine.csv |
mlr --csv cut -f label > wine.pr
Now compare the clusters with the wine classes:
cat wine.data |
cut -d"," -f 1 |
awk 'NR==1{print "class"} {print}' |
paste -d"," - wine.pr |
sort |
uniq -c
This give us a pairwise count of the wine class and the clustering. We
can see it’s not a great match but there is some semblance of
overlap. The first column is a frequency count, and then we have the
class and label separated by a comma. In this example the cluster
labelled 0
covers much of the wine classes 2 and 3, whilst the
cluster labelled 2 covers most of the wine class 1. There is then
various “noise.”
11 1,0
6 1,1
42 1,2
69 2,0
2 2,2
48 3,0
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0