7.13 kmeans pipeline
20220220 As with all mlhub commands, a goal is to provide powerful combinations of commands through pipelines. We might process a csv file through a number of steps, for example to normalise the columns, to then pipe the csv file into the train command followed by the predict command to output a csv file with each observation labelled with a cluster number.
cat iris.csv | ml train kmeans 3 | ml predict kmeans iris.csv
The output will be something like:
sepal_length,sepal_width,petal_length,petal_width,label
5.0,3.6,1.4,0.2,1
7.7,3.8,6.7,2.2,0
6.1,3.0,4.9,1.8,2
5.4,3.7,1.5,0.2,2
...
The pipeline can go one step further to visualise the clustering:
cat iris.csv | ml train kmeans 3 | ml predict kmeans iris.csv | ml visualise kmeans
This will popup a window to display the clustering result.
TODO: Include the resulting plot here.
A pipeline including normalise can be
illustrated with the wine.csv
dataset from Section
7.14:
cat wine.csv |
ml normalise kmeans |
tee norm.csv |
ml train kmeans 4 |
ml predict kmeans norm.csv |
mlr --csv cut -f label |
paste -d"," wine.csv -
Here after normalising the input dataset the result is saved to a file
norm.csv
using tee whilst piping the same data on
to the next command (to train a clustering). We
save to file since we’d like to predict the
clusters for each of the normalised observations, then map them back
to the original observations. This is accomplished using a combination
of mlr to cut the label column from the csv
output from the predict command, and then we
paste that label column to the original wine.csv
.
The output is something like:
alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,0
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,0
...
13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750,3
13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835,3
13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840,2
14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560,2
Once again we can visualise the result as part of the pipeline, whilst also using tee to also save the clustering to file:
cat wine.csv |
ml normalise kmeans |
tee norm.csv |
ml train kmeans 4 |
ml predict kmeans norm.csv |
mlr --csv cut -f label |
paste -d"," wine.csv - |
tee clustering.csv |
ml visualise kmeans
TODO: Include the resulting plot here.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0