7.13 kmeans pipeline
20220220 As with all mlhub commands, a goal is to provide powerful combinations of commands through pipelines. We might process a csv file through a number of steps, for example to normalise the columns, to then pipe the csv file into the train command followed by the predict command to output a csv file with each observation labelled with a cluster number.
cat iris.csv | ml train kmeans 3 | ml predict kmeans iris.csvThe output will be something like:
sepal_length,sepal_width,petal_length,petal_width,label
5.0,3.6,1.4,0.2,1
7.7,3.8,6.7,2.2,0
6.1,3.0,4.9,1.8,2
5.4,3.7,1.5,0.2,2
...
The pipeline can go one step further to visualise the clustering:
cat iris.csv | ml train kmeans 3 | ml predict kmeans iris.csv | ml visualise kmeansThis will popup a window to display the clustering result.
TODO: Include the resulting plot here.
A pipeline including normalise can be
illustrated with the wine.csv dataset from Section
7.14:
cat wine.csv |
ml normalise kmeans |
tee norm.csv |
ml train kmeans 4 |
ml predict kmeans norm.csv |
mlr --csv cut -f label |
paste -d"," wine.csv - Here after normalising the input dataset the result is saved to a file
norm.csv using tee whilst piping the same data on
to the next command (to train a clustering). We
save to file since we’d like to predict the
clusters for each of the normalised observations, then map them back
to the original observations. This is accomplished using a combination
of mlr to cut the label column from the csv
output from the predict command, and then we
paste that label column to the original wine.csv.
The output is something like:
alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,0
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,0
...
13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750,3
13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835,3
13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840,2
14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560,2
Once again we can visualise the result as part of the pipeline, whilst also using tee to also save the clustering to file:
cat wine.csv |
ml normalise kmeans |
tee norm.csv |
ml train kmeans 4 |
ml predict kmeans norm.csv |
mlr --csv cut -f label |
paste -d"," wine.csv - |
tee clustering.csv |
ml visualise kmeansTODO: Include the resulting plot here.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0