7.8 kmeans predict

20220104

Having performed a cluster analysis we have effectively fit a model to the data or as others may describe it, we have trained a model from the data. The model can now be used to β€œpredict”, or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (a csv file) based on the β€œmodel” saved as a csv file.

Common usage:

ml predict kmeans iris.csv model.csv

The output will be something like:

sepal_length,sepal_width,petal_length,petal_width,label
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
...
6.9,3.1,4.9,1.5,2
5.5,2.3,4.0,1.3,2
..
7.3,2.9,6.3,1.8,1
6.7,2.5,5.8,1.8,1
...

General usage:

ml predict kmeans DATAFILE [MODELFILE]

The input data.csv file is required as the observations to be labelled (β€œpredicting” the label which is actually finding the closest centroid).

If no input model file is supplied (containing the centres representing the model and a label together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label (generally the last column) and the remaining columns are the centres.

The output is a csv file, with a header and a column for the label, named as such, as the last column, identifying the nearest centre to each point.

To save the output to file:

ml predict kmeans iris.csv model.csv > predict.csv


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0