8.4 apriori itemsets
Itemsets are a basic concept for association rules. An itemset is the set of items that frequently occur together in multiple baskets.
ml itemsets apriori [options] [file.csv] --id=<name> The id column name. -o <file> --output=<file> Save itemsets to .csv or .rds file. -s <0-1> --support=<0-1> Minimum support threshold.
The input file is a two column csv file. One column is the basket
id and the other is an item in that basket. The item column can have
any name. If no data file is supplied the data is read from
often as part of a pipeline. An example input data file might be:
id,item u1234567,comp1234 u1234567,comp2345 u1234567,comp3456 u1234567,comp4567 u1234568,comp1234 u1234568,comp4567 ...
Output to stdout (by default) is a row for each possible basket item set combination, with a frequency count and support:
$ ml itemsets apriori mcomp.csv pattern,freq,support comp1234:comp4567,145,0.75 comp2345,123,0.45 ...
The output can be saved to a named csv file with
-o), with the argument being a filename including the
extension. If the filename extension is instead
.rds then the result
is saved as a single object in the named file.
ml itemset apriori -o itemsets.csv mcomp.csv ml itemset apriori -o itemsets.rds mcomp.csv
Output can be filtered to include only those itemsets with at least a specified value for the support. The default support threshold is 10% (0.1). The support for an itemset is simply the proportion of baskets which contain all items in the itemset.
$ ml itemsets apriori --support=0.5 mcomp.csv pattern,freq,support comp1234:comp4567,145,0.75 ...
A column named
id is expected. In general though the identifier
could be any column (like
ID,Course u1234567,comp1234 u1234568,comp2345 ...
To use a non-
id column as the identifier use
$ ml itemsets apriori --id=ID mcomp.csv pattern,freq,support comp1234:comp4567,145,0.75 ...
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0