8.4 apriori itemsets
20220216
Itemsets are a basic concept for association rules. An itemset is the set of items that frequently occur together in multiple baskets.
ml itemsets apriori [options] [file.csv]
--id=<name> The id column name.
-o <file> --output=<file> Save itemsets to .csv or .rds file.
-s <0-1> --support=<0-1> Minimum support threshold.
The input file is a two column csv file. One column is the basket
id
and the other is an item in that basket. The item column can have
any name. If no data file is supplied the data is read from stdin
,
often as part of a pipeline. An example input data file might be:
id,item
u1234567,comp1234
u1234567,comp2345
u1234567,comp3456
u1234567,comp4567
u1234568,comp1234
u1234568,comp4567
...
Output to stdout (by default) is a row for each possible basket item set combination, with a frequency count and support:
$ ml itemsets apriori mcomp.csv
pattern,freq,support
comp1234:comp4567,145,0.75
comp2345,123,0.45
...
The output can be saved to a named csv file with --output=
(or
-o
), with the argument being a filename including the .csv
extension. If the filename extension is instead .rds
then the result
is saved as a single object in the named file.
Output can be filtered to include only those itemsets with at least a specified value for the support. The default support threshold is 10% (0.1). The support for an itemset is simply the proportion of baskets which contain all items in the itemset.
$ ml itemsets apriori --support=0.5 mcomp.csv
pattern,freq,support
comp1234:comp4567,145,0.75
...
A column named id
is expected. In general though the identifier
could be any column (like ID
):
ID,Course
u1234567,comp1234
u1234568,comp2345
...
To use a non-id
column as the identifier use --id=
$ ml itemsets apriori --id=ID mcomp.csv
pattern,freq,support
comp1234:comp4567,145,0.75
...
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0