This page describes how to use the machine learning (aka mining) bundle.
Operators
CLUSTERING
Description
This operator clusters a set of tuples.
Parameter
- ATTRIBUTES: The attributes of the incoming tuples that should be recognized for clustering by the distance/similarity function. Notice, not all kinds of attribute types work here
- CLUSTERER: The clustering algorithm that should be used
- Currently implemented: kMeans, Weka (which in turn has further algorithms)
- ALGORITHM: A set of options to describe the algorithm
Example
This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
clustered = CLUSTERING({
attributes=['age', 'income'],
clusterer='weka',
algorithm =
[
'model'='SimplekMeans'
, 'arguments'='-N 3'
]
}, inputoperator) |
For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be found in the Weka Docs
- SIMPLEKMEANS (http://weka.sourceforge.net/doc.dev/weka/clusterers/SimpleKMeans.html)
- EM (http://weka.sourceforge.net/doc.dev/weka/clusterers/EM.html)
- COBWEB (http://weka.sourceforge.net/doc.dev/weka/clusterers/Cobweb.html)
- FARTHESTFIRST (http://weka.sourceforge.net/doc.dev/weka/clusterers/FarthestFirst.html)
- DENSITY_KMEANS (http://weka.sourceforge.net/doc.dev/weka/clusterers/MakeDensityBasedClusterer.html)
- HIERARCHICAL (http://weka.sourceforge.net/doc.dev/weka/clusterers/HierarchicalClusterer.html)
CLASSIFICATION_LEARN
Description
This operator is used to create a classifier. Therefore, the result is a stream of classifiers (this is an own datatype!)
Parameter
- CLASS: The attribute that should be used as the class (the
- NOMINALS: For nominal classifiers, this list provides the possible values, because some algorithms have to know them in advance
- LEARNER: The algorithm that is used to construct the classifier
- Currently implemented: Weka (which in turn has further algorithms, see above)
- ALGORITHM: A set of options to set up the algorithm
Example
This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
learned = CLASSIFICATION_LEARN({
class='attack',
nominals = ['attack'=['back', 'smurf', 'spy']],
learner = 'weka',
algorithm =
[
'model'='J48'
]
}, inputoperator) |
For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be found in the Weka Docs
Classification (nominal values):
- J48 (an adapted version of C4.5, a decision tree induction)
- NaiveBayes
- DecisionTable
- SMO (Sequential Minimal Optimization)
Regression (continuous values):
- LINEAR-REGRESSION
- SIMPLE_LINEAR-REGRESSION
- LOGISTIC
- SIMPLE-LOGISTIC
- GAUSSIAN-PROCESSES
- SMO-Regression (a regression version of SMO)
- MULTILAYER-PERCEPTRON
CLASSIFY
Description
This operator classifies a tuple by using a classifier. The operator needs two inputs: A stream of tuples that should be classified and a stream of classifiers (that normally comes from a CLASSIFICATION_LEARN operator).
It a appends a new attribute called "clazz" which contains the nominal class value or continuous value from a regression
For the classify operator, the type of the classifier (tree, list, bayes net... ) doesn't matter. You may even mixup them to classify the same tuple with different classifiers (see Ensembles). The left port is the input for the tuples that should be classified and the right input is the one with the classifiers.
Parameter
- CLASSIFIER: The attribute where the classifier in the classifier-tuple can be found (normally, it's "classifier", which is given by the CLASSIFICATION_LEARN operator)
Example
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
classified = CLASSIFY({classifier='classifier'}, test, learned) |
FREQUENTITEMSET
Description
This operator create frequent item sets from a given stream.
The result stream creates a tuple with 3 attributes:
- id: the number (a simple counter) of the pattern
- set: the frequent pattern, which is a list of tuples (a nested attribute ~ NF^2)
- support: the support of the pattern
Parameter
- SUPPORT: The minimal support that defines what is frequent. This can be either a total number > 1.0 or a double between 0.0 and 1.0. The double indicates the percent in terms of the number of transactions.
- TRANSACTIONS: A Number of transactions that should be investigated
- A transaction is a snap-shot of a window. so each time when a window changes, there is a new transaction
- LEARNER: the algorithm that is used
- Currently implemented: fpgrowth, Weka (which in turn has further algorithms)
- ALGORITHM: A set of options to describe the algorithm
Example
This example uses FP-Growth for finding frequent item sets. it does not need any parameters in algorithm
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
/// support is 3 out of 1000 transactions
fpm = FREQUENTITEMSET({support=3.0, transactions=1000, learner = 'fpgrowth'}, inputoperator)
/// support is 60% out of 1000 transactions, so it is equal to a support of 600.0
fpm = FREQUENTITEMSET({support=0.6, transactions=1000, learner = 'fpgrowth'}, inputoperator) |
GENERATERULES
Description
This operator uses a list of tuples and creates rules like "x => y".
A rule is a special datatype called "AssociationRule", which is principally a tuple of two patterns (one for the premise and one for the consequnce of the rule)
Parameter
- ITEMSET: The attribute where to find the frequent itemset/pattern
- SUPPORT: The attribute where to find the support of the pattern (this is used for calculating the confidence!)
- CONFIDENCE: The minimal confidence of a rule in percent (a value between 0.0 and 1.0)
Example
This example generates only rule with a confidence of 60% or more
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
rule = GENERATERULES({
itemset='set',
confidence=0.6,
support='support'
},
fpm) |
...
Ensembles
You can simply combine all operators with other operators in Odysseus to create ensembles.
...