This page describes how to use the machine learning (aka mining) bundle.

# Operators

### CLUSTERING

#### Description

This operator clusters a set of tuples.

#### Parameter

- ATTRIBUTES: The attributes of the incoming tuples that should be recognized for clustering by the distance/similarity function. Notice, not all kinds of attribute types work here
- CLUSTERER: The clustering algorithm that should be used
- Currently implemented: kMeans, Weka (which in turn has further algorithms)

- ALGORITHM: A set of options to describe the algorithm

#### Example

This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".

Code Block | ||||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

```
clustered = CLUSTERING({
attributes=['age', 'income'],
clusterer='weka',
algorithm =
[
'model'='SimplekMeans'
, 'arguments'='-N 3'
]
}, inputoperator)
``` |

For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be found in the Weka Docs

- SIMPLEKMEANS (http://weka.sourceforge.net/doc.dev/weka/clusterers/SimpleKMeans.html)
- EM (http://weka.sourceforge.net/doc.dev/weka/clusterers/EM.html)
- COBWEB (http://weka.sourceforge.net/doc.dev/weka/clusterers/Cobweb.html)
- FARTHESTFIRST (http://weka.sourceforge.net/doc.dev/weka/clusterers/FarthestFirst.html)
- DENSITY_KMEANS (http://weka.sourceforge.net/doc.dev/weka/clusterers/MakeDensityBasedClusterer.html)
- HIERARCHICAL (http://weka.sourceforge.net/doc.dev/weka/clusterers/HierarchicalClusterer.html)

### CLASSIFICATION_LEARN

#### Description

This operator is used to create a classifier. Therefore, the result is a stream of classifiers (this is an own datatype!)

#### Parameter

- CLASS: The attribute that should be used as the class (the
- NOMINALS: For nominal classifiers, this list provides the possible values, because some algorithms have to know them in advance
- LEARNER: The algorithm that is used to construct the classifier
- Currently implemented: Weka (which in turn has further algorithms, see above)

- ALGORITHM: A set of options to set up the algorithm

#### Example

This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".

Code Block | ||||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

```
learned = CLASSIFICATION_LEARN({
class='attack',
nominals = ['attack'=['back', 'smurf', 'spy']],
learner = 'weka',
algorithm =
[
'model'='J48'
]
}, inputoperator)
``` |

For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be found in the Weka Docs

Classification (nominal values):

- J48 (an adapted version of C4.5, a decision tree induction)
- NaiveBayes
- DecisionTable
- SMO (Sequential Minimal Optimization)

Regression (continuous values):

- LINEAR-REGRESSION
- SIMPLE_LINEAR-REGRESSION
- LOGISTIC
- SIMPLE-LOGISTIC
- GAUSSIAN-PROCESSES
- SMO-Regression (a regression version of SMO)
- MULTILAYER-PERCEPTRON

### CLASSIFY

#### Description

This operator classifies a tuple by using a classifier. The operator needs two inputs: A stream of tuples that should be classified and a stream of classifiers (that normally comes from a CLASSIFICATION_LEARN operator).

It a appends a new attribute called "clazz" which contains the nominal class value or continuous value from a regression

For the classify operator, the type of the classifier (tree, list, bayes net... ) doesn't matter. You may even mixup them to classify the same tuple with different classifiers (see Ensembles). The left port is the input for the tuples that should be classified and the right input is the one with the classifiers.

#### Parameter

- CLASSIFIER: The attribute where the classifier in the classifier-tuple can be found (normally, it's "classifier", which is given by the CLASSIFICATION_LEARN operator)

#### Example

Code Block | ||||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

`classified = CLASSIFY({classifier='classifier'}, test, learned) ` |

### FREQUENTITEMSET

#### Description

This operator create frequent item sets from a given stream.

The result stream creates a tuple with 3 attributes:

- id: the number (a simple counter) of the pattern
- set: the frequent pattern, which is a list of tuples (a nested attribute ~ NF^2)
- support: the support of the pattern

#### Parameter

- SUPPORT: The minimal support that defines what is frequent. This can be either a total number > 1.0 or a double between 0.0 and 1.0. The double indicates the percent in terms of the number of transactions.
- TRANSACTIONS: A Number of transactions that should be investigated
- A transaction is a snap-shot of a window. so each time when a window changes, there is a new transaction

- LEARNER: the algorithm that is used
- Currently implemented: fpgrowth, Weka (which in turn has further algorithms)

- ALGORITHM: A set of options to describe the algorithm

#### Example

This example uses FP-Growth for finding frequent item sets. it does not need any parameters in algorithm

Code Block | ||||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

```
/// support is 3 out of 1000 transactions
fpm = FREQUENTITEMSET({support=3.0, transactions=1000, learner = 'fpgrowth'}, inputoperator)
/// support is 60% out of 1000 transactions, so it is equal to a support of 600.0
fpm = FREQUENTITEMSET({support=0.6, transactions=1000, learner = 'fpgrowth'}, inputoperator)
``` |

### GENERATERULES

#### Description

This operator uses a list of tuples and creates rules like "x => y".

A rule is a special datatype called "AssociationRule", which is principally a tuple of two patterns (one for the premise and one for the consequnce of the rule)

#### Parameter

- ITEMSET: The attribute where to find the frequent itemset/pattern
- SUPPORT: The attribute where to find the support of the pattern (this is used for calculating the confidence!)
- CONFIDENCE: The minimal confidence of a rule in percent (a value between 0.0 and 1.0)

#### Example

This example generates only rule with a confidence of 60% or more

Code Block | ||||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

```
rule = GENERATERULES({
itemset='set',
confidence=0.6,
support='support'
},
fpm)
``` |

# Ensembles

You can simply combine all operators with other operators in Odysseus to create ensembles.

If we have, for examaple a stream with windspeed and power (speedandpower) and another one containing the windspeed (e.g. a forecast).

Then, it is possible to create different regressionfunctions, use them and weight the regression results by an aggregation.

One example:

Code Block | ||
---|---|---|

| ||

```
/// create the first classifier by using SMO
smo = CLASSIFICATION_LEARN({
class='power',
learner = 'weka',
algorithm = ['model'='SMO-REGRESSION']
},
speedandpower
)
/// create the second classifier by using gaussian processes
gaussian = CLASSIFICATION_LEARN({
class='power',
learner = 'weka',
algorithm = ['model'='GAUSSIAN-PROCESSES']
},
speedandpower
)
/// create the thirs classifier by using a linear regression
linear = CLASSIFICATION_LEARN({
class='power',
learner = 'weka',
algorithm = ['model'='LINEAR-REGRESSION']
},
speedandpower
)
/// union them all into one stream
unioned = UNION(smo, gaussian, linear)
/// then, classify them - each tuple will be classified by using all three classifiers
ensemble = CLASSIFY(speed, unioned)
/// aggregate the clazz using average - which allows a weighted kind of voting
agg = AGGREGATE({
aggregations=[
['AVG', 'clazz', 'powerForecast', 'DOUBLE']
]
},
ensemble
)
``` |

# Examples

to be added