This page describes how to use the machine learning (aka mining) bundle.

Operators

CLUSTERING

Description

This operator clusters a set of tuples.

Parameter

ATTRIBUTES: The attributes of the incoming tuples that should be recognized for clustering by the distance/similarity function. Notice, not all kinds of attribute types work here
CLUSTERER: The clustering algorithm that should be used
- Currently implemented: kMeans, Weka (which in turn has further algorithms)
ALGORITHM: A set of options to describe the algorithm

Example

This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".

clustered = CLUSTERING({
                attributes=['age', 'income'],
                clusterer='weka',                
                algorithm =                
                  [
                  'model'='SimplekMeans'
                  , 'arguments'='-N 3'
                  ]                                   
              }, inputoperator)

For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be found in the Weka Docs

SIMPLEKMEANS (http://weka.sourceforge.net/doc.dev/weka/clusterers/SimpleKMeans.html)
EM (http://weka.sourceforge.net/doc.dev/weka/clusterers/EM.html)
COBWEB (http://weka.sourceforge.net/doc.dev/weka/clusterers/Cobweb.html)
FARTHESTFIRST (http://weka.sourceforge.net/doc.dev/weka/clusterers/FarthestFirst.html)
DENSITY_KMEANS (http://weka.sourceforge.net/doc.dev/weka/clusterers/MakeDensityBasedClusterer.html)
HIERARCHICAL (http://weka.sourceforge.net/doc.dev/weka/clusterers/HierarchicalClusterer.html)

CLASSIFICATION_LEARN

Description

This operator is used to create a classifier. Therefore, the result is a stream of classifiers (this is an own datatype!)

Parameter

CLASS: The attribute that should be used as the class (the
NOMINALS: For nominal classifiers, this list provides the possible values, because some algorithms have to know them in advance
LEARNER: The algorithm that is used to construct the classifier
- Currently implemented: Weka (which in turn has further algorithms, see above)
ALGORITHM: A set of options to set up the algorithm

Example

This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".

learned = CLASSIFICATION_LEARN({
              class='attack',              
              nominals = ['attack'=['back', 'smurf', 'spy']],
              learner = 'weka',
              algorithm =                
                  [
                  'model'='J48'
                  ]                          
              
            }, inputoperator)

For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be found in the Weka Docs

Classification (nominal values):

J48 (an adapted version of C4.5, a decision tree induction)
NaiveBayes
DecisionTable
SMO (Sequential Minimal Optimization)

Regression (continuous values):

LINEAR-REGRESSION
SIMPLE_LINEAR-REGRESSION
LOGISTIC
SIMPLE-LOGISTIC
GAUSSIAN-PROCESSES
SMO-Regression (a regression version of SMO)
MULTILAYER-PERCEPTRON

CLASSIFY

Description

This operator classifies a tuple by using a classifier. The operator needs two inputs: A stream of tuples that should be classified and a stream of classifiers (that normally comes from a CLASSIFICATION_LEARN operator).

It a appends a new attribute called "clazz" which contains the nominal class value or continuous value from a regression

For the classify operator, the type of the classifier (tree, list, bayes net... ) doesn't matter. You may even mixup them to classify the same tuple with different classifiers (see Ensembles). The left port is the input for the tuples that should be classified and the right input is the one with the classifiers.

Parameter

CLASSIFIER: The attribute where the classifier in the classifier-tuple can be found (normally, it's "classifier", which is given by the CLASSIFICATION_LEARN operator)

Example

classified = CLASSIFY({classifier='classifier'}, test, learned)

FREQUENTITEMSET

Description

This operator create frequent item sets from a given stream.

The result stream creates a tuple with 3 attributes:

id: the number (a simple counter) of the pattern
set: the frequent pattern, which is a list of tuples (a nested attribute ~ NF^2)
support: the support of the pattern

Parameter

SUPPORT: The minimal support that defines what is frequent. This can be either a total number > 1.0 or a double between 0.0 and 1.0. The double indicates the percent in terms of the number of transactions.
TRANSACTIONS: A Number of transactions that should be investigated
- A transaction is a snap-shot of a window. so each time when a window changes, there is a new transaction
LEARNER: the algorithm that is used
- Currently implemented: fpgrowth, Weka (which in turn has further algorithms)
ALGORITHM: A set of options to describe the algorithm

Example

This example uses FP-Growth for finding frequent item sets. it does not need any parameters in algorithm

/// support is 3 out of 1000 transactions
fpm =  FREQUENTITEMSET({support=3.0, transactions=1000, learner = 'fpgrowth'}, inputoperator)

/// support is 60% out of 1000 transactions, so it is equal to a support of 600.0
fpm =  FREQUENTITEMSET({support=0.6, transactions=1000, learner = 'fpgrowth'}, inputoperator)

GENERATERULES

Description

This operator uses a list of tuples and creates rules like "x => y".

A rule is a special datatype called "AssociationRule", which is principally a tuple of two patterns (one for the premise and one for the consequnce of the rule)

Parameter

ITEMSET: The attribute where to find the frequent itemset/pattern
SUPPORT: The attribute where to find the support of the pattern (this is used for calculating the confidence!)
CONFIDENCE: The minimal confidence of a rule in percent (a value between 0.0 and 1.0)

Example

This example generates only rule with a confidence of 60% or more

rule = GENERATERULES({
            itemset='set',
            confidence=0.6,
            support='support'
          },
          fpm)

Ensembles

You can simply combine all operators with other operators in Odysseus to create ensembles.

If we have, for examaple a stream with windspeed and power (speedandpower) and another one containing the windspeed (e.g. a forecast).

Then, it is possible to create different regressionfunctions, use them and weight the regression results by an aggregation.

One example:

/// create the first classifier by using SMO
smo = CLASSIFICATION_LEARN({
          class='power',
          learner = 'weka',
          algorithm = ['model'='SMO-REGRESSION']
        },
        speedandpower
      )
/// create the second classifier by using gaussian processes
gaussian = CLASSIFICATION_LEARN({
                class='power',
                learner = 'weka',
                algorithm = ['model'='GAUSSIAN-PROCESSES']
              },
              speedandpower
            )
/// create the thirs classifier by using a linear regression
linear = CLASSIFICATION_LEARN({
              class='power',
              learner = 'weka',
              algorithm = ['model'='LINEAR-REGRESSION']
            },
            speedandpower
          )   
/// union them all into one stream
unioned = UNION(smo, gaussian, linear)

/// then, classify them - each tuple will be classified by using all three classifiers
ensemble = CLASSIFY(speed, unioned)

/// aggregate the clazz using average - which allows a weighted kind of voting
agg = AGGREGATE({
          aggregations=[            
            ['AVG', 'clazz', 'powerForecast', 'DOUBLE']
          ]
        },
        ensemble
      )

Examples

to be added