This page describes how to use the machine learning (aka mining) bundle.

 

Operators

CLUSTERING

Description

This operator clusters a set of tuples.

Parameter

Example

This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".

clustered = CLUSTERING({
                attributes=['age', 'income'],
                clusterer='weka',                
                algorithm =                
                  [
                  'model'='SimplekMeans'
                  , 'arguments'='-N 3'
                  ]                                   
              }, inputoperator)

For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be  found in the Weka Docs

CLASSIFICATION_LEARN

Description

This operator is used to create a classifier. Therefore, the result is a stream of classifiers (this is an own datatype!)

Parameter

Example

This example uses the weka-clusterer. The weka-clusterer should use the "simplekmeans" algorithm. the arguments to set up the weka-simplekmeans is "-N 3".

learned = CLASSIFICATION_LEARN({
              class='attack',              
              nominals = ['attack'=['back', 'smurf', 'spy']],
              learner = 'weka',
              algorithm =                
                  [
                  'model'='J48'
                  ]                          
              
            }, inputoperator)

For weka, there are currently the following algorithms that can be used as the "model". Further details and possible arguments can be  found in the Weka Docs

Classification (nominal values):

Regression (continuous values):

CLASSIFY

Description

This operator classifies a tuple by using a classifier. The operator needs two inputs: A stream of tuples that should be classified and a stream of classifiers (that normally comes from a CLASSIFICATION_LEARN operator).

It a appends a new attribute called "clazz" which contains the nominal class value or continuous value from a regression

For the classify operator, the type of the classifier (tree, list, bayes net... ) doesn't matter. You may even mixup them to classify the same tuple with different classifiers (see Ensembles). The left port is the input for the tuples that should be classified and the right input is the one with the classifiers.

Parameter

Example

 

classified = CLASSIFY({classifier='classifier'}, test, learned) 

 

FREQUENTITEMSET

Description

This operator create frequent item sets from a given stream.

The result stream creates a tuple with 3 attributes:

Parameter

Example

This example uses FP-Growth for finding frequent item sets. it does not need any parameters in algorithm

/// support is 3 out of 1000 transactions
fpm =  FREQUENTITEMSET({support=3.0, transactions=1000, learner = 'fpgrowth'}, inputoperator)

/// support is 60% out of 1000 transactions, so it is equal to a support of 600.0
fpm =  FREQUENTITEMSET({support=0.6, transactions=1000, learner = 'fpgrowth'}, inputoperator)

 

GENERATERULES

Description

This operator uses a list of tuples and creates rules like "x => y".

A rule is a special datatype called "AssociationRule", which is principally a tuple of two patterns (one for the premise and one for the consequnce of the rule)

Parameter

Example

This example generates only rule with a confidence of 60% or more

rule = GENERATERULES({
            itemset='set',
            confidence=0.6,
            support='support'
          },
          fpm)

 

Ensembles

You can simply combine all operators with other operators in Odysseus to create ensembles.

If we have, for examaple a stream with windspeed and power (speedandpower) and another one containing  the windspeed (e.g. a forecast).

Then, it is possible to create different regressionfunctions, use them and weight the regression results by an aggregation.

One example:

/// create the first classifier by using SMO
smo = CLASSIFICATION_LEARN({
          class='power',
          learner = 'weka',
          algorithm = ['model'='SMO-REGRESSION']
        },
        speedandpower
      )
/// create the second classifier by using gaussian processes
gaussian = CLASSIFICATION_LEARN({
                class='power',
                learner = 'weka',
                algorithm = ['model'='GAUSSIAN-PROCESSES']
              },
              speedandpower
            )
/// create the thirs classifier by using a linear regression
linear = CLASSIFICATION_LEARN({
              class='power',
              learner = 'weka',
              algorithm = ['model'='LINEAR-REGRESSION']
            },
            speedandpower
          )   
/// union them all into one stream
unioned = UNION(smo, gaussian, linear)

/// then, classify them - each tuple will be classified by using all three classifiers
ensemble = CLASSIFY(speed, unioned)

/// aggregate the clazz using average - which allows a weighted kind of voting
agg = AGGREGATE({
          aggregations=[            
            ['AVG', 'clazz', 'powerForecast', 'DOUBLE']
          ]
        },
        ensemble
      )

 

 

Examples

 

to be added