Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The field of use of Data Mining is hard to narrow and also the representation of the process is very abstract. The Predictive Model Markup Language (PMML) is a

standard exchange format by DMG for the process itself and the structure of the needed data, so that the compability of different software developer is maximized.

PMML follows a XML-Scheme which can be found on the  homepage of DMG. The standard (now in v4.3, state october 2017) has about 700 language elements.

PMML can be enhanced at a lot of (nearly all) elements. These enhancements are added as extensions to child-elements at the appropriate elements. Further information can be found on http://dmg.org/pmml/v4-3/GeneralStructure.html#xsdElement_Extensionfound here.

The XSD for an extension provides own elements:

...

General extensions should be defined for the header. An example for that is, that Kafka topics encapsulate KafkaTopics encapsulates data sources for each data schema. To bind to more than one topic, you could create unions with an optional number of topic elements, which serve as data sources.

...

Code Block
languagebash
<Header>
   <Extension>
      <KafkaTopics model="modelA">
         <Topic name="exampleTopic" />
      </KafkaTopics>
      <KafkaTopics model="modelB">
         <Topic name="exampleTopic2" />
         <Topic name="exampleTopicFoo" />
      </KafkaTopics>
   </Extension>
<!-- other Elements -->
</Header>

Structure

A possible access architecture might look like this:

Image Added

The PMML document is loaded by an ACCESS operator (e.g. by a Kafka Topictopic) and runs through all necessary extensions to get all essential information. A mining operator, which gets the model as an input, can then pass on current models (if a new PMML is send over a Topictopic), without starting the query again. If a UNION operator is the prefix of a mining operator, you could access further Topics topics over a COMMAND operator in the future, as long as they don't change during runtime (this type of overloads can be defined in extentions). 

Query of Mining Models (PMML)

...

Name

...

Query of Mining Models

...

...


PMML

...

  1. If the Http interface is protected through authentication, a PG EStream user with enough rights to access the needed queries exists.
  2. Default PMML models exist and don't have to be commissioned.
  3. There has to be a query available, which enables the user to request all accessible processes/models.

...

  1. PMML model is requested for appropriate PMML specification.
  2. Based on the PMML specification a hash is computed and and used to search through the Big Data Archive for the last appropriate PMML model.
  3. If the request was successful and there was at least one PMML model matching the declared PMML specification, it is returned as a response.

...

Model doesn't exist, because the query is not an arranged standard PMML model:

  1. If the PMML model is not (anymore) supported, then a matching error message should be returned.
  2. If the requested is supported, matching information should be returned, stating that currently no appropriate model exists but that a new one can be commissioned.

...

Execute PMML in Odysseus

Execute correct product:

  1. Either check out the Odysseus or EStream branch.

  2. Execute the product from de.pgestream.odysseus.studio (so all necessary bundles are automatically loaded).

...

Operators

ProtocolHandler:

To access a PMML document you can assign 'PMML' as protocol handler. There are no other parameters necessary and the PMML model has port 0 as an output. You can use e.g. Kafka as a transport handler to reload PMML documents during runtime of queries.

Parameters:

  • fireOnce: If the value is set to 'true', then only the very first document is read by the TransportHandler and the state of the operators is set to 'done' afterwards. If the InputStream ends somewhere else, the query is terminated. 

...

Code Block
languagejs
titleProtocolHandler
linenumberstrue
models = ACCESS({
    source='pmml-data',
    wrapper='GenericPull',
    transport='HTTP',
    protocol='PMML',
    dataHandler='Tuple',
    options=[
        ['uri', 'http://dmg.org/pmml/pmml_examples/rattle_pmml_examples/AuditKMeans.xml'], 
        ['method', 'get'],
        ['scheduler.delay', '100000'],
        ['fireOnce', 'true']
    ],
    schema=[
        ['models', 'Object'],
    ]
})

Operator:

A PMML document can potentially contain more than one mining models and is therefore outside of the TransporthandlerTransportHandler. After the procurement of models you can choose or run a model by name with the operator PMML_EVAL.

The input on port 1 (data) should contain data, which corresponds to the schema of the requested data of the PMML document. More attributes are ignored. Different models can be executed with the same data without integrating more operators in between for filtering, or transporting more meta-data with the analyzed data.

Parameters:

  • modelName: Every model within a PMML document has a unique name, so if 'default' is chosen the first model (in XML-order) is executed.
  • output (optional, default: NONE): Describes the output mode of the operator, where you can choose between none, model and input 
    • NONE: Output is only the result, the attribute is then called 'prediction'.
    • MODEL: All  fields used by the model are output in addition to 'prediction'. 'prediction' is put at the end of the column.
    • INPUT: All values from the data are forwarded to the output.
  • stackSize (optional, default: 100): Describes FIFO cache as a stack, which is used if the data returns elements, but no EvaluationModel was received. As soon as a model is received, all elements in the stack are executed before the next arriving element (until maximum size of the stack is reached). After each new model, the stack gets emptied.

Following example shows usage of thePMMLthe PMML_EVAL operator:

Code Block
languagejs
titlePMML_EVAL Operator
linenumberstrue
results = PMML_EVAL({
    modelName='KMeans_Model',
    output='input',
    stackSize=5
}, models, data)

Complete example:

In http://dmg.org/pmml/pmml_examples/index.html you can find examples with correlating data sets: 

...