autodeployai / pmml4s-spark   1.0.1

Apache License 2.0 GitHub

PMML scoring library for Spark as SparkML Transformer

Scala versions: 2.12 2.11

PMML4S-Spark

Build Status Maven Central

PMML4S-Spark is a PMML (Predictive Model Markup Language) scoring library for Spark as SparkML Transformer.

Features

PMML4S-Spark is the Spark wrapper of PMML4S, you can see PMML4S for details.

Prerequisites

  • Spark >= 2.0.0

Installation

SBT users
libraryDependencies += "org.pmml4s" %%  "pmml4s-spark" % pmml4sSparkVersion
Maven users
<dependency>
  <groupId>org.pmml4s</groupId>
  <artifactId>pmml4s-spark_${scala.version}</artifactId>
  <version>${pmml4s-spark.version}</version>
</dependency>

Use PMML for Spark in Scala

  1. Load model.

    import scala.io.Source
    import org.pmml4s.model.Model
    import org.pmml4s.spark.ScoreModel
    
    // The main constructor accepts an object of org.pmml4s.model.Model
    val model = ScoreModel(Model(Source.fromURL(new java.net.URL("http://dmg.org/pmml/pmml_examples/KNIME_PMML_4.1_Examples/single_iris_dectree.xml"))))

    or

    import org.pmml4s.spark.ScoreModel
    
    // load model from those help methods, e.g. pathname, file object, a string, an array of bytes, or an input stream.
    val model = ScoreModel.fromFile("single_iris_dectree.xml")
  2. Call transform(dataset) to run a batch score against an input dataset.

    // The data is from http://dmg.org/pmml/pmml_examples/Iris.csv
    val df = spark.read.
      format("csv").
      options(Map("header" -> "true", "inferSchema" -> "true")).
      load("Iris.csv")
    
    val scoreDf = model.transform(df)
    scala> scoreDf.show(5)
    +------------+-----------+------------+-----------+-----------+---------------+-----------+-----------------------+---------------------------+--------------------------+-------+
    |sepal_length|sepal_width|petal_length|petal_width|      class|predicted_class|probability|probability_Iris-setosa|probability_Iris-versicolor|probability_Iris-virginica|node_id|
    +------------+-----------+------------+-----------+-----------+---------------+-----------+-----------------------+---------------------------+--------------------------+-------+
    |         5.1|        3.5|         1.4|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         4.9|        3.0|         1.4|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         4.7|        3.2|         1.3|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         4.6|        3.1|         1.5|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         5.0|        3.6|         1.4|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    +------------+-----------+------------+-----------+-----------+---------------+-----------+-----------------------+---------------------------+--------------------------+-------+
    only showing top 5 rows

Use PMML for Spark in Java

  1. Load model.

    import org.pmml4s.spark.ScoreModel;
    
    // load model from those help methods, e.g. pathname, file object, a string, an array of bytes, or an input stream.
    ScoreModel model = ScoreModel.fromFile("single_iris_dectree.xml");
  2. Call transform(dataset) to run a batch score against an input dataset.

    import org.apache.spark.sql.Dataset;
    
    // The data is from http://dmg.org/pmml/pmml_examples/Iris.csv
    Dataset<?> df = spark.read().
       format("csv").
       option("header", "true").
       option("inferSchema", "true").
       load("Iris.csv"); 
    
    Dataset<?> scoreDf = model.transform(df);
    scoreDf.show(5);

Use PMML in PySpark

See the PyPMML-Spark project. PyPMML-Spark is a Python PMML scoring library for PySpark as SparkML Transformer, it really is the Python API for PMML4s-Spark.

Use PMML in Scala or Java

See the PMML4S project. PMML4S is a PMML scoring library for Scala. It provides both Scala and Java Evaluator API for PMML.

Use PMML in Python

See the PyPMML project. PyPMML is a Python PMML scoring library, it really is the Python API for PMML4S.

Deploy PMML as REST API

See the AI-Serving project. AI-Serving is serving AI/ML models in the open standard formats PMML and ONNX with both HTTP (REST API) and gRPC endpoints.

Deploy and Manage AI/ML models at scale

See the DaaS system that deploys AI/ML models in production at scale on Kubernetes.

Support

If you have any questions about the PMML4S-Spark library, please open issues on this repository.

Feedback and contributions to the project, no matter what kind, are always very welcome.

License

PMML4S-Spark is licensed under APL 2.0.