Statistical Machine Intelligence & Learning Engine

SMILE (Statistical Machine Intelligence & Learning Engine) is a comprehensive, high-performance machine learning framework for the JVM. SMILE v5+ requires Java 25; v4.x requires Java 21; all previous versions require Java 8. SMILE also provides idiomatic APIs for Scala and Kotlin. With advanced data structures and algorithms, SMILE delivers state-of-the-art performance across every aspect of machine learning.

SMILE Studio is an agentic IDE for data science using Python, Java, or Scala. See studio/README.md how to get your first project up and start interacting with your data with natural language in a few minutes.

Features
Module Map
Installation
Quick Start
SMILE Studio & Shell
Model Serialization
Visualization
License
Issues & Discussions
Contributing
Maintainers
Gallery

Features

Area	Highlights
LLM	LLaMA-3 inference, tiktoken BPE tokenizer, OpenAI-compatible REST server, SSE chat streaming
Deep Learning	LibTorch/GPU backend, EfficientNet-V2 image classification, custom layer API
Classification	SVM, Decision Trees, Random Forest, AdaBoost, Gradient Boosting, Logistic Regression, Neural Networks, RBF Networks, MaxEnt, KNN, Naïve Bayes, LDA/QDA/RDA
Regression	SVR, Gaussian Process, Regression Trees, GBDT, Random Forest, RBF, OLS, LASSO, ElasticNet, Ridge
Clustering	BIRCH, CLARANS, DBSCAN, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical, SIB, SOM, Spectral, Min-Entropy
Manifold Learning	IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA
Feature Engineering	Genetic Algorithm selection, Ensemble selection, TreeSHAP, SNR, Sum-Squares ratio, data transformations, formula API
NLP	Sentence / word tokenization, Bigram test, Phrase & Keyword extraction, Stemmer, POS tagging, Relevance ranking
Association Rules	FP-growth frequent itemset mining
Sequence Learning	Hidden Markov Model, Conditional Random Field
Nearest Neighbor	BK-Tree, Cover Tree, KD-Tree, SimHash, LSH
Numerical Methods	Linear algebra, numerical optimization (BFGS, L-BFGS), interpolation, wavelets, RBF, distributions, hypothesis tests
Visualization	Swing plots (scatter, line, bar, box, histogram, surface, heatmap, contour, …) and declarative Vega-Lite charts

Module Map

Each module has its own detailed user guide. Click the README link for the module overview, or drill into individual topic guides.

`base/` — Foundation

Data structures, math, linear algebra, statistical utilities, I/O

Document	Topics
README	Module overview and dependency setup
DATA_FRAME.md	DataFrame API — creation, selection, transformation
DATA_IO.md	CSV, JSON, Parquet, Arrow, JDBC, Avro readers/writers
DATA_TRANSFORMATION.md	Scalers, encoders, imputers, feature transforms
DATASET.md	Built-in benchmark and real-world datasets
FORMULA.md	R-style formula language for model matrices
DISTRIBUTIONS.md	Probability distributions (Normal, Poisson, Beta, …)
HYPOTHESIS_TESTING.md	t-test, chi-squared, ANOVA, KS-test, …
DISTANCES.md	Euclidean, Mahalanobis, Hamming, edit distance, …
NEAREST_NEIGHBOR.md	KD-Tree, Cover Tree, BK-Tree, LSH
KERNELS.md	Gaussian, polynomial, Laplacian, and other kernel functions
RBF.md	Radial basis function networks
INTERPOLATION.md	Linear, cubic spline, bilinear, bicubic
GRAPH.md	Adjacency list/matrix graph, BFS/DFS, spanning trees
SORT.md	Quick sort, heap sort, counting sort, index sort
HASH.md	Locality-sensitive hashing, SimHash
RNG.md	Random number generators, sampling, permutations
BFGS.md	L-BFGS and BFGS numerical optimizers
ICA.md	Independent Component Analysis
TENSOR.md	N-dimensional array (CPU tensor without LibTorch)
WAVELET.md	DWT, CWT, and wavelet families
GAP.md	GAP statistic for optimal cluster count estimation
COMPRESSED_SENSING.md	Compressed sensing and basis pursuit

`core/` — Machine Learning Algorithms

Classification, regression, clustering, manifold learning, and more

Document	Topics
README	Module overview
CLASSIFICATION.md	SVM, Random Forest, AdaBoost, GBDT, KNN, Naïve Bayes, LDA, …
REGRESSION.md	SVR, Gaussian Process, LASSO, Ridge, ElasticNet, GBDT, …
CLUSTERING.md	K-Means, DBSCAN, BIRCH, SOM, Spectral Clustering, …
FEATURE_ENGINEERING.md	Feature selection, PCA, ICA, projection, encoding
MANIFOLD.md	t-SNE, UMAP, IsoMap, LLE, Laplacian Eigenmap
ANOMALY_DETECTION.md	IsolationForest, one-class SVM, local outlier factor
ASSOCIATION_RULE_MINING.md	FP-growth, association rules, frequent itemsets
SEQUENCE.md	HMM (Baum-Welch, Viterbi), CRF
TIME_SERIES.md	ARIMA, box-plots, autocorrelation
REGRESSION.md	Full regression API reference
TRAINING.md	Cross-validation, bootstrap, hyper-parameter search
VALIDATION.md	Hold-out, k-fold, leave-one-out evaluation
VALIDATION_METRICS.md	Accuracy, AUC, F1, RMSE, MAE, confusion matrix
HYPER_PARAMETER_OPTIMIZATION.md	Grid search, random search, Bayesian optimization
VECTOR_QUANTIZATION.md	LVQ, Neural Gas, SOM as vector quantizers
ONNX.md	Exporting and importing models via ONNX

`deep/` — Deep Learning & LLMs

LibTorch-backed GPU/CPU tensor operations, neural network layers, LLaMA-3 inference, EfficientNet

Document	Topics
README	Full deep-learning & LLM user guide (tensors, layers, loss, optimizer, EfficientNet, LLaMA)

The deep/README.md covers:

smile.deep.tensor — Tensor factory, indexing, arithmetic, AutoScope memory management, dtype/device
smile.deep.layer — Linear, Conv2d, pooling, normalization (BN/GN/RMS), dropout, embedding, sequential blocks
smile.deep.activation — ReLU, GELU, SiLU, Tanh, Sigmoid, Softmax, GLU, HardShrink, …
smile.deep.Loss — MSE, cross-entropy, BCE, Huber, KL, hinge, and more
smile.deep.Optimizer — SGD, Adam, AdamW, RMSprop
smile.deep.Model — Abstract base class + training loop
smile.deep.metric — Accuracy, Precision, Recall, F1Score with macro/micro/weighted averaging
smile.llm — Message, Role, FinishReason, ChatCompletion records; sinusoidal & RoPE positional encodings
smile.llm.tokenizer — Tokenizer interface, Tiktoken BPE implementation (LLaMA-3 compatible)
smile.llm.llama — Full LLaMA-3 stack: Llama.build(), generate(), chat(), streaming via SubmissionPublisher
smile.vision — VisionModel, ImageDataset, EfficientNet.V2S/M/L() pretrained models, ImageNet labels
smile.vision.transform — Transform interface, ImageClassification pipeline, resize/crop/toTensor helpers

`nlp/` — Natural Language Processing

Text normalization, tokenization, POS tagging, stemming, relevance ranking

Document	Topics
README	Module overview
TOKENIZER.md	Sentence splitter, word tokenizer, regex tokenizer
POS.md	Part-of-speech tagging (Brill tagger, HMM tagger)
STEM.md	Porter, Lancaster, Lovins stemmers; lemmatization
COLLOCATION.md	Bigram/trigram statistical tests, phrase extraction
RELEVANCE.md	TF-IDF, BM25, keyword extraction
TAXONOMY.md	WordNet integration, synsets, hypernyms

`plot/` — Data Visualization

Swing-based interactive plots and declarative Vega-Lite charts

Document	Topics
README	Swing plotting API — scatter, line, bar, box, histogram, heatmap, surface, contour, wireframe
VEGA.md	Declarative `smile.plot.vega` (Vega-Lite) — JSON spec generation, web/Jupyter rendering

`serve/` — Inference Server

Quarkus-based REST inference service with OpenAI-compatible API and SSE streaming

Document	Topics
README	Building and running the server, `/chat/completions` endpoint, SSE streaming, configuration

`studio/` — Interactive Shell & Desktop IDE

An agentic IDE for data science using Python or SMILE

Document	Topics
README.md	Desktop Studio UX
CLI	CLI entry points (`smile`, `smile shell`, `smile scala`, `smile serve`)

`scala/` — Scala API

Idiomatic Scala shim — concise wrappers, symbolic operators, Scala collections integration

Document	Topics
README	API overview, `smile.classification`, `smile.regression`, `smile.clustering`, `smile.plot` in Scala

`kotlin/` — Kotlin API

Idiomatic Kotlin shim — extension functions, named parameters, builder DSLs

Document	Topics
README	API overview, extension functions, Kotlin-style builders
packages.md	Full package-by-package listing of all Kotlin extension functions

`json/` — JSON Library (Scala)

Lightweight zero-dependency JSON library for Scala with a clean DSL

Document	Topics
README	Parsing, building, pattern matching, path navigation, serialization

`spark/` — Apache Spark Integration

Use SMILE models inside Spark ML pipelines

Document	Topics
README	`SmileTransformer`, `SmileClassifier`, `SmileRegressor`; training and scoring in Spark DataFrames

Installation

Maven

<!-- Core ML algorithms -->
<dependency>
  <groupId>com.github.haifengl</groupId>
  <artifactId>smile-core</artifactId>
  <version>6.2.3</version>
</dependency>

<!-- Deep learning + LLMs (requires LibTorch) -->
<dependency>
  <groupId>com.github.haifengl</groupId>
  <artifactId>smile-deep</artifactId>
  <version>6.2.3</version>
</dependency>

<!-- Natural language processing -->
<dependency>
  <groupId>com.github.haifengl</groupId>
  <artifactId>smile-nlp</artifactId>
  <version>6.2.3</version>
</dependency>

<!-- Data visualization -->
<dependency>
  <groupId>com.github.haifengl</groupId>
  <artifactId>smile-plot</artifactId>
  <version>6.2.3</version>
</dependency>

SBT (Scala)

libraryDependencies += "com.github.haifengl" %% "smile-scala" % "6.2.3"

Gradle (Kotlin)

dependencies {
    implementation("com.github.haifengl:smile-kotlin:6.2.3")
}

Native Libraries (BLAS / LAPACK)

Several algorithms (manifold learning, Gaussian Process, MLP, some clustering) require BLAS and LAPACK.

Linux (Ubuntu / Debian)

sudo apt update
sudo apt install libopenblas-dev libarpack2-dev

macOS (Homebrew)

brew install arpack
# If macOS SIP strips DYLD_LIBRARY_PATH, create a symlink to the dylib in your working dir:
ln -s /opt/homebrew/lib/libarpack.dylib .

Windows — pre-built DLLs are included in the bin/ directory of the release package. Add that directory to PATH.

GPU (CUDA) — make sure the LibTorch CUDA native libraries are on PATH (Windows) or LD_LIBRARY_PATH (Linux).

Quick Start

import smile.classification.RandomForest;
import smile.data.formula.Formula;
import smile.io.Read;

// Load data
var data = Read.csv("src/test/resources/iris.csv");

// Train a random forest
var forest = RandomForest.fit(Formula.lhs("species"), data);

// Predict
int label = forest.predict(data.get(0));
System.out.println("Predicted class: " + label);

For deep learning and LLM examples, see deep/README.md. For visualization examples, see plot/README.md.

SMILE Studio

SMILE Studio is an agentic IDE for data science using Python or SMILE on JVM. See studio/README.md for full documentation.

Download a pre-packaged release from the releases page, then:

path/to/smile/bin/setup      # install required native dependencies
path/to/smile/bin/smile      # launch SMILE Studio from your project directory

Other entry points:

Command	Description
`smile`	Desktop agentic IDE
`smile shell`	Java REPL with all SMILE packages pre-imported
`smile scala`	Scala REPL
`smile train`	Train a supervised learning model
`smile predict`	Predict on a file using a saved model
`smile serve`	Start the LLM inference server

To increase the JVM heap:

path/to/smile/bin/smile -J-Xmx30G

Model Serialization

Most SMILE models implement java.io.Serializable. You can serialize a trained model to disk and load it in a production environment or inside a Spark job:

// Save
try (var out = new ObjectOutputStream(new FileOutputStream("model.ser"))) {
    out.writeObject(forest);
}

// Load
try (var in = new ObjectInputStream(new FileInputStream("model.ser"))) {
    var loaded = (RandomForest) in.readObject();
}

Visualization

SMILE provides two visualization layers:

smile.plot.swing — Swing-based interactive 2D/3D plots. See plot/README.md.
smile.plot.vega — Declarative Vega-Lite charts for browsers and Jupyter. See plot/VEGA.md.

<dependency>
  <groupId>com.github.haifengl</groupId>
  <artifactId>smile-plot</artifactId>
  <version>6.2.3</version>
</dependency>

License

SMILE employs a dual license model designed to meet the development and distribution needs of both commercial distributors (OEMs, ISVs, VARs) and open source projects. For details, see LICENSE. To acquire a commercial license, contact smile.sales@outlook.com.

Issues & Discussions

Channel	Purpose
GitHub Discussions	Questions, ideas, show-and-tell
Stack Overflow `[smile]`	Technical Q&A
Issue Tracker	Bug reports and feature requests
Online Docs	Tutorials and programming guides
Java API · Scala API · Kotlin API · Clojure API	API Javadoc

Contributing

Please read CONTRIBUTING.md for build and test instructions.

Maintainers

Haifeng Li (@haifengl)
Karl Li (@kklioss)

Gallery

Scatterplot Matrix
Scatter Plot	Line Plot	Surface Plot
Bar Plot	Box Plot	Histogram Heatmap
Rolling Average	Geo Map	UMAP
Text Plot	Heatmap with Contour	Hexmap
IsoMap	LLE	Kernel PCA
Neural Network	SVM	Hierarchical Clustering
SOM	DBSCAN	Neural Gas
Wavelet	Exponential Family Mixture	Teapot Wireframe
Grid Interpolation