napalabs-ru / spark-hscan   0.1

Apache License 2.0 GitHub

hyperscan wrapper for spark

Scala versions: 2.11

spark-hscan

This project is hyperscan wrapper for spark to allow matching large numbers (up to tens of thousands) of regular expressions

Add it to your project

Spark shell

spark-shell --packages ru.napalabs.spark:spark-hscan_2.11:0.1

SBT

libraryDependencies += "ru.napalabs.spark" % "spark-hscan_2.11" % "0.1"

Usage examples

SparkSQL

import ru.napalabs.spark.hscan.implicits._

spark.registerHyperscanFuncs()

val df = spark.sql("""
select * from my_table 
    where hlike(text_field, array("pattern.*", "[a-Z]+other"))"""
)

Scala DSL:

import ru.napalabs.spark.hscan.functions._

val df = spark.read
         .format("parquet")
         .load("/path/to/files")
df.where(hlike($"text_col"), Array("pattern.*", "[a-Z]+other"))

Benchmark

As a benchmark we used hsbench (teakettle_2500 pattern set and alexa200.db dataset) benchmark results

Limitations

See limitations in hyperscan-java project.

Also, this project is only with spark 2.3.2. Compatibility with other versions of spark is not guaranteed.

Contributing

Feel free to raise issues or submit a pull request.

License

Apache License, Version 2.0