gallia-genemania-spark

See original announcements on:

For more information, see gallia-core documentation, in particular:

The Spark section
The examples section

Description

This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)

Test Run

You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli and don't mind the cost (see below).

The script does the following:

Creates an S3 bucket for the code and data
Retrieves code and uploads it to the bucket (source+binaries)
Retrieves the data (or a subset thereof) and uploads it to the bucket
Creates an EMR Spark cluster and run the program as a single step
Awaits until termination and logs results

To run it on a small subset (expect ~$3^[2] in AWS charges), use:

./testrun.sh 10 4 # process first 10 files, using 4 workers

To run it in full (expect ~$18^[2] in AWS charges), use:

./testrun.sh ALL <number-of-workers> # eg 60 workers

The full EMR run will take about 120 minutes with 60 workers^[1]. As one would expect, it follows the distribution below:

Input

Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/

Output

Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz files

Limitations

Notable limitations are:

Only available for Scala 2.12 because:
- sbt-assembly does not seem to be available for 2.13
- Spark support for 2.13 is still immature
The I/O abstractions need to be aligned with the core's, they are somewhat hacky at the moment:
- gallia-core's io.in mechanisms (fluency, actions and atoms) vs gallia-spark's
- gallia-core's io.out mechanisms (fluency, actions and atoms) vs gallia-spark's

See list of spark-related tasks for more limitations.

Footnotes

^[1] ~+1h to accumulate the input data and upload it on s3 bucket (using a 5 seconds courtesy delay in between each request)
^[2] Cost estimates provided are not guaranteed at all, run it at own risk (but please let me know if yours are significantly different)

Contact

You may contact the author at [email protected]

galliaproject / gallia-genemania-spark 0.4.0