icon

See original announcements on:

For more information, see gallia-core documentation, in particular:

Description

This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)

Test Run

You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli and don't mind the cost (see below).

The script does the following:

  • Creates an S3 bucket for the code and data
  • Retrieves code and uploads it to the bucket (source+binaries)
  • Retrieves the data (or a subset thereof) and uploads it to the bucket
  • Creates an EMR Spark cluster and run the program as a single step
  • Awaits until termination and logs results

To run it on a small subset (expect ~$3[2] in AWS charges), use:

./testrun.sh 10 4 # process first 10 files, using 4 workers

To run it in full (expect ~$18[2] in AWS charges), use:

./testrun.sh ALL <number-of-workers> # eg 60 workers

The full EMR run will take about 120 minutes with 60 workers[1]. As one would expect, it follows the distribution below:

|distribution

Input

Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/

Output

Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz files

Limitations

Notable limitations are:

  • Only available for Scala 2.12 because:
    • sbt-assembly does not seem to be available for 2.13
    • Spark support for 2.13 is still immature
  • The I/O abstractions need to be aligned with the core's, they are somewhat hacky at the moment:

See list of spark-related tasks for more limitations.

Footnotes

  • [1] ~+1h to accumulate the input data and upload it on s3 bucket (using a 5 seconds courtesy delay in between each request)
  • [2] Cost estimates provided are not guaranteed at all, run it at own risk (but please let me know if yours are significantly different)

Contact

You may contact the author at [email protected]