Parquet Java (formerly Parquet MR)

This repository contains a Java implementation of Apache Parquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools.

The parquet-format repository contains the file format specification.

Parquet uses the record shredding and assembly algorithm described in the Dremel paper to represent nested structures. You can find additional details about the format and intended use cases in our Hadoop Summit 2013 presentation

Building

Parquet-Java uses Maven to build and depends on the thrift compiler (protoc is now managed by maven plugin).

Install Thrift

To build and install the thrift compiler, run:

wget -nv https://archive.apache.org/dist/thrift/0.23.0/thrift-0.23.0.tar.gz
tar xzf thrift-0.23.0.tar.gz
cd thrift-0.23.0
chmod +x ./configure
./configure --disable-libs
sudo make install -j

Note: if you wish to verify the signature and checksum of a release:

The GPG and sha checksums can be found under https://archive.apache.org/dist/thrift/0.23.0/
Validate the signature of the artifact against the Thrift committer KEYS.

If you're on OSX and use homebrew, you can instead install Thrift 0.23.0 with brew and ensure that it comes first in your PATH.

brew install thrift
export PATH="/usr/local/opt/thrift@0.23.0/bin:$PATH"

Build Parquet with Maven

Once protobuf and thrift are available in your path, you can build the project by running:

LC_ALL=C ./mvnw clean install

Features

Parquet is an active project, and new features are being added quickly. Here are a few features:

Type-specific encoding
Hive integration (deprecated)
Pig integration (deprecated)
Cascading integration (deprecated)
Crunch integration
Apache Arrow integration
Scrooge integration (deprecated)
Impala integration (non-nested)
Java Map/Reduce API
Native Avro support
Native Thrift support
Native Protocol Buffers support
Complex structure support
Run-length encoding (RLE)
Bit Packing
Adaptive dictionary encoding
Predicate pushdown
Column stats
Delta encoding
Index pages
Scala DSL (deprecated)
Java Vector API support (experimental)

Java Vector API support

The feature is experimental and is currently not part of the parquet distribution.

Parquet-Java has supported Java Vector API to speed up reading, to enable this feature:

Java 17+, 64-bit
Requiring the CPU to support instruction sets:
- avx512vbmi
- avx512_vbmi2
To build the jars: ./mvnw clean package -P vector-plugins
For Apache Spark to enable this feature:
- Build parquet and replace the parquet-encoding-{VERSION}.jar on the spark jars folder
- Build parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to the spark jars folder
- Edit spark class#VectorizedRleValuesReader, function#readNextGroup refer to parquet class#ParquetReadRouter, function#readBatchUsing512Vector
- Build spark with maven and replace spark-sql_2.12-{VERSION}.jar on the spark jars folder

Map/Reduce integration

Input and Output formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.

We've implemented this for 2 popular data formats to provide a clean migration path as well:

Thrift

Thrift integration is provided by the parquet-thrift sub-project.

Avro

Avro conversion is implemented via the parquet-avro sub-project.

Protobuf

Protobuf conversion is implemented via the parquet-protobuf sub-project.

Create your own objects

The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
The ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer

See the APIs:

Hive integration

Hive integration is now deprecated within the Parquet project. It is now maintained by Apache Hive.

Build

To run the unit tests: ./mvnw test

To build the jars: ./mvnw package

The build runs in GitHub Actions:

Add Parquet as a dependency in Maven

The current release is version 1.17.0.

  <dependencies>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-common</artifactId>
      <version>1.17.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-encoding</artifactId>
      <version>1.17.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-column</artifactId>
      <version>1.17.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-hadoop</artifactId>
      <version>1.17.0</version>
    </dependency>
  </dependencies>

How To Contribute

We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the parquet-java Git repository. If you've previously forked Parquet from its old location, you will need to add a remote or update your origin remote to https://github.com/apache/parquet-java.git.

If you are looking for some ideas on what to contribute, check out GitHub issues for labeled Good first issue. Comment on the issue and/or contact dev@parquet.apache.org with your questions and ideas.

If you’d like to report a bug but don’t have time to fix it, you can still raise an issue on GitHub, or email the mailing list dev@parquet.apache.org.

To contribute a patch:

Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
Create an issue for your patch on the GitHub issues.
Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the issue (ex: #3260).
Make sure that your code passes the unit tests. You can run the tests with ./mvnw test in the root directory.
Add new unit tests for your code.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:

Use 2 spaces for whitespace. Not tabs, not 4 spaces. The number of the spacing shall be 2.
Give your operators some room. Not a+b but a + b and not foo(int a,int b) but foo(int a, int b).
Generally speaking, stick to the Sun Java Code Conventions
Make sure tests pass!

Thank you for getting involved!

Authors and contributors

Code of Conduct

We hold ourselves and the Parquet developer community to two codes of conduct:

Discussions

Mailing list: dev@parquet.apache.org
GitHub issues: Issues
Discussions also take place in GitHub pull requests

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

apache / parquet-java 1.14.4