[reading review] An Empirical Evaluation of Columnar Storage Formats

In this paper, the authors revisted the most popular columanr storage formats (Parquet and ORC). They evaluated these two formats in encoding, compression, index and filter and decoding, etc. The authors also designed a benchmark to fully explore the performance of both formats. Looking deeply into the experimnets results, we can get some useful thoughts on how to design new columnar format.

  • Strengths: When desgining columnar storage formats, we should keep in mind: It is important to adopt simple encoding scheme for decoding speed.

  • Future works: New columnar format should consider more about how to use GPU for accelerating and how to leverage the high bandwidth, high latency cloud environment.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • [reading review] Exploiting Cloud Object Storage for High-Performance Analytics
  • [reading review] The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code
  • [reading review] Lakehouse: A New Generation of Open Platforms that unify Data Warehouse and Advanced Analytics
  • [reading review] MonetDB/X100: Hyper-Pipelining Query Execution
  • [reading review] OceanBase: A 707 Million tpmC Distributed Relational Database System