This post will introduce how the partitioning feature is implemented in JethroData. In a nutshell, we added a simple, automatic range partitioning mechanism that are very easy to work with.
Why Use Partitioning?
Generally, there are two reasons to partition a large table:
Ease of maintenance - partitioning allows implementing a data retention policy and enables efficient purging of old data when it is no longer needed (rolling window). Also, it allows removing part of the data if invalid data was accidentally loaded (for example, remove a specific day).
Performance and Scalability - regular parallel databases (like Impala or Hive on Tez in the Hadoop space) execute queries by doing a full scan of the local data in all nodes, in parallel. Partitioning allows each node to scan less data (partition pruning), improving performance. Also, it improves scalability - if a report accesses only one month of data, it will have the same performance even if we add many more months of data to the table.
Partitioning in JethroData