Making Interactive BI for Big Data a Reality

JethroData Blog

What's new in JethroData 1.0

Posted by Ofir Manor on Apr 7, 2015 11:13:00 AM

As we announced earlier today, JethroData 1.0 was just released. Since launching the public beta six months ago, we have added numeruos improvements and bug fixes across the board. I would like to share some highlights:

Read More

JethroData is Now in GA

Posted by Eli Singer on Apr 7, 2015 5:00:00 AM

I'm thrilled to announce that JethroData, the fastest SQL-on-Hadoop solution in the market, is now in GA. After more than two years in extensive development, we are proud to release this ready-for-prime-time, stable and highly performant version of the product.

Read More

Topics: Blog, Announcements, in the news, Product

Being "Creative" with TPC-DS Benchmark - Dynamic Partition Pruning

Posted by Ofir Manor on Nov 10, 2014 9:01:00 AM

In this post, I would like to present a common optimization challenge, how is is solved in JethroData, and how some other SQL-on-Hadoop products "overcame" that challenge by manually modifying their benchmark scripts and queries to avoid the situation (which was quite a surprise for us when figured it while running our own benchmarks).

Read More

Topics: Blog

Partitioning in Hive and Impala Versus JethroData (and some TPC-DS gossip)

Posted by Ofir Manor on Oct 30, 2014 5:40:00 PM

In my previous post, I explained how partitioning works in JethroData. In this post, I would like to explain how partitioning was implemented in Hive and Impala, why their design is very problematic, and how our implementation avoids those problems. Design matters!

Read More

Topics: Blog

Simple, Automatic Range Partitioning in JethroData

Posted by Ofir Manor on Oct 23, 2014 7:39:00 AM

This post will introduce how the partitioning feature is implemented in JethroData. In a nutshell, we added  a simple, automatic range partitioning mechanism that are very easy to work with.

Why Use Partitioning?

Generally, there are two reasons to partition a large table:

Ease of maintenance - partitioning allows implementing a data retention policy and enables efficient purging of old data when it is no longer needed (rolling window). Also, it allows removing part of the data if invalid data was accidentally loaded (for example, remove a specific day).

Performance and Scalability - regular parallel databases (like Impala or Hive on Tez in the Hadoop space) execute queries by doing a full scan of the local data in all nodes, in parallel. Partitioning allows each node to scan less data (partition pruning), improving performance. Also, it improves scalability - if a report accesses only one month of data, it will have the same performance even if we add many more months of data to the table.

Partitioning in JethroData Read More

Topics: Blog