Vertica

Vertica Journal : Parquet and External tables

This is a continuation to what we Vertica External table topic mentioned in previous post here.

This post will show you on how vertica interprets parquet file partitions and how we can best use it to our use cases.

For more details on how to load a parquet file to Vertica table please refer to this blogpost.

Before digging ahead, you might want to have a look at partitioning concept of Parquet files here. This seems to exactly the information you want to know before going ahead.

Before researching on this, I was under the assumption that parquet partitions are beneficial for vertica external tables. But my understanding was incorrect, when I found the below results.
I tried a few scenarios mentioned below and was astonished to see the difference myself.

Setting up Env

Created 2 sample parquet files with sample dataset (2.7 Million records) with exact same data, but 1 being partitioned and other being non-partitioned.

Later, created 2 external tables on vertica mapping to these location of parquet files 


DROP TABLE IF EXISTS  TEST_PERFORMANCE_PARQUET_PARTITIONED;
CREATE EXTERNAL TABLE TEST_PERFORMANCE_PARQUET_PARTITIONED
(
    
    THREATRE_ID varchar(512),
    MOVIE_NAME  varchar(512),
    SHOW_SLOT varchar(512),
    TICKETS_SOLD Integer, 
    GROSS_INCOME double precision,
    TICKET_DATE varchar(20),
    CITY_NAME varchar(512)
)
 AS 
COPY FROM '/path_to_parquet/*/*/*.parquet' PARQUET (hive_partition_cols='TICKET_DATE, CITY_NAME');

DROP TABLE IF EXISTS TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED;
CREATE EXTERNAL TABLE TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED
( 
    TICKET_DATE varchar(20),
    CITY_NAME varchar(512) ,
    THREATRE_ID varchar(512),
    MOVIE_NAME  varchar(512),
    SHOW_SLOT varchar(512),
    TICKETS_SOLD Integer, 
    GROSS_INCOME  double precision
    
)
 AS 
COPY FROM '/path_to_parquet/*.parquet' PARQUET ;

Once the above was done, I tried different queries on both the tables as below..
Count Aggregation

select count(*) from TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED
-- Time Taken 0.82 seconds
select count(*) from TEST_PERFORMANCE_PARQUET_PARTITIONED
-- Time Taken 1.34 seconds

Aggregation on partitioned columns

select TICKET_DATE, CITY_NAME, sum(GROSS_INCOME) from TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED
where TICKET_DATE = '7-Sep-2019'
group by 1,2
--Time Taken 0.6 seconds
select TICKET_DATE, CITY_NAME, sum(GROSS_INCOME) from TEST_PERFORMANCE_PARQUET_PARTITIONED
where TICKET_DATE = '7-Sep-2019'
group by 1, 2
-- Time taken 1.6 seconds

Explain Plans
Seems Vertica explain plan was not really verbose here, and cost and complete plan was exactly identical here, this means. Vertica engine does not try to interpret the partitions of Parquet files here (or does not show that in the plan). Seems the partition pruning feature of vertica does not respect the partitions of parquet files.
Somehow, vertica engine takes more time to traverse through the partition of parquet files.

So, I am assuming, when I am trying to add queries on partitioned parquet files, the engine is trying to traverse every partition folder and trying to match the where predicate. this hierarchical partition of Parquet files, might be confusing vertica engine and it maybe a reason why partitioned parquet files are not really great for vertica .
Below are the scenarios tested by me on the same lines and the performance for the same.

Performance of Vertica with Parquet files in different scenarios

Partitioned and huge dataPartitioned and Small dataNot partitioned and huge dataNot partitioned and small data
Queries with where Predicate on partitioned clauseperforms fasterperforms slowerperforms lot faster (10x-20x) performs lot faster
Queries without where predicate and without aggregationperforms slowerperforms fasterperforms lot faster (10x-20x) performs lot faster
Queries with distinct on partition columnsperforms faster performs slower performs lot faster (10x-20x) performs lot faster
Queries with aggregation on partitioned columns performs slower performs faster performs lot faster (10x-20x) performs lot faster

Conclusion, the 4th Column, “Not Partitioned and Huge data” is already a winner in every scenario.
This was an eye opener for me, hopefully, the Vertica team looks into this further and asserts on these observations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s