This is a continuation to what we Vertica External table topic mentioned in previous post here.
This post will show you on how vertica interprets parquet file partitions and how we can best use it to our use cases.
For more details on how to load a parquet file to Vertica table please refer to this blogpost.
Before digging ahead, you might want to have a look at partitioning concept of Parquet files here. This seems to exactly the information you want to know before going ahead.
Before researching on this, I was under the assumption that parquet partitions are beneficial for vertica external tables. But my understanding was incorrect, when I found the below results.
I tried a few scenarios mentioned below and was astonished to see the difference myself.
Setting up Env
Created 2 sample parquet files with sample dataset (2.7 Million records) with exact same data, but 1 being partitioned and other being non-partitioned.
Later, created 2 external tables on vertica mapping to these location of parquet files
DROP TABLE IF EXISTS TEST_PERFORMANCE_PARQUET_PARTITIONED; CREATE EXTERNAL TABLE TEST_PERFORMANCE_PARQUET_PARTITIONED ( THREATRE_ID varchar(512), MOVIE_NAME varchar(512), SHOW_SLOT varchar(512), TICKETS_SOLD Integer, GROSS_INCOME double precision, TICKET_DATE varchar(20), CITY_NAME varchar(512) ) AS COPY FROM '/path_to_parquet/*/*/*.parquet' PARQUET (hive_partition_cols='TICKET_DATE, CITY_NAME'); DROP TABLE IF EXISTS TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED; CREATE EXTERNAL TABLE TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED ( TICKET_DATE varchar(20), CITY_NAME varchar(512) , THREATRE_ID varchar(512), MOVIE_NAME varchar(512), SHOW_SLOT varchar(512), TICKETS_SOLD Integer, GROSS_INCOME double precision ) AS COPY FROM '/path_to_parquet/*.parquet' PARQUET ;
Once the above was done, I tried different queries on both the tables as below..
select count(*) from TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED -- Time Taken 0.82 seconds select count(*) from TEST_PERFORMANCE_PARQUET_PARTITIONED -- Time Taken 1.34 seconds
Aggregation on partitioned columns
select TICKET_DATE, CITY_NAME, sum(GROSS_INCOME) from TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED where TICKET_DATE = '7-Sep-2019' group by 1,2 --Time Taken 0.6 seconds select TICKET_DATE, CITY_NAME, sum(GROSS_INCOME) from TEST_PERFORMANCE_PARQUET_PARTITIONED where TICKET_DATE = '7-Sep-2019' group by 1, 2 -- Time taken 1.6 seconds
Seems Vertica explain plan was not really verbose here, and cost and complete plan was exactly identical here, this means. Vertica engine does not try to interpret the partitions of Parquet files here (or does not show that in the plan). Seems the partition pruning feature of vertica does not respect the partitions of parquet files.
Somehow, vertica engine takes more time to traverse through the partition of parquet files.
So, I am assuming, when I am trying to add queries on partitioned parquet files, the engine is trying to traverse every partition folder and trying to match the where predicate. this hierarchical partition of Parquet files, might be confusing vertica engine and it maybe a reason why partitioned parquet files are not really great for vertica .
Below are the scenarios tested by me on the same lines and the performance for the same.
Performance of Vertica with Parquet files in different scenarios
|Partitioned and huge data||Partitioned and Small data||Not partitioned and huge data||Not partitioned and small data|
|Queries with where Predicate on partitioned clause||performs faster||performs slower||performs lot faster (10x-20x)||performs lot faster|
|Queries without where predicate and without aggregation||performs slower||performs faster||performs lot faster (10x-20x)||performs lot faster|
|Queries with distinct on partition columns||performs faster||performs slower||performs lot faster (10x-20x)||performs lot faster|
|Queries with aggregation on partitioned columns||performs slower||performs faster||performs lot faster (10x-20x)||performs lot faster|
Conclusion, the 4th Column, “Not Partitioned and Huge data” is already a winner in every scenario.
This was an eye opener for me, hopefully, the Vertica team looks into this further and asserts on these observations.