Vertica

Vertica Journal : Parquet and External tables

This is a continuation to what we Vertica External table topic mentioned in previous post here.

This post will show you on how vertica interprets parquet file partitions and how we can best use it to our use cases.

For more details on how to load a parquet file to Vertica table please refer to this blogpost.

Before digging ahead, you might want to have a look at partitioning concept of Parquet files here. This seems to exactly the information you want to know before going ahead.

Before researching on this, I was under the assumption that parquet partitions are beneficial for vertica external tables. But my understanding was incorrect, when I found the below results.
I tried a few scenarios mentioned below and was astonished to see the difference myself.

Setting up Env

Created 2 sample parquet files with sample dataset (2.7 Million records) with exact same data, but 1 being partitioned and other being non-partitioned.

Later, created 2 external tables on vertica mapping to these location of parquet files 


DROP TABLE IF EXISTS  TEST_PERFORMANCE_PARQUET_PARTITIONED;
CREATE EXTERNAL TABLE TEST_PERFORMANCE_PARQUET_PARTITIONED
(
    
    THREATRE_ID varchar(512),
    MOVIE_NAME  varchar(512),
    SHOW_SLOT varchar(512),
    TICKETS_SOLD Integer, 
    GROSS_INCOME double precision,
    TICKET_DATE varchar(20),
    CITY_NAME varchar(512)
)
 AS 
COPY FROM '/path_to_parquet/*/*/*.parquet' PARQUET (hive_partition_cols='TICKET_DATE, CITY_NAME');

DROP TABLE IF EXISTS TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED;
CREATE EXTERNAL TABLE TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED
( 
    TICKET_DATE varchar(20),
    CITY_NAME varchar(512) ,
    THREATRE_ID varchar(512),
    MOVIE_NAME  varchar(512),
    SHOW_SLOT varchar(512),
    TICKETS_SOLD Integer, 
    GROSS_INCOME  double precision
    
)
 AS 
COPY FROM '/path_to_parquet/*.parquet' PARQUET ;

Once the above was done, I tried different queries on both the tables as below..
Count Aggregation

select count(*) from TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED
-- Time Taken 0.82 seconds
select count(*) from TEST_PERFORMANCE_PARQUET_PARTITIONED
-- Time Taken 1.34 seconds

Aggregation on partitioned columns

select TICKET_DATE, CITY_NAME, sum(GROSS_INCOME) from TEST_PERFORMANCE_PARQUET_NOT_PARTITIONED
where TICKET_DATE = '7-Sep-2019'
group by 1,2
--Time Taken 0.6 seconds
select TICKET_DATE, CITY_NAME, sum(GROSS_INCOME) from TEST_PERFORMANCE_PARQUET_PARTITIONED
where TICKET_DATE = '7-Sep-2019'
group by 1, 2
-- Time taken 1.6 seconds

Explain Plans
Seems Vertica explain plan was not really verbose here, and cost and complete plan was exactly identical here, this means. Vertica engine does not try to interpret the partitions of Parquet files here (or does not show that in the plan). Seems the partition pruning feature of vertica does not respect the partitions of parquet files.
Somehow, vertica engine takes more time to traverse through the partition of parquet files.

So, I am assuming, when I am trying to add queries on partitioned parquet files, the engine is trying to traverse every partition folder and trying to match the where predicate. this hierarchical partition of Parquet files, might be confusing vertica engine and it maybe a reason why partitioned parquet files are not really great for vertica .
Below are the scenarios tested by me on the same lines and the performance for the same.

Performance of Vertica with Parquet files in different scenarios

Partitioned and huge dataPartitioned and Small dataNot partitioned and huge dataNot partitioned and small data
Queries with where Predicate on partitioned clauseperforms fasterperforms slowerperforms lot faster (10x-20x) performs lot faster
Queries without where predicate and without aggregationperforms slowerperforms fasterperforms lot faster (10x-20x) performs lot faster
Queries with distinct on partition columnsperforms faster performs slower performs lot faster (10x-20x) performs lot faster
Queries with aggregation on partitioned columns performs slower performs faster performs lot faster (10x-20x) performs lot faster

Conclusion, the 4th Column, “Not Partitioned and Huge data” is already a winner in every scenario.
This was an eye opener for me, hopefully, the Vertica team looks into this further and asserts on these observations.

Vertica

How to build a calendar using Vertica SQL (bit 5)

This post is a part of Bit Series blog.

Every data-warehouse project needs to have calendar table as a dimension, which hosts all possible combinations of dates. This table can be used as a dimension or in many cases to derive the dates.

With this post, we will be seeing on how we can generate a calendar table directly using Vertica SQL. The Vertica SQL will be using a special Time series function which is inbuilt inside vertica.

The time series function of vertica enables us to perform analytics based on time-series and gap filling. You can get more information about it here.

For now we will stick to this user of generating the calendar using this function as below

CREATE TABLE
    calendar AS
SELECT
    slice_time                       AS actual_timestamp ,
    CAST(slice_time AS DATE)         AS DATE ,
    date_part('WEEK', slice_time)    AS week,
    date_part('YEAR', slice_time)    AS YEAR,
    date_part('MONTH', slice_time)   AS MONTH,
    date_part('DAY', slice_time)     AS DAY,
    date_part('QUARTER', slice_time) AS quarter
FROM
    (
        SELECT
            '2019-01-01 00:00:00' AS sample_date
        UNION
        SELECT
            '2019-12-31 00:00:00' AS sample_date ) a TIMESERIES slice_time AS '1 DAY' OVER(ORDER BY
    sample_date::VARCHAR::TIMESTAMP)

So, what are we doing here
1. We take 2 sample dates, between which we need to fill the gap, suppose you want the calendar to be for year 2019, take the date as shown above in subquery
2. We are using 1 vertica function here, TIMESERIES , slice_time is just an alias here. TIMESERIES is going to generate the dates based on interval specified, in this case ‘1 DAY’
3. Lastly the TIMESERIES function will always require a ORDER by operation on timestamp column. Here we are just sorting the data based on timestamp generated.
4. Once we get a daily timestamp, we are just using standard date functions to derive various date time properties of that timestamp, like day, week, quarter, year, month, etc.
5. Finally, we are just storing the data from the select statement in the CALENDAR table

You can check the results of this table, and it can be used easily as dimension table for Calendar use case.

Go ahead and try generating this data at your end, there are many possibilities, where you can use this data.

Vertica

How to load filename while loading file in Vertica (bit 4)

This post is a part of Bit Series blog.

This is going to be a quick trick on how to load the file name to DB table, when loading the data from file to vertica using COPY command.

Many people have this requirement, where they need to load the file name as the source of data, when loading the data from file ,as a part of ETL operations.

Vertica’s COPY command is a mini ETL tool in itself, if your know all the tricks on using it. Its like a little Pandora’s box, which has many secrets of its own, ready to be explored.

So how do you load a file name , when loading data from that file in Vertica.

This can be done using the function “current_load_source()”, this function can only be used with a COPY command when loading the data from a file. this helps to load the data to a column which is meant for storing the file names.

Please refer the below example

## Showing the content of the file
[navin~]$ cat test_file_name.data
1
2
3
4
5
## Getting into Vertica console
[navin~]$ vsql
Welcome to vsql, the Vertica Analytic Database interactive terminal.

Type:  \h or \? for help with vsql commands
       \g or terminate with semicolon to execute query
       \q to quit

-- Creating a table for storing the incoming data.
navin=>  create table test_file_name_table (i int, filename varchar);
CREATE TABLE

-- Load the data from file using the function current_load_source()
navin=> copy test_file_name_table (i, filename as current_load_source()) from '/home/navin/test_file_name.data';
 Rows Loaded
-------------
           5
(1 row)

-- Validate the data in the file.
navin=> select * from test_file_name_table;
 i |             filename
---+-----------------------------------
 2 | /home/navin/test_file_name.data
 3 | /home/navin/test_file_name.data
 5 | /home/navin/test_file_name.data
 1 | /home/navin/test_file_name.data
 4 | /home/navin/test_file_name.data
(5 rows)

navin=>

Not only this, the function supports data loading from multiple files in a single COPY command too.

-- Truncating the table first
navin=> truncate table test_file_name_table;
TRUNCATE TABLE

-- Loading data from multiple files here using same function current_load_source()
navin=> copy test_file_name_table (i, filename as current_load_source()) from '/home/navin/test_file_name*.data';
 Rows Loaded
-------------
          10
(1 row)

-- Validating the records
navin=> select * from test_file_name_table;
 i  |              filename
----+------------------------------------
  2 | /home/navin/test_file_name.data
  3 | /home/navin/test_file_name.data
  5 | /home/navin/test_file_name.data
 67 | /home/navin/test_file_name2.data
  1 | /home/navin/test_file_name.data
  1 | /home/navin/test_file_name2.data
 32 | /home/navin/test_file_name2.data
  4 | /home/navin/test_file_name.data
 45 | /home/navin/test_file_name2.data
 89 | /home/navin/test_file_name2.data
(10 rows)

So will this work even from windows, Yes it will even from Windows machine as below

-- This command is executed from vsql in windows console
navin=> copy test_file_name_table (i, filename as current_load_source()) from
local 'D:\Navin\txt\copy_from_windows.txt'
navin-> ;
 Rows Loaded
-------------
           5
(1 row)


-- Validating the records
navin=> select * from test_file_name_table;
  i   |              filename
------+------------------------------------
    2 | /home/navin/test_file_name.data
    3 | /home/navin/test_file_name.data
    5 | /home/navin/test_file_name.data
   67 | /home/navin/test_file_name2.data
    4 | /home/navin/test_file_name.data
   45 | /home/navin/test_file_name2.data
   89 | /home/navin/test_file_name2.data
  345 | D:\Navin\txt\copy_from_windows.txt
    1 | /home/navin/test_file_name.data
    1 | /home/navin/test_file_name2.data
   32 | /home/navin/test_file_name2.data
  123 | D:\Navin\txt\copy_from_windows.txt
  678 | D:\Navin\txt\copy_from_windows.txt
  123 | D:\Navin\txt\copy_from_windows.txt
 3456 | D:\Navin\txt\copy_from_windows.txt
(15 rows)

This concludes the post on how to use current_load_source() function to load file name , when loading data into vertica.

Go ahead an explore the different possibilities with this function.

Uncategorized, Vertica

How to know the disk space consumption of Vertica Cluster (bit 3)

This post is a part of Bit Series blog.

Have you ever wondered on how much data size or catalog size is dedicated for Vertica on a single host / complete cluster.

This information is going to help you get the stats on the below

  1. How much data size dedicated for catalog and actual vertica data
  2. How much is size used by catalog and data on each host
  3. Stats in terms of percentage

Use the below query as a super user on Vertica to get the stats.



navin => select node_name, storage_usage, disk_space_used_mb, disk_space_free_mb, disk_space_free_percent  
	from disk_storage order by node_name, storage_usage;

     node_name | storage_usage | disk_space_used_mb| disk_space_free_mb| disk_space_free_percent
-------------------+---------------+--------------------+--------------------+-------------------------
 v_db_node0001 | CATALOG       |              4540 |             46138 | 91%
 v_db_node0001 | DATA,TEMP     |            354548 |            270111 | 43%
 v_db_node0002 | CATALOG       |              9595 |             41083 | 81%
 v_db_node0002 | DATA,TEMP     |            362546 |            262113 | 41%
 v_db_node0003 | CATALOG       |              6957 |             43721 | 86%
 v_db_node0003 | DATA,TEMP     |            355516 |            269143 | 43%
(6 rows)

Note : This post only covers how to get stats for disk space utilization of Vertica data / catalog, besides this Vertica also stores the catalog in memory on every host and that can lead to some issues in performance of regular DML queries and TM operations.

Before Vertica 8.0 version, Vertica catalog issues were common, where a global catalog lock was held on every node of vertica to update the catalog at very frequent intervals, when there was such a lock held by Vertica Engine all the DML operations and Tuple Mover operations would be blocked. The main reason for this was Vertica tried to keep a centralized global catalog and sync of this log across all nodes would take time internally

Later they fixed by having a copy of global catalog on every node and thus reducing the time taken to sync to centralized catalog. Also they did reduce the size of global catalog by reducing the metadata related to projections inside each copy.

For more information on Catalog size in memory and checklist of identifying if your cluster has catalog performance issues, please refer to the below blogs by Vertica Engineering

  1. https://www.vertica.com/blog/catalog-size-debugging/
  2. https://www.vertica.com/blog/calculate-the-catalog-size-in-memory-on-each-node-quick-tip/
Vertica

How to close all Vertica sessions by a user (bit 2)

This post is a part of Bit Series blog and Its going to be a very quick one.

Have you ever faced a scenario as Vertica DBA to close all sessions on a vertica cluster for a specific user only.

The traditional way to do this is

  1. Query sessions table and find all open session for a specific user
  2. Take the session id one by one and then use “select close_session(‘session_id’) command to close each of those sessions

What if there is a smart of way of achieving this tedious task.

Yes, you can do all this with a single command as below.

## First we get all the sessions for the specific users and construct the close_session SQL command for each of the sessions, then we just pass the output of first vsql command as an input to second vsql command.

vsql -U dbadmin -w 'xxxxxx' -At -c "select 'select close_session(''' || session_id || ''');' from sessions where user_name = 'dbadmin';" | vsql  -U dbadmin -w 'xxxxxx'

Next time, you have this task at hand, do it with a single command using VSQL.

Stay tuned for more such upcoming tips and tricks.

HDInsight, Vertica

Load data from HDInsight Cluster to Vertica (part 3)

If you have landed on this page looking for “Load data from HDInsight Cluster to Vertica”, I would recommend you to visit the previous to post 1 and post 2, which are a part of this blog series.

This blog post is the third and final in series of “Load data from HDInsight Cluster to Vertica”. This will be focusing on how to leverage Vertica’s ability to read data from BLOB directly.

Continue reading
Vertica

How to assign a resource pool to Vertica user

Recently I came across a use case, which to me, was a very easy task, but when I started to work, I found it to be rather complex.

The only reason was confusing documentation done by Vertica team.

So the question was “How to assign a resource pool to any Vertica user”

Requirement –

We have a user who needs to be using a specific resource pool always.

Solution –

Create a user

Here we create a SQL authenticated user and just check the default resource pool assigned to it in system tables. We will see every user gets assigned to ‘general’ resource pool by default

navin=> create user navin identified by 'xxxxxxx';
CREATE USER


navin=> show resource_pool
navin-> ;
     name      | setting
---------------+---------
 resource_pool | general
(1 row)

navin=> select user_name, resource_pool from users where user_name like 'navin';
-[ RECORD 1 ]-+--------
user_name     | navin
resource_pool | general

Grant usage permission to the user for this resource pool.

User needs to have USAGE privileges on a specific resource pool, before assigning the resource pool as a default pool for that user. This can be done by using the below command,

navin=> GRANT USAGE ON RESOURCE POOL recovery to navin;
GRANT PRIVILEGE


Assign a default resource pool to the user

By default, if you don’t assign a resource pool to the user, ‘general’ resource pool is applied by Vertica.

If we need a user to use a specific pool, we should assign this resource pool to the user , using the below command . Here we have assigned resource pool “recovery” to this user.

navin=> ALTER USER navin RESOURCE POOL recovery;
ALTER USER
navin=> select user_name, resource_pool from users where user_name like 'navin';
 user_name | resource_pool
-----------+---------------
 navin     | recovery

--Validating if user can use this pool

navin=> set resource_pool=recovery;
SET
navin=> show resource_pool;
     name      | setting
---------------+----------
 resource_pool | recovery
(1 row)


These are the 3 steps which needs to be followed in order to assign a resource pool to the user.

Note :

  1. we have assigned this test user to recovery resource pool here, just for example purpose. recovery is a inbuilt resource pool, we should ideally avoid any kind of usage of this pool by any user of Vertica.
  2. We can assign USAGE permission on multiple user resource pools to a user, but only 1 resource pool can be set to default for a user.
HDInsight, Vertica

Load data from HDInsight Cluster to Vertica (part 1)

With the ever growing necessity to use the big data stack like Spark and Cloud, Leveraging the spark cluster to be used by Vertica has become very important. 

We came across a very common use case, where we had to transfer the data from HDInsight (Spark Cluster) to Vertica Cluster.

I will be taking you over the easiest way of doing it in this blog series. 

Continue reading

Uncategorized

Bit Series Introduction: The small useful information

Hey Folks ,

Blogging after a long time..almost 4 years…but feels like a decade ( a hypothetical statement) .

This series is going to shed light on very small information / knowledge, that you can grasp very quickly and will definitely be helpful to you in your day to day  combat with your technical stack. This is not subject to any specific technology, but can be related to any technology, which I come across in my day to day life and I am sure, you will as well.

I can certainly tell, this series will help you will all those little tricks and troubleshooting techniques, that you may take some time to find on your self.

In short, this is just my experience that I am sharing with you all, so that you can quickly solve or acknowledge these issues.

So lets not waste time and start with the first bit of information with this series.