How to build a calendar using Vertica SQL (bit 5)

This post is a part of Bit Series blog.

Every data-warehouse project needs to have calendar table as a dimension, which hosts all possible combinations of dates. This table can be used as a dimension or in many cases to derive the dates.

With this post, we will be seeing on how we can generate a calendar table directly using Vertica SQL. The Vertica SQL will be using a special Time series function which is inbuilt inside vertica.

The time series function of vertica enables us to perform analytics based on time-series and gap filling. You can get more information about it here.

For now we will stick to this user of generating the calendar using this function as below

    calendar AS
    slice_time                       AS actual_timestamp ,
    CAST(slice_time AS DATE)         AS DATE ,
    date_part('WEEK', slice_time)    AS week,
    date_part('YEAR', slice_time)    AS YEAR,
    date_part('MONTH', slice_time)   AS MONTH,
    date_part('DAY', slice_time)     AS DAY,
    date_part('QUARTER', slice_time) AS quarter
            '2019-01-01 00:00:00' AS sample_date
            '2019-12-31 00:00:00' AS sample_date ) a TIMESERIES slice_time AS '1 DAY' OVER(ORDER BY

So, what are we doing here
1. We take 2 sample dates, between which we need to fill the gap, suppose you want the calendar to be for year 2019, take the date as shown above in subquery
2. We are using 1 vertica function here, TIMESERIES , slice_time is just an alias here. TIMESERIES is going to generate the dates based on interval specified, in this case ‘1 DAY’
3. Lastly the TIMESERIES function will always require a ORDER by operation on timestamp column. Here we are just sorting the data based on timestamp generated.
4. Once we get a daily timestamp, we are just using standard date functions to derive various date time properties of that timestamp, like day, week, quarter, year, month, etc.
5. Finally, we are just storing the data from the select statement in the CALENDAR table

You can check the results of this table, and it can be used easily as dimension table for Calendar use case.

Go ahead and try generating this data at your end, there are many possibilities, where you can use this data.


How to load filename while loading file in Vertica (bit 4)

This post is a part of Bit Series blog.

This is going to be a quick trick on how to load the file name to DB table, when loading the data from file to vertica using COPY command.

Many people have this requirement, where they need to load the file name as the source of data, when loading the data from file ,as a part of ETL operations.

Vertica’s COPY command is a mini ETL tool in itself, if your know all the tricks on using it. Its like a little Pandora’s box, which has many secrets of its own, ready to be explored.

So how do you load a file name , when loading data from that file in Vertica.

This can be done using the function “current_load_source()”, this function can only be used with a COPY command when loading the data from a file. this helps to load the data to a column which is meant for storing the file names.

Please refer the below example

## Showing the content of the file
[navin~]$ cat
## Getting into Vertica console
[navin~]$ vsql
Welcome to vsql, the Vertica Analytic Database interactive terminal.

Type:  \h or \? for help with vsql commands
       \g or terminate with semicolon to execute query
       \q to quit

-- Creating a table for storing the incoming data.
navin=>  create table test_file_name_table (i int, filename varchar);

-- Load the data from file using the function current_load_source()
navin=> copy test_file_name_table (i, filename as current_load_source()) from '/home/navin/';
 Rows Loaded
(1 row)

-- Validate the data in the file.
navin=> select * from test_file_name_table;
 i |             filename
 2 | /home/navin/
 3 | /home/navin/
 5 | /home/navin/
 1 | /home/navin/
 4 | /home/navin/
(5 rows)


Not only this, the function supports data loading from multiple files in a single COPY command too.

-- Truncating the table first
navin=> truncate table test_file_name_table;

-- Loading data from multiple files here using same function current_load_source()
navin=> copy test_file_name_table (i, filename as current_load_source()) from '/home/navin/test_file_name*.data';
 Rows Loaded
(1 row)

-- Validating the records
navin=> select * from test_file_name_table;
 i  |              filename
  2 | /home/navin/
  3 | /home/navin/
  5 | /home/navin/
 67 | /home/navin/
  1 | /home/navin/
  1 | /home/navin/
 32 | /home/navin/
  4 | /home/navin/
 45 | /home/navin/
 89 | /home/navin/
(10 rows)

So will this work even from windows, Yes it will even from Windows machine as below

-- This command is executed from vsql in windows console
navin=> copy test_file_name_table (i, filename as current_load_source()) from
local 'D:\Navin\txt\copy_from_windows.txt'
navin-> ;
 Rows Loaded
(1 row)

-- Validating the records
navin=> select * from test_file_name_table;
  i   |              filename
    2 | /home/navin/
    3 | /home/navin/
    5 | /home/navin/
   67 | /home/navin/
    4 | /home/navin/
   45 | /home/navin/
   89 | /home/navin/
  345 | D:\Navin\txt\copy_from_windows.txt
    1 | /home/navin/
    1 | /home/navin/
   32 | /home/navin/
  123 | D:\Navin\txt\copy_from_windows.txt
  678 | D:\Navin\txt\copy_from_windows.txt
  123 | D:\Navin\txt\copy_from_windows.txt
 3456 | D:\Navin\txt\copy_from_windows.txt
(15 rows)

This concludes the post on how to use current_load_source() function to load file name , when loading data into vertica.

Go ahead an explore the different possibilities with this function.

Uncategorized, Vertica

How to know the disk space consumption of Vertica Cluster (bit 3)

This post is a part of Bit Series blog.

Have you ever wondered on how much data size or catalog size is dedicated for Vertica on a single host / complete cluster.

This information is going to help you get the stats on the below

  1. How much data size dedicated for catalog and actual vertica data
  2. How much is size used by catalog and data on each host
  3. Stats in terms of percentage

Use the below query as a super user on Vertica to get the stats.

navin => select node_name, storage_usage, disk_space_used_mb, disk_space_free_mb, disk_space_free_percent  
	from disk_storage order by node_name, storage_usage;

     node_name | storage_usage | disk_space_used_mb| disk_space_free_mb| disk_space_free_percent
 v_db_node0001 | CATALOG       |              4540 |             46138 | 91%
 v_db_node0001 | DATA,TEMP     |            354548 |            270111 | 43%
 v_db_node0002 | CATALOG       |              9595 |             41083 | 81%
 v_db_node0002 | DATA,TEMP     |            362546 |            262113 | 41%
 v_db_node0003 | CATALOG       |              6957 |             43721 | 86%
 v_db_node0003 | DATA,TEMP     |            355516 |            269143 | 43%
(6 rows)

Note : This post only covers how to get stats for disk space utilization of Vertica data / catalog, besides this Vertica also stores the catalog in memory on every host and that can lead to some issues in performance of regular DML queries and TM operations.

Before Vertica 8.0 version, Vertica catalog issues were common, where a global catalog lock was held on every node of vertica to update the catalog at very frequent intervals, when there was such a lock held by Vertica Engine all the DML operations and Tuple Mover operations would be blocked. The main reason for this was Vertica tried to keep a centralized global catalog and sync of this log across all nodes would take time internally

Later they fixed by having a copy of global catalog on every node and thus reducing the time taken to sync to centralized catalog. Also they did reduce the size of global catalog by reducing the metadata related to projections inside each copy.

For more information on Catalog size in memory and checklist of identifying if your cluster has catalog performance issues, please refer to the below blogs by Vertica Engineering


How to close all Vertica sessions by a user (bit 2)

This post is a part of Bit Series blog and Its going to be a very quick one.

Have you ever faced a scenario as Vertica DBA to close all sessions on a vertica cluster for a specific user only.

The traditional way to do this is

  1. Query sessions table and find all open session for a specific user
  2. Take the session id one by one and then use “select close_session(‘session_id’) command to close each of those sessions

What if there is a smart of way of achieving this tedious task.

Yes, you can do all this with a single command as below.

## First we get all the sessions for the specific users and construct the close_session SQL command for each of the sessions, then we just pass the output of first vsql command as an input to second vsql command.

vsql -U dbadmin -w 'xxxxxx' -At -c "select 'select close_session(''' || session_id || ''');' from sessions where user_name = 'dbadmin';" | vsql  -U dbadmin -w 'xxxxxx'

Next time, you have this task at hand, do it with a single command using VSQL.

Stay tuned for more such upcoming tips and tricks.

HDInsight, Vertica

Load data from HDInsight Cluster to Vertica (part 3)

If you have landed on this page looking for “Load data from HDInsight Cluster to Vertica”, I would recommend you to visit the previous to post 1 and post 2, which are a part of this blog series.

This blog post is the third and final in series of “Load data from HDInsight Cluster to Vertica”. This will be focusing on how to leverage Vertica’s ability to read data from BLOB directly.

Continue reading

How to assign a resource pool to Vertica user

Recently I came across a use case, which to me, was a very easy task, but when I started to work, I found it to be rather complex.

The only reason was confusing documentation done by Vertica team.

So the question was “How to assign a resource pool to any Vertica user”

Requirement –

We have a user who needs to be using a specific resource pool always.

Solution –

Create a user

Here we create a SQL authenticated user and just check the default resource pool assigned to it in system tables. We will see every user gets assigned to ‘general’ resource pool by default

navin=> create user navin identified by 'xxxxxxx';

navin=> show resource_pool
navin-> ;
     name      | setting
 resource_pool | general
(1 row)

navin=> select user_name, resource_pool from users where user_name like 'navin';
-[ RECORD 1 ]-+--------
user_name     | navin
resource_pool | general

Grant usage permission to the user for this resource pool.

User needs to have USAGE privileges on a specific resource pool, before assigning the resource pool as a default pool for that user. This can be done by using the below command,

navin=> GRANT USAGE ON RESOURCE POOL recovery to navin;

Assign a default resource pool to the user

By default, if you don’t assign a resource pool to the user, ‘general’ resource pool is applied by Vertica.

If we need a user to use a specific pool, we should assign this resource pool to the user , using the below command . Here we have assigned resource pool “recovery” to this user.

navin=> ALTER USER navin RESOURCE POOL recovery;
navin=> select user_name, resource_pool from users where user_name like 'navin';
 user_name | resource_pool
 navin     | recovery

--Validating if user can use this pool

navin=> set resource_pool=recovery;
navin=> show resource_pool;
     name      | setting
 resource_pool | recovery
(1 row)

These are the 3 steps which needs to be followed in order to assign a resource pool to the user.

Note :

  1. we have assigned this test user to recovery resource pool here, just for example purpose. recovery is a inbuilt resource pool, we should ideally avoid any kind of usage of this pool by any user of Vertica.
  2. We can assign USAGE permission on multiple user resource pools to a user, but only 1 resource pool can be set to default for a user.