Uncategorized

Spark Journal : Cell value with String, should not be trimmed

A very short article and precisely focusing on the problem of trimmed string in the cell value on Zeppelin tool.

For those, who don’t know what zeppelin is, its web interactive tool, which helps to run scala / sql on spark cluster directly.

Recently, we found a simple issue, where team reported, long string values were getting trimmed, while querying in zeppelin.
After a bit of R&D, we found its just a simple config on Livy interpreter, which enables / disables this behavior.

The problem
If we try with simple query on zeppelin, where the string length is big, it will trim the string in the result as shown below.

The solution
After setting the property ” zeppelin.livy.spark.sql.field.truncate” as false through the interpreter (Livy2) settings, This worked like a charm.

For setting this
1. Go to the interpreter
2. Click on EDIT for Livy2 interpreter
3. Adding this property and save, it will ask for a restart while saving.

Next time, when you open your notebook and run the same query, it will work as below.

Hope, you learned something with this small article !!!

Uncategorized

Spark Journal : Adding Alias for columns in bulk with SELECT API

This is more of a continuation to my previous blog, which shows how to use alias for column names when using the SELECT API on dataframes in spark.
Exploring the same, I found a good way to handle another scenario, when you are dealing with multiple columns (good number of columns) . In such cases, its not feasible to write a SELECT command with each column manually.

Instead I would prefer a programmatic way to do it, so that its easier, keeps the code clean and is readable.
In this approach ,
1. Firstly we are going to use a predefined Scala Map, which has column names as Keys and column Alias as values stored. Its going to be a default immutable scala object.
2. Secondly, the Map defined above will be used a lookup and we will traverse through each column name of the dataframe to compare /match the existing columns in Map (keys)
3. Thirdly, we will just use this final columns identified after comparison, as a list and replace the list in SELECT API using scala ascription(varargs)

val dataList = List((1,"abc"),(2,"def"))
val df = dataList.toDF("id","Name")

val colalias : Map[String, String] = Map("id" -> "unique id", "Name" -> "Actual Name")


val aliasedCols = df.columns.map(name => colalias.get(name) match { 
  case Some(newname) => col(name).as(newname) 
  case None => col(name) 
})

df.select(aliasedCols: _*).show

+---------+-----------+
|unique id|Actual Name|
+---------+-----------+
|        1|        abc|
|        2|        def|
+---------+-----------+

Next time, you have such a task at hand, and don’t want to use the traditional way, use this smart way to replace alias of columns dynamically.

Uncategorized, Vertica

How to know the disk space consumption of Vertica Cluster (bit 3)

This post is a part of Bit Series blog.

Have you ever wondered on how much data size or catalog size is dedicated for Vertica on a single host / complete cluster.

This information is going to help you get the stats on the below

  1. How much data size dedicated for catalog and actual vertica data
  2. How much is size used by catalog and data on each host
  3. Stats in terms of percentage

Use the below query as a super user on Vertica to get the stats.



navin => select node_name, storage_usage, disk_space_used_mb, disk_space_free_mb, disk_space_free_percent  
	from disk_storage order by node_name, storage_usage;

     node_name | storage_usage | disk_space_used_mb| disk_space_free_mb| disk_space_free_percent
-------------------+---------------+--------------------+--------------------+-------------------------
 v_db_node0001 | CATALOG       |              4540 |             46138 | 91%
 v_db_node0001 | DATA,TEMP     |            354548 |            270111 | 43%
 v_db_node0002 | CATALOG       |              9595 |             41083 | 81%
 v_db_node0002 | DATA,TEMP     |            362546 |            262113 | 41%
 v_db_node0003 | CATALOG       |              6957 |             43721 | 86%
 v_db_node0003 | DATA,TEMP     |            355516 |            269143 | 43%
(6 rows)

Note : This post only covers how to get stats for disk space utilization of Vertica data / catalog, besides this Vertica also stores the catalog in memory on every host and that can lead to some issues in performance of regular DML queries and TM operations.

Before Vertica 8.0 version, Vertica catalog issues were common, where a global catalog lock was held on every node of vertica to update the catalog at very frequent intervals, when there was such a lock held by Vertica Engine all the DML operations and Tuple Mover operations would be blocked. The main reason for this was Vertica tried to keep a centralized global catalog and sync of this log across all nodes would take time internally

Later they fixed by having a copy of global catalog on every node and thus reducing the time taken to sync to centralized catalog. Also they did reduce the size of global catalog by reducing the metadata related to projections inside each copy.

For more information on Catalog size in memory and checklist of identifying if your cluster has catalog performance issues, please refer to the below blogs by Vertica Engineering

  1. https://www.vertica.com/blog/catalog-size-debugging/
  2. https://www.vertica.com/blog/calculate-the-catalog-size-in-memory-on-each-node-quick-tip/
Uncategorized

Bit Series Introduction: The small useful information

Hey Folks ,

Blogging after a long time..almost 4 years…but feels like a decade ( a hypothetical statement) .

This series is going to shed light on very small information / knowledge, that you can grasp very quickly and will definitely be helpful to you in your day to day  combat with your technical stack. This is not subject to any specific technology, but can be related to any technology, which I come across in my day to day life and I am sure, you will as well.

I can certainly tell, this series will help you will all those little tricks and troubleshooting techniques, that you may take some time to find on your self.

In short, this is just my experience that I am sharing with you all, so that you can quickly solve or acknowledge these issues.

So lets not waste time and start with the first bit of information with this series.

Uncategorized

User Defined SQL Function in Vertica

               Today, we will have a look at creating another Vertica SQL Function and the clauses and restrictions while creating them, Function are the a good method of reducing your  SQL codes and using the again and again ( Re usability ). First, we will list down the rules for writing a SQL function in Vertica and then explain it with an example.

BASIC SYNTAX

A basic syntax for writing a USer Defined SQL Function in vertica is:

CREATE [ OR REPLACE ] FUNCTION
… [[db-name.]schema.]function-name ( [ argname argtype [,…]])
… RETURN rettype
… AS
… BEGIN
…… RETURN expression;
… END;

 Creating SQL  Functions in Vertica is a bit different from Oracle. Though it follows the same block structure (BEGIN and END), but it comes with a major restriction.

  1. 1.       We cannot use a select inside the Block of Vertica SQL function like we can use in Oracle.
  2. 2.       For now, there is no support for Variables, loops with User Defined SQL Functions

This limits the usage of SQL function to a greater extent.

EXAMPLE

A Simple User Defined SQL function would look like this.
CREATE FUNCTION myzeroifnull(x INT) RETURN INT
AS BEGIN
RETURN (CASE WHEN (x IS NOT NULL) THEN x ELSE 0 END);
END;
Best Use of User Defined SQL Functions : The user-defined SQL functions are best used when migrating SQL statements from other databases where those SQL statement references a functions that do not exist in Vertica.
A sample use case here can be, In Oracle there is a string function “REVERSE” which reverses the string in the same order and case.
Vertica does not have such a Reverse Function in its built in library, So we create a User Defined SQL Function “REVERSE” like this.
CREATE FUNCTION PUBLIC.REVERSE(X VARCHAR) RETURN VARCHAR
AS
BEGIN RETURN CASE WHEN LENGTH(X) <= 15
THEN
(SUBSTR(X, 15, 1 ) || SUBSTR(X, 14, 1 ) || SUBSTR(X, 13, 1 ) || SUBSTR(X, 12, 1 ) ||
SUBSTR(X, 11, 1 ) || SUBSTR(X, 10, 1 ) || SUBSTR(X, 9, 1 ) || SUBSTR(X, 8, 1 ) ||
SUBSTR(X, 7, 1 ) || SUBSTR(X, 6, 1 ) || SUBSTR(X, 5, 1 ) || SUBSTR(X, 4, 1 ) ||
SUBSTR(X, 3, 1 )|| SUBSTR(X, 2, 1 )|| SUBSTR(X, 1, 1 ))
ELSE ‘STRING OUT OF RANGE’ END;
END;

The above function can be used with strings and is limited to strings with length 15. Now we can use this function in any of our select queries.

TRACKING UDF SQL FUNCTION

You can track your UDF function just by using system table v_catalog.user_function


Hope this helps.
Uncategorized

Estimating Database and Table Size in Vertica

Hello All,

Today we will be looking at how to calculate size of database / tables in a Vertica Cluster.
So how is Vertica different in terms of storage – It is columnar
So Vertica stores data in Columnar format and hence the compression ratio is much better than other databases. I will not emphasize on Columnar storage of Vertica in this article (Will surely have a complete different article for this).
Below are some snippets through which you can get to know how much data is residing in your database or how much space is consumed by your data in database.

We will be calculating two different sizes in this article with respect to Raw data size and Actual Database Size:
1.       Database Size
2.       Table Size

Let’s start with Database size
The function used for calculating the database size is AUDIT                                                                          
AUDIT:Audit performs a full database audit and gives the actual database size in terms of raw data. Audit considers all the data in the database including the one is your WOS (Write optimized Store / Memory of your cluster)
Audit results are given in terms of bytes. It can be supplied with two optional parameters error_tolerance and confidence_level

Lowering the error_tolerance to zero (0) and raising the confidence_level to 100, provides the most accurate size estimate, and increases the performance impact of calling the AUDIT function. During a detailed, low error-tolerant audit, all of the data in the database is dumped to a raw format to calculate its size. Since performing a stringent audit can significantly impact database performance, never perform a full audit of a production database.
Audit function estimates the database size of your database considering the raw database size. What is raw data – raw data means the uncompressed, un-federated data stored in a single Vertica database.

What is excluded in while calculating raw database size
1.       Same data in multiple projections
2.       Data stored in temp tables
3.       Data which is marked for deletion ( Delete Vectors )
4.       Data stored in Monitoring, Data Collector and system tables
Usage of Audit Function

For calculating complete database size
NNANI=> select audit(”);
   audit
———–
 283448493

(1 row)
For calculating complete database size with high accuracy and high performance impact
NNANI=> select audit(”,0,100);
   audit
———–
 282084688

(1 row)
Here the 1st parameter is error_tolerance and second is confidence_level ….so the difference can be seen in both the scenarios regarding accuracy level.
To calculate size of a table
NNANI=> select audit(‘public.bank_holidays’);
 audit
——-
 319

(1 row

The size of table bank holidays  table is 319 bytes
Calculating Actual Database Size:
Actual database size will be the footprint of your database .i.e. it will be including all the projections for same table, data in WOS, Data in system tables. In simple words the space your data is consuming in your database.
For calculating this we can use two monitoring tables
1.       Column_storage
2.       Projection_storage
Using Column_storage table to calculate the database size
NNANI=> select round((sum(column_used_bytes)/1024/1024/1024), 2.0) || ‘ GB’ as database_size_in_GB from
NNANI-> (select anchor_table_schema, anchor_table_name,node_name, column_name, sum(used_bytes) as column_used_bytes
NNANI(> from column_storage where
NNANI(> node_name IN (select node_name from nodes)
NNANI(> group by 1,2,3,4) sub;
 database_size_in_GB
———————
 0.1 GB

(1 row)
Using projection_storage table to calculate the database size

NNANI=> select round(sum(used_bytes)/1024/1024/1024, 2.0) || ‘ GB’ as database_size_in_GB  from
NNANI-> (select anchor_table_name,anchor_table_schema,node_name,sum(ros_used_bytes) as used_bytes
NNANI(> from projection_storage
NNANI(> group by 1,2,3
NNANI(> order by 3) sub;
 database_size_in_GB
———————
 0.1 GB

(1 row)
Using Column_storage table to calculate the table size
NNANI=> select CASE WHEN round((sum(column_used_bytes)), 2.0) > 0 THEN round((sum(column_used_bytes)), 2.0) || ‘ MB’ END as table_size_in_Bytes,
NNANI-> CASE WHEN round((sum(column_used_bytes)/1024), 2.0) > 0 THEN round((sum(column_used_bytes)/1024), 2.0) || ‘ KB’ END as table_size_in_KB,
NNANI-> CASE WHEN round((sum(column_used_bytes)/1024/1024), 2.0) > 0 THEN round((sum(column_used_bytes)/1024/1024), 2.0) || ‘ MB’ END as table_size_in_MB,
NNANI-> CASE WHEN round((sum(column_used_bytes)/1024/1024/1024), 2.0) > 0 THEN round((sum(column_used_bytes)/1024/1024/1024), 2.0) || ‘ GB’ END as table_size_in_GB
NNANI-> from
NNANI-> (select anchor_table_schema, anchor_table_name,node_name, column_name, sum(used_bytes) as column_used_bytes
NNANI(> from column_storage where anchor_table_schema=’public’ and anchor_table_name=’bank_holidays’ and
NNANI(> node_name IN (select node_name from nodes)

NNANI(> group by 1,2,3,4) sub;

table_size_in_Bytes | table_size_in_KB | table_size_in_MB | table_size_in_GB
———————+——————+——————+——————
 828 MB              | 0.81 KB          |                  |
(1 row)
You can customize the above query according to your needs. You can even use the projection_storage table in same manner as column_storage is used.
This was a short article on calculating table size and database size in Vertica.
Please suggest any corrections if needed to this article
Your feedback is appreciated

Uncategorized

Vertica SQL function Words_to_number

Hi There,

Just a quick update where we left last time – Vertica SQL Function

Here is a good example of Vertica SQL Function

The name of this function is words_to_number
This function takes arguments as a two word digit say ‘seventy one’ and gives the result in numbers .i.e 71

Just observe the way we can use the SQL function in Vertica to optimize the reusable codes.

Function Definition

CREATE OR REPLACE FUNCTION WORDS_TO_NUMBER(x VARCHAR) RETURN VARCHAR
AS BEGIN
RETURN
CASE
WHEN REGEXP_COUNT(SPLIT_PART(x, ' ', 1), 'ty') = 1 then
(CASE 
WHEN SPLIT_PART(lower(x), ' ', 1) = 'twenty' THEN 2*10 
WHEN SPLIT_PART(lower(x), ' ', 1) = 'thirty' THEN 3*10 
WHEN SPLIT_PART(lower(x), ' ', 1) = 'forty' THEN 4*10
WHEN SPLIT_PART(lower(x), ' ', 1) = 'fifty' THEN 5*10
WHEN SPLIT_PART(lower(x), ' ', 1) = 'sixty' THEN 6*10
WHEN SPLIT_PART(lower(x), ' ', 1) = 'seventy' THEN 7*10
WHEN SPLIT_PART(lower(x), ' ', 1) = 'eighty' THEN 8*10
WHEN SPLIT_PART(lower(x), ' ', 1) = 'ninety' THEN 9*10 ELSE '0' END)
ELSE '0' END +
CASE
WHEN REGEXP_COUNT(x, ' [a-z]') = 1 then
(CASE 
WHEN SPLIT_PART(lower(x), ' ', 2) = 'one' THEN 1*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'two' THEN 2*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'three' THEN 3*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'four' THEN 4*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'five' THEN 5*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'six' THEN 6*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'seven' THEN 7*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'eight' THEN 8*1
WHEN SPLIT_PART(lower(x), ' ', 2) = 'nine' THEN 9*1
ELSE '0' END)
WHEN REGEXP_COUNT(x,' ') = 0 THEN
(CASE
WHEN lower(x) = 'one' THEN 1
WHEN lower(x) = 'two' THEN 2
WHEN lower(x) = 'three' THEN 3
WHEN lower(x) = 'four' THEN 4
WHEN lower(x) = 'five' THEN 5
WHEN lower(x) = 'six' THEN 6
WHEN lower(x) = 'seven' THEN 7
WHEN lower(x) = 'eight' THEN 8
WHEN lower(x) = 'nine' THEN 9
ELSE '0' END) 
ELSE 0 END;

END;

Always remember to use BEGIN and END statements as in Function definition.
And the most important or you can say limit to Vertica SQL Function is no usage of Select statement is allowed in Function statement.

Some example to use this function
Function Usage


nnani=> select WORDS_TO_NUMBER('twenty three');

 WORDS_TO_NUMBER
-----------------
 23
(1 row)

You can also try some more example with this function


select WORDS_TO_NUMBER('ninety nine');

select WORDS_TO_NUMBER('seventy one');

select WORDS_TO_NUMBER('one');

The rules / limits when using this functions are the words can only extend till two digits, you cannot have a three word digit in the arguments.

Go ahead and use this function. Let me know your reviews on this.

Uncategorized

Vertica SQL Functions

Hi all,

Unlike other databases, vertica is not so flexible when it comes to User defined functions.

Function can be defined through two ways in Vertica
1. Vertica SQL Function
2. Vertica UDF ( User Defined Function )

Today, we will be looking at an example of 1 type Vertica SQL Function.

Vertica SQL Function

They can be defined using the CREATE FUNCTION statement and with a return type mandatory

Syntax for SQL Function


CREATE [ OR REPLACE ] FUNCTION
[[db-name.]schema.]function-name ( [ argname argtype [, ...] ] )
RETURN rettype
AS
BEGIN
RETURN expression;
END;


Now an example to elaborate the above syntax


CREATE FUNCTION ADD_DAYS(x TIMESTAMP, y INT) RETURN TIMESTAMP
AS BEGIN
RETURN ( CAST (x AS TIMESTAMP) + y )
END;

This example will create ADD_DAYS Function which adds the number of days in int to a date / timestamp value.

The arguments defined are of two types TIMSTAMP and INT.

So examples of how this function works

EXAMPLE USAGE :

nnani=> select ADD_DAYS('2013-01-20',1);
      ADD_DAYS
---------------------
 2013-01-21 00:00:00
(1 row)

nnani=> select ADD_DAYS('2013-01-20',60);
      ADD_DAYS
---------------------
 2013-03-21 00:00:00
(1 row)

Uncategorized

History Of Vertica

Hello All,

This is my first blog with Vertica-database blogs.
I think starting with the history of Vertica database could be the best option at this time.

For people new, Hope you all find this blog interesting.

Vertica was founded in 2005 by database researcher Michael Stonebraker and Andrew Palmer.
Vertica is the only pure Columnar Database existing till date.

Did you know Vertica before commercializing was called as  C- Store database.
Yes  C- Store ( Column Store )
The C-Store project was a collaboration between MIT, Yale, Brandeis University. Brown University, and UMass Boston .

A Short note to C -Store

C-Store is a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures. 



More detail about this project can be found here

C-Store project

The last release for C-Store project was C-Store version 0.2 is now publicly available. Last released in October 2006, It is built with open source tools and runs on Linux x86 computers.

There is also a White paper release by Michael Stonebreaker and team

C-Store whitepaper

Enjoy reading.
Appreciate your comments.