HDInsight, Vertica

Load data from HDInsight Cluster to Vertica (part 3)

If you have landed on this page looking for “Load data from HDInsight Cluster to Vertica”, I would recommend you to visit the previous to post 1 and post 2, which are a part of this blog series.

This blog post is the third and final in series of “Load data from HDInsight Cluster to Vertica”. This will be focusing on how to leverage Vertica’s ability to read data from BLOB directly.

How to use Vertica’s abilities to read data from BLOB.

After going through the previous posts, you may have got a better idea of how we mounted the BLOB storage on Vertica nodes. Now we need to use Vertica’s external tables feature to read data directly from BLOB.

What are Vertica External Tables (in this context) ?

I will explain Vertica external table with reference to this use case only for now.
External tables do exist in almost every commercial databases, In Fact Vertica DB got this feature with version 5 and ahead only. What this feature allows you to do is, query a file directly without storing it in the DB engine, that means, even if the data is not stored in Vertica storage, we can still query the data from the file directly using the vertica SQL interface.

That does mean, it comes with its shortcomings and advantages as well.
Shortcomings
1. User needs to know the file layout exactly before loading to vertica external table
2. User cannot change the data in external table, as its just a linked file.
3. User cannot run analyze_statistics on such tables, as the data is not vertica storage engine.

Advantages
1. No extra footprint consumed by Vertica DB
2. It supports querying data without loading to DB
3. It supports S3 and Google Cloud Storage Engine as well, so you can directly read data from Cloud storage
4. It supports parquet and ORC format file loading as well.

What are the pre-requisites when creating external table in Vertica ?

  1. User needs to now the layout of the file, along with data types and length of char columns.
  2. dbadmin needs to have read privileges on the file, that user is going to host in the external table

How to create an external table and link it to the blob storage on Vertica hosts

We start with creating a simple parquet file on blob storage
I have loaded a simple parquet file to BLOB and the data inside it looks as below. 
I have used zepplin to show the below data.


This is how the file looks like in the BLOB storage which we have mounted on Vertica host

## Checking this from one of the vertica hosts.
[navin sample_output]$ ls -lrth
total 0
-rwxrwxrwx 1 dbadmin verticadba   0 Aug 06:00 _SUCCESS
-rwxrwxrwx 1 dbadmin verticadba 839 Aug 06:00 part-00000-4bf.snappy.parquet
[navin sample_output]$


This is the layout / schema of the parquet file along with its data types. I used zepplin notebook to get this layout.

root
 |-- Name: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- ZipCode: string (nullable = true)

Then we create a Vertica external table according to the layout of the file and query the data directly from vertica SQL interface.

-- Create table command for Vertica

CREATE EXTERNAL TABLE sample_output
(
     Name varchar(256),
     ID varchar(256),
     ZipCode varchar(256)
)
 AS 
COPY FROM '/mnt/<container_name>/sample_output/*.parquet' PARQUET;
## Note : the path mentioned here is the part of the folder in BLOB storage. sample_output is the folder which has the parquet files.



Finally you can query the data on external tables as below. All Vertica SQL should work on the external tables except for the DML commands.

-- After creating the table, we can query the table directly in Vertica

navin=> select * from sample_output;
      Name      | ID | ZipCode
----------------+----+----------
 Spartans       | 1  | 991271
 Avengers       | 2  | 981621
 Justice League | 34 | 12919271
 Transformers   | 44 | 9127912
(4 rows)


Gotchas when using Vertica external tables for parquet files.

  1. The external table has to be exactly like the layout of the file, so make sure you know the layout beforehand
  2. The data types in parquet files need to be mapped to the one of Vertica external tables, I will put some light on this mapping in upcoming post
  3. You should avoid executing a very complex SQL over External tables, as this is an external source and Vertica is just giving a query engine over it. So the benefits that vertica will reap over the columnar format of its own storage cannot be applied to the data in external tables, as the storage is different
  4. Make sure you add “*.parquet” in the COPY part of external tables, as there can be other folders / files in the BLOB directory given by you, which you don’t want to read. Vertica query engine will fail if it finds parquet files with different layout in one location or any other readable file in same location given by you if you don’t specify “*.parquet”
  5. If there location is not valid in External table DDL, Vertica won’t identify it run time, instead when you query the table, Vertica will throw the errors, that it cannot find the path

Conclusion

This post concludes this mini blog series on “Load data from HDInsight Cluster to Vertica”. There are other approaches to load this data, but I picked this one specifically considering the use case of Azure Cloud and the requirement to read from Azure HDInsight Cluster only and not write back to Azure HD Insight Cluster.

Hoping you have enjoyed this series.

Please stay tuned on upcoming articles like below

  1. Using vertica External Tables for Parquet wisely
  2. Optimization points for Vertica Queries ( maybe another blob series)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s