HDInsight, Vertica

Load data from HDInsight Cluster to Vertica (part 2)

If you have landed on this page looking for an approach to load data from HDInsight Cluster to Vertica, I would recommend you to please check this page first.

This post is the second in series for “Load data from HDInsight Cluster to Vertica”.

This post is going to cover the part (highlighted below) where we link the BLOB storage account to Vertica nodes, this is going to be the most integral part of this series.

How to Link the BLOB storage account to Vertica Nodes.

We will be using a external library called BLOBFUSE to get this task done. 

Pre-Requisites

  1. BLOB account details
    1. Blob account name
    2. Blob account key
    3. container name in blob
  2. Make sure BLOB account in same network / allows network access to Vertica hosts.

Steps

  1. Follow the installation steps mentioned here according to your OS ( I will be using CentOS).

  2. Run the below commands on Vertica hosts 
#Login to Vertica host using dbadmin
sudo su dbadmin

# Create directory with container name 
sudo mkdir /mnt/<container_name>

# Change the owner of directory to dbadmin
sudo chown dbadmin /mnt/<container_name>

#Change the access mode / permissions on directory to be accessible by all
sudo chmod 777 /mnt/<container_name>

# Create another directory, this will act as a place for caching by BLOBFUSE library
sudo mkdir /mnt/<container_name>_blob

# Change the owner of directory to dbadmin, so that can be used by Vertica when reading data
sudo chown dbadmin /mnt/<container_name>_blob

3. create a new confgiration.cfg file which has format like this

4. Run the below command to mount the blobfuse

Note : –tmp_path=/mnt/<container_name>_blob is the path where the data will be kept in cache by blobfuse library when Vertica fetches the data from BLOB.

blobfuse /mnt/<container_name> --tmp-path=/mnt/ <container_name>_blob  -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 --config-file=../connection.cfg --log-level=LOG_DEBUG --file-cache-timeout-in-seconds=120

5. Repeat the steps 2 and 4 on every host of Vertica to make sure the blob storage is mounted on each node of Vertica Cluster.

How to Validate

# Check the mounted drive on vertica host
[navin]$ df -h | grep blob
blobfuse              400G  9.4G  143G   4% /mnt/<container_name>

# Get into the mount location
[navin]$ cd /mnt/<container_name>

# Find the required file in blob storage directly using common linux commands
[navin container_name]$ ls -lrth | grep Test_file.txt
1 abc    abc         20 Feb 19 08:43 Test_file.txt

# Read the content of the file from blob directly
[navin container_name]$ cat Test_file.txt
Hello this is test

Make sure you validate the setting and mount on each host of vertica cluster.

Note : The user who has written the file does not matter here, as we are giving dbadmin all rights / permissions to have access to these files and read it. When Vertica query engine is going to query this data (this will be covered in next part), we will see the the tmp folder getting used, as blobfuse library uses that location to cache data it reads from BLOB storage.

Stay tuned for the the next part of this blob series, which will help you understand on how to query this data directly from Vertica.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s