If you have landed on this page looking for an approach to load data from HDInsight Cluster to Vertica, I would recommend you to please check this page first.
This post is the second in series for “Load data from HDInsight Cluster to Vertica”.
This post is going to cover the part (highlighted below) where we link the BLOB storage account to Vertica nodes, this is going to be the most integral part of this series.
How to Link the BLOB storage account to Vertica Nodes.
We will be using a external library called BLOBFUSE to get this task done.
- BLOB account details
- Blob account name
- Blob account key
- container name in blob
- Make sure BLOB account in same network / allows network access to Vertica hosts.
Follow the installation steps mentioned here according to your OS ( I will be using CentOS).
- Run the below commands on Vertica hosts
#Login to Vertica host using dbadmin sudo su dbadmin # Create directory with container name sudo mkdir /mnt/<container_name> # Change the owner of directory to dbadmin sudo chown dbadmin /mnt/<container_name> #Change the access mode / permissions on directory to be accessible by all sudo chmod 777 /mnt/<container_name> # Create another directory, this will act as a place for caching by BLOBFUSE library sudo mkdir /mnt/<container_name>_blob # Change the owner of directory to dbadmin, so that can be used by Vertica when reading data sudo chown dbadmin /mnt/<container_name>_blob
3. create a new confgiration.cfg file which has format like this
4. Run the below command to mount the blobfuse
Note : –tmp_path=/mnt/<container_name>_blob is the path where the data will be kept in cache by blobfuse library when Vertica fetches the data from BLOB.
blobfuse /mnt/<container_name> --tmp-path=/mnt/ <container_name>_blob -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 --config-file=../connection.cfg --log-level=LOG_DEBUG --file-cache-timeout-in-seconds=120
5. Repeat the steps 2 and 4 on every host of Vertica to make sure the blob storage is mounted on each node of Vertica Cluster.
How to Validate
# Check the mounted drive on vertica host [navin]$ df -h | grep blob blobfuse 400G 9.4G 143G 4% /mnt/<container_name> # Get into the mount location [navin]$ cd /mnt/<container_name> # Find the required file in blob storage directly using common linux commands [navin container_name]$ ls -lrth | grep Test_file.txt 1 abc abc 20 Feb 19 08:43 Test_file.txt # Read the content of the file from blob directly [navin container_name]$ cat Test_file.txt Hello this is test
Make sure you validate the setting and mount on each host of vertica cluster.
Note : The user who has written the file does not matter here, as we are giving dbadmin all rights / permissions to have access to these files and read it. When Vertica query engine is going to query this data (this will be covered in next part), we will see the the tmp folder getting used, as blobfuse library uses that location to cache data it reads from BLOB storage.
Stay tuned for the the next part of this blob series, which will help you understand on how to query this data directly from Vertica.