With the ever growing necessity to use the big data stack like Spark and Cloud, Leveraging the spark cluster to be used by Vertica has become very important.
We came across a very common use case, where we had to transfer the data from HDInsight (Spark Cluster) to Vertica Cluster.
I will be taking you over the easiest way of doing it in this blog series.
Since this is going to be a detailed post, I will be breaking this post into multiple posts for better understanding
We will break the blob series into multiple posts as below
- Introduction and detailing the approach (this one)
- Linking BLOB storage to Vertica Cluster directly (part 2)
- Loading the data from BLOB stage to Vertica tables directly (part 3)
There are different ways of loading data from a typical spark cluster to Vertica
- Using the Vertica Spark Connector
- Dumping the data from spark cluster to a common usable location where Vertica can read the files from.
- Leveraging BLOB account storage to be used by Vertica directly
I will be emphasizing on the 3rd Approach, which is specifically targeted towards HDInsight Service given by Microsoft Azure.
To Start with , better you have a fair idea of these jargon
- Azure HD Insight Managed Service – This is a offering by Microsoft Azure Stack, where Microsoft Azure is responsible for providing a full fledged Spark Cluster. This spark cluster comes bundled with all internal required utils like Hadoop, Zookeper, Spark Framework, Ambari UI, etc… In short this is fastest way to setting up a full fledged spark cluster in Azure Cloud. You will have the ability to
- Select and deploy the infrastructure as required
- Scale down / up the infrastructure as required
- Select the storage engine as BLOB, this is a hadoop native storage.
- Azure BLOB Storage – This is another offering by Microsoft Azure, which provides a complete Storage solution for all types of unstrcutured data. An important thing to note here is while installing Azure HD Insight Cluster, you need to select a default storage option for that HD Insight Cluster, you can use BLOB to be your default storage option. This storage will be used by Hadoop to store and manage all your intermediate files while processing.
- Vertica Cluster – A series of hosts installed with single instance of vertica , these hosts are bundled together to build a cluster. This provides for the basic MPP architecture of Vertica . Every hosts has its own computing power and storage capacity. All these are used together in a cluster to store and compute the data on DB level
The above diagram shows the detailed approach on how we are going to transfer the data from HDInsight Cluster to Vertica Cluster.
- The data will be stored in parquet format in the BLOB storage
- Mount the BLOB Storage on Vertica nodes as an external drive
- Use Vertica external tables and read data from the external mounted drive
For now, we are going to use the above approach for loading the data.
We will be going through the approach in detail in the upcoming blogpost, stay tuned.