spark

Spark Journal : Using select api on dataframe

When working with spark dataframes, you will find many instances where in you have to use SELECT statements over Dataframes. Beware, this is not the SQL Select statements over the dataframe, but using the Spark API on the dataframe object directly.
I know many people will prefer using a SELECT statements (SQL) directly over a dataframe, which is even supported by Spark, but I started doing the same using the SPARK API on dataframe objects.
If you want to know more about all the supported API with dataframe objects, please refer to this official documentation.

So, after spending almost a day and trying out different combinations, I found that there are multiple ways of doing a SELECT of columns from dataframe using a SELECT API on dataframe object.
Somehow, I don’t feel, its really documented well with examples, but maybe, its just me facing this issue as a beginner.

Lets say we build a dataframe like below for usage.

val dataList = List((1,"abc"),(2,"def"))
val df = dataList.toDF("id","Name")
df.show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 1 : Using quoted column names

df.select("id", "Name").show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 2 : Using $ with Quoted column names
This approach can be used further to drive much more transformations of column, will try to cover ahead.

df.select($"id", $"Name").show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 3 : Using col keyword along with quoted column names.
Again, using col key word allows you to have much more transformations ahead with ease.

df.select(col("id"), col("Name")).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 4 : Using a List[String] which has all column names
This approach is very much popular, when you are dealing with a set of standard column or huge number of columns and don’t want to keep on writing the SELECT with all column names repetitively.
We can define a List[String] with the column names and the order of columns we want to view the result in and then use, map method on dataframe, the map method will iterate over a list of column names (String) and dynamically add column names in your select statement.

var colList = List("id","Name")
df.select(colList.map(col): _*).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

There are a few more of the approaches, I will try to detail them as, I learn them in depth.
I will cover the next article, which will be more of an extension to this and cover more api that get used very frequently.

spark

Spark Journal : Building Empty DataFrame

I recently started learning Scala language, along with the Spark framework, when working on our big data stack. Having not much experience with Java, it is a challenge to learn the fundamentals, but I am still learning and its a long way to go.

Publishing small bits of useful information to everyone, so that beginners like me, can find it useful.

So the task in spark was to create empty dataframes (Won’t go into dataframe details, for now, as even I am learning the stuff). If you have worked with Pandas framework in Python, you should be acquainted with Dataframe term.
If you still don’t understand it, for quick understanding, think of it as a 2 dimensional table, which stores the data in memory (can be extended to disk).

Creating a dataframe, this task is usually done by reading the data files in whichever format. But we had to create an empty dataframe, for this we used the below approach.

Using Case class
Case class, this is very frequently used construct in Scala language. we will use case class to define the schema of dataframe.
Here we are using Seq keyword, which means we are asking to create a empty sequence in scala and then convert it into the schema mentioned in case class, finally converting to a DataFrame, using the toDF method, which is spark Framework API for creating dataframes from Sequences, List.

case class model (id : Int, Name : String, marks : Double)
val emptyDf = Seq.empty[model].toDF
emptyDf.show

+---+----+-----+
| id|Name|marks|
+---+----+-----+
+---+----+-----+

So, you can define the schema, you need and then create an empty dataframe as above.

Powershell

Powershell Journal – For Loop in a single statement

Recently I was given a task for writing a powershell script which does some DELETION of blob objects, but the challenge was to keep this script minimal and mostly limited to a one liner.

I know, you won’t call a one liner as a script, but instead a command.
So, I took this challenge and started dewlling into “How we can iterate over a list of objects in a single line with Powershell

Yes, I know its possible by just writing a the for loop syntax in a single line like this, but we are going to do it using pipe operator. So we cannot use a regular For Loop syntax in this case.

We can do it 2 ways.

Lets suppose, we are trying to list files in a directory

D:\Navin\test> Get-ChildItem .

    Directory: D:\Navin\test

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        8/27/2019   9:47 PM              0 test1 - Copy.txt
-a----        8/27/2019   9:47 PM              0 test1.txt
-a----        8/27/2019   9:47 PM              0 test2.txt
-a----        8/27/2019   9:47 PM              0 test3.txt

Based on the output, we can use ForEach-Object along with a pipeline operator here to pass the result of first command to second command as input. At the same time, you can refer to $_. as the current element of list, while iterating in loop.

D:\Navin\test> Get-ChildItem  | ForEach-Object {write-host $_.Name}
test1 - Copy.txt
test1.txt
test2.txt
test3.txt

Other way of doing this is by using a % symbol instead of ForEach-Object, this does exactly the same thing, but is a more shorter version.

D:\Navin\test> Get-ChildItem  | % {write-host $_.Name}
test1 - Copy.txt
test1.txt
test2.txt
test3.txt

That is how we can iterate over a result (list) in a single line of powershell.
Go ahead and explore this by yourself, you will be amazed to know, how easy it is.

Troubleshooting

How to delete Azure BLOB Snapshots using powershell

Recently, we came across a scenario, where the HDInsight Cluster was connected to an Azure blob and the spark jobs were failing, when deleting the parquet files, for which snapshot was already present.

Azure Snapshots, are just in time backups created based on the blob files.
We were unable to track on how these snapshots were created, but the important task at hand was to delete the snapshots, so that spark could continue processing regular parquet files.

Identifying and deleting the snapshots manually would have taken ages, as its a very cumbersome process, when done through Azure Storage Explorer.
I did some research on automating this effort, as there are already client libraries available for Azure BLOB Storage with Java, Powershell, Python.
As I am efficient with Powershell and Azure Powershell is more native to Microsoft Cloud stack, I decided to go with using Powershell to automate this effort.

If you are starting new with this, I would recommend going ahead with reading this post first.

Pre-Requisites
1. Powershell 5.0 and above
2. Azure Powershell Module
3. Azure Storage account name and key for access through powershell

We start with creating a context, this is more of establishing a connection with Azure Storage account.

$StorageAccountName = "dummyaccount"
$StorageAccountKey = "xxxx"
$ContainerName = "containername"
$BlobName = "fact" 
$tx = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $StorageAccountKey

The next cmdlet Get-AzureStorageContainer, gets connection object to the container, you can many containers inside a storage account, we can connect to a single storage container using the below command.

$blobObject = Get-AzureStorageContainer -Context $Ctx -Name $ContainerName

Now that you are connected to the container, we can query the container to check all the BLOB files as below, the function ListBlobs, takes Blob name (this can be a directory or path to a directory OR the prefix of the blob name) and boolean parameter, which is used for checking the flat structure and not the hierarchical structure. So in the below case, it will only list all files and folders under the mentioned bloc and not its sub directories.

For more information, refer to this page

$ListOfBlobs = $blobObject.CloudBlobContainer.ListBlobs($BlobName, $true, "Snapshots")

For deleting the snapshots found, from the above command, we loop over the result set and call a DELETE method over the blob, which is an actual snapshot.

foreach ($CloudBlockBlob in $ListOfBlobs)
{
  if ($CloudBlockBlob.IsSnapshot) {
    Write-Host "Deleting $($CloudBlockBlob.Name), Snapshot was created on $($CloudBlockBlob.SnapshotTime)"
    $CloudBlockBlob.Delete() 
  }
}

This is how you delete all the snapshots under a specific BLOB folder or a blob file directly.