spark

Spark Journal : Using select api on dataframe

When working with spark dataframes, you will find many instances where in you have to use SELECT statements over Dataframes. Beware, this is not the SQL Select statements over the dataframe, but using the Spark API on the dataframe object directly.
I know many people will prefer using a SELECT statements (SQL) directly over a dataframe, which is even supported by Spark, but I started doing the same using the SPARK API on dataframe objects.
If you want to know more about all the supported API with dataframe objects, please refer to this official documentation.

So, after spending almost a day and trying out different combinations, I found that there are multiple ways of doing a SELECT of columns from dataframe using a SELECT API on dataframe object.
Somehow, I don’t feel, its really documented well with examples, but maybe, its just me facing this issue as a beginner.

Lets say we build a dataframe like below for usage.

val dataList = List((1,"abc"),(2,"def"))
val df = dataList.toDF("id","Name")
df.show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 1 : Using quoted column names

df.select("id", "Name").show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 2 : Using $ with Quoted column names
This approach can be used further to drive much more transformations of column, will try to cover ahead.

df.select($"id", $"Name").show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 3 : Using col keyword along with quoted column names.
Again, using col key word allows you to have much more transformations ahead with ease.

df.select(col("id"), col("Name")).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

Approach 4 : Using a List[String] which has all column names
This approach is very much popular, when you are dealing with a set of standard column or huge number of columns and don’t want to keep on writing the SELECT with all column names repetitively.
We can define a List[String] with the column names and the order of columns we want to view the result in and then use, map method on dataframe, the map method will iterate over a list of column names (String) and dynamically add column names in your select statement.

var colList = List("id","Name")
df.select(colList.map(col): _*).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
+---+----+

There are a few more of the approaches, I will try to detail them as, I learn them in depth.
I will cover the next article, which will be more of an extension to this and cover more api that get used very frequently.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s