spark

Spark Journal : Using Scala IF Construct with DataFrame API.

Till now, we have been using the DataFrame API purely , when working with dataframes, but how about if there is a requirement to check for a condition and then do replace a API call when using dataframes.

Yes, that’s what this post is going to focus on.
We are going to use Scala IF construct and derive a value, which will be further used in a DataFrame API directly.
For some, it may be a piece of cake, but believe me I spent considerable amount of type on researching on this.

Problem Statement : We want to assign a new column to dataframe, dynamically, if the flag for getting this column data is true, show -1, else show -2.

Approach 1
Using the variable and assigning the value and then using the value directly in API

val getColorCode = true 
val dataList1 = List((1,"red"),(2,"blue"))
val df1 = dataList1.toDF("id","Name")
 
val colorcd = if (getColorCode==true) {
     -1
} else {
     -2
}
val df = df1.select(col("id"), col("Name"), lit(colorcd).as("colorcd"))
df.show

+---+----+-------+
| id|Name|colorcd|
+---+----+-------+
|  1| red|     -1|
|  2|blue|     -1|
+---+----+-------+

Approach 2
Instead of using the variable, we use the IF clause directly in DataFrame API, yes, as I said, this is possible.

val getColorCode = true 
val dataList1 = List((1,"red"),(2,"blue"))
val df1 = dataList1.toDF("id","Name")

val df = df1.select(col("id"), col("Name"), 
lit(if (getColorCode==true) {
 -1
} else {
 -2
}).as("colorcd"))

df.show

+---+----+-------+
| id|Name|colorcd|
+---+----+-------+
|  1| red|     -1|
|  2|blue|     -1|
+---+----+-------+

Approach 3
Its more of an improvement to approach 2, here we use lit keyword in a separate way, the advantage is, we can either opt for any other transformation in the IF Clause.
Below, we have intentionally set the flag to false and in the ELSE clause, we are just using col(“id”) as another transformation instead of regular lit, that we have been using.
Finally the dataframe shows 3 columns, where 1st and 3rd column are exactly same.

val getColorCode = false 
val dataList1 = List((1,"red"),(2,"blue"))
val df1 = dataList1.toDF("id","Name")

val df = df1.select(col("id"), col("Name"), 
if (getColorCode==true) {
 lit(-1)
} else {
 col("id")
}.as("colorcd"))

df.show
+---+----+-------+
| id|Name|colorcd|
+---+----+-------+
|  1| red|      1|
|  2|blue|      2|
+---+----+-------+

I know there are 1000 ways of doing this kind of stuff in spark, but as I am learning it, this was something new I discovered and thought of sharing it with beginners.
Please note : With bigger dataframes, I am doubting, if this is really going to perform better, stay tuned for some performance numbers on these approaches.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s