spark

Spark Journal : Using DROP API on dataframes

As of now, I was very much habituated to using the SELECT API on my dataframes and doing most of the work, however, after my recent code review with our Architects, I came to know about the DROP API , DROP command on dataframes, which can be surprisingly easier to use and reduced the number lines of code, that we need to write for achieving certain use cases.

Don’t confuse this with the SQL DROP command, this is a completely different command.

What does DROP command do ?
It simply eliminates the column mentioned in the drop command parameters from the applied dataframe.
Suppose, you have 3 columns, in your dataframe and you want to show only 1 column, out of it. You can do this using DROP command and adding the unwanted columns in the drop param.


What are the use cases, we can use it ?
1. Suppose we need to eliminate any number of column from dataframe
2. Suppose we want to replace a column with same column name but another values from a dataframe

How to use it ?

Eliminating column from dataframe.

val dataList1 = List((1,"abc"),(2,"def"),(2,"def"),(2,"def"),(2,"def"),(2,"def"),(2,"def"))
val df1 = dataList1.toDF("id","Name")

df1.drop("Name").show

+---+
| id|
+---+
|  1|
|  2|
|  2|
|  2|
|  2|
|  2|
|  2|
+---+

Eliminating columns from dataframe.

val dataList1 = List((1,"abc",99),(2,"def",99),(2,"def",99),(2,"def",99),(2,"def",99),(2,"def",99),(2,"def",99))
val df1 = dataList1.toDF("id","Name","Marks")

df1.drop("Name","Marks").show

+---+
| id|
+---+
|  1|
|  2|
|  2|
|  2|
|  2|
|  2|
|  2|
+---+

Replacing a column data completely

val dataList1 = List((1,"abc",99),(2,"def",99),(2,"def",99),(2,"def",99),(2,"def",99))
val df1 = dataList1.toDF("id","Name","Marks")

df1.drop("Name","Marks").withColumn("Name",lit("abc")).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| abc|
|  2| abc|
|  2| abc|
|  2| abc|
+---+----+

Please note : you can do all operations using a SELECT API as well and using on the columns that you need in dataframe, but consider a scenario, where you have 100 columns and just want to eliminate 1 column out of the dataframe, this is where you can take advantage of DROP command.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s