spark

Spark Journal : Using UNION with SELECT API on dataframes

You will easily come across this use case, where you need to merge 2 separate Dataframes at one go. This is a very peculiar use case, when working with data and there are multiple ways of doing so.


Some crucial points to remember when using Spark UNION
1. Spark has no UNION ALL, it only has a UNION command
2. Spark UNION does not deduplicate the data
3. Spark does not care about the data type of columns when merging
4. Spark does not care about the sequence of columns when merging

The most easiest way of doing this, if you were from a SQL background was to use a UNION SQL command, which would merge data in both the dataframes using a SQL SELECT command itself.
However, here we are focusing on getting this same task done by UNION API on dataframes.

So, you are trying to merge 2 dataframes.
1. When the schema is exactly same

val dataList1 = List((1,"abc"),(2,"def"))
val df1 = dataList1.toDF("id","Name")

val dataList2 = List((3,"efg"),(4,"hij"))
val df2 = dataList2.toDF("id","Name")

val df3 = df1.union(df2).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
|  3| efg|
|  4| hij|
+---+----+

2. When the number of columns is same but datatype is different
Here the 2nd column in dataframe 1 is String, but second dataframe is Int.

val dataList1 = List((1,"abc"),(2,"def"))
val df1 = dataList1.toDF("id","Name")

val dataList2 = List((3,10),(4,11))
val df2 = dataList2.toDF("id","Name")

val df3 = df1.union(df2).show

+---+----+
| id|Name|
+---+----+
|  1| abc|
|  2| def|
|  3|  10|
|  4|  11|
+---+----+

3. When the number of columns is not same
Here the union fails as expected.

val dataList1 = List((1,"abc"),(2,"def"))
val df1 = dataList1.toDF("id","Name")

val dataList2 = List((3,"ghi","home"),(4,"jkl","ctrl"))
val df2 = dataList2.toDF("id","Name")

val df3 = df1.union(df2).show

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.

4. When you want to perform UNION based on values in list
This will be very useful, when you don’t know how many unions you want to have and it needs to be derived dynamically
Here we iterate over a list and then union based on elements in list.

case class emptyDfSchema (NameType:String)
val idTypeList  = List(List("Name"),List("Middle Name"),List("SurName"))
var df3 = Seq.empty[emptyDfSchema].toDF


idTypeList.foreach {
    element => {
        df3 = df3.union(element.toDF)
    }
}

df3.show

+-----------+
|   NameType|
+-----------+
|       Name|
|Middle Name|
|    SurName|
+-----------+

So, these are the different ways you can use the UNION API on dataframes directly, you can also use this command with conjunction on SELECT API.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s