Spark Journal : Change the data type of columns in dataframe.

Hey there, recently, we found a functional issue in our parquet files, where the data type of columns was not accurate. The datatype was supposed to be a DECIMAL with some precision and scale, but it was found to be mix and match of String and Double.
The Challenge at hands was to make sure the columns have accurate datatype as needed.

How to make sure we assign the right data type to column in dataframe
We will take an example of converting a integer column to DECIMAL here.

Approach 1
If going with Spark SQL option, use CAST and convert to required data type.
the Dataframe reflects the correct data type

val df = spark.sql("""SELECT  CAST(1 as DECIMAL(14,4)) AS id union SELECT  CAST(2 as DECIMAL(14,4)) AS id""")

df: org.apache.spark.sql.DataFrame = [id: decimal(14,4)]
 |-- id: decimal(14,4) (nullable = false)

|    id|

Approach 2

If going with Spark DataFrame API, we can still use cast method directly as below.

import org.apache.spark.sql.types.{DecimalType}
val dataList1 = List((1,"abc",99),(2,"def",99),(2,"def",99),(2,"def",99),(2,"def",99))
val df1 = dataList1.toDF("id","Name","Marks").select(col("Marks").cast(DecimalType(14,4)).alias("Marks_changed"), col("Marks"))

import org.apache.spark.sql.types.DecimalType
dataList1: List[(Int, String, Int)] = List((1,abc,99), (2,def,99), (2,def,99), (2,def,99), (2,def,99))
df1: org.apache.spark.sql.DataFrame = [Marks_changed: decimal(14,4), Marks: int]
 |-- Marks_changed: decimal(14,4) (nullable = false)
 |-- Marks: integer (nullable = false)

|      99.0000|   99|
|      99.0000|   99|
|      99.0000|   99|
|      99.0000|   99|
|      99.0000|   99|

For other data types available with Spark API, refer to this link


Scala Journal : Using tuples effectively

Hey Folks, back after a good amount of time and No….. was not on vacations, but working on some very critical deliverable.

So What did I do the last one and half month, learned more Scala and Spark… its getting interesting day by day and I have already learnt a lot of stuff, that can be shared through by posts here…. so stay tuned…. there is good information coming along.

Continue reading