Working with complex dataset types
In the real world, we very often have to deal with data that doesn’t fit into a standard table format with one value per column in each record. We did see a little of that previously with our netflix titles CSV
file in our cast and director columns, but what happens when we run into more complex structures?
In this section, we’ll show you how to manage nested data in semi-structured data, such as XML and JSON. Consider the following code:
val dfDevicesJson = spark.read.json( "src/main/scala/com/packt/dewithscala/chapter6/data/devices.json") dfDevicesJson.printSchema() root |-- country: string (nullable = true) |-- device_id: string (nullable = true) |-- event_ts: timestamp (nullable = true) |-- event_type: string (nullable = true) |-- id: long (nullable = true) |-- line: string (nullable = true) |-- manufacturer: string (nullable = true) |-- observations: array (nullable = true) | |-- element...