Apache Spark 2: Data Processing and Real-Time Analytics

3 (1 reviews total)
By Romeo Kienzler , Md. Rezaul Karim , Sridhar Alla and 4 more
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. A First Taste and What's New in Apache Spark V2

About this book

Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform.

You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools.

By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle.

This Learning Path includes content from the following Packt products:

  • Mastering Apache Spark 2.x by Romeo Kienzler
  • Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla
  • Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook
Publication date:
December 2018


Chapter 1. A First Taste and What's New in Apache Spark V2

Apache Spark is a distributed and highly scalable in-memory data analytics system, providing you with the ability to develop applications in Java, Scala, and Python, as well as languages such as R. It has one of the highest contribution/involvement rates among the Apache top-level projects at this time. Apache systems, such as Mahout, now use it as a processing engine instead of MapReduce. It is also possible to use a Hive context to have the Spark applications process data directly to and from Apache Hive.

Initially, Apache Spark provided four main submodules--SQL, MLlib, GraphX, and Streaming. They will all be explained in their own chapters, but a simple overview would be useful here. The modules are interoperable, so data can be passed between them. For instance, streamed data can be passed to SQL and a temporary table can be created. Since version 1.6.0, MLlib has a sibling called SparkML with a different API, which we will cover in later chapters.

The following figure explains how this book will address Apache Spark and its modules:

The top two rows show Apache Spark and its submodules. Wherever possible, we will try to illustrate by giving an example of how the functionality may be extended using extra tools.


We infer that Spark is an in-memory processing system. When used at scale (it cannot exist alone), the data must reside somewhere. It will probably be used along with the Hadoop toolset and the associated ecosystem.

Luckily, Hadoop stack providers, such as IBM and Hortonworks, provide you with an open data platform, a Hadoop stack, and cluster manager, which integrates with Apache Spark, Hadoop, and most of the current stable toolset fully based on open source.

During this book, we will use the Hortonworks Data Platform (HDP®) Sandbox 2.6.

You can use an alternative configuration, but we find that the open data platform provides most of the tools that we need and automates the configuration, leaving us more time for development.

In the following sections, we will cover each of the components mentioned earlier in more detail before we dive into the material starting in the next chapter:

  • Spark Machine Learning
  • Spark Streaming
  • Spark SQL
  • Spark Graph Processing
  • Extended Ecosystem
  • Updates in Apache Spark
  • Cluster design
  • Cloud-based deployments
  • Performance parameters

Spark machine learning

Machine learning is the real reason for Apache Spark because, at the end of the day, you don't want to just ship and transform data from A to B (a process called ETL (Extract Transform Load)). You want to run advanced data analysis algorithms on top of your data, and you want to run these algorithms at scale. This is where Apache Spark kicks in.

Apache Spark, in its core, provides the runtime for massive parallel data processing, and different parallel machine learning libraries are running on top of it. This is because there is an abundance of machine learning algorithms for popular programming languages like R and Python but they are not scalable. As soon as you load more data to the available main memory of the system, they crash.

Apache Spark, in contrast, can make use of multiple computer nodes to form a cluster and even on a single node can spill data transparently to disk, therefore, avoiding the main memory bottleneck. Two interesting machine learning libraries are shipped with Apache Spark, but in this work, we'll also cover third-party machine learning libraries.

The Spark MLlib module, Classical MLlib, offers a growing but incomplete list of machine learning algorithms. Since the introduction of the DataFrame-based machine learning API called SparkML, the destiny of MLlib is clear. It is only kept for backward compatibility reasons.

In SparkML, we have a machine learning library in place that can take advantage of these improvements out of the box, using it as an underlying layer.


SparkML will eventually replace MLlib. Apache SystemML introduces the first library running on top of Apache Spark that is not shipped with the Apache Spark distribution. SystemML provides you with an execution environment of R-style syntax with a built-in cost-based optimizer. Massive parallel machine learning is an area of constant change at a high frequency. It is hard to say where that the journey goes, but it is the first time where advanced machine learning at scale is available to everyone using open source and cloud computing.

Deep learning on Apache Spark uses H20, Deeplearning4j, and Apache SystemML, which are other examples of very interesting third-party machine learning libraries that are not shipped with the Apache Spark distribution.

While H20 is somehow complementary to MLlib, Deeplearning4j only focuses on deep learning algorithms. Both use Apache Spark as a means for parallelization of data processing. You might wonder why we want to tackle different machine learning libraries.


The reality is that every library has advantages and disadvantages with the implementation of different algorithms. Therefore, it often depends on your data and Dataset size which implementation you choose for best performance.

However, it is nice that there is so much choice and you are not locked in a single library when using Apache Spark. Open source means openness, and this is just one example of how we are all benefiting from this approach in contrast to a single vendor, single product lock-in. Although recently Apache Spark integrated GraphX, another Apache Spark library into its distribution, we don't expect this will happen too soon. Therefore, it is most likely that Apache Spark as a central data processing platform and additional third-party libraries will co-exist, like Apache Spark being the big data operating system and the third-party libraries are the software you install and run on top of it.


Spark Streaming

Stream processing is another big and popular topic for Apache Spark. It involves the processing of data in Spark as streams and covers topics such as input and output operations, transformations, persistence, and checkpointing, among others.

Apache Spark Streamingwill cover the area of processing, and we will also see practical examples of different types of stream processing. This discusses batch and window stream configuration and provides a practical example of checkpointing. It alsocovers different examples of stream processing, including Kafka and Flume.

There are many ways in which stream data can be used. Other Spark module functionality (for example, SQL, MLlib, and GraphX) can be used to process the stream. You can use Spark Streaming with systems such as MQTT or ZeroMQ. You can even create custom receivers for your own user-defined data sources.


Spark SQL

From Spark version 1.3, data frames have been introduced in Apache Spark, so that Spark data can be processed in a tabular form and tabular functions (such as select, filter, and groupBy) can be used to process data. The Spark SQL module integrates with Parquet and JSON formats, to allow data to be stored in formats, that better represent the data. This also offers more options to integrate with external systems.

The idea of integrating Apache Spark into the Hadoop Hive big data database can also be introduced. Hive context-based Spark applications can be used to manipulate Hive-based table data. This brings Spark's fast in-memory distributed processing to Hive's big data storage capabilities. It effectively lets Hive use Spark as a processing engine.

Additionally, there is an abundance of additional connectors to access NoSQL databases outside the Hadoop ecosystem directly from Apache Spark.


Spark graph processing

Graph processing is another very important topic when it comes to data analysis. In fact, a majority of problems can be expressed as a graph.

A graph is basically, a network of items and their relationships to each other. Items are called nodes and relationships are called edges. Relationships can be directed or undirected. Relationships, as well as items, can have properties. So a map, for example, can be represented as a graph as well. Each city is a node and the streets between the cities are edges. The distance between the cities can be assigned as properties on the edge.

The Apache Spark GraphX module allows Apache Spark to offer fast big data in-memory graph processing. This allows you to run graph algorithms at scale.

One of the most famous algorithms, for example, is the traveling salesman problem. Consider the graph representation of the map mentioned earlier. A salesman has to visit all cities of a region but wants to minimize the distance that he has to travel. As the distances between all the nodes are stored on the edges, a graph algorithm can actually tell you the optimal route. GraphX is able to create, manipulate, and analyze graphs using a variety of built-in algorithms.

It introduces two new data types to support graph processing in Spark--VertexRDD and EdgeRDD--to represent graph nodes and edges. It also introduces graph processing algorithms, such as PageRank and triangle processing.


Extended ecosystem

When examining big data processing systems, we think it is important to look at, not just the system itself, but also how it can be extended and how it integrates with external systems so that greater levels of functionality can be offered. In a book of this size, we cannot cover every option, but by introducing a topic, we can hopefully stimulate the reader's interest so that they can investigate further.


What's new in Apache Spark V2?

Since Apache Spark V2, many things have changed. This doesn't mean that the API has been broken. In contrast, most of the V1.6 Apache Spark applications will run on Apache Spark V2 with or without very little changes, but under the hood, there have been a lot of changes.

Although the Java Virtual Machine (JVM) is a masterpiece on its own, it is a general-purpose bytecode execution engine. Therefore, there is a lot of JVM object management and garbage collection (GC) overhead. So, for example, to store a 4-byte string, 48 bytes on the JVM are needed. The GC optimizes on object lifetime estimation, but Apache Spark often knows this better than JVM. Therefore, Tungsten disables the JVM GC for a subset of privately managed data structures to make them L1/L2/L3 Cache-friendly.

In addition, code generation removed the boxing of primitive types polymorphic function dispatching. Finally, a new first-class citizen called Dataset unified the RDD and DataFrame APIs. Datasets are statically typed and avoid runtime type errors. Therefore, Datasets can be used only with Java and Scala. This means that Python and R users still have to stick to DataFrames, which are kept in Apache Spark V2 for backward compatibility reasons.


Cluster design

As we have already mentioned, Apache Spark is a distributed, in-memory, parallel processing system, which needs an associated storage system. So, when you build a big data cluster, you will probably use a distributed storage system such as Hadoop, as well as tools to move data such as Sqoop, Flume, and Kafka.

We wanted to introduce the idea of edge nodes in a big data cluster. These nodes in the cluster will be client-facing, on which reside the client-facing components such as Hadoop NameNode or perhaps the Spark master. Majority of the big data cluster might be behind a firewall. The edge nodes would then reduce the complexity caused by the firewall as they would be the only points of contact accessible from outside. The following figure shows a simplified big data cluster:

It shows five simplified cluster nodes with executor JVMs, one per CPU core, and the Spark Driver JVM sitting outside the cluster. In addition, you see the disk directly attached to the nodes. This is called the JBOD (just a bunch of disks) approach. Very large files are partitioned over the disks and a virtual filesystem such as HDFS makes these chunks available as one large virtual file. This is, of course, stylized and simplified, but you get the idea.

The following simplified component model shows the driver JVM sitting outside the cluster. It talks to the Cluster Manager in order to obtain permission to schedule tasks on the worker nodes, because the Cluster Manager keeps track of resource allocation of all processes running on the cluster.

As we will see later, there is a variety of different cluster managers, some of them also capable of managing other Hadoop workloads or even non-Hadoop applications in parallel to the Spark Executors. Note that the Executor and Driver have bidirectional communication all the time, so network-wise, they should also be sitting close together:

Figure source: https://spark.apache.org/docs/2.0.2/cluster-overview.html

Generally, firewalls, while adding security to the cluster, also increase the complexity. Ports between system components need to be opened up so that they can talk to each other. For instance, Zookeeper is used by many components for configuration. Apache Kafka, the publish/subscribe messaging system, uses Zookeeper to configure its topics, groups, consumers, and producers. So, client ports to Zookeeper, potentially across the firewall, need to be open.

Finally, the allocation of systems to cluster nodes needs to be considered. For instance, if Apache Spark uses Flume or Kafka, then in-memory channels will be used. The size of these channels, and the memory used, caused by the data flow, need to be considered. Apache Spark should not be competing with other Apache components for memory usage. Depending on your data flows and memory usage, it might be necessary to have Spark, Hadoop, Zookeeper, Flume, and other tools on distinct cluster nodes. Alternatively, resource managers such as YARN, Mesos, or Docker can be used to tackle this problem. In standard Hadoop environments, YARN is most likely.

Generally, the edge nodes that act as cluster NameNode servers or Spark master servers will need greater resources than the cluster processing nodes within the firewall. When many Hadoop ecosystem components are deployed on the cluster, all of them will need extra memory on the master server. You should monitor edge nodes for resource usage and adjust in terms of resources and/or application location as necessary. YARN, for instance, is taking care of this.

This section has briefly set the scene, for the big data cluster in terms of Apache Spark, Hadoop, and other tools. However, how might the Apache Spark cluster itself, within the big data cluster, be configured? For instance, it is possible to have many types of Spark cluster manager. The next section will examine this and describe each type of the Apache Spark cluster manager.


Cluster management

The Spark context, as you will see in many of the examples in this book, can be defined via a Spark configuration object and Spark URL. The Spark context connects to the Spark cluster manager, which then allocates resources across the worker nodes for the application. The cluster manager allocates executors across the cluster worker nodes. It copies the application JAR file to the workers and finally allocates tasks.

The following subsections describe the possible Apache Spark cluster manager options available at this time.


By specifying a Spark configuration local URL, it is possible to have the application run locally. By specifying local[n], it is possible to have Spark use n threads to run the application locally. This is a useful development and test option because you can also test some sort of parallelization scenarios but keep all log files on a single machine.


Standalone mode uses a basic cluster manager that is supplied with Apache Spark. The spark master URL will be as follows:


Here,<hostname>is the name of the host on which the Spark master is running. We have specified7077as the port, which is the default value, but this is configurable. This simple cluster manager currently supports only FIFO (first-in-first-out) scheduling. You can contrive to allow concurrent application scheduling by setting the resource configuration options for each application; for instance, usingspark.core.maxto share cores between applications.

Apache YARN

At a larger scale, when integrating with Hadoop YARN, the Apache Spark cluster manager can be YARN and the application can run in one of two modes. If the Spark master value is set as yarn-cluster, then the application can be submitted to the cluster and then terminated. The cluster will take care of allocating resources and running tasks. However, if the application master is submitted as yarn-client, then the application stays alive during the life cycle of processing, and requests resources from YARN.

Apache Mesos

Apache Mesos is an open source system, for resource sharing across a cluster. It allows multiple frameworks, to share a cluster by managing and scheduling resources. It is a cluster manager, that provides isolation using Linux containers and allowing multiple systems such as Hadoop, Spark, Kafka, Storm, and more to share a cluster safely. It is highly scalable to thousands of nodes. It is a master/slave-based system and is fault tolerant, using Zookeeper for configuration management.

For a single master node Mesos cluster, the Spark master URL will be in this form:


Here, <hostname> is the hostname of the Mesos master server; the port is defined as 5050, which is the default Mesos master port (this is configurable). If there are multiple Mesos master servers in a large-scale high availability Mesos cluster, then the Spark master URL would look as follows:


So, the election of the Mesos master server will be controlled by Zookeeper. The <hostname> will be the name of a host in the Zookeeper quorum. Also, the port number, 2181, is the default master port for Zookeeper.


Cloud-based deployments

There are three different abstraction levels of cloud systems--Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). We will see how to use and install Apache Spark on all of these.

The new way to do IaaS is Docker and Kubernetes as opposed to virtual machines, basically providing a way to automatically set up an Apache Spark cluster within minutes. The advantage of Kubernetes is that it can be used among multiple different cloud providers as it is an open standard and also based on open source.

You even can use Kubernetes, in a local data center and transparently and dynamically move workloads between local, dedicated, and public cloud data centers. PaaS, in contrast, takes away from you the burden of installing and operating an Apache Spark cluster because this is provided as a service.

There is an ongoing discussion, whether Docker is IaaS or PaaS but, in our opinion, this is just a form of a lightweight preinstalled virtual machine.This is particularly interesting because the offering is completely based on open source technologies, which enables you to replicate the system on any other data center.

One of the open source components, we'll introduce is Jupyter notebooks; a modern way to do data science, in a cloud-based collaborative environment.



Before moving on to the rest of the chapters, covering functional areas of Apache Spark and extensions, we will examine the area of performance. What issues and areas need to be considered? What might impact the Spark application performance, starting at the cluster level and finishing with actual Scala code? We don't want to just repeat, what the Spark website says, so take a look at this URL:http://spark.apache.org/docs/<version>/tuning.html.

Here, <version> relates to the version of Spark that you are using; that is, either the latest or something like 1.6.1 for a specific version. So, having looked at this page, we will briefly mention some of the topic areas. We will list some general points in this section without implying an order of importance.

The cluster structure

The size and structure of your big data cluster are going to affect performance. If you have a cloud-based cluster, your IO and latency will suffer, in comparison to an unshared hardware cluster. You will be sharing the underlying hardware, with multiple customers and the cluster hardware may be remote.There are some exceptions to this. The IBM cloud, for instance, offers dedicated bare metal high-performance cluster nodes, with an InfiniBand network connection, which can be rented on an hourly basis.

Additionally, the positioning of cluster components on servers may cause resource contention. For instance, think carefully about locating Hadoop NameNodes, Spark servers, Zookeeper, Flume, and Kafka servers in large clusters. With high workloads, you might consider segregating servers to individual systems. You might also consider using an Apache system such as Mesos thatprovides better distributions and assignment of resources to the individual processes.

Consider potential parallelism as well. The greater the number of workers in your Spark cluster for large Datasets, the greater the opportunity for parallelism. One rule of thumb is one worker per hyper-thread or virtual core respectively.

Hadoop Distributed File System

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.

The reason why GPFS might be a better choice is that coming from the high-performance computing background, this filesystem has a full read-write capability, whereas HDFS is designed as a write once, read many filesystems. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.

Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.

Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.

Finally, Cassandra is not a filesystem but a NoSQL key-value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.

Data locality

The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago, but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.


We can conclude from this, that HDFS is one of the best ways to achieve data locality, as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.

Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of the data processing capabilities of the source engine. So, for example, when using MongoDB in conjunction with SparkSQL, parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.


In order to avoid OOM (Out of Memory) messages for the tasks on your Apache Spark cluster, please consider a number of questions for the tuning:

  • Consider the level of physical memory available on your Spark worker nodes. Can it be increased? Check on the memory consumption of operating system processes during high workloads in order to get an idea of free memory. Make sure that the workers have enough memory.
  • Consider data partitioning. Can you increase the number of partitions? As a rule of thumb, you should have at least as many partitions as you have available CPU cores on the cluster. Use the repartition function on the RDD API.
  • Can you modify the storage fraction and the memory used by the JVM for storage and caching of RDDs? Workers are competing for memory against data storage. Use the Storage page on the Apache Spark user interface to see if this fraction is set to an optimal value. Then update the following properties:
  • spark.memory.fraction
  • spark.memory.storageFraction
  • spark.memory.offHeap.enabled=true
  • spark.memory.offHeap.size

In addition, the following two things can be done in order to improve performance:

  • Consider using Parquet as a storage format, which is much more storage effective than CSV or JSON
  • Consider using the DataFrame/Dataset API instead of the RDD API as it might resolve in more effective executions (more about this in the next three chapters)


Try to tune your code, to improve the Spark application performance. For instance, filter your application-based data early in your ETL cycle.One example is, when using raw HTML files, detag them and crop away unneeded parts at an early stage.Tune your degree of parallelism, try to find the resource-expensive parts of your code, and find alternatives.


ETL is one of the first things you are doing in an analytics project. So you are grabbing data, from third-party systems, either by directly accessing relational or NoSQL databases or by reading exports in various file formats such as, CSV, TSV, JSON or even more exotic ones from local or remote filesystems or from a staging area in HDFS: after some inspections and sanity checks on the files an ETL process in Apache Spark basically reads in the files and creates RDDs or DataFrames/Datasets out of them.

They are transformed, so that they fit the downstream analytics application, running on top of Apache Spark or other applications and then stored back into filesystems as either JSON, CSV or PARQUET files, or even back to relational or NoSQL databases.


Finally, I can recommend the following resource for any performance-related problems with Apache Spark: https://spark.apache.org/docs/latest/tuning.html.



Although parts of this book will concentrate on examples of Apache Spark installed on physically server-based clusters, we want to make a point, that there are multiple cloud-based options out there that imply many benefits. There are cloud-based systems, that use Apache Spark as an integrated component and cloud-based systems that offer Spark as a service. 



In closing this chapter, we invite you to work your way, through each of the Scala code-based examples in the following chapters. The rate at which Apache Spark has evolved is impressive, and important to note is the frequency of the releases. So even though, at the time of writing, Spark has reached 2.2, we are sure that you will be using a later version.

If you encounter problems, report them at www.stackoverflow.com and tag them accordingly; you'll receive feedback within minutes--the user community is very active. Another way of getting information and help is subscribing to the Apache Spark mailing list: [email protected].

By the end of this chapter, you should have a good idea what's waiting for you in this book. We've dedicated our effort to showing you practical examples that are, on the one hand, practical recipes to solve day-to-day problems, but on the other hand, also support you in understanding the details of things taking place behind the scenes. This is very important for writing good data products and a key differentiation from others.

About the Authors

  • Romeo Kienzler

    Romeo Kienzler works as the chief data scientist in the IBM Watson IoT worldwide team, helping clients to apply advanced machine learning at scale on their IoT sensor data. He holds a Master's degree in computer science from the Swiss Federal Institute of Technology, Zurich, with a specialization in information systems, bioinformatics, and applied statistics.

    Browse publications by this author
  • Md. Rezaul Karim

    Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.

    Browse publications by this author
  • Sridhar Alla

    Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He holds a bachelor's in computer science from JNTU, India. He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, TensorFlow, NoSQL, IoT, and deep learning.

    Browse publications by this author
  • Siamak Amirghodsi

    Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.

    Browse publications by this author
  • Meenakshi Rajendran

    Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.

    Browse publications by this author
  • Broderick Hall

    Broderick Hall is a hands-on big data analytics expert and holds a master’s degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.

    Browse publications by this author
  • Shuen Mei

    Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.

    Browse publications by this author

Latest Reviews

(1 reviews total)
Haven't finished it yet, but so far it is OK, not super but ok
Apache Spark 2: Data Processing and Real-Time Analytics
Unlock this book and the full library for FREE
Start free trial