Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Scala for Data Science

You're reading from  Scala for Data Science

Product type Book
Published in Jan 2016
Publisher
ISBN-13 9781785281372
Pages 416 pages
Edition 1st Edition
Languages
Author (1):
Pascal Bugnion Pascal Bugnion
Profile icon Pascal Bugnion

Table of Contents (22) Chapters

Scala for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. Scala and Data Science 2. Manipulating Data with Breeze 3. Plotting with breeze-viz 4. Parallel Collections and Futures 5. Scala and SQL through JDBC 6. Slick – A Functional Interface for SQL 7. Web APIs 8. Scala and MongoDB 9. Concurrency with Akka 10. Distributed Batch Processing with Spark 11. Spark SQL and DataFrames 12. Distributed Machine Learning with MLlib 13. Web APIs with Play 14. Visualization with D3 and the Play Framework Pattern Matching and Extractors Index

Programming in data science


This book is not a book about data science. It is a book about how to use Scala, a programming language, for data science. So, where does programming come in when processing data?

Computers are involved at every step of the data science pipeline, but not necessarily in the same manner. The style of programs that we build will be drastically different if we are just writing throwaway scripts to explore data or trying to build a scalable application that pushes data through a well-understood pipeline to continuously deliver business intelligence.

Let's imagine that we work for a company making games for mobile phones in which you can purchase in-game benefits. The majority of users never buy anything, but a small fraction is likely to spend a lot of money. We want to build a model that recognizes big spenders based on their play patterns.

The first step is to explore data, find the right features, and build a model based on a subset of the data. In this exploration phase, we have a clear goal in mind but little idea of how to get there. We want a light, flexible language with strong libraries to get us a working model as soon as possible.

Once we have a working model, we need to deploy it on our gaming platform to analyze the usage patterns of all the current users. This is a very different problem: we have a relatively clear understanding of the goals of the program and of how to get there. The challenge comes in designing software that will scale out to handle all the users and be robust to future changes in usage patterns.

In practice, the type of software that we write typically lies on a spectrum ranging from a single throwaway script to production-level code that must be proof against future expansion and load increases. Before writing any code, the data scientist must understand where their software lies on this spectrum. Let's call this the permanence spectrum.

You have been reading a chapter from
Scala for Data Science
Published in: Jan 2016 Publisher: ISBN-13: 9781785281372
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}