"Machine Learning can be useful in almost every problem domain:" An interview with Sebastian Raschka


September 04th, 2017

Sebastian Raschka is a machine learning expert. He is currently a researcher at Michigan State University, where he is working on computational biology. But he is also the author of Python Machine Learning, the most popular book ever published by Packt. It's a book that has helped to define the field, breaking it out of the purely theoretical and showing readers how machine learning algorithms can be applied to every day problems. 

Python Machine Learning was published in 2015, but Sebastian is back with a brand new edition, updated and improved for 2017, working alongside his colleague Vahid Mirjalili. We were lucky enough to catch Sebastian in between his research and working on the new edition to ask him a few questions about what's new in the second edition of Python Machine Learning, and to get his assessment of what the key challenges and opportunities in data science are today.

What's the most interesting takeaway from your book?

Sebastian Raschka: In my opinion, the key take away from my book is that machine learning can be useful in almost every problem domain.

I cover a lot of different subfields of machine learning in my book: classification, regression analysis, clustering, feature extraction, dimensionality reduction, and so forth. By providing hands-on examples for each one of those topics, my hope is that people can find inspiration for applying these fundamental techniques to drive their research or industrial applications. Also, by using well-developed and maintained open source software, makes machine learning very accessible to a broad audience of experienced programmers as well as people who are new to programming. And introducing the basic mathematics behind machine learning, we can appreciate machine learning being more than just black box algorithms, giving readers an intuition of the capabilities but also limitations of machine learning, and how to apply those algorithms wisely.

What's new in the second edition?

SR: As time and the software world moved on after the first edition was released in September 2015, we decided to replace the introduction to deep learning via Theano. No worries, we didn't remove it! But it got a substantial overhaulf and is now based on TensorFlow, which has become a major player in my research toolbox since its open source release by Google in November 2015.

Along with the new introduction to deep learning using TensorFlow, the biggest additions to this new edition are three brand new chapters focussing on deep learning applications: A more detailed overview of the TensorFlow mechanics, an introduction to convolutional neural networks for image classification, and an introduction to recurrent neural networks for natural language processing. Of course, and in a similar vein as the rest of the book, these new chapters do not only provide readers with practical instructions and examples but also introduce the fundamental mathematics behind those concepts, which are an essential building block for understanding how deep learning works.

What do you think is the most exciting trend in data science and machine learning?

SR: One interesting trend in data science and machine learning is the development of libraries that make machine learning even more accessible. Popular examples include TPOT and AutoML/auto-sklearn. Or, in other words, libraries that further automate the building of machine learning pipelines. While such tools do not aim to replace experts in the field, they may be able to make machine learning even more accessible to an even broader audience of non-programmers. However, being to interpret the outcomes of predictive modeling tasks and being to evaluate the results appropriately will always require a certain amount of knowledge. Thus, I see those tools not as replacements but rather as assistants for data scientists, to automate tedious tasks such as hyperparameter tuning.

Another interesting trend is the continued development of novel deep learning architectures and the large progress in deep learning research overall. We've seen many interesting ideas from generative adversarial neural networks (GANs), densely connected neural networks (DenseNets), and  ladder networks. Large profress has been made in this field thanks to those new ideas and the continued improvements of deep learning libraries (and our computing infrastructure) that accelerate the implementation of research ideas and the development of these technologies in industrial applications. 

How has the industry changed since you first started working?

SR: Over the years, I have noticed that more and more companies embrace open source, i.e., by sharing parts of their tool chain in GitHub, which is great. Also, data science and open source related conferences keep growing, which means more and more people are not only getting interested in data science but also consider working together, for example, as open source contributors in their free time, which is nice.

Another thing I noticed is that as deep learning becomes more and more popular, there seems to be an urge to apply deep learning to problems even if it doesn't necessarily make sense -- i.e., the urge to use deep learning just for the sake of using deep learning. Overall, the positive thing is that people get excited about new and creative approaches to problem-solving, which can drive the field forward. Also, I noticed that more and more people from other domains become more familiar with the techniques used in statistical modeling (thanks to "data science") and machine learning. This is nice, since good communication in collaborations and teams is important, and a given, common knowledge about the basics makes this communication indeed a bit easier.

What advice would you give to someone who wants to become a data scientist?

SR: I recommend starting with a practical, introductory book or course to get a brief overview of the field and the different techniques that exist. A selection of concrete examples would be beneficial for understanding the big picture and what data science and machine learning is capable of.

Next, I would start a passion project while trying to apply the newly learned techniques from statistics and machine learning to address and answer interesting questions related to this project. While working on an exciting project, I think the practitioner will naturally become motivated to read through the more advanced material and improve their skill.

What are the biggest misunderstandings and misconceptions people have about machine learning today?

Well, there's this whole debate on AI turning evil. As far as I can tell, the fear mongering is mostly driven by journalists who don't work in the field and are apparently looking for catchy headlines. Anyway, let me not iterate over this topic as readers can find plenty of information (from both viewpoints) in the news and all over the internet. To say it with one of the earlier comments, Andrew Ng's famous quote: “I don’t work on preventing AI from turning evil for the same reason that I don’t work on combating overpopulation on the planet Mars."

What's so great about Python? Why do you think it's used in data science and beyond?

SR: It is hard to tell which came first: Python becoming a popular language so that many people developed all the great open-source libraries for scientific computing, data science, and machine learning or Python becoming so popular due to the availability of these open-source libraries. One thing is obvious though: Python is a very versatile language that is easy to learn and easy to use.

While most algorithms for scientific computing are not implemented in pure Python, Python is an excellent language for interacting with very efficient implementations in Fortran, C/C++, and other languages under the hood. This, calling code from computationally efficient low-level languages but also providing users with a very natural and intuitive programming interface, is probably one of the big reasons behind Python's rise to popularity as a lingua franca in the data science and machine learning community.

What tools, frameworks and libraries do you think people should be paying attention to?

There are many interesting libraries being developed for Python. As a data scientist or machine learning practitioner, I'd especially want to highlight the well-maintained tools from Python core scientific stack:

  • -       NumPy and SciPy as efficient libraries for working with data arrays and scientific computing
  • -       Pandas to read in and manipulate data in a convenient data frame format
  • -       matplotlib for data visualization (and seaborn for additional plotting capabilities and more specialized plots)
  • -       scikit-learn for general machine learning

There are many, many more libraries that I find useful in my project. For example, Dask is an excellent library for working with data frames that are too large to fit into memory and to parallelize computations across multiple processors. Or take TensorFlow, Keras, and PyTorch, which are all excellent libraries for implementing deep learning models.

What does the future look like for Python?

In my opinion, Python's future looks very bright! For example, Python has just been ranked as top 1 programming language by IEEE Spectrum as of July 2017. While I mainly speak of Python from the data science/machine learning perspective, I heard from many people in other domains that they appreciate Python as a versatile language and its rich ecosystem of libraries. Of course, Python may not be the best tool for every problem, it is very well regarded as a "productive" language for programmers who want to "get things done."

Also, while the availability of plenty of libraries is one of the strengths of Python, I must also highlight that most packages that have been developed are still being exceptionally well maintained, and new features and improvements to the core data science and machine learning libraries are being added on a daily basis. For instance, the NumPy project, which has been around since 2006, just received a $645,000 grant to further support its continued developed as a core library for scientific computing in Python.

At this point, I also want to thank all the developers of Python and its open source libraries that have made Python to what it is today. It's an immensely useful tool to me, and as Python user, I also hope you will consider getting involved in open source -- every contribution is useful and appreciated, small documentation fixes, bug fixes in the code, new features, or entirely new libraries. Again, and with big thanks to the awesome community around it,  I think Python's future looks very bright.