Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Chapter 2: Analyzing Open Source Software

You can't have a grasp of data science unless you understand open source. It is the oxygen that has fueled the explosion of artificial intelligence (AI) growth in the last two decades. You will be hard-pressed to find any software product or tool being used today that does not make use of open source or is not open source itself.

In this chapter, we will learn what it means for a tool to be open source and how that limits (or does not) how you can use it. We will then walk through how to find and start using different open source tools in your projects today. Finally, we will put these skills to use by evaluating and using one of the most popular open source tools for data science, scikit-learn.

We will focus on the following topics:

  • Understanding open source
  • Understanding the top four OSS licenses
  • Evaluating a new tool or library
  • Importing packages using the Anaconda distibution and conda-forge
  • Evaluating...

Technical requirements

For this chapter, you will need to have the Anaconda distribution installed. This will come with conda, Navigator, and the most widely used tools for data science, including all the packages we will use later in the chapter.

You'll need to have a GitHub account set up before you begin. Head to https://github.com/join to do that.

You can find the code for this chapter and the rest of the book here: https://github.com/PacktPublishing/Building-Data-Science-Solutions-with-Anaconda/tree/main/Chapter02.

We'll also be writing and editing code. An integrated developer environment (IDE) of choice such as Visual Studio Code can be used but its suggested you use the very popular Jupyter Notebook for the tasks that we will be doing in this chapter. Jupyter is included in the Anaconda distribution.

Understanding open source

Getting a fundamental view of this term you have heard or will hear all the time is critical. It will let you keep a clear picture in your head of how to navigate the differences of what is currently out there and ensure that you are able to pick and choose what tools you need to do your job. In this section, you'll learn what open source software (or OSS for short) means.

Free as in free speech, not free beer – this is a phrase used to describe the free part of open source. Free in this sense is used not to mean something that you can use and consume as much as you want with nothing in return; it's more the idea and concept that there should be an open exchange of ideas. Just like free speech, there are limits around what you can and can't say and do unless you want legal ramifications. This distinction will become clearer as we continue through the chapter.

In short, OSS is software that is free to be used, modified, and shared...

Understanding the top four OSS licenses

OSS is copyrighted and there are restrictions around it, they just tend to be much less strict than the IP laws that were so much more common before the turn of the century. There is a good chance that, at some point, you have heard terms such as MIT licenses, GPL, and others. While these might just sound like legal speak no one cares about, there is a huge need to become familiar with things at a high level for the following reasons:

  • If you put something in your company's code base that is sold, you might have to make that software open source.
  • Your personal project might be something you would rather be kept secret or proprietary.
  • You might want to make money off your project in the future.

You should keep these in mind as you are making the decision to use certain packages as once you are using those tools in your software, the only choice you will have in the future to not be held to those licenses' restrictions...

Evaluating a new tool or library

The only constant is change, and there is no doubt as I type this, a new tool that "fixes" all the things that are broken with framework X but is simpler to use is being developed. This section will help you navigate the new world where a constant stream of new software is available for free. You will learn what attributes and factors to look at to decide whether something is worth using or not.

There are a few heuristics that you could use when evaluating a new tool. Feel free to adjust which ones you use based on your specific needs:

  • The number of stars the tool has on GitHub
  • The tool's age
  • How long it's been since the tool has been updated
  • The number of maintainers
  • The number of open issues/PRs
  • The number of dependencies

I want to add a big asterisk to all of these. The answer to how important each of these are is the same as the answer to which architecture style is right for your code base...

Importing packages with Anaconda and conda-forge

This section might be one of the most valuable in the entire book as it's such a foundational part of the work you will do day to day as a data scientist (and as a developer). In any given project or even a small proof of concept, you will use many packages to accomplish what you need to, so let's look at how conda and conda-forge work together to get you what you need.

The conda package manager and Navigator are great tools, but they are useless without the packages themselves. For any given update to a package, there are things that might have changed with it or new dependencies brought in. For example, TensorFlow (https://github.com/tensorflow/tensorflow), the popular machine learning framework, is looking at releasing version 2.6.0. This release splits out a major part, Keras, so now there may be libraries that aren't needed and new ones that are. Some package updates are very minor, but some require a lot of manual...

Evaluating and using scikit-learn

Let's say you want to tackle a problem you have, such as wanting to evaluate the price of houses in California. You know you want to evaluate the popular data science framework scikit-learn to see if it fits your need. Scikit-learn is a very powerful ready-to-go solution that allows you to train and evaluate many different types of models.

One of the most powerful things a library or piece of software can do is provide abstraction so that you can do a lot with a little. Scikit-learn exposes just enough that with very readable and specific commands the result is the creation of a model that years ago would have taken you a week to come up with on your own.

Think of it as being like giving your order to a waiter. You give them a high-level insight into what you would like to eat, and they relay that information to the chef who already knows how to prepare a fantastic meal. You don't need to worry about what exact temperature to set the...

Summary

In this chapter, we covered a lot of ground. We looked at what it means to be open source by digging into how the OSI defines it as a common understanding that the source code should be accessible, open to change, and not limited by industry, among other things.

We found out about the major licenses that you'll come across on your journey and the differences between them. You saw that copyleft licenses such as GPL require you to share anything you create, but permissive licenses give you permission to keep those things for yourself, like MIT licenses do.

We then looked at the criteria that you can use to evaluate whether an open source tool might be for you by using things such as the number of GitHub stars, the number of maintainers, and how long it's been around. Looking at some of these things holistically lets us put together a better picture of whether we can count on our OSS tool to be maintained and reliable.

Finally, we saw how you can access the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador