You're reading from Building Data Science Solutions with Anaconda

Product typeBook

Published inMay 2022

PublisherPackt

ISBN-139781800568785

Edition1st Edition

Tools

Anaconda

Concepts

Data Science

Author (1)

Dan Meador

Chapter 2: Analyzing Open Source Software

You can't have a grasp of data science unless you understand open source. It is the oxygen that has fueled the explosion of artificial intelligence (AI) growth in the last two decades. You will be hard-pressed to find any software product or tool being used today that does not make use of open source or is not open source itself.

In this chapter, we will learn what it means for a tool to be open source and how that limits (or does not) how you can use it. We will then walk through how to find and start using different open source tools in your projects today. Finally, we will put these skills to use by evaluating and using one of the most popular open source tools for data science, scikit-learn.

We will focus on the following topics:

Understanding open source
Understanding the top four OSS licenses
Evaluating a new tool or library
Importing packages using the Anaconda distibution and conda-forge
Evaluating...

Technical requirements

For this chapter, you will need to have the Anaconda distribution installed. This will come with conda, Navigator, and the most widely used tools for data science, including all the packages we will use later in the chapter.

You'll need to have a GitHub account set up before you begin. Head to https://github.com/join to do that.

You can find the code for this chapter and the rest of the book here: https://github.com/PacktPublishing/Building-Data-Science-Solutions-with-Anaconda/tree/main/Chapter02.

We'll also be writing and editing code. An integrated developer environment (IDE) of choice such as Visual Studio Code can be used but its suggested you use the very popular Jupyter Notebook for the tasks that we will be doing in this chapter. Jupyter is included in the Anaconda distribution.

Understanding open source

Getting a fundamental view of this term you have heard or will hear all the time is critical. It will let you keep a clear picture in your head of how to navigate the differences of what is currently out there and ensure that you are able to pick and choose what tools you need to do your job. In this section, you'll learn what open source software (or OSS for short) means.

Free as in free speech, not free beer – this is a phrase used to describe the free part of open source. Free in this sense is used not to mean something that you can use and consume as much as you want with nothing in return; it's more the idea and concept that there should be an open exchange of ideas. Just like free speech, there are limits around what you can and can't say and do unless you want legal ramifications. This distinction will become clearer as we continue through the chapter.

In short, OSS is software that is free to be used, modified, and shared...

Understanding the top four OSS licenses

OSS is copyrighted and there are restrictions around it, they just tend to be much less strict than the IP laws that were so much more common before the turn of the century. There is a good chance that, at some point, you have heard terms such as MIT licenses, GPL, and others. While these might just sound like legal speak no one cares about, there is a huge need to become familiar with things at a high level for the following reasons:

If you put something in your company's code base that is sold, you might have to make that software open source.
Your personal project might be something you would rather be kept secret or proprietary.
You might want to make money off your project in the future.

You should keep these in mind as you are making the decision to use certain packages as once you are using those tools in your software, the only choice you will have in the future to not be held to those licenses' restrictions...

Evaluating a new tool or library

The only constant is change, and there is no doubt as I type this, a new tool that "fixes" all the things that are broken with framework X but is simpler to use is being developed. This section will help you navigate the new world where a constant stream of new software is available for free. You will learn what attributes and factors to look at to decide whether something is worth using or not.

There are a few heuristics that you could use when evaluating a new tool. Feel free to adjust which ones you use based on your specific needs:

The number of stars the tool has on GitHub
The tool's age
How long it's been since the tool has been updated
The number of maintainers
The number of open issues/PRs
The number of dependencies

I want to add a big asterisk to all of these. The answer to how important each of these are is the same as the answer to which architecture style is right for your code base...

Importing packages with Anaconda and conda-forge

This section might be one of the most valuable in the entire book as it's such a foundational part of the work you will do day to day as a data scientist (and as a developer). In any given project or even a small proof of concept, you will use many packages to accomplish what you need to, so let's look at how conda and conda-forge work together to get you what you need.

The conda package manager and Navigator are great tools, but they are useless without the packages themselves. For any given update to a package, there are things that might have changed with it or new dependencies brought in. For example, TensorFlow (https://github.com/tensorflow/tensorflow), the popular machine learning framework, is looking at releasing version 2.6.0. This release splits out a major part, Keras, so now there may be libraries that aren't needed and new ones that are. Some package updates are very minor, but some require a lot of manual...

Evaluating and using scikit-learn

Let's say you want to tackle a problem you have, such as wanting to evaluate the price of houses in California. You know you want to evaluate the popular data science framework scikit-learn to see if it fits your need. Scikit-learn is a very powerful ready-to-go solution that allows you to train and evaluate many different types of models.

One of the most powerful things a library or piece of software can do is provide abstraction so that you can do a lot with a little. Scikit-learn exposes just enough that with very readable and specific commands the result is the creation of a model that years ago would have taken you a week to come up with on your own.

Think of it as being like giving your order to a waiter. You give them a high-level insight into what you would like to eat, and they relay that information to the chef who already knows how to prepare a fantastic meal. You don't need to worry about what exact temperature to set the...

Summary

In this chapter, we covered a lot of ground. We looked at what it means to be open source by digging into how the OSI defines it as a common understanding that the source code should be accessible, open to change, and not limited by industry, among other things.

We found out about the major licenses that you'll come across on your journey and the differences between them. You saw that copyleft licenses such as GPL require you to share anything you create, but permissive licenses give you permission to keep those things for yourself, like MIT licenses do.

We then looked at the criteria that you can use to evaluate whether an open source tool might be for you by using things such as the number of GitHub stars, the number of maintainers, and how long it's been around. Looking at some of these things holistically lets us put together a better picture of whether we can count on our OSS tool to be maintained and reliable.

Finally, we saw how you can access the...

The rest of the chapter is locked

You have been reading a chapter from

Building Data Science Solutions with Anaconda

Published in: May 2022Publisher: PacktISBN-13: 9781800568785

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages