You're reading from Data Science for Web3

Product typeBook

Published inDec 2023

PublisherPackt

ISBN-139781837637546

Edition1st Edition

Concepts

Data Science

Author (1)

Gabriela Castillo Areco

Exploring the data ingredients

Important note

If you have a background in data science, you may skip this section.

However, if you do not, this basic introduction is essential to understand the concepts and tools discussed throughout the book.

Data science is an interdisciplinary field that combines mathematics, statistics, programming, and machine learning with specific subject matter knowledge to extract meaningful insights.

Imagine you work at a top-tier bank that is considering making its first investment in a blockchain protocol, and they have asked you to present a shortlist of protocols to invest in based on relevant metrics. You may have some ideas about what metrics to consider, but how do you know which metric and value is the most relevant to determine which protocol should make it on your list? And once you know the metric, how do you find the data and calculate it?

This is where data science comes in. By analyzing transaction data (on-chain data) and data that is not on-chain (off-chain data), we can identify patterns and insights that will help us make informed decisions. For example, we might find that certain protocols are more active during business hours in a time zone different from where the bank is located. In this case, the bank can decide whether they are ready to make an investment in a product serving clients in a different time zone. We may also check the value locked in the protocol to assess the general investors’ trust in that smart contract, among many other metrics.

But data science is not just about analyzing past data. We can also use predictive modeling to forecast future trends and add those trends to our assessment. For instance, we could use machine learning algorithms to predict the price range of the token issued by the protocol based on its price history.

For this data analysis, we require the right tools, skills, and business knowledge. We’ll need to know how to collect and clean our data, how to analyze it using statistical techniques, how to separate what is business-relevant from what is not, and how to visualize our findings so we can communicate them effectively. Making data-driven decisions is the most effective way to improve all the relevant metrics for a business, which is more valuable than ever in this competitive world.

Due to the fast pace of data creation and the shortage of data scientists on the market, data scientist has been referred to as “the sexiest job of the 21st century” by the Harvard Business Review. The data economy has opened the door to multiple roles, such as data analyst, data scientist, data engineer, data architect, Business Intelligence (BI) analyst, and machine learning engineer. Depending on the complexity of the problem and the size of the data, we can see them playing a role in a typical data science project.

A typical Web3 data science project involves the following steps:

Problem definition: At this stage, we try to answer the question of whether the problem can be solved with data, and if so, what data would be useful to answer it. Collaboration between data scientists and business users is crucial in defining the problem, as the latter are the specialists and those who will use what the data scientist produces. BI tools such as Tableau, Looker, and Power BI, or Python data visualization libraries such as Seaborn and Matplotlib, are useful in meetings with business stakeholders. It is worth noting that while many BI tools currently provide optimization packages for commonly used data sources, such as Facebook Ads or HubSpot, as of the time of writing, I have not seen any optimization for on-chain data. Therefore, it is preferable to choose highly flexible data visualization tools that can adapt to any visualization needs.
Investigation and data ingestion: At this stage, we try to answer the question: where can we find the necessary data to use for this project? Throughout this book, especially Chapters 2 and 3, we will list multiple data sources related to Web3 that will help answer this question. Once we find where the data is, we need to build an ingestion pipeline for consumption by the data scientist. This process is called ETL, which stands for extract, transform, and load. These steps are necessary to make clean and organized data available to the data analyst or data scientist.
Data collection or extraction is the first step of the ETL process and can include manual entry, web scraping, live streaming from devices, or a connection to an API. Data can be presented in a structured format, meaning that it is stored in a predefined way, or an unstructured format, meaning that it has no predefined storage format and is simply stored in its native way. Transformation consists of modifying the raw data to be stored or analyzed. Some of the activities that data transformation can involve include data normalization, data deduplication, and data cleaning. Finally, loading is the act of moving the transformed data into data storage and making it available. There are a few additional aspects to consider when referring to data availability, such as storing data in the correct format, including all related metadata, providing the correct access privileges to the right team members, and ensuring the data is up to date, accurate, or enough to fulfill the data scientist’s needs. The ETL process is generally led by the data engineer, but the business owner and the data scientist will have a say when identifying the data source.
Analysis/modeling: In this stage, we analyze the data to extract conclusions and may need to model it to try to predict future outcomes. Once the data is available, we can perform the following:
- Descriptive analysis: This uses data analysis and methods to describe what the data shows, gaining insights into trends, composition, distribution, and more. For example, a descriptive analysis of a Decentralized Finance (DeFi) protocol can reveal when its clients are most active and the Total Value Locked (TVL) and how the locked value has evolved over time.
- Diagnostic analysis: This uses data analysis to explain the reasons behind the occurrence of certain matters. Techniques such as data composition, correlations, and drill-down are used in these types of analyses. For example, a blockchain analyst may try to understand the correlation between a peak in new addresses and the activity of certain addresses to identify the applications that these users give to the chain.
- Predictive analysis: This uses historical data to make forecasts about trends or events in the future. Techniques can include machine learning, cluster analysis, and time series forecasting. For example, a trader may try to predict the evolution of a certain cryptocurrency based on its historical performance.
- Prescriptive analysis: This uses the result of predictive analysis as an input to suggest the optimum response or best course of action. For example, a bot can suggest whether to sell or buy certain cryptocurrency.
- Generative AI: This uses machine learning techniques and huge amounts of data to learn patterns and generates new and original outputs. Artificial intelligence can create images, videos, audio, text, and more. Applications of generative models include ChatGPT, Leonardo AI, and Midjourney.
Evaluation: In this stage, the result of our analysis or modeling is evaluated and tested to confirm it meets the project goals and provides value to the business. Any bias or weakness of our models is identified, and if necessary, the process starts again to address those errors.
Presentation/deployment: The final stage of the process depends on the problem. If it is an analysis from which the company will make a decision, our job will probably conclude with a presentation and explanation of our findings. Alternatively, if we are working as part of a larger software pipeline, our model will most likely be deployed or integrated into the data pipeline.

This is an iterative process, meaning that many times, especially in step 4, we will receive valuable feedback from the business team about our analysis, and we will change the initial conclusions accordingly. What is true for traditional data science is reinforced for the Web3 industry as this is one of the industries where data plays a key role in building trust, leading investments, and, in general, unlocking new value.

Although data science is not a programming career, it heavily relies on programming because of the large amount of data available. In this book, we will work with the Python language and some SQL to query databases. Python is a general-purpose programming language commonly used by the data science community, and it is easy to learn due to its simple syntax. An alternative to Python is R, which is a statistical programming language commonly used for data analysis, machine learning, scientific research, and data visualization. A simple way to access Python or R and their associated libraries and tools is to install the Anaconda distribution. It includes popular data science libraries (such as NumPy, pandas, and Matplotlib for Python) and simplifies the process of setting up an environment to start working on data analysis and machine learning projects.

The activities in this book will be carried out in three work environments:

Notebooks: For example, Anaconda Jupyter notebooks or Google Colaboratory (also frequently referred to as Colab). These files are saved in .ipynb format and are very useful for data analysis or training models. We will use Colab notebooks in the machine learning chapters due to the access it provides to GPU resources in its free tier.
IDEs: PyCharm, Visual Studio Code, or any other IDE that supports Python. Their files are saved in .py format and are very useful for building applications. Most IDEs allow the user to download extensions to work with notebook files.
Query platforms: In Chapter 2, we will access on-chain data platforms that have built-in query systems. Examples of those platforms are Dune Analytics, Flipside, Footprint Analytics, and Increment.

Anaconda Jupyter notebooks and IDEs use our computer resources (e.g., RAM), while Google Colaboratory uses cloud services (more on resources can be found in the Appendix 1).

Please refer to the Appendix 1 to install any of the environments mentioned previously.

Once we have a clean notebook, we will warm up our Python skills with the Chapter01/Python_warm_up notebook, which follows a tutorial by https://learnxinyminutes.com/docs/python/. For a more thorough study of Python, we encourage you to check out Data Science with Python, by Packt Publishing, or Python Data Science Handbook, both of which are listed in the Further reading section of this chapter.

Once we have completed the warm-up exercise, we will initiate the Web3 client using the Web3.py library. Let’s learn about these concepts in the following section.

You have been reading a chapter from

Data Science for Web3

Published in: Dec 2023Publisher: PacktISBN-13: 9781837637546

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gabriela Castillo Areco

Gabriela Castillo Areco holds an M.Sc. in big data science from the TECNUM School of Engineering, University of Navarra. With extensive experience in both the business and data facets of blockchain technology, Gabriela has undertaken roles as a data scientist, machine learning analyst, and blockchain consultant in both large corporations and small ventures. She served as a professor of new crypto businesses at Torcuato di Tella University and is currently a member of the BizOps data team at IOV Labs.
Read more about Gabriela Castillo Areco

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages