Data Science for Web3

Where Data and Web3 Meet

As we assume no prior knowledge of data or blockchain, this chapter introduces the basic concepts of both topics. A good understanding of these concepts is essential to tackle Web3 data science projects, as we will refer to them. A Web3 data science project tries to solve a business problem or unlock new value with data; it is an example of applied science. It has two main components, the data science ingredients and the blockchain ingredients, which we will cover in this chapter.

In the Exploring the data ingredients section, we will analyze the concept of data science, available data tools, and the general steps we will follow, and provide a gentle practical introduction to Python. In the Understanding the blockchain ingredients section, we will cover what blockchain is, its main characteristics, and why it is called the internet of value.

In the final part of this chapter, we will dive into some industry concepts and how to use them. We will also analyze challenges related to the quality and standardization of data and concepts, respectively. Lastly, we will briefly review the concept of APIs and describe the ones that we will be using throughout the book.

In this chapter, we will cover the following topics:

What is a business data science project?
What are data ingredients?
Introducing the blockchain ingredients
Approaching relevant industry metrics
The challenges with data quality and standards
Classifying the APIs

Exploring the data ingredients

Important note

If you have a background in data science, you may skip this section.

However, if you do not, this basic introduction is essential to understand the concepts and tools discussed throughout the book.

Data science is an interdisciplinary field that combines mathematics, statistics, programming, and machine learning with specific subject matter knowledge to extract meaningful insights.

Imagine you work at a top-tier bank that is considering making its first investment in a blockchain protocol, and they have asked you to present a shortlist of protocols to invest in based on relevant metrics. You may have some ideas about what metrics to consider, but how do you know which metric and value is the most relevant to determine which protocol should make it on your list? And once you know the metric, how do you find the data and calculate it?

This is where data science comes in. By analyzing transaction data (on-chain data) and data that is not on-chain (off-chain data), we can identify patterns and insights that will help us make informed decisions. For example, we might find that certain protocols are more active during business hours in a time zone different from where the bank is located. In this case, the bank can decide whether they are ready to make an investment in a product serving clients in a different time zone. We may also check the value locked in the protocol to assess the general investors’ trust in that smart contract, among many other metrics.

But data science is not just about analyzing past data. We can also use predictive modeling to forecast future trends and add those trends to our assessment. For instance, we could use machine learning algorithms to predict the price range of the token issued by the protocol based on its price history.

For this data analysis, we require the right tools, skills, and business knowledge. We’ll need to know how to collect and clean our data, how to analyze it using statistical techniques, how to separate what is business-relevant from what is not, and how to visualize our findings so we can communicate them effectively. Making data-driven decisions is the most effective way to improve all the relevant metrics for a business, which is more valuable than ever in this competitive world.

Due to the fast pace of data creation and the shortage of data scientists on the market, data scientist has been referred to as “the sexiest job of the 21st century” by the Harvard Business Review. The data economy has opened the door to multiple roles, such as data analyst, data scientist, data engineer, data architect, Business Intelligence (BI) analyst, and machine learning engineer. Depending on the complexity of the problem and the size of the data, we can see them playing a role in a typical data science project.

A typical Web3 data science project involves the following steps:

Problem definition: At this stage, we try to answer the question of whether the problem can be solved with data, and if so, what data would be useful to answer it. Collaboration between data scientists and business users is crucial in defining the problem, as the latter are the specialists and those who will use what the data scientist produces. BI tools such as Tableau, Looker, and Power BI, or Python data visualization libraries such as Seaborn and Matplotlib, are useful in meetings with business stakeholders. It is worth noting that while many BI tools currently provide optimization packages for commonly used data sources, such as Facebook Ads or HubSpot, as of the time of writing, I have not seen any optimization for on-chain data. Therefore, it is preferable to choose highly flexible data visualization tools that can adapt to any visualization needs.
Investigation and data ingestion: At this stage, we try to answer the question: where can we find the necessary data to use for this project? Throughout this book, especially Chapters 2 and 3, we will list multiple data sources related to Web3 that will help answer this question. Once we find where the data is, we need to build an ingestion pipeline for consumption by the data scientist. This process is called ETL, which stands for extract, transform, and load. These steps are necessary to make clean and organized data available to the data analyst or data scientist.
Data collection or extraction is the first step of the ETL process and can include manual entry, web scraping, live streaming from devices, or a connection to an API. Data can be presented in a structured format, meaning that it is stored in a predefined way, or an unstructured format, meaning that it has no predefined storage format and is simply stored in its native way. Transformation consists of modifying the raw data to be stored or analyzed. Some of the activities that data transformation can involve include data normalization, data deduplication, and data cleaning. Finally, loading is the act of moving the transformed data into data storage and making it available. There are a few additional aspects to consider when referring to data availability, such as storing data in the correct format, including all related metadata, providing the correct access privileges to the right team members, and ensuring the data is up to date, accurate, or enough to fulfill the data scientist’s needs. The ETL process is generally led by the data engineer, but the business owner and the data scientist will have a say when identifying the data source.
Analysis/modeling: In this stage, we analyze the data to extract conclusions and may need to model it to try to predict future outcomes. Once the data is available, we can perform the following:
- Descriptive analysis: This uses data analysis and methods to describe what the data shows, gaining insights into trends, composition, distribution, and more. For example, a descriptive analysis of a Decentralized Finance (DeFi) protocol can reveal when its clients are most active and the Total Value Locked (TVL) and how the locked value has evolved over time.
- Diagnostic analysis: This uses data analysis to explain the reasons behind the occurrence of certain matters. Techniques such as data composition, correlations, and drill-down are used in these types of analyses. For example, a blockchain analyst may try to understand the correlation between a peak in new addresses and the activity of certain addresses to identify the applications that these users give to the chain.
- Predictive analysis: This uses historical data to make forecasts about trends or events in the future. Techniques can include machine learning, cluster analysis, and time series forecasting. For example, a trader may try to predict the evolution of a certain cryptocurrency based on its historical performance.
- Prescriptive analysis: This uses the result of predictive analysis as an input to suggest the optimum response or best course of action. For example, a bot can suggest whether to sell or buy certain cryptocurrency.
- Generative AI: This uses machine learning techniques and huge amounts of data to learn patterns and generates new and original outputs. Artificial intelligence can create images, videos, audio, text, and more. Applications of generative models include ChatGPT, Leonardo AI, and Midjourney.
Evaluation: In this stage, the result of our analysis or modeling is evaluated and tested to confirm it meets the project goals and provides value to the business. Any bias or weakness of our models is identified, and if necessary, the process starts again to address those errors.
Presentation/deployment: The final stage of the process depends on the problem. If it is an analysis from which the company will make a decision, our job will probably conclude with a presentation and explanation of our findings. Alternatively, if we are working as part of a larger software pipeline, our model will most likely be deployed or integrated into the data pipeline.

This is an iterative process, meaning that many times, especially in step 4, we will receive valuable feedback from the business team about our analysis, and we will change the initial conclusions accordingly. What is true for traditional data science is reinforced for the Web3 industry as this is one of the industries where data plays a key role in building trust, leading investments, and, in general, unlocking new value.

Although data science is not a programming career, it heavily relies on programming because of the large amount of data available. In this book, we will work with the Python language and some SQL to query databases. Python is a general-purpose programming language commonly used by the data science community, and it is easy to learn due to its simple syntax. An alternative to Python is R, which is a statistical programming language commonly used for data analysis, machine learning, scientific research, and data visualization. A simple way to access Python or R and their associated libraries and tools is to install the Anaconda distribution. It includes popular data science libraries (such as NumPy, pandas, and Matplotlib for Python) and simplifies the process of setting up an environment to start working on data analysis and machine learning projects.

The activities in this book will be carried out in three work environments:

Notebooks: For example, Anaconda Jupyter notebooks or Google Colaboratory (also frequently referred to as Colab). These files are saved in .ipynb format and are very useful for data analysis or training models. We will use Colab notebooks in the machine learning chapters due to the access it provides to GPU resources in its free tier.
IDEs: PyCharm, Visual Studio Code, or any other IDE that supports Python. Their files are saved in .py format and are very useful for building applications. Most IDEs allow the user to download extensions to work with notebook files.
Query platforms: In Chapter 2, we will access on-chain data platforms that have built-in query systems. Examples of those platforms are Dune Analytics, Flipside, Footprint Analytics, and Increment.

Anaconda Jupyter notebooks and IDEs use our computer resources (e.g., RAM), while Google Colaboratory uses cloud services (more on resources can be found in the Appendix 1).

Please refer to the Appendix 1 to install any of the environments mentioned previously.

Once we have a clean notebook, we will warm up our Python skills with the Chapter01/Python_warm_up notebook, which follows a tutorial by https://learnxinyminutes.com/docs/python/. For a more thorough study of Python, we encourage you to check out Data Science with Python, by Packt Publishing, or Python Data Science Handbook, both of which are listed in the Further reading section of this chapter.

Once we have completed the warm-up exercise, we will initiate the Web3 client using the Web3.py library. Let’s learn about these concepts in the following section.

Understanding the blockchain ingredients

If you have a background in blockchain development, you may skip this section. Web3 represents a new generation of the World Wide Web that is based on decentralized databases, permissionless and trustless interactions, and native payments. This new concept of the internet opens up various business possibilities, some of which are still in their early stages.

Figure 1.1 – Evolution of the web

Currently, we are in the Web2 stage, where centralized companies store significant amounts of data sourced from our interactions with apps. The promise of Web3 is that we will interact with Decentralized Apps (dApps) that store only the relevant information on the blockchain, accessible to everyone.

As of the time of writing, Web3 has some limitations recognized by the Ethereum organization:

Velocity: The speed at which the blockchain is updated poses a scalability challenge. Multiple initiatives are being tested to try to solve this issue.
Intuition: Interacting with Web3 is still difficult to understand. The logic and user experience are not as intuitive as in Web2 and a lot of education will be necessary before users can start utilizing it on a massive scale.
Cost: Recording an entire business process on the chain is expensive. Having multiple smart contracts as part of a dApp costs a lot for the developer and the user.

Blockchain technology is a foundational technology that underpins Web3. It is based on Distributed Ledger Technology (DLT), which stores information once it is cryptographically verified. Once reflected on the ledger, each transaction cannot be modified and multiple parties have a complete copy of it.

Two structural characteristics of the technology are the following:

It is structured as a set of blocks, where each block contains information (cryptographically hashed – we will learn more about this in this chapter) about the previous block, making it impossible to alter it at a later stage. Each block is chained to the previous one by this cryptographic sharing mechanism.

Figure 1.2 – Representation of a set of blocks

It is decentralized. The copy of the entire ledger is distributed among several servers, which we will call nodes. Each node has a complete copy of the ledger and verifies consistency every time it adds a new block on top of the blockchain.

This structure provides the solution to double spending, enabling for the first time the decentralized transfer of value through the internet. This is why Web3 is known as the internet of value.

Since the complete version of the ledger is distributed among all the participants of the blockchain, any new transaction that contradicts previously stored information will not be successfully processed (there will be no consensus to add it). This characteristic facilitates transactions among parties that do not know each other without the need for an intermediary acting as a guarantor between them, which is why this technology is known as trustless.

The decentralized storage also takes control away from each server and, thus, there is no sole authority with sufficient power to change any data point once the transaction is added to the blockchain. Since taking down one node will not affect the network, if a hacker wants to attack the database, they would require such high computing power that the attempt would be economically unfeasible. This adds a security level that centralized servers do not have.

Three generations of blockchain

The first-generation blockchain is Bitcoin, which is based on Satoshi Nakamoto’s paper Bitcoin: A Peer-to-Peer Electronic Cash System. The primary use case of this blockchain is financial. Although the technology was initially seen as a way to bypass intermediaries such as banks, currently, traditional financial systems and the crypto world are starting to work together, especially with regard to Bitcoin because it is now considered a digital store of value, a sort of digital gold. Notwithstanding the preceding, there are still many regulatory and practical barriers to the integration of the two systems.

The second-generation blockchain added the concept of smart contracts to the database structure described previously, and Ethereum was the first to introduce this. With Ethereum, users can agree on terms and conditions before a transaction is carried out. This chain started the smart contracts era, and as Nick Szabo describes it, the smart contract logic is that of a vending machine that can execute code autonomously, including the management of digital assets, which is a real revolution. To achieve this, the network has an Ethereum Virtual Machine (EVM) that can execute arbitrary code.

Lastly, the third-generation blockchain builds upon the previous generations and aims to solve scalability and interoperability problems. When referring to on-chain data in this book, we will be talking about data generated by the second- and third-generation blockchains that are EVM compatible, as this is where most development is being carried out at the time of writing (e.g., Ethereum, BSC, or Rootstock). Consequently, Bitcoin data and non-EVM structures are not covered.

Introducing the blockchain ingredients

Now, let’s understand some important additional concepts regarding blockchain.

Gas

In order to make a car move forward, we use gas as fuel. This enables us to reach our desired destination, but it comes at a cost. The price of gas fluctuates based on various factors, such as oil prices and transportation costs. The same concept applies to the blockchain technology. To save a transaction on a chain, it is necessary to pay for gas. In short, gas is the instruction cost paid to the network to carry out our transactions. The purpose of establishing a cost is twofold: the proceedings of the gas payment go to the miners/validators as a payment for their services and as an incentive to continue being integrated into the blockchain; it also sets a price for users to be mindful of when using resources, encouraging the use of the blockchain to record only what is worth more than the gas value paid. This concept is universal to all networks we will study in this book.

Gas has several cost implications. As the price of gas is paid in the network’s native coin, if the price increases, the cost of using the network can become excessively expensive, discouraging adoption. This is what happened with Ethereum, which led to multiple changes to its internal rules to solve this issue.

As mentioned earlier, each interaction with the blockchain incurs a cost. Therefore, not everything needs to be stored in it and the adoption of such a database as blockchain needs to be validated by business requirements.

Cryptocurrencies can be divided into smaller units of that cryptocurrency, just like how a dollar can be divided into cents. The smaller unit of a Bitcoin is a Satoshi and the smaller denomination of an Ether is Wei. The following is a chart with the denominations, which will be useful for tracking gas costs.

Unit name	Wei	Ether
Wei	1	10-18
Kwei	1,000	10-15
Mwei	1,000,000	10-12
Gwei	1,000,000,000	10-9
Microether	1,000,000,000,000	10-6
Milliether	1,000,000,000,000,000	10-3
Ether	1,000,000,000,000,000,000	1

Table 1.1 – Unit denominations and their values

Address

When we use a payment method other than cash, we transmit a sequence of letters or numbers, or a combination of both, to transfer our funds. This sequence of characters is essential for identifying the country, bank, and account of the recipient, for the entity that holds our funds. Similarly, an address performs a comparable function and serves as an identification number on the blockchain. It is a string of letters and numbers that can send or receive cryptocurrency. For example, Ethereum addresses consist of 42 hexadecimal characters. An address is the public key hash of an asymmetric key pair, which is all the information required by a third party to transfer cryptocurrency. This public key is derived from a private key, but the reverse process (deriving the private key from a public key) cannot be performed. The private key is required to authorize/sign transactions or access the funds stored in the account.

Addresses can be classified into two categories: Externally Owned Addresses (EOAs) and contract accounts. Both of them can receive, hold, and send funds and interact with smart contracts. EOAs are owned by users who hold the private key, and users can create as many as they need. Contract accounts are those where smart contracts are deployed and are controlled by their contract code. Another difference between them is the cost of creating them. Creating an EOA does not cost gas but creating a smart contract address has to pay for gas. Only EOA accounts can initiate transactions.

There is another product in the market known as smart accounts that leverage the account abstraction concept. The idea behind this development is to facilitate users to program more security and better user experiences into their accounts, such as setting rules on daily spending limits or selecting the token to pay for gas. These are programmable smart contracts.

Although the terms “wallet” and “address” are often used interchangeably, there is a technical distinction between them. As mentioned before, an address is the public key hash of an asymmetric key pair. On the other hand, a wallet is the abstract location where the public and private keys are stored together. It is a software interface or application that simplifies interacting with the network and facilitates querying our accounts, transaction signing, and more.

Consensus protocols

When multiple parties work together, especially if they do not know each other, it is necessary to agree on a set of rules to work sustainably. In the blockchain case, it is necessary to determine how to add transactions to a block and alter its state. This is where the consensus protocol comes into play. Consensus refers to the agreement reached by all nodes of the blockchain to change the state of the chain by adding a new block to it. The protocol comprises a set of rules for participation, rewards/penalties to align incentives, and more. The more nodes participate, the more decentralized the network becomes, making it more secure.

Consensus can be reached in several ways, but two main concepts exist in open networks.

Proof of Work (PoW)

This is the consensus protocol used by Bitcoin. It involves solving mathematical equations that vary in difficulty depending on how congested the network is.

Solving these puzzles consumes a lot of energy, resulting in a hardware-intensive competition. Parties trying to solve the puzzle are known as miners.

The winning party finds an integer that complies with the equation rules and informs the other nodes of the answer. The other parties verify that the answer is correct and add the block to their copy of the blockchain. The winning party gets the reward for solving the puzzle, which is a predefined amount of cryptocurrency. This is how the system issues Bitcoin that has never been spent and is known as a Coinbase transaction.

In Bitcoin protocol, the reward is halved every 21,000 blocks.

Proof of Stake (PoS)

This is the current protocol used by the Ethereum blockchain (up to September 15, 2022, the consensus protocol was PoW) and many others, such as Cardano.

The rationale behind PoS is that parties become validators in the blockchain by staking their own cryptocurrency in exchange for the chance to validate transactions, update the blockchain, and earn rewards. Generally, there is a minimum amount of cryptocurrency that must be staked to become a validator. It is “at stake” because the rules include potential penalizations or “slashing” of the deposited cryptocurrency if the validator (node that processes transactions and adds new blocks to the chain) goes offline or behaves poorly. Slashing means losing a percentage of the deposited cryptocurrency.

As we can see, there are rewards and penalties to align the incentives of all participants toward a single version of the blockchain.

The list of consensus protocols is continuously evolving, reflecting the ongoing search to solve some of the limitations identified in Web3, such as speed or cost. Some alternative consensus protocols include proof of authority – where a small number of nodes have the power to validate transactions and add blocks to the chain – and proof of space – which uses disk space to validate transactions.

Making the first transaction

With these concepts in mind, we will now carry out a transaction on our local environment with local Ethereum from Ganache.

To get started, let’s open a local Jupyter notebook and a quick-start version of Ganache.

Here is the information we need:

Figure 1.3 – Ganache main page and relevant information to connect

Let’s look at the code:

Import the Web3.py library:
```
from web3 import Web3
```
Connect to the blockchain running on the port described in our Ganache page (item 1):
```
ganache_url= "http://127.0.0.1:8545"
web3= Web3(Web3.HTTPProvider(ganache_url))
```

Define the receiving and sending addresses (item 2):

from_account="0xd5eAc5e5f45ddFCC698b0aD27C52Ba55b98F5653"
to_account= "0xFfd597CE52103B287Efa55a6e6e0933dff314C63"

Define the transaction. In this case, we are transferring 30 ether between the accounts defined previously:

transaction= web3.eth.send_transaction({
  'to': to_account,
  'from': from_account,
  'value': web3.toWei(30, "ether")
})

We can review the account balances before and after the transaction with the following code snippet:

web3.fromWei(web3.eth.getBalance(from_account),'ether'))
web3.fromWei (web3.eth.getBalance(to_account), 'ether'))

Congratulations! If you have never before transferred value on a blockchain, you have achieved your first milestone. The complete code can be found in Chapter01/First_transaction.

A word on CBDC

What is CBDC? The acronym stands for Central Bank Digital Currency. It is a new form of electronic money issued by the central banks of countries.

Many countries are at different stages in this roadmap. On January 20, 2022, the Federal Reserve Board issued discussion papers about CBDC, and prior to the COVID-19 pandemic, they also informed of ongoing research regarding the benefits that could be brought to their system. As of July 2022, there were 100 CBDCs in research and development. Countries are looking for the best infrastructure, studying the impact on their communities, and are mindful of a new range of risks that this new way of transferring value will pose to financial systems that may be reluctant to change.

Some of the concepts that we have covered in this chapter will be useful for the CBDC era, but depending on the project and its characteristics, not all of them will be present. It will be especially interesting to see how they solve centralization issues. A very informative tracker on the status of the projects is available at the following link: https://cbdctracker.org/.

In this section, we analyzed the fundamentals of blockchain technology, including key concepts such as gas, addresses, and consensus protocols, and explored the evolution of Web3. We also executed a transaction using Ganache and Web3.py.

With this basic understanding of the transaction flow, we will now shift our focus toward analyzing initial metrics and gaining a better understanding of the data challenges in this industry.

Approaching Web3 industry metrics

There are some metrics that are pretty standard on every Web3 dashboard that we review in this section. However, this is just a basic layer, and each player in the industry will add additional metrics relevant to them.

To extract information from the Ethereum blockchain, we need to establish a connection to the blockchain through a node that holds a copy of it. There are multiple ways to connect to the blockchain, which we will explore in more detail in Chapter 2. For the following metrics, we will make use of Infura. For a step by step guide to connect to Infura, refer to the Appendix 1.

Block height

This refers to the current block on the blockchain. The Genesis block is commonly referred to as block 0 and subsequent blocks are numbered accordingly. To check the block height, use the following code snippet:

web3.eth.blockNumber

The block number can be used as the ID of the block. Tracking it can be useful to determine the number of confirmations a transaction has, which is equivalent to the number of additional blocks that were mined or added after the block of interest. The deeper a transaction is in the chain, the safer and more irreversible it becomes.

Time

When discussing time in the context of blockchain, two concepts need to be taken into account. The first is the time between blocks, which varies depending on the blockchain. In Ethereum, after the recent protocol change, there are 12-second slots. Each validator is given a slot to propose a block during that time, and if all validators are online, there will be no empty slots, resulting in a new block being added every 12 seconds. The second concept is the timestamp for when a block was added to the blockchain, which is typically stored in Unix timestamp format. The Unix timestamp is a way of tracking the time elapsed as a running total of seconds from January 1, 1970, in UTC.

To extract the block timestamps, use the following code snippet:

web3.eth.get_block(latest).timestamp

Tokenomics

Tokenomics refers to the characteristics of the internal economy of token projects on the blockchain, including supply, demand, and inflation. This involves determining how many digital assets will be issued, whether there is a cap on the total offer, the use cases of the token, and the burning schema to control the number of assets in circulation.

The token white paper typically contains the official explanation for basic tokenomics questions.

Bitcoin tokenomics

The Bitcoin supply is capped at 21 million Bitcoins, and this amount cannot be exceeded. New Bitcoin enters circulation through mining, and miners are rewarded each time they successfully add a block to the chain.

Each block is mined approximately every 10 minutes, so all 21 million Bitcoins will be in circulation by 2140.

The number of Bitcoins rewarded is halved every time 210,000 blocks are mined, resulting in a halving approximately every four years. Once all 21 million Bitcoins have been mined, miners will no longer receive block rewards and will rely solely on transaction fees for revenue.

Tokens, and therefore their tokenomics, play a fundamental role in the functioning and sustainability of DeFi platforms. One of the industries most impacted by this technology is the financial industry, which has given birth to a new concept known as Decentralized Finance, or DeFi. DeFi consists of peer-to-peer financial solutions built on public blockchains. These initiatives offer services that are similar to those offered by banks and other financial institutions, such as earning interest on deposits, lending, and trading assets, without the intervention of banks or other centralized financial institutions. This is achieved through a set of smart contracts (or protocols) that are open to anyone with an address to participate.

One concrete example of DeFi is Aave, a lending and borrowing platform that allows users to lend and borrow various cryptocurrencies without intermediaries such as banks. For instance, if Jane wants to borrow 10 ETH, she can go to Aave, create a borrowing request, and wait for the smart contract to match her request with available lenders who are willing to lend ETH. The borrowed ETH is lent with an interest rate percentage that reflects supply and demand levels. The money lent comes from a liquidity pool where lenders deposit their cryptocurrencies to earn interest on them. With Aave’s decentralized platform, borrowers and lenders can transact directly with each other without needing to go through a traditional financial institution.

We will dive deep into DeFi in Chapter 5.

Total Value Locked (TVL)

TVL refers to the total value of assets currently locked in a specific DeFi protocol. It measures the health of a certain protocol by the amount of money users secure in it. The TVL will increase when users deposit more assets in the protocol, and vice versa, it will decrease when the users withdraw it. It is calculated by summing the value of the assets locked in the protocol and multiplying them by the current price.

Different DeFi protocols may have specific ways of measuring their TVL, and accurately calculating it requires an understanding of how each protocol works. A website that specializes in measuring TVL is DefiLlama (available at https://defillama.com/).

TVL also helps traders determine whether a certain token is undervalued or not by dividing that number by the market cap (or total supply in circulation) of the token issued by said protocol.

This metric helps compare DeFi protocols with each other.

Total market cap

Market capitalization represents the size of the market for a certain token and is closely related to traditional financial concepts. It is calculated by multiplying the number of coins or tokens issued by their current market price. The circulating supply is the sum of tokens currently held by public holders. To get this number, we calculate the tokens in all addresses that are not the minting and burning addresses and subtract the value held by addresses that we know are controlled by the protocol or are allocated to the development team or some investors, and so on.

The max supply or total supply of tokens is the total number of tokens that will be issued by a certain smart contract. Multiplying the max supply by the current price will result in a fully diluted market cap. There are two ways to get the total supply, with state data and transactional data.

In Chapter 2, we will learn how to access state data as tokens as smart contracts have a function that can be queried with Web3.py. To do this, we will need the Application Binary Interface (ABI) of the smart contract and a connection to a node, such as Infura. Example code for this can be found in Chapter01/Relevant metrics II.ipynb.

Another way to access the transactions database and calculate the total supply is by summing all the minting events of a smart contract and subtracting the burning events with SQL. We will learn how to do this in Chapter 5.

The market cap value is dynamic, as it can change as the market price and the supply of tokens fluctuate. A token market cap is widely used in the cryptocurrency industry as a benchmark for the performance of different tokens.

In Chapter01/Relevant metrics II.ipynb, we analyze the Wrapped BTC (WBTC) token, which is one of those cases where the TVL and total market cap coincide, as the token is pegged 1:1 with the collateral.

One of the biggest challenges data scientists will face is agreeing on common definitions and finding trustworthy data sources. We may have a good grasp of mathematical formulas to calculate complex financial indicators, but without reliable data and community consensus on standards, our ability to communicate our findings will be constrained. In the next section, we will explore these challenges and discuss ways to overcome them.

Data quality challenges

In this section, we will discuss the challenges of data quality, which are not unique to Web3 but relevant to all professionals who make decisions based on data. Data quality challenges range from access to incomplete, inaccurate, or inconsistent data to matters of data security, privacy, or governance. However, one of the most important challenges that a Web3 data analyst will face is the reliability of sources.

For instance, the market cap is the result of a simple multiplication of two data sources: the blockchain data that informs the total supply of tokens in circulation and the market price. However, the result of such multiplication varies depending on the source. Let’s take an example of the market cap for USDT. In one source, the following information appears:

Figure 1.4 – USDT market cap information (source: https://etherscan.io/token/0xdac17f958d2ee523a2206206994597c13d831ec7)

On the CoinMarketCap website, for the same token, the fully diluted market cap is $70,158,658,274 (https://coinmarketcap.com/currencies/tether/).

As we see from the example, the same concept is shown differently depending on the source we review. So, how do we choose when we have multiple sources of information?

The most trustworthy and comprehensive source of truth regarding blockchain activity is the full copy of a node. Accessing a node ensures that we will always have access to the latest version of the blockchain. Some services index the blockchain to facilitate access and querying, such as Google BigQuery, Covalent, or Dune, continuously updating their copies. These copies are controlled and centralized.

When it comes to prices, there are numerous sources. A common approach to sourcing prices is connecting to an online marketplace for cryptocurrencies, commonly known as exchanges, such as Binance or Kraken, to extract their market prices. However, commercialization in these markets can be halted for various reasons. For example, during the well-known Terra USD (TUSD) de-peg incident, when the stablecoin lost its 1:1 peg to the US dollar, many exchanges ceased commercialization, citing consumer protection concerns. If our workflow relies on such data, it can be disrupted or show inaccurate old prices. To solve this issue, it is advisable to source prices from sources that average the prices from multiple exchanges, providing more robust information.

At this stage, it is crucial to understand what constitutes quality for our company. Do we prioritize fast and readily available information updated by the second, or do we value highly precise information with relatively slower access? While it may not be necessary to consider this for every project, deciding on certain sources and standardizing processes will save us time in the future.

Once we have determined the quality of the information we will consume, we need to agree on the concepts we want to analyze.

Data standards challenges

As a young industry, there is still no complete consensus on the meaning of many concepts. Let’s examine a few examples.

Retail

Within the cryptocurrency space, there is a complete aquatic ecosystem used to categorize addresses based on the amount of cryptocurrencies they hold. Larger addresses are often referred to as “whales,” while smaller addresses have their own names. Please refer to the following illustration for reference:

Figure 1.5 – Sizes of crypto holdings and their aquatic equivalent

While there is a consensus on the aquatic equivalents used for categorization, there is no unified agreement on the specific number of Bitcoins that each category represents. A quick Google search will reveal varying criteria for what constitutes an address in one category or another.

Another classification, which is particularly valuable for analysts, distinguishes between retail addresses (small investors) and professional addresses. The challenge lies in determining the threshold that distinguishes one from the other. Various approaches are in use, and we can follow the aquatic equivalent definition as mentioned previously or opt for the definition proposed by a forensic company called Chainalysis, which states: “Retail traders (…) deposit less than $10,000 USD worth of Bitcoin at a time on exchanges.”

Confirmations

In a traditional bank or centralized organization, when a user sends a transaction, once received, it is confirmed and can be considered complete. In the blockchain space, the decentralized nature of the network introduces a different dynamic and, consequently, it is common to see the number of confirmations required for transactions of different amounts to be considered valid.

Within a decentralized network, it is entirely possible for two blocks to be mined simultaneously in different parts of the world. The protocol waits for the next block to be mined and, depending on where it attaches, determines which chain is the longest (the longest chain is deemed the valid one). A block in Bitcoin that doesn’t become part of the longest chain is referred to as a stale block, while in Ethereum (prior to the merge), they were known as uncle blocks. Once a transaction is incorporated into a block, it is assigned one confirmation. If our transaction finds its way into a stale block, it will be reversed and return to the mempool in Bitcoin or be added to another block in Ethereum, resuming the count of confirmations.

Given the possibility of reversal, however slim, it has become customary for transaction counterparties to request a certain number of confirmations before accepting that their transactions are irreversible. The longer the chain grows after the block that included our transaction, the less likely it is to be reversed. Following the merge in Ethereum (which took place at block 15537394), uncle blocks ceased to be generated, but some of these practices persist among market participants.

There are no universal standards for the number of confirmations required. Recommendations can vary, with some suggesting six confirmations for Bitcoin and only two for small transfer amounts. For Ethereum, the range was typically between 20 and 40 confirmations. Notably, centralized exchanges such as Coinbase may still require two confirmations for Bitcoin and 35 for Ethereum.

Figure 1.6– Stale and valid transact ions

NFT Floor Price

The NFT Floor Price serves as a metric for determining the minimum price at which any NFT within a collection can be sold, providing market participants with valuable insights into a project’s fair pricing.

There is no universally accepted method for its calculation. One basic approach involves finding the minimum price at which an NFT within a collection has previously been sold. However, due to the presence of multiple marketplaces, each with its unique pricing structure, an alternative approach is to consider prices from the most prominent art marketplaces or to aggregate prices from various sources, giving more weight to the significant marketplaces.

Furthermore, it is crucial to account for practices such as wash trading, which artificially inflates the metric under analysis. We will analyze more of this concept in Chapter 4.

The concept of “lost”

Suppose we need to calculate the circulating supply of Bitcoins for the next five years. For such a calculation, we must take into account not only how much will be mined but also how many Bitcoins can be considered “lost.” How do we determine that a certain amount of crypto is lost?

To move assets on the blockchain, we need to sign the transfer with our private key. If that private key is lost, we cannot access those assets, and therefore, those assets have to be counted as lost. With this information in mind, it is safe to assume that some of the Bitcoin supplied as of today is already lost or will be. When reading the blockchain, we can see those funds in possession of a certain address, but it is entirely possible that such an address is unable to dispose of them. Since this is a pseudo-anonymous system, we cannot contact the Bitcoin holders and ask them to verify whether they have access to their funds; there is no centralized way to do it.

The forensics company named Chainalysis proposed the definition that “Bitcoin that has not moved for five years now is considered lost.” The consequence of such a definition is that 20% of the mined bitcoins would be lost. This is a proposed concept, and it is yet to be seen whether it becomes a standard.

In conclusion, we can agree on three ideal ways of approaching data in Web3:

Deep dive into the metrics that will be available in our dashboards or the data that will be consumed by our model. Read the concepts and documentation thoroughly.
Be open to finding different approaches to the same market subject. Since the industry is growing, there is no established way of doing some things.
Be prepared to witness concepts change as the industry matures and fine-tunes its best practices.

To understand the technical aspects of smart contracts, the OpenZeppelin documentation is a valuable reference. Similarly, for market-related concepts, as mentioned previously, Chainalysis defines many concepts and can help as a starting point.

A brief overview of APIs

APIs, or application programming interfaces, facilitate communication between two software services through a series of requests and responses. For instance, when we receive a notification about a token’s price drop in our telephone app, it means that our app is communicating with a price provider such as CoinMarketCap via an API. To structure a request for the desired response, we must always refer to the relevant API documentation.

For a more comprehensive understanding of APIs, we can find additional information in the book Python API Development Fundamentals by Packt Publishing . Since we will frequently interact with APIs to extract information, it’s beneficial to review the primary characteristics of different APIs. This will greatly assist us when we aim to programmatically access information.

For this purpose, we will focus on the following:

Remote Procedure Call (RPC) APIs: In RPC APIs, the client initiates a function on the server, and the server sends back the output. In practice, we include the method (endpoints) of the function in the URL and the arguments in the query string. In this case, the client needs to possess all the information about the endpoints, and sometimes it involves constructing a workflow with information queried from other URLs. An example of an RPC API encoded in JSON format is the Infura suite, which we have utilized in previous sections.
Representational State Transfer (REST) API: This API is stateless, meaning that it does not save the client’s data between requests. This is one of the most popular methods on the market because it is lightweight and easy to maintain and scale.
The client sends requests to the server in the form of a web URL, including methods such as GET, POST, PUT, or DELETE, and the server responds, typically in JSON format. A REST request typically comprises the HTTP method, endpoint, headers, and body. The endpoint identifies the resource online, headers provide server information, such as authentication, and the body contains the information the sender wishes to transmit to the server as a request.

An alternative approach was developed by Facebook as an open source query language named GraphQL. The key difference from the aforementioned APIs is that GraphQL is a query language, whereas REST is an architectural concept for software. GraphQL is a syntax for data retrieval that empowers the client to specify the required information, unlike the REST infrastructure, where queries return fixed datasets for each endpoint (sometimes including more information than necessary).

A noteworthy feature of GraphQL is its ability to construct requests that fetch data from multiple resources using a single API call. The Graph is an indexer and query protocol for the Ethereum network that is queried using GraphQL; we will delve into it further in Chapter 2.