You're reading from Developing Kaggle Notebooks

Product typeBook

Published inDec 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781805128519

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Gabriel Preda

Analyzing Acoustic Signals to Predict the Next Simulated Earthquake

In the previous chapters, we explored basic table-formatted data, covering categories like categorical, ordinal, and numerical data, as well as text, geographical coordinates, and imagery. The current chapter shifts our focus to a different data category, specifically, simulated or experimental signal data. This data type often appears in a range of formats beyond the standard CSV file format.

Our primary case study will be data from the LANL Earthquake Prediction Kaggle competition (see Reference 1). I contributed to this competition with a widely recognized and frequently forked notebook titled LANL Earthquake EDA and Prediction (see Reference 2), which will serve as the foundational resource for this chapter’s principal notebook. We’ll then delve into feature engineering, employing a variety of signal analysis techniques vital for developing a predictive model for the competition. Our goal will...

Introducing the LANL Earthquake Prediction competition

The LANL Earthquake Prediction competition centers on utilizing seismic signals to determine the precise timing of a laboratory-induced earthquake. Currently, predicting natural earthquakes remains beyond the reach of our scientific knowledge and technological capabilities. The ideal scenario for scientists is to predict the timing, location, and magnitude of such an event.

Simulated earthquakes, however, created in highly controlled artificial environments, mimic real-world seismic activities. These simulations enable attempts to forecast lab-generated quakes using the same types of signals observed in natural settings. In this competition, participants use an acoustic data input signal to estimate the time until the next artificial earthquake occurs, as detailed in Reference 3. The challenge is to predict the timing of the earthquake, addressing one of the three critical unknowns in earthquake forecasting: when it will happen...

Formats for signal data

Several competitions on Kaggle used sound data as an addition to regular tabular features. There were three competitions organized by Cornell Lab of Ornithology’s BirdCLEF (LifeCLEF Bird Recognition Challenge) in 2021, 2022, and 2023 for predicting a bird species from samples of bird songs (see Reference 4 for an example of one of these competitions). The format used in these competitions was .ogg. The .ogg format is used to store audio data with less bandwidth. It is considered technically superior to the .mp3 format.

We can read these types of file formats using the librosa library (see Reference 5). The following code can be used to load an .ogg file and display the sound wave:

import matplotlib.pyplot as plt
import librosa
def display_sound_wave(sound_path=None,
               text="Test", 
               color="green"):
    """
    Display a sound wave
    Args
        sound_path: path to the sound file...

Exploring our competition data

The LANL Earthquake Prediction dataset consists of the following data:

A train.csv file, with two columns only:
- acoustic_data: This is the amplitude of the acoustic signal.
- time_to_failure: This is the time to failure corresponding to the current data segment.
A test folder with 2,624 files with small segments of acoustic data.
A sample_submission.csv file; for each test file, those competing will need to give an estimate for time to failure.

The training data (9.56 GB) contains 692 million rows. The actual time constant for the samples in the training data results from the continuous variation of time_to_failure values. The acoustic data is integer values, from -5,515 to 5,444, with an average of 4.52 and a standard deviation of 10.7 (values oscillating around 0). The time_to_failure values are real numbers, ranging from 0 to 16, with a mean of 5.68 and a standard deviation of 3.67...

Feature engineering

We will use several libraries specific to signal processing to generate most of the features. From SciPy (Python scientific library), we are using a few functions from the signal module. The Hann function returns a Hann window, which modifies the signal to smooth the values at the end of the sampled signal to 0 (uses a cosine “bell” function). The Hilbert function computes the analytic signal, using the Hilbert transform. The Hilbert transform is a mathematical technique used in signal processing, with a property that shifts the phase of the original signal by 90 degrees.

Other library functions used are from numpy: Fast Fourier Transform (FFT), mean, min, max, std (standard deviation), abs (absolute value), diff (the difference between two successive values in the signal), and quantile (where a sample is divided into equal-sized, adjacent groups). We are also using a few statistical functions that are available from pandas: mad (median absolute...

Building a baseline model

From the original temporal data, through feature engineering, we generated time-aggregated features for each time segment in the training data, equal in duration with one test set. For the baseline model demonstrated in this competition, we chose LGBMRegressor, one of the best-performing algorithms at the time of the competition, which, in many cases, had a similar performance to XGBoost. The training data is split using KFold into five splits, and we run training and validation for each fold until we reach the final number of iterations or when the validation error ceases to improve after a specified number of steps (given by the patience parameter). For each split, we then also run the prediction for the test set, with the best model – trained with the current train split for the current fold, that is, with 4/5 from the training set. At the end, we will work out the average of the predictions obtained for each fold. We can use this cross-validation...

Summary

In this chapter, we delved into handling signal data, focusing particularly on audio signals. We explored various storage formats for such data and examined libraries for loading, transforming, and visualizing this data type. To develop potent features, we applied a range of signal-processing techniques. Our feature engineering efforts transformed time-series data from each training segment and aggregated features for each test set.

We consolidated all feature engineering processes into a single function, applicable to all training segments and test sets. The transformed features underwent scaling. We then used this prepared data to train a baseline model utilizing the LGBMRegressor algorithm. This model employed cross-validation, and we generated predictions for the test set using the model trained in each fold. Subsequently, we aggregated these predictions to create the submission file. Additionally, we captured and visualized the feature importance for each fold.

...

References

LANL Earthquake Prediction, Can you predict upcoming laboratory earthquakes?, Kaggle Competition: https://www.kaggle.com/competitions/LANL-Earthquake-Prediction
Gabriel Preda, LANL Earthquake EDA and Prediction: https://www.kaggle.com/code/gpreda/lanl-earthquake-eda-and-prediction
LANL Earthquake Prediction, dataset description: https://www.kaggle.com/competitions/LANL-Earthquake-Prediction/data
BirdCLEF 2021 - Birdcall Identification, identify bird calls in soundscape recordings, Kaggle competition: https://www.kaggle.com/competitions/birdclef-2021
McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. “librosa: Audio and music signal analysis in Python.” In Proceedings of the 14th Python in Science Conference, pp. 18-25. 2015
librosa load function: https://librosa.org/doc/main/generated/librosa.load.html
Cornell Birdcall Identification, Kaggle competition: https...

The rest of the chapter is locked

You have been reading a chapter from

Developing Kaggle Notebooks

Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages