Reader small image

You're reading from  Developing Kaggle Notebooks

Product typeBook
Published inDec 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781805128519
Edition1st Edition
Languages
Right arrow
Author (1)
Gabriel Preda
Gabriel Preda
author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Right arrow

Analyzing Acoustic Signals to Predict the Next Simulated Earthquake

In the previous chapters, we explored basic table-formatted data, covering categories like categorical, ordinal, and numerical data, as well as text, geographical coordinates, and imagery. The current chapter shifts our focus to a different data category, specifically, simulated or experimental signal data. This data type often appears in a range of formats beyond the standard CSV file format.

Our primary case study will be data from the LANL Earthquake Prediction Kaggle competition (see Reference 1). I contributed to this competition with a widely recognized and frequently forked notebook titled LANL Earthquake EDA and Prediction (see Reference 2), which will serve as the foundational resource for this chapter’s principal notebook. We’ll then delve into feature engineering, employing a variety of signal analysis techniques vital for developing a predictive model for the competition. Our goal will...

Introducing the LANL Earthquake Prediction competition

The LANL Earthquake Prediction competition centers on utilizing seismic signals to determine the precise timing of a laboratory-induced earthquake. Currently, predicting natural earthquakes remains beyond the reach of our scientific knowledge and technological capabilities. The ideal scenario for scientists is to predict the timing, location, and magnitude of such an event.

Simulated earthquakes, however, created in highly controlled artificial environments, mimic real-world seismic activities. These simulations enable attempts to forecast lab-generated quakes using the same types of signals observed in natural settings. In this competition, participants use an acoustic data input signal to estimate the time until the next artificial earthquake occurs, as detailed in Reference 3. The challenge is to predict the timing of the earthquake, addressing one of the three critical unknowns in earthquake forecasting: when it will happen...

Formats for signal data

Several competitions on Kaggle used sound data as an addition to regular tabular features. There were three competitions organized by Cornell Lab of Ornithology’s BirdCLEF (LifeCLEF Bird Recognition Challenge) in 2021, 2022, and 2023 for predicting a bird species from samples of bird songs (see Reference 4 for an example of one of these competitions). The format used in these competitions was .ogg. The .ogg format is used to store audio data with less bandwidth. It is considered technically superior to the .mp3 format.

We can read these types of file formats using the librosa library (see Reference 5). The following code can be used to load an .ogg file and display the sound wave:

import matplotlib.pyplot as plt
import librosa
def display_sound_wave(sound_path=None,
               text="Test", 
               color="green"):
    """
    Display a sound wave
    Args
        sound_path: path to the sound file...

Exploring our competition data

The LANL Earthquake Prediction dataset consists of the following data:

  • A train.csv file, with two columns only:
    • acoustic_data: This is the amplitude of the acoustic signal.
    • time_to_failure: This is the time to failure corresponding to the current data segment.
  • A test folder with 2,624 files with small segments of acoustic data.
  • A sample_submission.csv file; for each test file, those competing will need to give an estimate for time to failure.

The training data (9.56 GB) contains 692 million rows. The actual time constant for the samples in the training data results from the continuous variation of time_to_failure values. The acoustic data is integer values, from -5,515 to 5,444, with an average of 4.52 and a standard deviation of 10.7 (values oscillating around 0). The time_to_failure values are real numbers, ranging from 0 to 16, with a mean of 5.68 and a standard deviation of 3.67...

Feature engineering

We will use several libraries specific to signal processing to generate most of the features. From SciPy (Python scientific library), we are using a few functions from the signal module. The Hann function returns a Hann window, which modifies the signal to smooth the values at the end of the sampled signal to 0 (uses a cosine “bell” function). The Hilbert function computes the analytic signal, using the Hilbert transform. The Hilbert transform is a mathematical technique used in signal processing, with a property that shifts the phase of the original signal by 90 degrees.

Other library functions used are from numpy: Fast Fourier Transform (FFT), mean, min, max, std (standard deviation), abs (absolute value), diff (the difference between two successive values in the signal), and quantile (where a sample is divided into equal-sized, adjacent groups). We are also using a few statistical functions that are available from pandas: mad (median absolute...

Building a baseline model

From the original temporal data, through feature engineering, we generated time-aggregated features for each time segment in the training data, equal in duration with one test set. For the baseline model demonstrated in this competition, we chose LGBMRegressor, one of the best-performing algorithms at the time of the competition, which, in many cases, had a similar performance to XGBoost. The training data is split using KFold into five splits, and we run training and validation for each fold until we reach the final number of iterations or when the validation error ceases to improve after a specified number of steps (given by the patience parameter). For each split, we then also run the prediction for the test set, with the best model – trained with the current train split for the current fold, that is, with 4/5 from the training set. At the end, we will work out the average of the predictions obtained for each fold. We can use this cross-validation...

Summary

In this chapter, we delved into handling signal data, focusing particularly on audio signals. We explored various storage formats for such data and examined libraries for loading, transforming, and visualizing this data type. To develop potent features, we applied a range of signal-processing techniques. Our feature engineering efforts transformed time-series data from each training segment and aggregated features for each test set.

We consolidated all feature engineering processes into a single function, applicable to all training segments and test sets. The transformed features underwent scaling. We then used this prepared data to train a baseline model utilizing the LGBMRegressor algorithm. This model employed cross-validation, and we generated predictions for the test set using the model trained in each fold. Subsequently, we aggregated these predictions to create the submission file. Additionally, we captured and visualized the feature importance for each fold.

...

References

  1. LANL Earthquake Prediction, Can you predict upcoming laboratory earthquakes?, Kaggle Competition: https://www.kaggle.com/competitions/LANL-Earthquake-Prediction
  2. Gabriel Preda, LANL Earthquake EDA and Prediction: https://www.kaggle.com/code/gpreda/lanl-earthquake-eda-and-prediction
  3. LANL Earthquake Prediction, dataset description: https://www.kaggle.com/competitions/LANL-Earthquake-Prediction/data
  4. BirdCLEF 2021 - Birdcall Identification, identify bird calls in soundscape recordings, Kaggle competition: https://www.kaggle.com/competitions/birdclef-2021
  5. McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. “librosa: Audio and music signal analysis in Python.” In Proceedings of the 14th Python in Science Conference, pp. 18-25. 2015
  6. librosa load function: https://librosa.org/doc/main/generated/librosa.load.html
  7. Cornell Birdcall Identification, Kaggle competition: https...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Developing Kaggle Notebooks
Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda