DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionSupervised anomaly detection helps turn labelled time series data into practical models that can identify abnormal patterns, such as stockouts, sensor failures, or fraud events. Unlike standard classification tasks, anomaly detection often involves severe class imbalance and imperfect labels, making accuracy a poor measure of success. In this article, you’ll learn how to build a more reliable supervised anomaly detection workflow using residual features, CatBoost, time-aware validation, class weighting, and threshold tuning to improve precision and recall.Supervised anomaly detectionSupervised anomaly detection is simply classification, as any good academic will tell you. You have observations, labels, on which you train a model that distinguishes anomalous from non-anomalous observations. Chapters 14 and 15 covered some aspects of classification methodology, so we won’t repeat that here. What is worth discussing is what makes the anomaly detection variant of classification distinct from the general case, and what a practical pipeline looks like.Two things distinguish supervised anomaly detection from general classification problems: severe class imbalance and label quality. Both are structural rather than incidental, and both directly affect how you build and evaluate supervised detectors. The label quality problemLabels in anomaly detection are rarely clean. They come from one of three sources, each with its own contamination issue.Operational incident logs are accurate for large, impactful events (a week-long stockout, a confirmed sensor failure) but can miss subtle anomalies that don’t rise to the threshold of a ticket. The positive class is high-precision but recall can be low.Rules-based labelling pipelines are the most common source, and the most dangerous. Rules are written at a point in time, reflecting a data distribution at that moment. As the underlying process drifts, rules generate stale labels. Flagged observations are no longer anomalous (false positive labels) and new anomaly patterns are missed (false negatives).Prior unsupervised detector output is used when no other labels exist. If your profiling model drifted before retraining, or was itself trained on contaminated data, the labels it generated carry that error forward into your supervised model. This can result in compounded contamination, which makes you wonder if foundational anomaly detection is ever possible. Regardless, a bad unsupervised model produces bad labels; a supervised model trained on bad labels produces confidently wrong predictions.The practical response is to audit before training. But I have three million time series, you say. We should then inspect a random sample of our positive-class (anomaly) labels and verify them against raw data. If more than ~20% of your positive labels are wrong, a supervised model will likely underperform a well-tuned unsupervised approach. You’re better off fixing labels first. Class imbalanceWith anomalies comprising well under 5% of observations, often under 1%, a classifier that predicts“normal” everywhere achieves very high accuracy while being completely useless. This is why accuracy is a meaningless metric for anomaly classification, and why you should never use it as your optimization target. Some practical remedies are:• class_weight='balanced' (or auto_class_weights='Balanced' in CatBoost). This reweights the loss function by inverse class frequency, making a missed anomaly proportionally more costly than a missed normal observation. It is almost always worth doing and costs nothing• Threshold adjustment: most classifiers produce a probability score, and a default threshold of 0.5 implicitly assumes balanced classes. Lowering it (say, to 0.2) increases recall at the cost of precision. Use a precision-recall curve on a validation set to fi nd the threshold that reflects your actual cost tradeoff .• Oversampling the minority class: DO NOT DO THIS. Once widely used (and researched!) but now considered poor practice. SMOTE creates synthetic points by interpolating in feature space, producing examples that are statistically plausible but temporally incoherent. Better alternatives are to augment using domain knowledge (jitter known anomaly windows, vary their duration) or simply collect more real labels. Gradient boosting on residual featuresThe most effective supervised approach in our experience, and the one that connects most naturally to the profiling work earlier in this chapter, is to build a feature matrix from profile model residuals and train a gradient boosting classifier on it. A profile model removes known structure. What remains in the residuals is the unexplained variation. A supervised classifier trained on residual features learns to distinguish residual patterns associated with confirmed anomalies from patterns that are just normal noise, things like seasonal peaks that happened to breach a threshold, measurement jitter, or genuine promotions that the model underestimated.Below we have made a few specific design choices.• We use TimeSeriesSplit rather than random cross-validation. This is because our rolling features (resid_roll_std, sales_roll_mean) are computed from past observations and shifted by one step to avoid look-ahead. Random shuffling would be fi ne if we didn’t include these kinds of features; as it stands they would leak information about the temporal neighborhood of test points into training.• We use CatBoost with auto_class_weights='Balanced', which automatically reweights the loss function inversely proportional to class frequency. With anomalies representing roughly 4% of observations, an unweighted classifier would achieve 96% accuracy by predicting normal for every row. Balanced weighting forces the model to treat a missed anomaly as roughly 24x more costly than a false alarm during training, which prevents it from ignoring the minority class (outliers). The threshold adjustment step after training lets us fine-tune the actual precision-recall tradeoff .CatBoost was selected for our classifier, but you could easily use another algorithm i.e. XGBoost, LightGBM, or a regularized logistic regression. What is important here is having good labels, a strong temporal structure and optionally a well-explained model leaving us with good residual signal.from catboost import CatBoostClassifier from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import classification_report, precision_recall_curve import pandas as pd import numpy as npdef build_supervised_features(df, window=14):
"""
Feature matrix for supervised residual profiling. All rolling features are shifted by 1 to avoid lookahead bias. """ feats = pd.DataFrame(index=df.index)
# Core residual signal feats['residual'] = df['residual'] feats['residual_z'] = df['residual_z'] feats['residual_abs'] = df['residual'].abs()
# Local residual statistics for w in [7, 14, 28]:
feats[f'resid_roll_mean_{w}'] = ( df['residual'].rolling(w, min_periods=3).mean().shift(1)
) feats[f'resid_roll_std_{w}'] = ( df['residual'].rolling(w, min_periods=3).std().shift(1)
) feats[f'resid_roll_max_{w}'] = ( df['residual'].abs().rolling(w, min_periods=3).max().shift(1) )
# Scores from unsupervised detectors as features for col in ['if_unsup_score', 'if_score', 'eif_score', 'lof_score']: if col in df.columns: feats[col] = df[col]
# Context feats['price'] = df['price'] feats['promotion'] = df['promotion'] feats['sales_roll_mean'] = ( df['sales'].rolling(14, min_periods=3).mean().shift(1)
)
feats['day_of_week'] = df.index.dayofweek feats['month'] = df.index.month
return feats.fillna(0)
feature_df = build_supervised_features(stockouts_data) labels = stockouts_data['is_anomaly'].fillna(0).astype(int)
# Time-series split tscv = TimeSeriesSplit(n_splits=4)
all_preds = np.zeros(len(labels)) all_probs = np.zeros(len(labels)) test_mask = np.zeros(len(labels), dtype=bool)
for train_idx, test_idx in tscv.split(feature_df):
X_tr = feature_df.iloc[train_idx].values X_te = feature_df.iloc[test_idx].values y_tr = labels.iloc[train_idx].values y_te = labels.iloc[test_idx].values
clf = CatBoostClassifier( iterations=300, learning_rate=0.05, depth=4, auto_class_weights='Balanced', eval_metric='F1', random_seed=42, verbose=0
) clf.fit(X_tr, y_tr, eval_set=(X_te, y_te))
all_preds[test_idx] = clf.predict(X_te) all_probs[test_idx] = clf.predict_proba(X_te)[:, 1] test_mask[test_idx] = True
y_test_all = labels.values[test_mask] y_pred_all = all_preds[test_mask].astype(int) y_prob_all = all_probs[test_mask]
print(classification_report(y_test_all, y_pred_all, target_names=['Normal', 'Anomaly']))
# Output:
precision recall f1-score support
Normal 1.00 0.99 1.00 562 Anomaly 0.81 1.00 0.90 22
accuracy 0.99 584 macro avg 0.91 1.00 0.95 584 weighted avg 0.99 0.99 0.99 584
The four blocks in build_supervised_features each give our classifier a different view on the same observation. The core residual signal (residual, residual_z, residual_abs) is the raw ‘what the profile couldn’t explain’; this is our primary anomaly evidence. The local residual statistics (rolling mean, std, and max at 7, 14, and 28 days, all shifted by one) provide context, because a residual of +20 means very different things depending on whether the surrounding fortnight has been quiet or noisy. The unsupervised detector scores (if_unsup_score, if_score, eif_score, lof_score) are second opinions from earlier in the chapter; our classifier learns when to trust them and when to override them. Finally the context features (price, promotion, sales_roll_mean, day_of_week, month) let our model learn that a large negative residual in late December under a deep promotion is structurally different from the same residual on a quiet Tuesday in February. Every rolling feature is shifted by one step so that no feature at time ݐ contains ݕ௧ itself.Figure 17.19: CatBoost precision-recall curve and feature importanceThe feature importance panel tells us which signals the model is actually using. If if_unsup_score or if_score dominate, the supervised model is largely deferring to unsupervised detectors and the supervised wrapper adds little beyond threshold calibration. Some influence from resid_roll_std suggests the classifier has learned temporal patterns from residuals that no single unsupervised method captures on its own. We could instead, build more features, and labels with the raw data, which would avoid the need for ensembled output. Threshold adjustmentCatBoost’s default threshold of 0.5 assumes equal misclassification costs. In anomaly detection this is rarely appropriate, as the cost of missing real stockouts often exceeds false alarms. The precision-recall curve in the figure above shows the full tradeoff surface; here we select the threshold that maximizes F1 as a reasonable default.from sklearn.metrics import f1_score
thresholds = np.arange(0.1, 0.9, 0.01) f1_scores = [ f1_score(y_test_all, (y_prob_all >= t).astype(int), zero_division=0) for t in thresholds
]
best_thresh = thresholds[np.argmax(f1_scores)] best_f1 = max(f1_scores)
y_pred_tuned = (y_prob_all >= best_thresh).astype(int) tp_sup = ((y_pred_tuned == 1) & (y_test_all == 1)).sum() fp_sup = ((y_pred_tuned == 1) & (y_test_all == 0)).sum() fn_sup = ((y_pred_tuned == 0) & (y_test_all == 1)).sum()
prec_sup = tp_sup / (tp_sup + fp_sup) if (tp_sup + fp_sup) else 0.0 rec_sup = tp_sup / (tp_sup + fn_sup) if (tp_sup + fn_sup) else 0.0
print(f"Best threshold: {best_thresh:.2f}")
print(f"Supervised (tuned): Precision: {prec_sup:.2f} Recall: {rec_sup:.2f}
F1: {best_f1:.2f}")
# Output:
Best threshold: 0.55
Supervised (tuned): Precision: 0.91 Recall: 0.95 F1: 0.93If false negatives are expensive (missed equipment failures, missed fraud) you want a lower threshold to maximize recall. If false positives are expensive (wasted analyst time, unnecessary operational interventions) you want a higher threshold to maximize precision. There is no universally correct answer; the threshold is a business decision, not a model decision.17.19: CatBoost supervised detections with tuned thresholdMethodTypePrecisionRecallF1Z-score ProfileUnsupervised0.741.000.85IF: Raw FeaturesUnsupervised0.510.970.67IF: ResidualsUnsupervised0.250.480.33INNE: ResidualsUnsupervised0.050.100.07LOF: ResidualsUnsupervised0.830.970.90CatBoost (supervised, tuned threshold)Supervised0.910.950.93Table 17.4: Comparing all approachesCatBoost wins clearly here, but this result is an artifact of the task. Our anomalies were generated programmatically and we have certainty in them. We have a strong temporal structure and relationships in our data, even where they don’t repeat in calendar position, noise is low. With clean labels, repeated anomaly patterns, and a profile model that already explains most of the variance, a supervised classifier on residual features has very little left to do. It is an easy task, that shows us how critical structure and labelling are.We should think of this as a calibration layer that sits on top of unsupervised detection. In production, labels come from unsupervised detectors and human-in-the-loop review (the contamination problem returns); novel anomalies by definition should not repeat in your training data. We will return to this self-supervised detection in Chapter 18.Some practical considerations:•Hyperparameter sensitivity: contamination has the largest effect in isolation-based methods; it sets the decision threshold on anomaly scores, not the model structure. Halving max_samples and max_features to 0.5 while raising contamination to 0.08 lifted the unsupervised IF’s recall from 0.61 to 0.97. For supervised models, threshold adjustment on the probability output typically matters more than tree hyperparameters themselves (Soenen, Wolputte, and Perini 2021).•The contamination-imputation loop: Start with a generous contamination rate (high recall, accept false positives), review flagged observations, then impute confirmed anomalies with interpolated values. Some values were already wrong; replacing them should improve signal rather than fabricate it. Normal data is abundant enough to anchor any reasonable interpolation. Retrain a profile on the imputed series and re-score. Each iteration sharpens the residuals and tightens detection boundaries. Human confirmation at each step is essential; uncertain flags stay in the data untouched. Depending on your case, you could pair this with forecast accuracy, being sure to leave validation sets clean; otherwise, you’re marking your own homework.ConclusionSupervised anomaly detection works best when labels are trustworthy, temporal structure is preserved, and model outputs are tuned around real business costs. By using residual-based features, balanced class weighting, TimeSeriesSplit, and precision-recall-based threshold adjustment, you can build detectors that perform better than simple unsupervised baselines in well-labelled scenarios. However, the real challenge in production is not just model selection; it is maintaining label quality, managing class imbalance, and deciding the right trade-off between missed anomalies and false alarms.This article is an excerpt from Time Series with PyTorch: Modern Deep Learning Toolkit for Real-World Forecasting Challenges, published by Packt. Author BioGraeme Davidson is a Lead Data Scientist at Retail Express, where he redesigned the company's demand forecasting framework in line with contemporary statistical learning practices. His background spans cognitive neuroscience, researching implicit reward processing and human decision-making, through advertising analytics to research-focused demand forecasting. He is an active contributor to several data science Slack and Discord communities, an occasional competitor in forecasting competitions, and was approached by Packt in late 2022 to write the book he wished had existed when he first fell down an ARIMA rabbit hole chasing answers about how supermarkets actually forecast demand, and how a quantitative researcher models financial markets.Lei Ma is a physicist-turned data scientist specializing in time series forecasting. He is theorist but has tackled real-world forecasting challenges across a variety of industries like housing, logistics, ecommerce, and manufacturing. Lei has led and delivered numerous forecasting projects where he combines deep expertise in building advanced time series models with a strategic approach to delivering holistic business insights. Lei creates time series forecasting tutorials online and joined the venture when Graeme approached him to collaborate on this book.
Read more