Hands-On Attention Mechanism for Time Series Classification, with Python

is a game changer in Machine Learning. In fact, in the recent history of Deep Learning, the idea of allowing models to focus on the most relevant parts of an input sequence when making a prediction completely revolutionized the way we look at Neural Networks.

That being said, there is one controversial take that I have about the attention mechanism:

The best way to learn the attention mechanism is not through Natural Language Processing (NLP)

It is (technically) a controversial take for two reasons.

People naturally use NLP cases (e.g., translation or NSP) because NLP is the reason why the attention mechanism was developed in the first place. The original goal was to overcome the limitations of RNNs and CNNs in handling long-range dependencies in language (if you haven’t already, you should really read the paper Attention is All You Need).
Second, I will also have to say that in order to understand the general idea of putting the “attention” on a specific word to do translation tasks is very intuitive.

That being said, if we want to understand how attention REALLY works in a hands-on example, I believe that Time Series is the best framework to use. There are many reasons why I say that.

Computers are not really “made” to work with strings; they work with ones and zeros. All the embedding steps that are necessary to convert the text into vectors add an extra layer of complexity that is not strictly related to the attention idea.
The attention mechanism, though it was first developed for text, has many other applications (for example, in computer vision), so I like the idea of exploring attention from another angle as well.
With time series specifically, we can create very small datasets and run our attention models in minutes (yes, including the training) without any fancy GPUs.

In this blog post, we will see how we can build an attention mechanism for time series, specifically in a classification setup. We will work with sine waves, and we will try to classify a normal sine wave with a “modified” sine wave. The “modified” sine wave is created by flattening a portion of the original signal. That is, at a certain location in the wave, we simply remove the oscillation and replace it with a flat line, as if the signal had temporarily stopped or become corrupted.

To make things more spicy, we will assume that the sine can have whatever frequency or amplitude, and that the location and extension (we call it length) of the “rectified” part are also parameters. In other words, the sine can be whatever sine, and we can put our “straight line” wherever we like on the sine wave.

Well, ok, but why should we even bother with the attention mechanism? Why are we not using something simpler, like Feed Forward Neural Networks (FFNs) or Convolutional Neural Networks (CNNs)?

Well, because again we are assuming that the “modified” signal can be “flattened” everywhere (in whatever location of the timeseries), and it can be flattened for whatever length (the rectified part can have whatever length). This means that a standard Neural Network is not that efficient, because the anomalous “part” of the timeseries is not always in the same portion of the signal. In other words, if you are just trying to deal with this with a linear weight matrix + a non linear function, you will have suboptimal results, because index 300 of time series 1 can be completely different from index 300 of time series 14. What we need instead is a dynamic approach that puts the attention on the anomalous part of the series. This is why (and where) the attention method shines.

This blog post will be divided into these 4 steps:

Code Setup. Before getting into the code, I will display the setup, with all the libraries we will need.
Data Generation. I will provide the code that we will need for the data generation part.
Model Implementation. I will provide the implementation of the attention model
Exploration of the results. The benefit of the attention model will be displayed through the attention scores and classification metrics to assess the performance of our approach.

It seems like we have a lot of ground to cover. Let’s get started! 🚀

1. Code Setup

Before delving into the code, let’s invoke some friends that we will need for the rest of the implementation.

These are just default values that can be used throughout the project. What you see below is the short and sweet requirements.txt file.

I like it when things are easy to change and modular. For this reason, I created a .json file where we can change everything about the setup. Some of these parameters are:

The number of normal vs abnormal time series (the ratio between the two)
The number of time series steps (how long your timeseries is)
The size of the generated dataset
The min and max locations and lengths of the linearized part
Much more.

The .json file looks like this.

So, before going to the next step, make sure you have:

The constants.py file is in your work folder
The .json file in your work folder or in a path that you remember
The libraries in the requirements.txt file were installed

2. Data Generation

Two simple functions build the normal sine wave and the modified (rectified) one. The code for this is found in data_utils.py:

Now that we have the basics, we can do all the backend work in data.py. This is intended to be the function that does it all:

Receives the setup information from the .json file (that’s why you need it!)
Builds the modified and normal sine waves
Does the train/test split and train/val/test split for the model validation

The data.py script is the following:

The additional data script is the one that prepares the data for Torch (SineWaveTorchDataset), and it looks like this:

If you want to take a look, this is a random anomalous time series:

And this is a non-anomalous time series:

Now that we have our dataset, we can worry about the model implementation.

3. Model Implementation

The implementation of the model, the training, and the loader can be found in the model.py code:

Now, let me take some time to explain why the attention mechanism is a game-changer here. Unlike FFNN or CNN, which would treat all time steps equally, attention dynamically highlights the parts of the sequence that matter most for classification. This allows the model to “zoom in” on the anomalous section (regardless of where it appears), making it especially powerful for irregular or unpredictable time series patterns.

Let me be more precise here and talk about the Neural Network.
In our model, we use a bidirectional LSTM to process the time series, capturing both past and future context at each time step. Then, instead of feeding the LSTM output directly into a classifier, we compute attention scores over the entire sequence. These scores determine how much weight each time step should have when forming the final context vector used for classification. This means the model learns to focus only on the meaningful parts of the signal (i.e., the flat anomaly), no matter where they occur.

Now let’s connect the model and the data to see the performance of our approach.

4. A practical example

4.1 Training the Model

Given the big backend part that we develop, we can train the model with this super simple block of code.

This took around 5 minutes on the CPU to complete.
Notice that we implemented (on the backend) an early stopping and a train/val/test to avoid overfitting. We are responsible kids.

4.2 Attention Mechanism

Let’s use the following function here to display the attention mechanism together with the sine function.

Let’s show the attention scores for a normal time series.

Image generated by author using the code above

As we can see, the attention scores are localized (with a sort of time shift) on the areas where there is a flat part, which would be near the peaks. Nonetheless, again, these are only localized spikes.

Now let’s look at an anomalous time series.

As we can see here, the model recognizes (with the same time shift) the area where the function flattens out. Nonetheless, this time, it is not a localized peak. It is a whole section of the signal where we have higher than usual scores. Bingo.

4.3 Classification Performance

Ok, this is nice and all, but does this work? Let’s implement the function to generate the classification report.

The results are the following:

Accuracy : 0.9775
Precision : 0.9855
Recall : 0.9685
F1 Score : 0.9769
ROC AUC Score : 0.9774

Confusion Matrix:
[[1002 14]
[ 31 953]]

Very high performance in terms of all the metrics. Works like a charm. 🙃

5. Conclusions

Thank you very much for reading through this article ❤️. It means a lot. Let’s summarize what we found in this journey and why this was helpful. In this blog post, we applied the attention mechanism in a classification task for time series. The classification was between normal time series and “modified” ones. By “modified” we mean that a part (a random part, with random length) has been rectified (substituted with a straight line). We found that:

Attention mechanisms have been originally developed in NLP, but they also excel at identifying anomalies in time series data, especially when the location of the anomaly varies across samples. This flexibility is difficult to achieve with traditional CNNs or FFNNs.
By using a bidirectional LSTM combined with an attention layer, our model learns what parts of the signal matter most. We saw that a posteriori through the attention scores (alpha), which reveal which time steps were most relevant for classification. This framework provides a transparent and interpretable approach: we can visualize the attention weights to understand why the model made a certain prediction.
With minimal data and no GPU, we trained a highly accurate model (F1 score ≈ 0.98) in just a few minutes, proving that attention is accessible and powerful even for small projects.

6. About me!

Thank you again for your time. It means a lot ❤️

My name is Piero Paialunga, and I’m this guy here:

I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department. I talk about AI and Machine Learning in my blog posts and on LinkedIn, and here on TDS. If you liked the article and want to know more about machine learning and follow my studies, you can:

A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you can see all my code
C. For questions, you can send me an email at [email protected]

Ciao!

Hands-On Attention Mechanism for Time Series Classification, with Python

Unlock ERP Efficiency with Tailored Odoo Customization Services

Why BlackRock’s Cybersecurity ETF ($BUG) Is Upgraded Amid AI Surge

softbliss

Related Posts

New AI Innovation Hub in Tunisia Drives Technological Advancement Across Africa

Beyond Text Compression: Evaluating Tokenizers Across Scales

Teaching AI models the broad strokes to sketch more like humans do | MIT News

NotebookLM introduces public notebooks for sharing

8 FREE Platforms to Host Machine Learning Models

Why BlackRock's Cybersecurity ETF ($BUG) Is Upgraded Amid AI Surge

Leave a Reply Cancel reply

Premium Content

How AI Challenges Notions of Authorship (opinion)

Top 10 Excel Add-Ins for Data Analysis and Productivity in 2025

What Makes for a Good Stereoscopic Image?

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Hands-On Attention Mechanism for Time Series Classification, with Python

1. Code Setup

2. Data Generation

3. Model Implementation

4. A practical example

4.1 Training the Model

4.2 Attention Mechanism

4.3 Classification Performance

5. Conclusions

6. About me!

Unlock ERP Efficiency with Tailored Odoo Customization Services

Why BlackRock’s Cybersecurity ETF ($BUG) Is Upgraded Amid AI Surge

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?