Sitemap

Neural Networks and LLMs for Time Series Forecasting

Harnessing Deep Learning and LLMs to Predict the Future with Time Series Data

14 min readMay 2, 2025

--

Press enter or click to view image in full size
Photo by Donald Giannatti on Unsplash

Time series forecasting represents one of the oldest and most practical applications of predictive analytics. From predicting stock prices and electricity demand to forecasting weather patterns and pandemic trajectories, the ability to make informed predictions based on historical time-ordered data drives critical decision-making across industries. While classical statistical methods have served practitioners well for decades, recent advances in neural networks and large language models have revolutionized the field, enabling unprecedented accuracy and flexibility.

This article provides a comprehensive journey through the evolution of time series forecasting techniques — from traditional statistical approaches to cutting-edge deep learning architectures. We’ll explore how each innovation addresses the limitations of previous methods and delve into practical implementation details to help you build effective forecasting solutions.

Table of Contents

  1. Understanding Time Series Forecasting
    1.1 What Is Time Series Forecasting?
    1.2 How Time Series Forecasting Differs from Classification and Regression
  2. Classical Approaches to Time Series Forecasting
    2.1 Exponential Smoothing Methods
    2.2 ARIMA Models
    2.3 Global Forecasting Models
  3. Common Libraries for Time Series Forecasting
  4. Neural Networks for Time Series Forecasting
    4.1 Recurrent Neural Networks (RNNs)
    4.2 Long Short-Term Memory Networks (LSTMs)
    4.3 Gated Recurrent Units (GRUs)
    4.4 Convolutional Neural Networks (CNNs)
    4.5 Feature Engineering for Neural Forecasting
  5. Transformers for Time Series Forecasting
    5.1 Transformer Architecture Basics
    5.2 Adapting Transformers for Time Series
    5.3 Feature Engineering for Transformers
  6. Global Deep Learning Forecasting Models
    6.1 Shared Representation Learning
    6.2 Specialized Deep Learning Architectures
    — N-BEATS / N-BEATSx
    — N-HiTS
    — Autoformer Family (Autoformer, Informer, Reformer)
    — LSTF-Linear
    — PatchTST
    — iTransformer
    — Temporal Fusion Transformer (TFT)
    — TSMixer
    — Time Series Dense Encoder
  7. Common Mistakes in Time Series Forecasting
    7.1 Data Leakage
    7.2 Ignoring Data Quality Issues
    7.3 Improper Evaluation
    7.4 Model Complexity Mismatches
    7.5 Deployment Challenges
  8. Conclusion

Understanding Time Series Forecasting

Press enter or click to view image in full size
Source: AI Algorithms for Time Series Analysis

What Is Time Series Forecasting?

At its core, time series forecasting involves predicting future values based on previously observed values in a chronologically ordered dataset. A time series is a sequence of data points indexed in time order, typically collected at uniform time intervals. Examples include:

  • Daily stock prices
  • Hourly temperature readings
  • Monthly retail sales figures
  • Quarterly GDP measurements

The fundamental goal of time series forecasting is to build a mathematical model that captures patterns in historical data and uses them to predict future values with reasonable accuracy.

How Time Series Forecasting Differs from Classification and Regression

While standard machine learning tasks typically assume independent and identically distributed (i.i.d) data points, time series violates this fundamental assumption. Consider these key differences:

Additionally, several key differences set it apart:

  1. Temporal Dependency: Unlike standard regression where observations are assumed to be independent, time series data exhibits temporal dependency — current values depend on past values. This autocorrelation is a defining characteristic of time series data.
  2. Non-Stationarity: Many time series exhibit changing statistical properties over time (non-stationarity), such as trends, seasonality, and cyclical patterns, making them more complex to model than typical regression problems.
  3. Ordered Data: The sequential nature of time series data means that the order of observations matters fundamentally, unlike in many classification or regression tasks where observations can be shuffled.
  4. Evaluation Metrics: Time series forecasting uses specialized evaluation metrics like MAPE (Mean Absolute Percentage Error), RMSE (Root Mean Squared Error) with time-based validation splits, and others that account for the temporal structure.
  5. Multi-step Forecasting: Time series often requires predicting multiple future time steps, introducing complexities around error propagation that aren’t present in traditional predictive modeling.

Unlike classification, which assigns discrete labels or categories to observations, time series forecasting predicts continuous values along a time dimension, often with the added complexity of capturing multiple interdependent variables evolving simultaneously.

Classical Approaches to Time Series Forecasting

Before diving into neural network approaches, it’s important to understand the classical methods that formed the foundation of the field.

Press enter or click to view image in full size
Source: ARIMA vs ETS

Exponential Smoothing Methods

Exponential smoothing represents one of the simplest yet effective approaches to time series forecasting. These methods assign exponentially decreasing weights to past observations, giving more importance to recent data points.

The family of exponential smoothing methods includes:

  • Simple Exponential Smoothing (SES): Suitable for data without trend or seasonality
  • Holt’s Linear Method: Extends SES to handle trends
  • Holt-Winters Method: Further extends to incorporate seasonality

The basic formula for simple exponential smoothing is:

S_t = α × Y_t + (1 - α) × S_(t-1)

Where:

  • S_t is the smoothed value at time t
  • `Y_t` is the observed value at time t
  • `α` is the smoothing parameter (0 < α < 1)

Exponential smoothing methods are computationally efficient and work well for stable time series with clear patterns, but they struggle with complex, nonlinear relationships and external variables.

ARIMA Models

Autoregressive Integrated Moving Average (ARIMA) models, popularized by Box and Jenkins in the 1970s, provide a more sophisticated statistical approach to time series forecasting. ARIMA combines three components:

  1. Autoregressive (AR): Uses the dependency between an observation and a number of lagged observations
  2. Integrated (I): Makes the time series stationary by differencing
  3. Moving Average (MA): Uses the dependency between an observation and a residual error from a moving average model applied to lagged observations

ARIMA models are represented as ARIMA(p,d,q), where:

  • p is the order of the autoregressive component
  • d is the degree of differencing required for stationarity
  • q is the order of the moving average component

Variations include:

  • SARIMA: Incorporates seasonality
  • ARIMAX: Includes exogenous variables
  • VARIMA: Extends to multivariate time series

ARIMA models excel at capturing linear relationships in stationary data and can model complex seasonal patterns. However, they have notable limitations:

  • Assume linear relationships between variables
  • Require manual intervention for parameter selection
  • Struggle with long-term dependencies
  • Limited ability to incorporate external factors
  • Difficulty handling multiple seasonalities or irregular patterns

While still valuable for many forecasting tasks, these limitations spurred the development of more flexible approaches.

Global Forecasting Models

Press enter or click to view image in full size
Photo by Casey Horner on Unsplash

Traditional forecasting methods typically build separate models for each time series. In contrast, global forecasting models leverage information across multiple related time series to improve prediction accuracy, especially for series with limited historical data.

When Global Models Excel

Global models show their strength in several scenarios:

  1. Cold-start problems: When new products or services have limited historical data
  2. Hierarchical forecasting: When forecasting needs to be consistent across different levels of aggregation
  3. Cross-series knowledge transfer: When patterns learned from data-rich series can improve forecasts for data-poor ones
  4. Handling exogenous variables: When incorporating external factors that affect multiple series

A notable example is Facebook’s Prophet model, which, while not strictly a global model, introduced concepts that influenced the development of global forecasting approaches. Prophet decomposes time series into trend, seasonality, and holiday components, making it effective for business time series with multiple seasonal patterns and holiday effects.

Global models overcome several limitations of traditional approaches:

  • They reduce the need for series-specific parameter tuning
  • They leverage cross-series information to improve accuracy
  • They handle cold-start problems more effectively
  • They can maintain hierarchical consistency

However, global models may sometimes underperform highly-tuned local models for individual series with unique patterns. This trade-off between generalization and specialization laid the groundwork for deep learning approaches that could potentially offer the best of both worlds.

Common Libraries for Time Series Forecasting

Several Python libraries have become indispensable tools for time series analysis and forecasting:

Statistical Modeling Libraries

  • statsmodels: Implements ARIMA, SARIMA, exponential smoothing, and other statistical models
  • prophet: Facebook’s decomposable forecasting tool designed for business time series
  • pmdarima: Auto-ARIMA implementation for automated model selection

Machine Learning Libraries

  • scikit-learn: While not time series specific, provides tools for feature engineering and ML models
  • sktime: A unified framework for machine learning with time series
  • tslearn: Dedicated to time series machine learning tasks

Deep Learning Libraries

  • TensorFlow and Keras: General-purpose deep learning frameworks with time series capabilities
  • PyTorch: Flexible deep learning framework popular for research implementations
  • GluonTS: Amazon’s toolkit for probabilistic time series modeling
  • Darts: Library specializing in time series forecasting and anomaly detection
  • Neuralforecast: Modern deep learning models for time series forecasting

Specialized Forecasting Libraries

  • tsfeatures: Extracts features from time series data
  • tsfresh: Automated extraction of relevant features from time series
  • kats: Facebook’s toolkit for time series analysis
  • Merlion: A machine learning library for time series intelligence

These libraries simplify implementation and experimentation with different forecasting approaches, from classical methods to cutting-edge deep learning architectures.

Neural Networks for Time Series Forecasting

Press enter or click to view image in full size
Source: How Neural Networks Can Think Like Humans And Why It Matters

Deep learning approaches have revolutionized time series forecasting by addressing many limitations of classical methods. Neural networks excel at capturing complex non-linear relationships and can automatically learn features from raw time series data.

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks were among the first deep learning architectures applied to sequence modeling. They process sequences by maintaining a hidden state that captures information about previous inputs.

The basic RNN update equation is:

h_t = tanh(W_hx * x_t + W_hh * h_(t-1) + b_h)

Where:

  • h_t is the hidden state at time t
  • x_t is the input at time t
  • W_hx, W_hh, and b_h are trainable parameters

While conceptually elegant, vanilla RNNs suffer from the vanishing gradient problem when learning long-term dependencies, limiting their effectiveness for many time series applications.

Long Short-Term Memory (LSTM) Networks

LSTMs address the vanishing gradient problem by introducing a more complex cell structure with gates that control information flow:

  1. Forget gate: Decides what information to discard from the cell state
  2. Input gate: Updates the cell state with new information
  3. Output gate: Controls what information from the cell state is used for output

This architecture allows LSTMs to capture both short and long-term dependencies, making them particularly effective for time series with complex temporal patterns. LSTMs have been successfully applied to stock price prediction, energy load forecasting, and weather prediction.

Gated Recurrent Units (GRUs)

GRUs simplify the LSTM architecture while maintaining similar capabilities:

  1. Update gate: Combines forget and input gates
  2. Reset gate: Controls how much past information to forget

With fewer parameters than LSTMs, GRUs are computationally more efficient while still capturing long-term dependencies effectively. They often serve as a good default choice for many time series forecasting problems.

Convolutional Neural Networks (CNNs) for Time Series

While primarily associated with image processing, CNNs have proven remarkably effective for time series forecasting:

  1. 1D Convolutions: Extract local patterns across time
  2. Dilated Convolutions: Capture longer-range dependencies without increasing parameter count
  3. Causal Convolutions: Ensure that predictions only use past information

CNNs offer several advantages for time series forecasting:

  • Parallelizable training (unlike RNNs)
  • Efficient extraction of local patterns
  • Ability to capture multi-scale temporal features
  • Lower risk of vanishing/exploding gradients

WaveNet and Temporal Convolutional Networks (TCNs) demonstrate how specialized CNN architectures can outperform recurrent models on many time series tasks.

Feature Engineering for Neural Network-based Forecasting

Effective feature engineering remains crucial even with deep learning models:

Time-based Features

  • Calendar features: Day of week, month, quarter, year
  • Holiday indicators: Binary flags for holidays and special events
  • Cyclical encoding: Sin/cos transformations for cyclical variables
  • Time since event: Features capturing elapsed time since important events

Lag Features

  • Direct lags: Previous values of the target variable
  • Rolling statistics: Moving averages, standard deviations, minimums, maximums
  • Differencing features: First and seasonal differences to capture trends
  • Autocorrelation features: ACF and PACF values at different lags

External Features

  • Related time series: Other variables that might influence the target
  • Economic indicators: For business and financial forecasting
  • Weather data: For energy, retail, and transportation forecasting

Sample feature preparation code that utilizes time based features:

Common Mistakes to Avoid

  1. Data leakage: Inadvertently using future information
  2. Improper scaling: Not normalizing inputs appropriately
  3. Ignoring seasonality: Failing to account for seasonal patterns
  4. Feature explosion: Creating too many features leading to overfitting
  5. Overlooking stationarity: Not addressing non-stationary behavior

When properly engineered, these additional metadata features can dramatically improve neural network performance on time series forecasting tasks.

Transformers for Time Series Forecasting

Press enter or click to view image in full size
Photo by Arseny Togulev on Unsplash

Originally designed for natural language processing, transformer architectures have emerged as powerful tools for time series forecasting, addressing several limitations of RNN and CNN-based approaches.

Transformer Architecture Basics

The transformer architecture replaces recurrence with attention mechanisms:

  1. Self-attention: Allows the model to weigh the importance of different time steps
  2. Multi-head attention: Captures different types of dependencies in parallel
  3. Positional encoding: Preserves temporal order information
  4. Feed-forward networks: Processes the attention output

This architecture offers several advantages for time series forecasting:

  • Parallelizable computation
  • Direct modeling of long-range dependencies
  • Flexible attention to different parts of the input sequence

Adapting Transformers for Time Series

Several modifications make transformers more suitable for time series:

  1. Causal attention masks: Ensure predictions only use past information
  2. Time-based positional encodings: Better capture the specific nature of time
  3. Feature-wise attention: Handle multivariate inputs more effectively
  4. Specialized embedding layers: Process numerical time series data

Feature Engineering for Transformers

Time series transformers require specific feature engineering approaches:

  1. Temporal Embeddings: Encode time information (hour, day, month) as embeddings
  2. Positional Encodings: Help model understand sequence order
  3. Attention Masks: Handle variable-length sequences efficiently
  4. Input Normalization: Feature-wise standardization
  5. Time-based Features: Aggregate statistics over different windows

Global Deep Learning Forecasting Models

Global deep learning models extend the concept of global forecasting to neural network architectures, enabling powerful cross-series learning.

Shared Representation Learning

These models learn shared patterns across multiple time series through several approaches:

  1. Entity embeddings: Learning representations for categorical identifiers
  2. Cross-series feature extraction: Identifying common patterns across series
  3. Meta-learning: Learning how to learn from related series

Specialized Deep Learning Architectures for Forecasting

Recent years have seen the development of architectures specifically designed for time series forecasting, each addressing particular challenges.

Numerous open-source libraries, such as Nixtla, offers implementations of various architectures. For pedagogical purposes, you can also use PyTorch Forecasting and neuralforecast, which provide greater flexibility.

N-BEATS and N-BEATSx

N-BEATS (Neural Basis Expansion Analysis for Time Series) introduced a deep neural architecture with several innovative features:

  1. Interpretable decomposition: Separates trend and seasonality
  2. Basis functions: Models time series as combinations of basis functions
  3. Double residual stacking: Processes information hierarchically
  4. No architectural priors: Learns patterns directly from data

N-BEATSx extends this approach with exogenous variables, addressing a key limitation of the original model.

N-HiTS

N-HiTS (Neural Hierarchical Interpolation for Time Series) builds on N-BEATS by adding:

  1. Multi-rate data processing: Handles different frequencies simultaneously
  2. Hierarchical interpolation: Processes different time scales efficiently
  3. Improved computational efficiency: Faster training and inference

This architecture excels at multi-horizon forecasting and efficiently captures patterns at multiple time scales.

Autoformer Family

Autoformer introduces a specialized transformer architecture for time series:

  1. Auto-correlation mechanisms: Replaces standard attention with time series-specific operations
  2. Series decomposition: Separates seasonal-trend components
  3. Frequency-enhanced attention: Better captures periodic patterns

Later variations like Informer and Reformer address efficiency issues with sparse attention mechanisms.

LSTF-Linear Models

LSTF-Linear (Large Scale Time Series Forecasting) demonstrates that surprisingly simple linear models with appropriate preprocessing can outperform complex architectures:

  1. Patching: Divides time series into fixed-length segments
  2. Channel independence: Processes multivariate data efficiently
  3. Linear projection: Uses simple linear layers instead of complex attention

This family challenges the assumption that more complex architectures are always better for time series forecasting.

PatchTST

PatchTST (Patched Time Series Transformer) combines:

  1. Patch embedding: Groups consecutive time points
  2. Channel-independence: Processes variables separately
  3. Efficient transformer design: Specialized for time series

By focusing on local patterns within patches while maintaining the transformer’s ability to capture long-range dependencies, PatchTST achieves state-of-the-art performance with lower computational requirements.

iTransformer

iTransformer inverts the typical transformer approach:

  1. Instance-wise attention: Focuses on relationships between variables
  2. Channel-wise feature extraction: Processes time separately
  3. Dual-domain modeling: Combines time and feature representations

This architecture particularly excels at multivariate forecasting where inter-variable relationships are crucial.

Temporal Fusion Transformer (TFT)

TFT addresses the complexity of real-world forecasting by combining:

  1. Variable selection networks: Identify relevant inputs dynamically
  2. Gated residual networks: Handle different information types
  3. Multi-horizon attention: Specialized for long forecast horizons
  4. Interpretable attention weights: Provide insights into model decisions

TFT has proven particularly effective for business forecasting with multiple exogenous variables.

TSMixer

TSMixer applies the MLP-Mixer concept to time series:

  1. Time mixing: Captures temporal patterns with MLPs
  2. Channel mixing: Models relationships between variables
  3. Simple architecture: Relies on feed-forward networks only

This architecture demonstrates that self-attention isn’t always necessary for effective time series modeling.

Time Series Dense Encoder

Time Series Dense Encoder uses:

  1. Dense connectivity: Connects each layer to all subsequent layers
  2. Multi-scale processing: Captures patterns at different time scales
  3. Efficient parameter sharing: Reduces model size

This approach shows particular promise for memory-constrained applications while maintaining competitive accuracy.

Common Mistakes in Time Series Forecasting

Even experienced practitioners encounter pitfalls when building forecasting models. Awareness of these common mistakes can significantly improve model performance.

Data Leakage

Data leakage or feature leakage occurs when future information inadvertently influences predictions. For example using sale of shoes to predict sale of belts while training. But in future i.e. during prediction time you will not have the data of sale of shoes at time t+1.

  1. Look-ahead bias: Using future values to create features
  2. Inappropriate normalization: Scaling using statistics from the entire dataset
  3. Incorrect validation splits: Not respecting temporal order

Prevention: Always maintain strict time-based train/validation/test splits and ensure all preprocessing steps only use past information. Offset feature values by 1.

Ignoring Data Quality Issues

Time series often contain problematic data:

  1. Missing values: Gaps in historical data
  2. Outliers: Extreme values due to errors or rare events
  3. Distribution shifts: Changes in data patterns over time
  4. Inconsistent frequencies: Irregular or mixed sampling rates

Prevention: Implement robust preprocessing pipelines and monitor data quality over time.

Improper Evaluation

Evaluation errors lead to overoptimistic performance estimates:

  1. Using inappropriate metrics: Not matching metrics to business objectives
  2. Ignoring uncertainty: Focusing solely on point forecasts
  3. Limited backtesting: Not testing across different time periods
  4. Disregarding baseline comparisons: Not comparing against simple models

Prevention: Implement rigorous evaluation frameworks with multiple metrics and extensive backtesting.

Model Complexity Mismatches

Choosing inappropriate model complexity causes either underfitting or overfitting:

  1. Overparameterization: Using unnecessarily complex models
  2. Failure to capture seasonality: Using models that can’t represent periodic patterns
  3. Ignoring hierarchy: Not accounting for hierarchical relationships
  4. Overlooking ensemble opportunities: Relying on single models

Prevention: Start simple and increase complexity incrementally, with proper validation at each step.

Deployment Challenges

Models that work well in development may fail in production:

  1. Feature drift: Changes in input feature distributions
  2. Concept drift: Changes in the underlying relationships
  3. Delayed data availability: Assumptions about when data becomes available
  4. Computational constraints: Resource limitations in production environments

Prevention: Design robust deployment pipelines with monitoring, retraining strategies, and fallback options.

Conclusion

Time series forecasting has evolved dramatically from classical statistical methods to sophisticated deep learning architectures. While traditional approaches like ARIMA and exponential smoothing continue to provide robust baselines, neural networks and large language models have unlocked new capabilities for handling complex patterns, multiple series, and external factors.

The journey through forecasting methodologies reveals several key insights:

  1. No Silver Bullet: Different problems require different architectures. Know when to use simpler models versus complex ones.
  2. Domain Knowledge Matters: Understanding the underlying data generation process helps in feature engineering and model selection.
  3. Rigorous Evaluation: Use appropriate time series cross-validation techniques and metrics.
  4. Avoid Common Pitfalls: Particularly data leakage and improper train-test splits.
  5. Consider Uncertainty: Point forecasts alone can be misleading; probabilistic approaches often provide more value.

By understanding the strengths and limitations of different approaches and avoiding common pitfalls, you can develop time series forecasting solutions that provide reliable insights for critical decision-making.

References

  1. Joseph, M., & Tackes, J. (2023). Modern Time Series Forecasting with Python. O’Reilly Media.
  2. Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice. OTexts.
  3. Benidis, K., et al. (2022). “Neural forecasting: Introduction and literature review.” arXiv preprint arXiv:2004.10240.
  4. Lim, B., et al. (2021). “Temporal fusion transformers for interpretable multi-horizon time series forecasting.” International Journal of Forecasting.
  5. Oreshkin, B. N., et al. (2019). “N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.” International Conference on Learning Representations.
  6. Zhou, H., et al. (2021). “Informer: Beyond efficient transformer for long sequence time-series forecasting.” AAAI Conference on Artificial Intelligence.
  7. Zeng, A., et al. (2022). “Are transformers effective for time series forecasting?” arXiv preprint arXiv:2205.13504.
  8. Nie, Y., et al. (2022). “A time series is worth 64 words: Long-term forecasting with transformers.” International Conference on Learning Representations.
  9. Wen, Q., et al. (2023). “Transformers in time series: A survey.” arXiv preprint arXiv:2202.07125.

--

--

Responses (1)