How to Preprocess Missing Data for Demand Forecasting

🧠

This content is the product of human creativity.

Missing data can ruin your demand forecasts. It leads to stock problems, financial losses, and unhappy customers. But with the right techniques, you can fix these issues and make your forecasts more reliable. Here’s how to handle missing data effectively:

Identify Missing Data Types: Learn about MCAR, MAR, and MNAR to choose the right strategy.
Fill Data Gaps: Use imputation methods like mean, median, regression, or machine learning.
Handle Time Series Data: Apply interpolation, moving averages, or seasonal adjustments.
Check Data Quality: Detect and fix outliers, scale data, and validate imputation results.
Add Features: Create new data points like lag features, rolling averages, or seasonal indicators.

Why it matters: Proper data preprocessing reduces errors, improves forecast accuracy, and helps you make smarter business decisions. Follow these steps to ensure your data is ready for demand forecasting.

Imputing Missing Values in Time Series Data: A Hands-on …

Types of Missing Data

Missing data is generally classified into three categories – MCAR, MAR, and MNAR. Understanding these helps you pick the right preprocessing strategy.

Missing Completely at Random (MCAR)

MCAR happens when missing data shows no discernible pattern or connection to other variables. For instance, a random point-of-sale system glitch causing occasional data loss is a classic MCAR example. For such cases, you can use simple strategies like:

Mean or median imputation
Complete case analysis
Random sampling

These methods work because the missing data doesn’t depend on any specific factor.

Missing at Random (MAR)

MAR occurs when the missing data can be explained by other variables in your dataset. For example, if sales data is missing during holiday periods, the time of year explains the gaps, even though it doesn’t directly relate to the sales figures themselves.

Method	Best Scenario	Advantage
Regression Imputation	Large datasets with clear variable links	Maintains relationships between variables
Multiple Imputation	Complex datasets with various missing patterns	Handles uncertainty in predictions

These approaches are better suited for MAR since they take related variables into account.

Missing Not at Random (MNAR)

MNAR arises when the missing data is directly related to the values themselves. For example, high-value transactions might not be recorded because of system limitations. Addressing MNAR is more complex and often requires:

Modeling the missing data mechanism
Conducting sensitivity analyses
Consulting domain experts for deeper insights

Data Preprocessing Steps

Effectively handle missing data to improve forecast accuracy.

1. Identify Missing Values

Python‘s pandas library makes it simple to detect missing data patterns:

# Check total missing values
df.isnull().sum()

# Calculate percentage of missing values
(df.isnull().sum() / len(df)) * 100

# Visualize missing patterns
import missingno as msno
msno.matrix(df)

These commands help you understand the distribution of missing data in your dataset.

2. Handle Missing Data

Remove Missing Data

If the percentage of missing data is low (less than 5%), you can remove rows with missing values. However, this approach might introduce bias if the data isn’t missing completely at random (MCAR).

# Remove rows with any missing values
df_clean = df.dropna()

Replace Missing Values

For simple imputation, choose a method based on your data type and distribution:

Method	Best Use Case	Implementation
Mean	Numerical data with normal distribution	`df['column'].fillna(df['column'].mean())`
Median	Numerical data with outliers	`df['column'].fillna(df['column'].median())`
Mode	Categorical data	`df['column'].fillna(df['column'].mode()[0])`
Forward Fill	Time series with consistent trends	`df['column'].fillna(method='ffill')`

For more complex datasets, advanced methods might be necessary.

Predict Missing Values

Machine learning can be a powerful tool for filling in missing data in larger datasets:

from sklearn.impute import KNNImputer

# KNN imputation
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

3. Time Series-Specific Techniques

After basic imputation, enhance time series data using specialized methods like interpolation or moving averages:

# Time series imputation options
df['sales'].interpolate(method='time')  # Time-based interpolation
df['sales'].fillna(df['sales'].rolling(window=3, center=True).mean())  # Moving average

# For seasonal data
df['sales'].interpolate(method='polynomial', order=2)

These techniques help maintain the integrity of time-dependent patterns in your data.

sbb-itb-2ec70df

Data Quality Checks

After imputation, it’s important to confirm data quality to ensure accurate forecasting. Here’s a look at some key quality control steps.

Find and Fix Outliers

Outliers can throw off your forecasts. Use statistical techniques to identify and address them:

# Detect outliers using Z-scores
from scipy import stats
z_scores = stats.zscore(df['demand'])
outliers = abs(z_scores) > 3

# Identify outliers using the IQR method
Q1 = df['demand'].quantile(0.25)
Q3 = df['demand'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

Here are some ways to handle outliers:

Method	Best Use Case	How It Works
Capping	For extreme but valid values	Limit values to a range, e.g., 5th to 95th percentile
Removal	For obvious data errors	Exclude values outside 3 standard deviations
Investigation	For unexpected patterns	Collaborate with domain experts to analyze anomalies

Scale and Normalize Data

Scaling numerical data ensures consistency across features, which is particularly useful for machine learning models:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardize data (mean=0, std=1)
scaler = StandardScaler()
df['demand_scaled'] = scaler.fit_transform(df[['demand']])

# Normalize data to a 0-1 range
normalizer = MinMaxScaler()
df['demand_normalized'] = normalizer.fit_transform(df[['demand']])

Once scaled or normalized, you can extract more insights from your data.

Create New Data Features

Adding new features can highlight important trends and patterns:

# Generate lag features
df['demand_lag1'] = df['demand'].shift(1)
df['demand_lag7'] = df['demand'].shift(7)  # Weekly lag

# Add seasonal indicators
df['month'] = df.index.month
df['day_of_week'] = df.index.dayofweek

# Calculate rolling statistics
df['rolling_mean'] = df['demand'].rolling(window=7).mean()
df['rolling_std'] = df['demand'].rolling(window=7).std()

Focus on features that align with your forecasting needs:

Feature Type	Purpose	Examples
Temporal	Highlight time-based trends	Month, day of week, season
Statistical	Show historical patterns	Moving averages, volatility
Domain-specific	Add business context	Promotions, holidays
Interaction	Combine related variables	Price × season, demand × day

Quality Control Methods

After preprocessing, thorough quality checks ensure your data is ready for accurate forecasting.

Work with Industry Experts

Collaborate with domain experts to validate assumptions, spot anomalies, and align preprocessing with business needs. Regular consultations can help you:

Review preprocessing assumptions
Detect seasonal trends and irregularities
Confirm business rules for data handling
Provide context for unusual data patterns

Stage	Action	Expected Outcome
Initial Review	Share methodology	Validate assumptions
Check-ins	Review imputation	Identify potential biases
Feedback Loop	Document insights	Refine preprocessing methods
Performance Analysis	Compare with KPIs	Ensure alignment with business goals

Check Imputation Results

Testing the accuracy of imputed data is critical for reliable forecasting. Here’s an example of using time series cross-validation to assess imputation:

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import numpy as np

tscv = TimeSeriesSplit(n_splits=5)
mse_scores = []

for train_idx, test_idx in tscv.split(data):
    test_mask = np.random.rand(len(test_idx)) < 0.2
    mse = mean_squared_error(
        data.iloc[test_idx][test_mask],
        imputed_data.iloc[test_idx][test_mask]
    )
    mse_scores.append(mse)

Key metrics for validation include:

Metric	Purpose
RMSE	Evaluate prediction accuracy
MAE	Measure absolute differences
Distribution Tests	Check statistical properties
Visual Inspection	Confirm consistency in data patterns

These steps help bridge preprocessing efforts with model evaluation, ensuring reliable results.

Update Methods Regularly

Maintaining high-quality data requires ongoing updates to adapt to changing business needs and data trends.

Monthly Performance Review: Regularly assess pipeline metrics, including imputation accuracy, outlier detection, and model performance.
Quarterly Method Updates: Update methods based on new data trends, changes in business requirements, emerging techniques, and performance insights.
Documentation and Version Control: Keep detailed records of updates and use version control for tracking changes.

Example configuration for version control:

preprocessing_config = {
    'version': '2.1.0',
    'last_updated': '2025-04-09',
    'changes': {
        'imputation_method': 'advanced_knn',
        'outlier_threshold': 3.5,
        'scaling_technique': 'robust_scaler'
    }
}

By regularly reviewing and updating your methods, you can ensure your data remains reliable and aligned with business objectives.

Preprocessing Missing Data: Why It Matters

Handling missing data effectively is critical for accurate demand forecasting. By carefully identifying, addressing, and validating missing values, you can improve forecasts and make smarter business decisions.

Here’s how proper preprocessing impacts your business:

Aspect	Business Impact
Data Quality	Reduces bias and ensures models are more reliable.
Forecast Accuracy	Leads to more precise demand predictions.
Decision Making	Improves inventory management and resource allocation.
Cost Efficiency	Helps avoid stockouts and overstocking issues.

To maintain these benefits, it’s important to regularly review your preprocessing methods. Steps like keeping thorough documentation, consulting experts, and monitoring performance ensure your approach stays effective as trends and data evolve.

For advanced solutions, consider working with data analytics professionals. Companies like Growth-onomics specialize in turning raw data into actionable insights, helping you create accurate demand forecasts.

Adopting these practices helps you build forecasting models you can trust, supporting better decisions and driving business success.

Miltos George

Miltos George is a visionary growth strategist and Chief Growth Officer at Growth-onomics, with over 15 years of experience driving scalable results. A pioneer in AI-driven marketing, Miltos translates complex data into actionable growth opportunities, delivering transformative outcomes like 300% revenue growth for clients. Connect with Miltos: 🌐 LinkedIn | 🌐 Personal Website | 🌐 Social Media

How to Preprocess Missing Data for Demand Forecasting

How to Preprocess Missing Data for Demand Forecasting

Imputing Missing Values in Time Series Data: A Hands-on …

Types of Missing Data

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Data Preprocessing Steps

1. Identify Missing Values

2. Handle Missing Data

Remove Missing Data

Replace Missing Values

Predict Missing Values

3. Time Series-Specific Techniques

sbb-itb-2ec70df

Data Quality Checks

Find and Fix Outliers

Scale and Normalize Data

Create New Data Features

Quality Control Methods

Work with Industry Experts

Check Imputation Results

Update Methods Regularly

Preprocessing Missing Data: Why It Matters

Related posts

Book a Call

Services

Company

Hey There! Subscribe to Our Newsletter!

let’s talk

Offices

Solonos 48, 1011, Nicosia, Cyprus

Regus, Marina Gate M Floor – Dubai Marina – Dubai – United Arab Emirates

© 2025 Copyright By Growth-onomics