Testing Imputation Methods on Solar Activity Data

Why Solar Activity Data?

I chose the sunspot dataset because solar activity directly impacts technology we use daily, like GPS accuracy, satellite communications, and power grids.

This dataset contains monthly sunspot numbers from 1749 to 1983 - over 200 years of observations. In that time, missing data due to cloudy weather or equipment problems was common so this makes it a useful test case for imputation methods.

loading and Exploring the Data

library(imputeTestbench)

data("sunspot.month")
sunspot_data <- as.numeric(sunspot.month)

cat("Dataset: Monthly sunspot numbers (1749-1983)\n")

## Dataset: Monthly sunspot numbers (1749-1983)

cat("Total observations:", length(sunspot_data), "\n")

## Total observations: 3310

cat("Activity range:", round(min(sunspot_data), 1), "to", round(max(sunspot_data), 1), "\n")

## Activity range: 0 to 398.2

let’s plot this:

plot(sunspot.month, 
     main="Solar Activity: Sunspot Numbers Over 234 Years",
     ylab="Number of Sunspots",
     xlab="Year",
     col="darkorange", lwd=1.5)

The data shows clear ~11-year solar cycles with periods of high and low activity that’s why this cyclical pattern makes it interesting for testing imputation methods. ( the Sun goes through cycles of activity - every 11 years: HIGH activity (lots of sunspots) → LOW activity (few sunspots) → HIGH again )

Running the benchmark

let me test how different imputation methods handle missing data in this cyclical time series:

results <- impute_errors(data = sunspot_data)
print(results)

## $Parameter
## [1] "rmse"
## 
## $MissingPercent
## [1] 10 20 30 40 50 60 70 80 90
## 
## $na.approx
## [1]  7.055947 10.375468 12.837354 15.386366 17.691068 20.097133
## [7] 22.597475 24.894691 30.350494
## 
## $na.interp
## [1]  7.055947 10.375468 12.837354 15.386366 17.691068 20.097133
## [7] 22.597475 24.894691 30.350494
## 
## $na_interpolation
## [1]  7.055947 10.375468 12.837354 15.386366 17.691068 20.097133
## [7] 22.597475 24.894691 30.350494
## 
## $na.locf
## [1]  8.651962 12.529590 15.886170 18.993992 22.037891 24.840293
## [7] 28.839534 34.645176 48.130099
## 
## $na_mean
## [1] 21.07200 30.36478 37.13709 42.76722 47.75549 52.57667 56.80614
## [8] 60.50147 64.32008

Comparing Methods Visually

plot_errors(results, plotType='line')

What stands out:

na_mean is consistently the worst performer - makes sense since it just uses the average (~50 sunspots) and completely ignores the 11-year cycles
na.interp and na.approx perform best across most scenarios.
na.locf performs reasonably until we hit extreme missing data (90%), where it breaks down completely.

How Methods Handle 40% Missing Data

Let me see the actual imputation results when 40% of observations are missing:

plot_impute(dataIn = sunspot_data, missPercent = 40)

Pink dots = imputed values, Blue dots = actual observations

My observations:

na.approx, na.interp, and na_interpolation: These three are hard to distinguish from each other - the pink imputed points flow naturally with the blue actual data. They’re capturing the ~11-year solar cycles pretty well.
na.mean This one’s obviously broken for this data. see that flat pink line around 100 sunspots? It’s completely ignoring 234 years of solar variation and just using the average. useless for any real analysis.