Why Solar Activity Data?

I chose the sunspot dataset because solar activity directly impacts technology we use daily, like GPS accuracy, satellite communications, and power grids.

This dataset contains monthly sunspot numbers from 1749 to 1983 - over 200 years of observations. In that time, missing data due to cloudy weather or equipment problems was common so this makes it a useful test case for imputation methods.

loading and Exploring the Data

library(imputeTestbench)

data("sunspot.month")
sunspot_data <- as.numeric(sunspot.month)

cat("Dataset: Monthly sunspot numbers (1749-1983)\n")
## Dataset: Monthly sunspot numbers (1749-1983)
cat("Total observations:", length(sunspot_data), "\n")
## Total observations: 3310
cat("Activity range:", round(min(sunspot_data), 1), "to", round(max(sunspot_data), 1), "\n")
## Activity range: 0 to 398.2

let’s plot this:

plot(sunspot.month, 
     main="Solar Activity: Sunspot Numbers Over 234 Years",
     ylab="Number of Sunspots",
     xlab="Year",
     col="darkorange", lwd=1.5)

The data shows clear ~11-year solar cycles with periods of high and low activity that’s why this cyclical pattern makes it interesting for testing imputation methods. ( the Sun goes through cycles of activity - every 11 years: HIGH activity (lots of sunspots) → LOW activity (few sunspots) → HIGH again )

Running the benchmark

let me test how different imputation methods handle missing data in this cyclical time series:

results <- impute_errors(data = sunspot_data)
print(results)
## $Parameter
## [1] "rmse"
## 
## $MissingPercent
## [1] 10 20 30 40 50 60 70 80 90
## 
## $na.approx
## [1]  7.055947 10.375468 12.837354 15.386366 17.691068 20.097133
## [7] 22.597475 24.894691 30.350494
## 
## $na.interp
## [1]  7.055947 10.375468 12.837354 15.386366 17.691068 20.097133
## [7] 22.597475 24.894691 30.350494
## 
## $na_interpolation
## [1]  7.055947 10.375468 12.837354 15.386366 17.691068 20.097133
## [7] 22.597475 24.894691 30.350494
## 
## $na.locf
## [1]  8.651962 12.529590 15.886170 18.993992 22.037891 24.840293
## [7] 28.839534 34.645176 48.130099
## 
## $na_mean
## [1] 21.07200 30.36478 37.13709 42.76722 47.75549 52.57667 56.80614
## [8] 60.50147 64.32008

Comparing Methods Visually

plot_errors(results, plotType='line')

What stands out:

How Methods Handle 40% Missing Data

Let me see the actual imputation results when 40% of observations are missing:

plot_impute(dataIn = sunspot_data, missPercent = 40)

Pink dots = imputed values, Blue dots = actual observations

My observations: