# forecastML **Repository Path**: himawork/forecastML ## Basic Information - **Project Name**: forecastML - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-25 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

[![CRAN](https://www.r-pkg.org/badges/version/forecastML)](https://cran.r-project.org/package=forecastML) [![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) [![Travis Build Status](https://travis-ci.org/nredell/forecastML.svg?branch=master)](https://travis-ci.org/nredell/forecastML) [![codecov](https://codecov.io/github/nredell/forecastML/branch/master/graphs/badge.svg)](https://codecov.io/github/nredell/forecastML) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nredell/forecastML/master?urlpath=https%3A%2F%2Fgithub.com%2Fnredell%2FforecastML%2Ftree%2Fmaster%2Fnotebooks%2F) # package::forecastML forecastML logo The purpose of `forecastML` is to provide a series of functions and visualizations that simplify the process of **multi-step-ahead forecasting with standard machine learning algorithms**. It's a wrapper package aimed at providing maximum flexibility in model-building--**choose any machine learning algorithm from any `R` or `Python` package**--while helping the user quickly assess the (a) accuracy, (b) stability, and (c) generalizability of grouped (i.e., multiple related time series) and ungrouped forecasts produced from potentially high-dimensional modeling datasets. This package is inspired by Bergmeir, Hyndman, and Koo's 2018 paper [A note on the validity of cross-validation for evaluating autoregressive time series prediction](https://doi.org/10.1016/j.csda.2017.11.003). which supports--under certain conditions--forecasting with high-dimensional ML models **without having to use methods that are time series specific**. The following quote from Bergmeir et al.'s article nicely sums up the aim of this package: > "When purely (non-linear, nonparametric) autoregressive methods are applied to forecasting problems, as is often the case > (e.g., when using Machine Learning methods), the aforementioned problems of CV are largely > irrelevant, and CV can and should be used without modification, as in the independent case." ## Featured Notebooks * **[Forecasting with big data - Spark and H2O](https://github.com/nredell/forecastML/blob/master/notebooks/Forecasting%20with%20big%20data%20-%20Spark%20and%20H2O.ipynb)** * **[Forecasting with Python - scikit-learn in parallel](https://github.com/nredell/forecastML/blob/master/notebooks/python_sklearn_and_r_in_parallel/Forecasting%20with%20Python%20-%20scikit%20learn%20in%20parallel.ipynb)** * **[Forecast reconciliation across planning horizons - coherent weekly ML and monthly ARIMA forecasts](https://github.com/nredell/forecastML/blob/master/notebooks/forecast_reconciliation/Forecast%20reconciliation%20across%20planning%20horizons%20-%20coherent%20weekly%20ML%20and%20monthly%20ARIMA%20forecasts.ipynb)** User-contributed notebooks welcome! ## Lightning Example * Requires `packageVersion("forecastML")` >= v0.9.1 ``` r library(glmnet) library(forecastML) data("data_seatbelts", package = "forecastML") data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", method = "direct", outcome_col = 1, lookback = 1:15, horizons = 1:12) windows <- forecastML::create_windows(data_train, window_length = 0) model_fn <- function(data) { x <- as.matrix(data[, -1, drop = FALSE]) y <- as.matrix(data[, 1, drop = FALSE]) model <- glmnet::cv.glmnet(x, y) } model_results <- forecastML::train_model(data_train, windows, model_name = "LASSO", model_function = model_fn) predict_fn <- function(model, data) { data_pred <- as.data.frame(predict(model, as.matrix(data))) } data_fit <- predict(model_results, prediction_function = list(predict_fn), data = data_train) residuals <- residuals(data_fit) data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", method = "direct", outcome_col = 1, lookback = 1:15, horizons = 1:12) data_forecasts <- predict(model_results, prediction_function = list(predict_fn), data = data_forecast) data_forecasts <- forecastML::combine_forecasts(data_forecasts) set.seed(224) data_forecasts <- forecastML::calculate_intervals(data_forecasts, residuals, levels = seq(.5, .95, .05), times = 200) plot(data_forecasts, data_seatbelts[-(1:160), ], (1:nrow(data_seatbelts))[-(1:160)], interval_alpha = seq(.1, .2, length.out = 10)) ``` ![](./tools/lightning_example.png) ## README Contents * **[Install](#install)** * **[Approach to forecasting](#approach-to-forecasting)** * **[Vignettes](#vignettes)** * **[Cheat sheets](#cheat-sheets)** * **[FAQ](#faq)** * **Examples** + **[Forecasting numeric outcomes](#examples---numeric-outcomes-with-r-and-python)** + **[Direct forecasting](#direct-forecast-in-r)** + **[Multi-output forecasting](#multi-output-forecast-in-r)** + **[Forecasting factor outcomes (forecasting sequences)](#examples---factor-outcomes-with-r-and-python)** ## Install * CRAN ``` r install.packages("forecastML") library(forecastML) ``` * Development ``` r remotes::install_github("nredell/forecastML") library(forecastML) ``` ## Approach to Forecasting ### Direct forecasting The direct forecasting approach used in `forecastML` involves the following steps: **1.** Build a series of horizon-specific short-, medium-, and long-term forecast models. **2.** Assess model generalization performance across a variety of heldout datasets through time. **3.** Select those models that consistently performed the best at each forecast horizon and combine them to produce a single ensemble forecast. * Below is a plot of 5 forecast models used to produce a single 12-step-ahead forecast where each color represents a distinct horizon-specific ML model. From left to right these models are: * **1**: A feed-forward neural network (purple); **2**: An ensemble of ML models; **3**: A boosted tree model; **4**: A LASSO regression model; **5**: A LASSO regression model (yellow). ![](./tools/forecastML_plot.png) * Below is a similar combination of horizon-specific models with a factor outcome and forecasting factor probabilities 12 steps ahead. ![](./tools/forecastML_factor_plot.png) ### Multi-output forecasting The multi-output forecasting approach used in `forecastML` involves the following steps: **1.** Build a single multi-output model that simultaneously forecasts over both short- and long-term forecast horizons. **2.** Assess model generalization performance across a variety of heldout datasets through time. **3.** Select the hyperparamters that minimize forecast error over all the relevant forecast horizons and re-train. ## Vignettes The main functions covered in each vignette are shown below as `function()`. * Detailed **[forecastML overview vignette](https://nredell.github.io/forecastML/doc/package_overview.html)**. `create_lagged_df()`, `create_windows()`, `train_model()`, `return_error()`, `return_hyper()`, `combine_forecasts()` * **[Creating custom feature lags for model training](https://nredell.github.io/forecastML/doc/lagged_features.html)**. `create_lagged_df(lookback_control = ...)` * **[Direct Forecasting with multiple or grouped time series](https://nredell.github.io/forecastML/doc/grouped_forecast.html)**. `fill_gaps()`, `create_lagged_df(dates = ..., dynamic_features = ..., groups = ..., static_features = ...)`, `create_windows()`, `train_model()`, `combine_forecasts()` * **[Direct Forecasting with multiple or grouped time series - Sequences](https://nredell.github.io/forecastML/doc/grouped_forecast_sequences.html)**. `fill_gaps()`, `create_lagged_df(dates = ..., dynamic_features = ..., groups = ..., static_features = ...)`, `create_windows()`, `train_model()`, `combine_forecasts()` * **[Customizing the user-defined wrapper functions](https://nredell.github.io/forecastML/doc/custom_functions.html)**. `train()` and `predict()` * **[Forecast combinations](https://nredell.github.io/forecastML/doc/combine_forecasts)**. `combine_forecasts()` ## Cheat Sheets ![](./tools/forecastML_cheat_sheet.PNG) 1. **`fill_gaps`:** Optional if no temporal gaps/missing rows in data collection. Fill gaps in data collection and prepare a dataset of evenly-spaced time series for modeling with lagged features. Returns a 'data.frame' with missing rows added in so that you can either (a) impute, remove, or ignore `NA`s prior to the `forecastML` pipeline or (b) impute, remove, or ignore them in the user-defined modeling function--depending on the `NA` handling capabilities of the user-specified model. 2. **`create_lagged_df`:** Create model training and forecasting datasets with lagged, grouped, dynamic, and static features. 3. **`create_windows`:** Create time-contiguous validation datasets for model evaluation. 4. **`train_model`:** Train the user-defined model across forecast horizons and validation datasets. 5. **`return_error`:** Compute forecast error across forecast horizons and validation datasets. 6. **`return_hyper`:** Return user-defined model hyperparameters across validation datasets. 7. **`combine_forecasts`:** Combine multiple horizon-specific forecast models to produce one forecast. ![](./tools/forecastML_cheat_sheet_data.PNG)
![](./tools/forecastML_cheat_sheet_model.PNG) ## FAQ * **Q:** Where does `forecastML` fit in with respect to popular `R` machine learning packages like [mlr3](https://mlr3.mlr-org.com/) and [caret](https://github.com/topepo/caret)? * **A:** The idea is that `forecastML` takes care of the tedious parts of forecasting with ML methods: creating training and forecasting datasets with different types of features--grouped, static, and dynamic--as well as simplifying validation dataset creation to assess model performance at specific points in time. That said, the workflow for packages like `mlr3` and `caret` would mostly occur inside of the user-supplied modeling function which is passed into `forecastML::train_model()`. Refer to the wrapper function customization vignette for more details. * **Q:** How do I get the model training and forecasting datasets as well as the trained models out of the `forecastML` pipeline? * **A:** After running `forecastML::create_lagged_df()` with either `type = "train"` or `type = "forecast"`, the `data.frame`s can be accessed with `my_lagged_df$horizon_h` where "h" is an integer marking the horizon-specific dataset (e.g., the value(s) passed in `horizons = ...`). The trained models from `forecastML::train_model()` can be accessed with `my_trained_model$horizon_h$window_w$model` where "w" is the validation window number from `forecastML::create_windows()`. ## Examples - Numeric Outcomes with R and Python ### Direct forecast in R Below is an example of how to create 12 horizon-specific ML models to forecast the number of `DriversKilled` 12 time periods into the future using the `Seatbelts` dataset. Notice in the last plot that there are multiple forecasts; these are from the slightly different LASSO models trained in the nested cross-validation. An example of selecting optimal hyperparameters and retraining to create a single forecast model (i.e., `create_windows(..., window_length = 0)`) can be found in the overview vignette. ``` r library(glmnet) library(forecastML) # Sampled Seatbelts data from the R package datasets. data("data_seatbelts", package = "forecastML") # Example - Training data for 12 horizon-specific models w/ common lags per feature. The data do # not have any missing rows or temporal gaps in data collection; if there were gaps, # we would need to use fill_gaps() first. horizons <- 1:12 # 12 models that forecast 1, 1:2, 1:3, ..., and 1:12 time steps ahead. lookback <- 1:15 # A lookback of 1 to 15 dataset rows (1:15 * 'date frequency' if dates are given). #------------------------------------------------------------------------------ # Create a dataset of lagged features for modeling. data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", outcome_col = 1, lookback = lookback, horizon = horizons) #------------------------------------------------------------------------------ # Create validation datasets for outer-loop nested cross-validation. windows <- forecastML::create_windows(data_train, window_length = 12) #------------------------------------------------------------------------------ # User-define model - LASSO # A user-defined wrapper function for model training that takes the following # arguments: (1) a horizon-specific data.frame made with create_lagged_df(..., type = "train") # (e.g., my_lagged_df$horizon_h) and, optionally, (2) any number of additional named arguments # which can also be passed in '...' in train_model(). The function returns a model object suitable for # the user-defined predict function. The returned model may also be a list that holds meta-data such # as hyperparameter settings. model_function <- function(data, my_outcome_col) { # my_outcome_col = 1 could be defined here. x <- data[, -(my_outcome_col), drop = FALSE] y <- data[, my_outcome_col, drop = FALSE] x <- as.matrix(x, ncol = ncol(x)) y <- as.matrix(y, ncol = ncol(y)) model <- glmnet::cv.glmnet(x, y) return(model) # This model is the first argument in the user-defined predict() function below. } #------------------------------------------------------------------------------ # Train a model across forecast horizons and validation datasets. # my_outcome_col = 1 is passed in ... but could have been defined in the user-defined model function. model_results <- forecastML::train_model(data_train, windows = windows, model_name = "LASSO", model_function = model_function, my_outcome_col = 1, # ... use_future = FALSE) #------------------------------------------------------------------------------ # User-defined prediction function - LASSO # The predict() wrapper function takes 2 positional arguments. First, # the returned model from the user-defined modeling function (model_function() above). # Second, a data.frame of model features. If predicting on validation data, expect the input data to be # passed in the same format as returned by create_lagged_df(type = 'train') but with the outcome column # removed. If forecasting, expect the input data to be in the same format as returned by # create_lagged_df(type = 'forecast') but with the 'index' and 'horizon' columns removed. The function # can return a 1- or 3-column data.frame with either (a) point # forecasts or (b) point forecasts plus lower and upper forecast bounds (column order and names do not matter). prediction_function <- function(model, data_features) { x <- as.matrix(data_features, ncol = ncol(data_features)) data_pred <- data.frame("y_pred" = predict(model, x, s = "lambda.min"), # 1 column is required. "y_pred_lower" = predict(model, x, s = "lambda.min") - 50, # optional. "y_pred_upper" = predict(model, x, s = "lambda.min") + 50) # optional. return(data_pred) } # Predict on the validation datasets. data_valid <- predict(model_results, prediction_function = list(prediction_function), data = data_train) #------------------------------------------------------------------------------ # Plot forecasts for each validation dataset. plot(data_valid, horizons = c(1, 6, 12)) #------------------------------------------------------------------------------ # Forecast. # Forward-looking forecast data.frame. data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", outcome_col = 1, lookback = lookback, horizons = horizons) # Forecasts. data_forecasts <- predict(model_results, prediction_function = list(prediction_function), data = data_forecast) # We'll plot a background dataset of actuals as well. plot(data_forecasts, data_actual = data_seatbelts[-(1:150), ], actual_indices = as.numeric(row.names(data_seatbelts[-(1:150), ])), horizons = c(1, 6, 12), windows = c(5, 10, 15)) ``` ![](./tools/validation_data_forecasts.png) ![](./tools/forecasts.png) *** ### Direct forecast in R & Python Now we'll look at an example similar to above. The main difference is that our user-defined modeling and prediction functions are now written in `Python`. Thanks to the [reticulate](https://github.com/rstudio/reticulate) `R` package, entire ML workflows already written in `Python` can be imported into `forecastML` with the simple addition of 2 lines of `R` code. * The `reticulate::source_python()` function will run a .py file and import any objects into your `R` environment. As we'll see below, we'll only be importing library calls and functions to keep our `R` environment clean. ``` r library(forecastML) library(reticulate) # Move Python objects in and out of R. See the reticulate package for setup info. reticulate::source_python("modeling_script.py") # Run a Python file and import objects into R. ```
* Below is a simple, slightly different `forecastML` setup for the seatbelt forecasting problem from the previous example. ``` r data("data_seatbelts", package = "forecastML") horizons <- c(1, 12) # 2 models that forecast 1 and 1:12 time steps ahead. # A lookback across select time steps in the past. Feature lags 1 through 9 will be silently dropped from the 12-step-ahead model. lookback <- c(1, 3, 6, 9, 12, 15) date_frequency <- "1 month" # Time step frequency. # The date indices, which don't come with the stock dataset, should not be included in the modeling data.frame. dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = date_frequency) # Create a dataset of features for modeling. data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", outcome_col = 1, lookback = lookback, horizon = horizons, dates = dates, frequency = date_frequency) # Create 2 custom validation datasets for outer-loop nested cross-validation. The purpose of # the multiple validation windows is to assess expected forecast accuracy for specific # time periods while supporting an investigation of the hyperparameter stability for # models trained on different time periods. Validation windows can overlap. window_start <- c(as.Date("1983-01-01"), as.Date("1984-01-01")) window_stop <- c(as.Date("1983-12-01"), as.Date("1984-12-01")) windows <- forecastML::create_windows(data_train, window_start = window_start, window_stop = window_stop) ```
#### modeling_script.py * Let's look at the content of our `Python` modeling file that we source()'d above. The `Python` wrapper function inputs and returns for `py_model_function()` and `py_prediction_function()` are the same as their `R` counterparts. Just be sure to expect and return `pandas` `DataFrame`s as conversion from `numpy` arrays has not been tested. ``` python import pandas as pd from sklearn import linear_model from sklearn.preprocessing import StandardScaler # User-defined model. # A user-defined wrapper function for model training that takes the following # arguments: (1) a horizon-specific pandas DataFrame made with create_lagged_df(..., type = "train") # (e.g., my_lagged_df$horizon_h) def py_model_function(data): X = data.iloc[:, 1:] y = data.iloc[:, 0] scaler = StandardScaler() X = scaler.fit_transform(X) model_lasso = linear_model.Lasso(alpha = 0.1) model_lasso.fit(X = X, y = y) return({'model': model_lasso, 'scaler': scaler}) # User-defined prediction function. # The predict() wrapper function takes 2 positional arguments. First, # the returned model from the user-defined modeling function (py_model_function() above). # Second, a pandas DataFrame of model features. For numeric outcomes, the function # can return a 1- or 3-column pandas DataFrame with either (a) point # forecasts or (b) point forecasts plus lower and upper forecast bounds (column order and names do not matter). def py_prediction_function(model_list, data_x): data_x = model_list['scaler'].transform(data_x) data_pred = pd.DataFrame({'y_pred': model_list['model'].predict(data_x)}) return(data_pred) ```
* Train and predict on historical validation data with the imported `Python` wrapper functions. ``` r # Train a model across forecast horizons and validation datasets. model_results <- forecastML::train_model(data_train, windows = windows, model_name = "LASSO", model_function = py_model_function, use_future = FALSE) # Predict on the validation datasets. data_valid <- predict(model_results, prediction_function = list(py_prediction_function), data = data_train) # Plot forecasts for each validation dataset. plot(data_valid, horizons = c(1, 12)) ``` ![](./tools/validation_data_forecasts_python.png)
* Forecast with the same imported `Python` wrapper functions. The final wrapper functions may eventually have fixed hyperparameters or complicated model ensembles based on repeated model training and investigation. ``` r # Forward-looking forecast data.frame. data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", outcome_col = 1, lookback = lookback, horizon = horizons, dates = dates, frequency = date_frequency) # Forecasts. data_forecasts <- predict(model_results, prediction_function = list(py_prediction_function), data = data_forecast) # We'll plot a background dataset of actuals as well. plot(data_forecasts, data_actual = data_seatbelts[-(1:150), ], actual_indices = dates[-(1:150)], horizons = c(1, 12)) ``` ![](./tools/forecasts_python.png) *** ### Multi-output forecast in R * This is the same seatbelt dataset example except now, instead of 1 model for each forecast horizon, we'll build 1 multi-output neural network model that forecasts 12 steps into the future. * Given that this is a small dataset, the multi-output approach would require a decent amount of tuning to produce accurate results. An alternative would be to forecast, say, horizons 6 through 12 if longer term forecasts were of interest to reduce the number of parameters; the output neurons do not have to start at a horizon of 1 or even be contiguous. ``` r library(forecastML) library(keras) # Using the TensorFlow 2.0 backend. data("data_seatbelts", package = "forecastML") data_seatbelts[] <- lapply(data_seatbelts, function(x) { (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) }) date_frequency <- "1 month" dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = date_frequency) data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", method = "multi_output", outcome_col = 1, lookback = 1:15, horizons = 1:12, dates = dates, frequency = date_frequency, dynamic_features = "law") # 'window_length = 0' creates 1 historical training dataset with no external validation datasets. # Set it to, say, 24 to see the model and forecast stability when trained across different slices # of historical data. windows <- forecastML::create_windows(data_train, window_length = 0) #------------------------------------------------------------------------------ # 'data_y' consists of 1 column for each forecast horizon--here, 12. model_fun <- function(data, horizons) { # 'horizons' is passed in train_model(). data_x <- apply(as.matrix(data[, -(1:length(horizons))]), 2, function(x){ifelse(is.na(x), 0, x)}) data_y <- apply(as.matrix(data[, 1:length(horizons)]), 2, function(x){ifelse(is.na(x), 0, x)}) layers_x_input <- keras::layer_input(shape = ncol(data_x)) layers_x_output <- layers_x_input %>% keras::layer_dense(ncol(data_x), activation = "relu") %>% keras::layer_dense(ncol(data_x), activation = "relu") %>% keras::layer_dense(length(horizons)) model <- keras::keras_model(inputs = layers_x_input, outputs = layers_x_output) %>% keras::compile(optimizer = 'adam', loss = 'mean_absolute_error') early_stopping <- callback_early_stopping(monitor = 'val_loss', patience = 2) tensorflow::tf$random$set_seed(224) model_results <- model %>% keras::fit(x = list(as.matrix(data_x)), y = list(as.matrix(data_y)), validation_split = 0.2, callbacks = c(early_stopping), verbose = FALSE) return(list("model" = model, "model_results" = model_results)) } #------------------------------------------------------------------------------ # The predict() wrapper function will return a data.frame with a number of columns # equaling the number of forecast horizons. prediction_fun <- function(model, data_features) { data_features[] <- lapply(data_features, function(x){ifelse(is.na(x), 0, x)}) data_features <- list(as.matrix(data_features, ncol = ncol(data_features))) data_pred <- data.frame(predict(model$model, data_features)) names(data_pred) <- paste0("y_pred_", 1:ncol(data_pred)) return(data_pred) } #------------------------------------------------------------------------------ model_results <- forecastML::train_model(data_train, windows, model_name = "Multi-Output NN", model_function = model_fun, horizons = 1:12) data_valid <- predict(model_results, prediction_function = list(prediction_fun), data = data_train) # We'll plot select forecast horizons to reduce visual clutter. plot(data_valid, facet = ~ model, horizons = c(1, 3, 6, 12)) ``` ![](./tools/multi_outcome_train_plot.png) * Forecast combinations from `combine_forecasts()` aren't necessary as we've trained only 1 model. ``` r data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", method = "multi_output", outcome_col = 1, lookback = 1:15, horizons = 1:12, dates = dates, frequency = date_frequency, dynamic_features = "law") data_forecasts <- predict(model_results, prediction_function = list(prediction_fun), data = data_forecast) plot(data_forecasts, facet = NULL, data_actual = data_seatbelts[-(1:100), ], actual_indices = dates[-(1:100)]) ``` ![](./tools/multi_outcome_forecast_plot.png) ## Examples - Factor Outcomes with R and Python ### R * This example is similar to the numeric outcome examples with the exception that the outcome has been factorized to illustrate how factors or sequences are forecasted. ``` r data("data_seatbelts", package = "forecastML") # Create an artifical factor outcome for illustration' sake. data_seatbelts$DriversKilled <- cut(data_seatbelts$DriversKilled, 3) horizons <- c(1, 12) # 2 models that forecast 1 and 1:12 time steps ahead. # A lookback across select time steps in the past. Feature lag 1 will be silently dropped from the 12-step-ahead model. lookback <- c(1, 12, 18) date_frequency <- "1 month" # Time step frequency. # The date indices, which don't come with the stock dataset, should not be included in the modeling data.frame. dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = date_frequency) # Create a dataset of features for modeling. data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", outcome_col = 1, lookback = lookback, horizon = horizons, dates = dates, frequency = date_frequency) # We won't use nested cross-validation; rather, we'll train a model over the entire training dataset. windows <- forecastML::create_windows(data_train, window_length = 0) # This is the model-training dataset. plot(windows, data_train) ``` ![](./tools/sequence_windows.png) * Model training and historical fit. ``` r model_function <- function(data, my_outcome_col) { # my_outcome_col = 1 could be defined here. outcome_names <- names(data)[1] model_formula <- formula(paste0(outcome_names, "~ .")) set.seed(224) model <- randomForest::randomForest(formula = model_formula, data = data, ntree = 3) return(model) # This model is the first argument in the user-defined predict() function below. } #------------------------------------------------------------------------------ # Train a model across forecast horizons and validation datasets. # my_outcome_col = 1 is passed in ... but could have been defined in the user-defined model function. model_results <- forecastML::train_model(data_train, windows = windows, model_name = "RF", model_function = model_function, my_outcome_col = 1, # ... use_future = FALSE) #------------------------------------------------------------------------------ # User-defined prediction function. # # The predict() wrapper function takes 2 positional arguments. First, # the returned model from the user-defined modeling function (model_function() above). # Second, a data.frame of model features. If predicting on validation data, expect the input data to be # passed in the same format as returned by create_lagged_df(type = 'train') but with the outcome column # removed. If forecasting, expect the input data to be in the same format as returned by # create_lagged_df(type = 'forecast') but with the 'index' and 'horizon' columns removed. # # For factor outcomes, the function can return either (a) a 1-column data.frame with factor level # predictions or (b) an L-column data.frame of predicted class probabilities where 'L' equals the # number of levels in the outcome; the order of the return()'d columns should match the order of the # outcome factor levels from left to right which is the default behavior of most predict() functions. # Predict/forecast a single factor level. prediction_function_level <- function(model, data_features) { data_pred <- data.frame("y_pred" = predict(model, data_features, type = "response")) return(data_pred) } # Predict/forecast outcome class probabilities. prediction_function_prob <- function(model, data_features) { data_pred <- data.frame("y_pred" = predict(model, data_features, type = "prob")) return(data_pred) } # Predict on the validation datasets. data_valid_level <- predict(model_results, prediction_function = list(prediction_function_level), data = data_train) data_valid_prob <- predict(model_results, prediction_function = list(prediction_function_prob), data = data_train) ``` * Predict historical factor levels. * With `window_length = 0` these are essentially plots of model fit. ``` r plot(data_valid_level, horizons = c(1, 12)) ``` ![](./tools/sequence_valid_level.png) * Predict historical class probabilities. ``` r plot(data_valid_prob, horizons = c(1, 12)) ``` ![](./tools/sequence_valid_prob.png) * Forecast ``` r # Forward-looking forecast data.frame. data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", outcome_col = 1, lookback = lookback, horizons = horizons) # Forecasts. data_forecasts_level <- predict(model_results, prediction_function = list(prediction_function_level), data = data_forecast) data_forecasts_prob <- predict(model_results, prediction_function = list(prediction_function_prob), data = data_forecast) ``` * Forecast factor levels ``` r plot(data_forecasts_level) ``` ![](./tools/sequence_forecast_level.png) * Forecast class probabilities ``` r plot(data_forecasts_prob) ``` ![](./tools/sequence_forecast_prob.png)