Forecaster

This is the main object that is utilized for making predictions on the test set, making forecasts, evaluating models, data differencing, adding regressors, and saving, visualizing, and exporting results.

 from scalecast.Forecaster import Forecaster
 array_of_dates = ['2021-01-01','2021-01-02','2021-01-03']
 array_of_values = [1,2,3]
 f = Forecaster(
   y=array_of_values,
   current_dates=array_of_dates,
   # defaults below
   require_future_dates=True,
   future_dates=None,
   test_length = 0,
   cis = False,
   metrics = ['rmse','mape','mae','r2'],
)
class src.scalecast.Forecaster.Forecaster(y, current_dates, future_dates=None, test_length=0, cis=False, metrics=['rmse', 'mape', 'mae', 'r2'], carry_fit_models=True, **kwargs)
__init__(y, current_dates, future_dates=None, test_length=0, cis=False, metrics=['rmse', 'mape', 'mae', 'r2'], carry_fit_models=True, **kwargs)
Parameters:
  • y (collection) – An array of all observed values.

  • current_dates (collection) – An array of all observed dates. Must be same length as y and in the same sequence. Can pass any numerical index if dates are unknown; in this case, It will act as if dates are in nanosecond frequency.

  • future_dates (int) – Optional. The future dates to add to the model upon initialization. If not added when object is initialized, can be added later.

  • test_length (int or float) – Default 0. The test length that all models will use to test all models out of sample. If float, must be between 0 and 1 and will be treated as a fractional split. By default, models will not be tested.

  • cis (bool) – Default False. Whether to evaluate naive conformal confidence intervals for every model evaluated. If setting to True, ensure you also set a test_length of at least 20 observations for 95% confidence intervals. See eval_cis() and set_cilevel() methods and docstrings for more information.

  • metrics (list) – Default [‘rmse’,’mape’,’mae’,’r2’]. The metrics to evaluate when validating and testing models. Each element must exist in utils.metrics and take only two arguments: a and f. See https://scalecast.readthedocs.io/en/latest/Forecaster/Util.html#metrics. Or the element should be a function that accepts two arguments that will be referenced later by its name. The first element of this list will be set as the default validation metric, but that can be changed. For each metric and model that is tested, the test-set and in-sample metrics will be evaluated and can be exported.

  • carry_fit_models (bool) – Default True. Whether to store the regression model for each fitted model in history. Setting this to False can save memory.

  • **kwargs – Become attributes.

Methods:

STL([diffy, train_only])

Returns a Season-Trend decomposition using LOESS of the y values.

add_AR_terms(N)

Adds seasonal auto-regressive terms.

add_ar_terms(n)

Adds auto-regressive terms.

add_combo_regressors(*args[, sep])

Combines all passed variables by multiplying their values together.

add_covid19_regressor([called, start, end])

Adds a dummy variable that is 1 during the time period that COVID19 effects are present for the series, 0 otherwise.

add_cycle(cycle_length[, fourier_order, called])

Adds a regressor that acts as a seasonal cycle.

add_exp_terms(*args, pwr[, sep, cutoff, drop])

Raises all passed variables (no AR terms) to exponential powers (ints or floats).

add_lagged_terms(*args[, lags, upto, sep, drop])

Lags all passed variables (no AR terms) 1 or more times.

add_logged_terms(*args[, base, sep, drop])

Logs all passed variables (no AR terms).

add_metric(func[, called])

Add a metric to be evaluated when validating and testing models.

add_other_regressor(called, start, end)

Adds a dummy variable that is 1 during the specified time period, 0 otherwise.

add_poly_terms(*args[, pwr, sep])

raises all passed variables (no AR terms) to exponential powers (ints only).

add_pt_terms(*args[, method, sep, drop])

Applies a box-cox or yeo-johnson power transformation to all passed variables (no AR terms).

add_seasonal_regressors(*args[, raw, ...])

Adds seasonal regressors.

add_series(series, called[, first_date, ...])

Adds other series to the object as regressors.

add_signals(model_nicknames[, ...])

Adds the predictions from already-evaluated models as covariates that can be used for future evaluated models.

add_sklearn_estimator(imported_module, called)

Adds a new estimator from scikit-learn not built-in to the forecaster object that can be called using set_estimator().

add_time_trend([called])

Adds a time trend from 1 to length of the series + the forecast horizon as a current and future Xvar.

adf_test([critical_pval, full_res, ...])

Tests the stationarity of the y series using augmented dickey fuller.

all_feature_info_to_excel([out_path, excel_name])

Saves all feature importance and summary stats to excel.

all_validation_grids_to_excel([out_path, ...])

Saves all validation grids to excel.

auto_Xvar_select([estimator, try_trend, ...])

Attempts to find the ideal trend, seasonality, and look-back representations for the stored series by systematically adding regressors to the object and monintoring a passed metric value.

auto_forecast([call_me, dynamic_testing, ...])

Auto forecasts with the best parameters indicated from the tuning process.

chop_from_back(n)

Cuts y observations in the object from the back by counting forward from the beginning.

chop_from_front(n[, fcst_length])

Cuts the amount of y observations in the object from the front counting backwards.

copy()

Creates an object copy.

cross_validate([k, test_length, ...])

Tunes a model's hyperparameters using time-series cross validation.

deepcopy()

Creates an object deepcopy.

determine_best_series_length([estimator, ...])

Attempts to find the optimal length for the series to produce accurate forecasts by systematically shortening the series, running estimations, and monitoring a passed metric value.

drop_Xvars(*args[, error])

Drops regressors.

drop_all_Xvars()

drops all regressors.

drop_regressors(*args[, error])

Drops regressors.

eval_cis([mode, cilevel])

Call this function to change whether or not the Forecaster sets confidence intervals on all evaluated models.

export([dfs, models, best_model, ...])

Exports 1-all of 3 pandas DataFrames.

export_Xvars_df([dropna])

Gets all utilized regressors and values.

export_feature_importance(model)

Exports the feature importance from a model.

export_fitted_vals(model)

Exports a single dataframe with dates, fitted values, actuals, and residuals for one model.

export_summary_stats(model)

Exports the summary stats from a model.

export_validation_grid(model)

Exports the validation grid from a model, converted to a pandas dataframe.

generate_future_dates(n)

Generates a certain amount of future dates in same frequency as current_dates.

get_freq()

Gets the pandas inferred date frequency.

get_regressor_names()

Gets the regressor names stored in the object.

infer_freq()

Uses the pandas library to infer the frequency of the loaded dates.

ingest_Xvars_df(df[, date_col, drop_first, ...])

Ingests a dataframe of regressors and saves its Xvars to the object.

ingest_grid(grid)

Ingests a grid to tune the estimator.

keep_smaller_history(n)

Cuts y observations in the object by counting back from the beginning.

limit_grid_size(n[, min_grid_size, random_seed])

Makes a grid smaller randomly.

load_tf_model([name])

Loads a fitted tensorflow (RNN/LSTM) model and attaches it to the Forecaster object in the tf_model attribute.

manual_forecast([call_me, dynamic_testing, ...])

Manually forecasts with the hyperparameters, Xvars, and normalizer selection passed as keywords.

normality_test([train_only])

Runs D'Agostino and Pearson's test for normality ported from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html.

order_fcsts([models, determine_best_by])

Gets estimated forecasts ordered from best-to-worst.

plot([models, exclude, order_by, ci, ax, ...])

Plots all forecasts with the actuals, or just actuals if no forecasts have been evaluated or are selected.

plot_acf([diffy, train_only])

Plots an autocorrelation function of the y values.

plot_fitted([models, exclude, order_by, ax, ...])

Plots all fitted values with the actuals.

plot_pacf([diffy, train_only])

Plots a partial autocorrelation function of the y values.

plot_periodogram([diffy, train_only])

Plots a periodogram of the y values (comes from scipy.signal).

plot_test_set([models, exclude, order_by, ...])

Plots all test-set predictions with the actuals.

pop(*args)

Deletes evaluated forecasts from the object's memory.

reduce_Xvars([method, estimator, ...])

Reduces the regressor variables stored in the object.

restore_series_length()

Restores the original y values and current dates in the object from before keep_smaller_history() or determine_best_series_length() were called.

round([decimals])

Rounds the values saved to Forecaster.y.

save_feature_importance([method, on_error, ...])

Saves feature info for models that offer it (sklearn models).

save_summary_stats()

Saves summary stats for models that offer it and will not raise errors if not available.

save_tf_model([name])

Saves a fitted tensorflow (RNN/LSTM) model as a file.

seasonal_decompose([diffy, train_only])

Returns a signal/seasonal decomposition of the y values.

set_cilevel(n)

Sets the level for the resulting confidence intervals (95% default).

set_estimator(estimator)

Sets the estimator to forecast with.

set_grids_file([name])

Sets the name of the file where the object will look automatically for grids when calling tune(), cross_validate(), tune_test_forecast(), or similar function.

set_last_future_date(date)

Generates future dates in the same frequency as current_dates that ends on a specified date.

set_metrics(metrics)

Set or change the evaluated metrics for all model testing and validation.

set_test_length([n])

Sets the length of the test set.

set_validation_length([n])

Sets the length of the validation set.

set_validation_metric(metric)

Sets the metric that will be used to tune all subsequent models.

test([dynamic_testing, call_me])

Tests the forecast estimator out-of-sample.

transfer_cis(transfer_from, model[, ...])

Transfers the confidence intervals from a model forecast in a passed Forecaster or MVForecaster object.

transfer_predict(transfer_from, model[, ...])

Makes predictions using an already-trained model over any given forecast horizon.

tune([dynamic_tuning, set_aside_test_set])

Tunes the specified estimator using an ingested grid (ingests a grid from Grids.py with same name as the estimator by default).

tune_test_forecast(models[, cross_validate, ...])

Iterates through a list of models, tunes them using grids in a grids file, forecasts them, and can save feature information.

validate_regressor_names()

Validates that all regressor names exist in both current_xregs and future_xregs.

STL(diffy=False, train_only=False, **kwargs)

Returns a Season-Trend decomposition using LOESS of the y values.

Parameters:
  • diffy (bool) – Default False. Whether to difference the data before passing the values to the function. If False or 0, does not difference. If True or 1, differences 1 time.

  • train_only (bool) – Default False. If True, will exclude the test set from the test (a measure added to avoid leakage).

  • **kwargs – Passed to STL() function from statsmodels. See https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.STL.html.

Returns:

An object with seasonal, trend, and resid attributes.

Return type:

(DecomposeResult)

>>> import matplotlib.pyplot as plt
>>> f.STL(train_only=True).plot()
>>> plt.show()
add_AR_terms(N)

Adds seasonal auto-regressive terms.

Parameters:

N (tuple) – First element is the number of lags to add and the second element is the space between lags.

Returns:

None

>>> f.add_AR_terms((2,12)) # adds 12th and 24th lags called 'AR12', 'AR24'
add_ar_terms(n)

Adds auto-regressive terms.

Parameters:

n (int or collection[int]) – If int, the number of lags to add to the object (1 to this number will be added by default). If collection, will add the lags specified in the collection ([2,4] will add lags 2 and 4). To add only lag 10, pass [10]. To add 10 lags, pass 10.

Returns:

None

>>> f.add_ar_terms(4) # adds four lags of y called 'AR1' - 'AR4' to predict with
>>> f.add_ar_terms([4]) # adds the fourth lag called 'AR4' to predict with
add_combo_regressors(*args, sep='_')

Combines all passed variables by multiplying their values together.

Parameters:
  • *args (str) – Names of Xvars that aleady exist in the object.

  • sep (str) – Default ‘_’. The separator between each term in arg to create the final variable name.

Returns:

None

>>> f.add_combo_regressors('t','monthsin') # multiplies these two together (called 't_monthsin')
>>> f.add_combo_regressors('t','monthcos') # multiplies these two together (called 't_monthcos')
add_covid19_regressor(called='COVID19', start=datetime.datetime(2020, 3, 15, 0, 0), end=datetime.datetime(2021, 5, 13, 0, 0))

Adds a dummy variable that is 1 during the time period that COVID19 effects are present for the series, 0 otherwise. The default dates are selected to be optimized for the time-span where the economy was most impacted by COVID.

Parameters:
  • called (str) – Default ‘COVID19’. What to call the resulting variable.

  • start (str, datetime.datetime, or pd.Timestamp) – Default datetime.datetime(2020,3,15). The start date (default is day Walt Disney World closed in the U.S.). Must be parsable by pandas’ Timestamp function.

  • end – (str, datetime.datetime, or pd.Timestamp): Default datetime.datetime(2021,5,13). The end date (default is day the U.S. CDC first dropped the mask mandate/recommendation for vaccinated people). Must be parsable by pandas’ Timestamp function.

Returns:

None

add_cycle(cycle_length, fourier_order=2.0, called=None)

Adds a regressor that acts as a seasonal cycle. Use this function to capture non-normal seasonality.

Parameters:
  • cycle_length (int) – How many time steps make one complete cycle.

  • fourier_order (float) – Default 2.0. The fourier order to apply. This number is the number of complete cycles in that given seasonal period. 2 captures the fundamental frequency and its first harmonic. Higher orders will capture more complex seasonality, but may lead to overfitting.

  • called (str) – Optional. What to call the resulting variable. Two variables will be created–one for a sin transformation and the other for cos resulting variable names will have “sin” or “cos” at the end. Example, called = ‘cycle5’ will become ‘cycle5sin’, ‘cycle5cos’. If left unspecified, ‘cycle{cycle_length}’ will be used as the name.

Returns:

None

>>> f.add_cycle(13) # adds a seasonal effect that cycles every 13 observations called 'cycle13'
add_exp_terms(*args, pwr, sep='^', cutoff=2, drop=False)

Raises all passed variables (no AR terms) to exponential powers (ints or floats).

Parameters:
  • *args (str) – Names of Xvars that aleady exist in the object.

  • pwr (float) – The power to raise each term to in args. Can use values like 0.5 to perform square roots, etc.

  • sep (str) – default ‘^’. The separator between each term in arg to create the final variable name.

  • cutoff (int) – default 2. The resulting variable name will be rounded to this number based on the passed pwr. For instance, if pwr = 0.33333333333 and ‘t’ is passed as an arg to *args, the resulting name will be t^0.33 by default.

  • drop (bool) – Default False. Whether to drop the regressors passed to *args.

Returns:

None

>>> f.add_exp_terms('t',pwr=.5) # adds square root t called 't^0.5'
add_lagged_terms(*args, lags=1, upto=True, sep='_', drop=False)

Lags all passed variables (no AR terms) 1 or more times.

Parameters:
  • *args (str) – Names of Xvars that aleady exist in the object.

  • lags (int) – Greater than 0, default 1. The number of times to lag each passed variable.

  • upto (bool) – Default True. Whether to add all lags up to the number passed to lags. If you pass 6 to lags and upto is True, lags 1, 2, 3, 4, 5, 6 will all be added. If you pass 6 to lags and upto is False, lag 6 only will be added.

  • sep (str) – Default ‘_’. The separator between each term in arg to create the final variable name. Resulting variable names will be like “tlag_1” or “tlag_2” by default.

  • drop (bool) – Default False. Whether to drop the regressors passed to *args.

Returns:

None

>>> add_lagged_terms('t',lags=3) # adds first, second, and third lag of t called 'tlag_1' - 'tlag_3'
>>> add_lagged_terms('t',lags=6,upto=False) # adds 6th lag of t only called 'tlag_6'
add_logged_terms(*args, base=2.718281828459045, sep='', drop=False)

Logs all passed variables (no AR terms).

Parameters:
  • *args (str) – Names of Xvars that aleady exist in the object.

  • base (float) – Default math.e (natural log). The log base. Must be math.e or int greater than 1.

  • sep (str) – Default ‘’. The separator between each term in arg to create the final variable name. Resulting variable names will be like “log2t” or “lnt” by default.

  • drop (bool) – Default False. Whether to drop the regressors passed to *args.

Returns:

None

>>> f.add_logged_terms('t') # adds natural log t callend 'lnt'
add_metric(func, called=None)

Add a metric to be evaluated when validating and testing models. The function should accept two arguments where the first argument is an array of actual values and the second is an array of predicted values. The function returns a float.

Parameters:
  • func (function) – The function used to calculate the metric.

  • called (str) – Optional. The name that can be used to reference the metric function within the object. If not specified, will use the function’s name.

>>> from scalecast.util import metrics
>>> def rmse_mae(a,f):
>>>     # average of rmse and mae
>>>     return (metrics.rmse(a,f) + metrics.mae(a,f)) / 2
>>> f.add_metric(rmse_mae)
>>> f.set_validation_metric('rmse_mae') # optimize models using this metric
add_other_regressor(called, start, end)

Adds a dummy variable that is 1 during the specified time period, 0 otherwise.

Parameters:
  • called (str) – What to call the resulting variable.

  • start (str, datetime.datetime, or pd.Timestamp) – Start date. Must be parsable by pandas’ Timestamp function.

  • end (str, datetime.datetime, or pd.Timestamp) – End date. Must be parsable by pandas’ Timestamp function.

Returns:

None

>>> f.add_other_regressor('january_2021','2021-01-01','2021-01-31')
add_poly_terms(*args, pwr=2, sep='^')

raises all passed variables (no AR terms) to exponential powers (ints only).

Parameters:
  • *args (str) – Names of Xvars that aleady exist in the object

  • pwr (int) – Default 2. The max power to add to each term in args (2 to this number will be added).

  • sep (str) – default ‘^’. The separator between each term in arg to create the final variable name.

Returns:

None

>>> f.add_poly_terms('t','year',pwr=3) # raises t and year to 2nd and 3rd powers (called 't^2', 't^3', 'year^2', 'year^3')
add_pt_terms(*args, method='box-cox', sep='_', drop=False)

Applies a box-cox or yeo-johnson power transformation to all passed variables (no AR terms).

Parameters:
  • *args (str) – Names of Xvars that aleady exist in the object.

  • method (str) – One of {‘box-cox’,’yeo-johnson’}, default ‘box-cox’. The type of transformation. box-cox works for positive values only. yeo-johnson is like a box-cox but can be used with 0s or negatives. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html.

  • sep (str) – Default ‘’. The separator between each term in arg to create the final variable name. Resulting variable names will be like “box-cox_t” or “yeo-johnson_t” by default.

  • drop (bool) – Default False. Whether to drop the regressors passed to *args.

Returns:

None

>>> f.add_pt_terms('t') # adds box cox of t called 'box-cox_t'
add_seasonal_regressors(*args, raw=True, sincos=False, dummy=False, drop_first=False, cycle_lens=None, fourier_order=2.0)

Adds seasonal regressors. Can be in the form of Fourier transformed, dummy, or integer values.

Parameters:
  • *args (str) – Values that return a series of int type from pandas.dt or pandas.dt.isocalendar(). See https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.year.html.

  • raw (bool) – Default True. Whether to use the raw integer values.

  • sincos (bool) – Default False. Whether to use a Fourier transformation of the raw integer values. The length of the cycle is derived from the max observed value unless cycle_lens is specified.

  • dummy (bool) – Default False. Whether to use dummy variables from the raw int values.

  • drop_first (bool) – Default False. Whether to drop the first observed dummy level. Not relevant when dummy = False.

  • cycle_lens (dict) – Optional. A dictionary that specifies a cycle length for each selected seasonality. If this is not specified or a selected seasonality is not added to the dictionary as a key, the cycle length will be selected automatically as the maximum value observed for the given seasonality. Not relevant when sincos = False.

  • fourier_order (float) – Default 2.0. The fourier order to apply to terms that are added using sincos = True. This number is the number of complete cycles in that given seasonal period. 2 captures the fundamental frequency and its first harmonic. Higher orders will capture more complex seasonality, but may lead to overfitting.

Returns:

None

>>> f.add_seasonal_regressors('year')
>>> f.add_seasonal_regressors(
>>>     'dayofyear',
>>>     'month',
>>>     'week',
>>>     'quarter',
>>>     raw=False,
>>>     sincos=True,
>>>     cycle_lens={'dayofyear':365.25},
>>> )
>>> f.add_seasonal_regressors('dayofweek',raw=False,dummy=True,drop_first=True)
add_series(series, called, first_date=None, forward_pad=True, back_pad=True)

Adds other series to the object as regressors. If the added series is less than the length of Forecaster.y + len(Forecaster.future_dates), it will padded with 0s by default.

Parameters:
  • series (list-like) – The series to add as a regressor to the object.

  • called (str) – Required. What to call the resulting regressor in the Forecaster object.

  • first_date (Datetime) – Optional. The first date that corresponds with the added series. If left unspecified, will assume its first date is the same as the first date in the Forecaster object. Must be datetime or otherwise able to be parsed by the pandas.Timestamp() function.

  • pad (bool) – Default True. Whether to put 0s before and/or after the series if the series is too short.

>>> x = [1,2,3,4,5,6]
>>> f.add_series(series = x,called='x') # assumes first date is same as what is in f.current_dates
add_signals(model_nicknames, fill_strategy='actuals', train_only=False)

Adds the predictions from already-evaluated models as covariates that can be used for future evaluated models. The names of the added variables will all begin with “signal_” and end with the given model nickname.

Parameters:
  • model_nicknames (list) – The names of already-evaluated models with information stored in the history attribute.

  • fill_strategy (str or None) – The strategy to fill NA values that are present at the beginning of a given model’s fitted values. Available options are: ‘actuals’ (default) which will replace nulls with actuals; ‘bfill’ which will backfill null values; or None which will leave null values alone, which can cause errors in future evaluated models.

  • train_only (bool) – Default False. Whether to add fitted values from the training set only. The test-set predictions will be out-of-sample if this is True. The future unknown values are always out-of-sample. Even when this is True, the future unknown values are taken from a model trained on the full set of known observations.

>>> f.set_estimator('lstm')
>>> f.manual_forecast(call_me='lstm')
>>> f.add_signals(model_nicknames = ['lstm']) # adds a regressor called 'signal_lstm'
add_sklearn_estimator(imported_module, called)

Adds a new estimator from scikit-learn not built-in to the forecaster object that can be called using set_estimator(). Only regression models are accepted.

Parameters:
  • imported_module (scikit-learn regression model) – The model from scikit-learn to add. Must have already been imported locally. Supports models from sklearn and sklearn APIs.

  • called (str) – The name of the estimator that can be called using set_estimator().

Returns:

None

>>> from sklearn.ensemble import StackingRegressor
>>> f.add_sklearn_estimator(StackingRegressor,called='stacking')
>>> f.set_estimator('stacking')
>>> f.manual_forecast(...)
add_time_trend(called='t')

Adds a time trend from 1 to length of the series + the forecast horizon as a current and future Xvar.

Parameters:

Called (str) – Default ‘t’. What to call the resulting variable.

Returns:

None

>>> f.add_time_trend() # adds time trend called 't'
adf_test(critical_pval=0.05, full_res=True, train_only=False, diffy=False, **kwargs)

Tests the stationarity of the y series using augmented dickey fuller. Ports from statsmodels: https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html.

Parameters:
  • critical_pval (float) – Default 0.05. The p-value threshold in the statistical test to accept the alternative hypothesis.

  • full_res (bool) – Default True. If True, returns a dictionary with the pvalue, evaluated statistic, and other statistical information (returns what the adfuller() function from statsmodels does). If False, returns a bool that matches whether the test indicates stationarity.

  • train_only (bool) – Default False. If True, will exclude the test set from the test (to avoid leakage).

  • diffy (bool or int) – One of {True,False,0,1}. Default False. Whether to difference the data before passing the values to the function. If False or 0, does not difference. If True or 1, differences 1 time.

  • **kwargs – Passed to the adfuller() function from statsmodels. See https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html.

Returns:

If bool (full_res = False), returns whether the test suggests stationarity. Otherwise, returns the full results (stat, pval, etc.) of the test.

Return type:

(bool or tuple)

>>> stat, pval, _, _, _, _ = f.adf_test(full_res=True)
all_feature_info_to_excel(out_path='./', excel_name='feature_info.xlsx')

Saves all feature importance and summary stats to excel. Each model where such info is available for gets its own tab. Be sure to have called save_summary_stats() and/or save_feature_importance() before using this function.

Parameters:
  • out_path (str) – Default ‘./’. The path to export to.

  • excel_name (str) – Default ‘feature_info.xlsx’. The name of the resulting excel file.

Returns:

None

all_validation_grids_to_excel(out_path='./', excel_name='validation_grids.xlsx', sort_by_metric_value=False, ascending=True)

Saves all validation grids to excel. Each model where such info is available for gets its own tab. Be sure to have tuned at least model before calling this.

Parameters:
  • out_path (str) – Default ‘./’. The path to export to.

  • excel_name (str) – Default ‘feature_info.xlsx’. The name of the resulting excel file.

  • sort_by_metric_value (bool) – Default False. Whether to sort the output by performance on validation set.

  • ascending (bool) – Default True. Whether to sort least-to-greatest. Ignored if sort_by_metric_value is False.

Returns:

None

auto_Xvar_select(estimator='mlr', try_trend=True, trend_estimator='mlr', trend_estimator_kwargs={}, decomp_trend=True, decomp_method='additive', try_ln_trend=True, max_trend_poly_order=2, try_seasonalities=True, seasonality_repr=['sincos'], exclude_seasonalities=[], irr_cycles=None, max_ar='auto', test_already_added=True, must_keep=[], monitor='ValidationMetricValue', cross_validate=False, dynamic_tuning=False, cvkwargs={}, **kwargs)

Attempts to find the ideal trend, seasonality, and look-back representations for the stored series by systematically adding regressors to the object and monintoring a passed metric value. Searches for trend first, then seasonalities, then optimal lag order, then the best combination of all of the above, along with irregular cycles (if specified) and any regressors already added to the object. The function offers flexibility around setting Xvars it must add to the object by letting the user add these regressors before calling the function, telling the function not to re-search for them, and telling the function not to drop them when considering the optimal combination of regressors. The final optimal combination of regressors is determined by grouping all extracted regressors into trends, seasonalities, irregular cycles, ar terms, and regressors already added, and tying all combinations of all these groups. See the example: https://scalecast-examples.readthedocs.io/en/latest/misc/auto_Xvar/auto_Xvar.html.

Parameters:
  • estimator (str) – One of Forecaster.sklearn_estimators. Default ‘mlr’. The estimator to use to determine the best seasonal and lag regressors.

  • try_trend (bool) – Default True. Whether to search for trend representations of the series.

  • trend_estimator (str) – One of Forecaster.sklearn_estimators. Default ‘mlr’. Ignored if try_trend is False. The estimator to use to determine the best trend representation.

  • trend_estimator_kwargs (dict) – Default {}. The model parameters to pass to the trend_estimator model.

  • decomp_trend (bool) – Default True. Whether to decompose the series to estimate the trend. Ignored if try_trend is False. The idea is there can be many seasonalities represented by scalecast, but only one trend, so using a decomposition method for trend could lead to finding a better trend representation.

  • decomp_method (str) – One of ‘additive’,’multiplicative’. Default ‘additive’. The decomp method used to represent the trend. Ignored if try_trend is False. Ignored if decomp_trend is False.

  • try_ln_trend (bool) – Default True. Ignored if try_trend is False. Whether to search logged trend representations using a natural log.

  • max_trend_poly_order (int) – Default 2. The highest order trend representation that will be searched.

  • try_seasonalities (bool) – Default True. Whether to search for seasonal representations. This function uses a hierachical approach from secondly –> quarterly representations. Secondly will search all seasonal representations up to quarterly to find the best hierarchy of seasonalities. Anything lower than second and higher than quarter will not receive a seasonality with this method. Day seasonality and lower will try, ‘day’ (of month), ‘dayofweek’, and ‘dayofyear’ seasonalities. Everything else will try cycles that reset yearly, so to search for intermitent seasonal fluctuations, use the irr_cycles argument.

  • seasonality_repr (list or dict[str,list]) – Default [‘sincos’]. How to represent the extracted seasonalties. the default will use fourier representations only. Ignored if try_seasonalities is False. Other elements to add to the list: ‘dummy’,’raw’,’drop_first’. Can add multiple or one of these. If dict, the key needs to be the seasonal representation (‘quarter’ for quarterly, ‘month’ for monthly) and the value a list. If a seasonal representation is not found in this dictionary, it will default to [‘sincos’], i.e. a fourier representation. ‘drop_first’ ignored when ‘dummy’ is not present.

  • exclude_seasonalities (list) – Default []. Ignored if try_seasonalities is False. Add in this list any seasonal representations to skip searching. If you have day frequency and you only want to search dayofweek, you should specify this as: [‘dayofweek’,’week’,’month’,’quarter’].

  • irr_cycles (list[int]) – Optional. Add any irregular cycle lengths to a list as integers to search for using this method.

  • max_ar ('auto' or int) – The highest lag order to search for. If ‘auto’, will use the greater of the forecast length or the test-set length as the lag order. If a larger number than available observations is placed here, the AR search will stop early. Set to 0 to skip searching for lag terms.

  • test_already_added (bool) – Default True. If there are already regressors added to the series, you can either always keep them in the object by setting this to False, or by default, it is possible they will be dropped when looking for the optimal combination of regressors in the object.

  • must_keep (list-like) – Default []. The names of any regressors that must be kept in the object. All regressors here must already be added to the Forecaster object before calling the function. This is ignored if test_already_added is False since it becomes redundant.

  • monitor (str) – One of Forecaster.determine_best_by. Default ‘ValidationMetricValue’. The metric to be monitored when making reduction decisions.

  • cross_validate (bool) – Default False. Whether to tune the model with cross validation. If False, uses the validation slice of data to tune. If not monitoring ValidationMetricValue, you will want to leave this False.

  • dynamic_tuning (bool or int) – Default False. Whether to dynamically tune the model or, if int, how many forecast steps to dynamically tune it.

  • cvkwargs (dict) – Default {}. Passed to the cross_validate() method.

  • **kwargs – {assed to manual_forecast() method and can include arguments related to a given model’s hyperparameters or dynamic_testing. Do not pass Xvars.

Returns:

A dictionary where each key is a tuple of variable combinations and the value is the derived metric (based on value passed to monitor argument).

Return type:

(dict[tuple[float]])

>>> f.add_covid19_regressor()
>>> f.auto_Xvar_select(cross_validate=True)
auto_forecast(call_me=None, dynamic_testing=True, test_again=True)

Auto forecasts with the best parameters indicated from the tuning process.

Parameters:
  • call_me (str) – Optional. What to call the model when storing it in the object’s history dictionary. If not specified, the model’s nickname will be assigned the estimator value (‘mlp’ will be ‘mlp’, etc.). Duplicated names will be overwritten with the most recently called model.

  • dynamic_testing (bool or int) – Default True. Whether to dynamically/recursively test the forecast (meaning AR terms will be propagated with predicted values). If True, evaluates dynamically over the entire out-of-sample slice of data. If int, window evaluates over that many steps (2 for 2-step dynamic forecasting, 12 for 12-step, etc.). Setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform more than one period out. The model will skip testing if the test_length attribute is set to 0.

  • test_again (bool) – Default True. Whether to test the model before forecasting to a future horizon. If test_length is 0, this is ignored. Set this to False if you tested the model manually by calling f.test() and don’t want to waste resources testing it again.

>>> f.set_estimator('xgboost')
>>> f.tune()
>>> f.auto_forecast()
chop_from_back(n)

Cuts y observations in the object from the back by counting forward from the beginning.

Parameters:

n (int) – The number of observations to cut from the back.

>>> f.chop_from_back(10) # chops 10 observations off the back
chop_from_front(n, fcst_length=None)

Cuts the amount of y observations in the object from the front counting backwards. The current length of the forecast horizon will be maintained and all future regressors will be rewritten to the appropriate attributes.

Parameters:
  • n (int) – The number of observations to cut from the front.

  • fcst_length (int) – Optional. The new length of the forecast length. By default, maintains the same forecast length currently in the object.

>>> f.chop_from_front(10) # keeps all observations before the last 10
copy()

Creates an object copy.

cross_validate(k=5, test_length=None, train_length=None, space_between_sets=None, rolling=False, dynamic_tuning=False, set_aside_test_set=True, verbose=False)

Tunes a model’s hyperparameters using time-series cross validation. Monitors the metric specified in the valiation_metric attribute. Set an estimator before calling. Reads a grid for the estimator from a grids file unless a grid is ingested manually. The chosen parameters are stored in the best_params attribute. All metrics from each iteration are stored in grid_evaluated. The rows in this matrix correspond to the element index in f.grid (a hyperparameter combo) and the columns are the derived metrics across the k folds. Any hyperparameters that ever failed to evaluate will return N/A and are not considered. The best parameter combo is determined by the best average derived matrix across all folds. The temporal order of the series is always maintained in this process. If a test_length is specified in the object, it will be set aside by default. (Default) Normal cv diagram: https://scalecast-examples.readthedocs.io/en/latest/misc/validation/validation.html#5-Fold-Time-Series-Cross-Validation. (Default) Rolling cv diagram: https://scalecast-examples.readthedocs.io/en/latest/misc/validation/validation.html#5-Fold-Rolling-Time-Series-Cross-Validation.

Parameters:
  • k (int) – Default 5. The number of folds. If 1, behaves as if the model were being tuned on a single held out set.

  • test_length (int) – Optional. The size of each held-out sample. By default, determined such that the last test set and train set are the same size.

  • train_length (int) – Optional. The size of each training set. By default, all available observations before each test set are used.

  • space_between_sets (int) – Optional. The space between each training set. By default, uses the test_length.

  • rolling (bool) – Default False. Whether to use a rolling method, meaning every train and test size is the same. This is ignored when either of train_length or test_length is specified.

  • dynamic_tuning (bool or int) – Default False. Whether to dynamically/recursively test the forecast during the tuning process (meaning AR terms will be propagated with predicted values). If True, evaluates recursively over the entire out-of-sample slice of data. If int, window evaluates over that many steps (2 for 2-step recurvie testing, 12 for 12-step, etc.). Setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform more than one period out.

  • set_aside_test_set (bool) – Default True. Whether to separate the test set specified in f.test_length during this process.

  • verbose (bool) – Default False. Whether to print out information about the test size, train size, and date ranges for each fold.

Returns:

None

>>> f.set_estimator('xgboost')
>>> f.cross_validate() # tunes hyperparam values
>>> f.auto_forecast() # forecasts with the best params
deepcopy()

Creates an object deepcopy.

determine_best_series_length(estimator='mlr', min_obs=100, max_obs=None, step=25, monitor='ValidationMetricValue', cross_validate=False, dynamic_tuning=False, cvkwargs={}, chop=True, **kwargs)

Attempts to find the optimal length for the series to produce accurate forecasts by systematically shortening the series, running estimations, and monitoring a passed metric value. This should be run after Xvars have already been added to the object and all Xvars will be used in the iterative estimations.

Parameters:
  • estimator (str) – One of Forecaster.estimators. Default ‘mlr’. The estimator to use to determine the best series length.

  • min_obs (int) – Default 100. The shortest representation of the series to search.

  • max_obs (int) – Optional. The longest representation of the series to search. By default, the last estimation will be run on all available observations.

  • step (int) – Default 25. How big a step to take between iterations.

  • monitor (str) – One of Forecaster.determine_best_by. Default ‘ValidationSetMetric’. The metric to be monitored when making reduction decisions.

  • cross_validate (bool) – Default False. Whether to tune the model with cross validation. If False, uses the validation slice of data to tune. If not monitoring ValidationMetricValue, you will want to leave this False.

  • dynamic_tuning (bool or int) – Default False. Whether to dynamically tune the model or, if int, how many forecast steps to dynamically tune it.

  • cvkwargs (dict) – Default {}. Passed to the cross_validate() method.

  • chop (bool) – Default True. Whether to shorten the series if a shorter length is found to be best.

  • **kwargs – Passed to manual_forecast() method and can include arguments related to a given model’s hyperparameters, dynamic_testing, or Xvars.

Returns:

A dictionary where each key is a series length and the value is the derived metric (based on what was passed to the monitor argument).

Return type:

(dict[int[float]])

>>> f.auto_Xvar_select()
>>> f.determine_best_series_length()
drop_Xvars(*args, error='raise')

Drops regressors.

Parameters:
  • *args (str) – The names of regressors to drop.

  • error (str) – One of ‘ignore’,’raise’. Default ‘raise’. What to do with the error if the Xvar is not found in the object.

Returns:

None

>>> f.add_time_trend()
>>> f.add_exp_terms('t',pwr=.5)
>>> f.drop_Xvars('t','t^0.5')
drop_all_Xvars()

drops all regressors.

drop_regressors(*args, error='raise')

Drops regressors.

Parameters:
  • *args (str) – The names of regressors to drop.

  • error (str) – One of ‘ignore’,’raise’. Default ‘raise’. What to do with the error if the Xvar is not found in the object.

Returns:

None

>>> f.add_time_trend()
>>> f.add_exp_terms('t',pwr=.5)
>>> f.drop_regressors('t','t^0.5')
eval_cis(mode=True, cilevel=0.95)

Call this function to change whether or not the Forecaster sets confidence intervals on all evaluated models. Beginning 0.17.0, only conformal confidence intervals are supported. Conformal intervals need a test set to be configured soundly. Confidence intervals cannot be evaluated when there aren’t at least 1/(1-cilevel) observations in the test set.

Parameters:
  • mode (bool) – Default True. Whether to set confidence intervals on or off for models.

  • cilevel (float) – Default .95. Must be greater than 0, less than 1. The confidence level to use to set intervals.

export(dfs=['model_summaries', 'lvl_test_set_predictions', 'lvl_fcsts'], models='all', best_model='auto', determine_best_by=None, cis=False, to_excel=False, out_path='./', excel_name='results.xlsx') Dict[str, DataFrame] | DataFrame

Exports 1-all of 3 pandas DataFrames. Can write to excel with each DataFrame on a separate sheet. Will return either a dictionary with dataframes as values (df str arguments as keys) or a single dataframe if only one df is specified.

Parameters:
  • dfs (list-like or str) – Default [‘model_summaries’, ‘lvl_test_set_predictions’, ‘lvl_fcsts’]. A list or name of the specific dataframe(s) you want returned and/or written to excel. Must be one of or multiple of the elements in default. Exporting test set predictions only works if all exported models were tested using the same test length.

  • models (list-like or str) – Default ‘all’. The models to write information for. Can start with “top_” and the metric specified in determine_best_by will be used to order the models appropriately.

  • best_model (str) – Default ‘auto’. The name of the best model, if “auto”, will determine this by the metric in determine_best_by. If not “auto”, must match a model nickname of an already-evaluated model.

  • determine_best_by (str) – One of Forecaster.determine_best_by or None. Default ‘TestSetRMSE’. If None and best_model is ‘auto’, the best model will be designated as the first-evaluated model.

  • to_excel (bool) – Default False. Whether to save to excel.

  • out_path (str) – Default ‘./’. The path to save the excel file to (ignored when to_excel=False).

  • cis (bool) – Default False. Whether to export confidence intervals for models in “lvl_test_set_predictions”, “lvl_fcsts” dataframes.

  • excel_name (str) – Default ‘results.xlsx’. The name to call the excel file (ignored when to_excel=False).

Returns:

either a single pandas dataframe if one element passed to dfs or a dictionary where the keys match what was passed to dfs and the values are dataframes.

Return type:

(DataFrame or Dict[str,DataFrame])

>>> results = f.export(dfs=['model_summaries','lvl_fcsts'],to_excel=True) # returns a dict
>>> model_summaries = results['model_summaries'] # returns a dataframe
>>> lvl_fcsts = results['lvl_fcsts'] # returns a dataframe
>>> ts_preds = f.export('lvl_test_set_predictions') # returns a dataframe
export_Xvars_df(dropna=False)

Gets all utilized regressors and values.

Parameters:

dropna (bool) – Default False. Whether to drop null values from the resulting dataframe

Returns:

A dataframe of Xvars and names/values stored in the object.

Return type:

(DataFrame)

export_feature_importance(model) DataFrame

Exports the feature importance from a model. Raises an error if you never saved the model’s feature importance.

Parameters:

model (str) – The name of them model to export for. Matches what was passed to call_me when evaluating the model.

Returns:

The resulting feature importances of the evaluated model passed to model.

Return type:

(DataFrame)

>>> fi = f.export_feature_importance('mlr')
export_fitted_vals(model)

Exports a single dataframe with dates, fitted values, actuals, and residuals for one model.

Parameters:

model (str) – The model nickname.

Returns:

A dataframe with dates, fitted values, actuals, and residuals.

Return type:

(DataFrame)

export_summary_stats(model) DataFrame

Exports the summary stats from a model. Raises an error if you never saved the model’s summary stats.

Parameters:

model (str) – The name of them model to export for. Matches what was passed to call_me when evaluating the model.

Returns:

The resulting summary stats of the evaluated model passed to model.

Return type:

(DataFrame)

>>> ss = f.export_summary_stats('arima')
export_validation_grid(model) DataFrame

Exports the validation grid from a model, converted to a pandas dataframe. Raises an error if the model was not tuned.

Parameters:

model (str) – The name of them model to export for. Matches what was passed to call_me when evaluating the model.

Returns:

The resulting validation grid of the evaluated model passed to model arg.

Return type:

(DataFrame)

generate_future_dates(n)

Generates a certain amount of future dates in same frequency as current_dates.

Parameters:

n (int) – Greater than 0. Number of future dates to produce. This will also be the forecast length.

Returns:

None

>>> f.generate_future_dates(12) # 12 future dates to forecast out to
get_freq()

Gets the pandas inferred date frequency.

Returns:

The inferred frequency of the current_dates array.

Return type:

(str)

>>> f.get_freq()
get_regressor_names()

Gets the regressor names stored in the object.

Returns:

Regressor names that have been added to the object.

Return type:

(list)

>>> f.add_time_trend()
>>> f.get_regressor_names()
infer_freq()

Uses the pandas library to infer the frequency of the loaded dates.

ingest_Xvars_df(df, date_col='Date', drop_first=False, use_future_dates=False, pad=False)

Ingests a dataframe of regressors and saves its Xvars to the object. The user must specify a date column name in the dataframe being ingested. All non-numeric values are dummied. The dataframe should cover the entire future horizon stored within the Forecaster object, but can be padded with 0s if testing only is desired. Any columns in the dataframe that begin with “AR” will be confused with autoregressive terms and could cause errors.

Parameters:
  • df (DataFrame) – The dataframe that is at least the length of the y array stored in the object plus the forecast horizon.

  • date_col (str) – Default ‘Date’. The name of the date column in the dataframe. This column must have the same frequency as the dates stored in the Forecaster object.

  • drop_first (bool) – Default False. Whether to drop the first observation of any dummied variables. Irrelevant if passing all numeric values.

  • use_future_dates (bool) – Default False. Whether to use the future dates in the dataframe as the resulting future_dates attribute in the Forecaster object.

  • pad (bool) – Default False. Whether to pad any missing values with 0s.

Returns:

None

ingest_grid(grid)

Ingests a grid to tune the estimator.

Parameters:

grid (dict or str) – If dict, must be a user-created grid. If str, must match the name of a dict grid stored in a grids file.

Returns:

None

>>> f.set_estimator('mlr')
>>> f.ingest_grid({'normalizer':['scale','minmax']})
keep_smaller_history(n)

Cuts y observations in the object by counting back from the beginning.

Parameters:

n (int, str, or datetime.datetime) – If int, the number of observations to keep. Otherwise, the last observation to keep. Must be parsable by pandas’ Timestamp function.

Returns:

None

>>> f.keep_smaller_history(500) # keeps last 500 observations
>>> f.keep_smaller_history('2020-01-01') # keeps only observations on or later than 1/1/2020
limit_grid_size(n, min_grid_size=1, random_seed=None)

Makes a grid smaller randomly.

Parameters:
  • n (int or float) – If int, randomly selects that many parameter combinations. If float, must be less than 1 and greater 0, randomly selects that percentage of parameter combinations.

  • min_grid_size (int) – Default 1. The min number of hyperparameters to keep from the original grid if a float is passed to n.

  • random_seed (int) – Optional. Set a seed to make results consistent.

Returns:

None

>>> from scalecast import GridGenerator
>>> GridGenerator.get_example_grids()
>>> f.set_estimator('mlp')
>>> f.ingest_grid('mlp')
>>> f.limit_grid_size(10,random_seed=20) # limits grid to 10 iterations
>>> f.limit_grid_size(.5,random_seed=20) # limits grid to half its original size
load_tf_model(name='model.h5')

Loads a fitted tensorflow (RNN/LSTM) model and attaches it to the Forecaster object in the tf_model attribute.

Parameters:

name (str) – Default ‘model.h5’. The name of the file to load. A file directory with a file name is also accepted here.

>>> f.set_estimator('rnn')
>>> f.manual_forecast()
>>> f.save_tf_model('path/to/model.h5')
>>> del f.tf_model # deletes the attribute to save memory
>>> f.load_tf_model('path/to/model.h5')
manual_forecast(call_me=None, dynamic_testing=True, test_again=True, bank_history=True, **kwargs)

Manually forecasts with the hyperparameters, Xvars, and normalizer selection passed as keywords. See https://scalecast.readthedocs.io/en/latest/Forecaster/_forecast.html.

Parameters:
  • call_me (str) – Optional. What to call the model when storing it in the object’s history. If not specified, the model’s nickname will be assigned the estimator value (‘mlp’ will be ‘mlp’, etc.). Duplicated names will be overwritten with the most recently called model.

  • dynamic_testing (bool or int) – Default True. Whether to dynamically/recursively test the forecast (meaning AR terms will be propagated with predicted values). If True, evaluates dynamically over the entire out-of-sample slice of data. If int, window evaluates over that many steps (2 for 2-step dynamic forecasting, 12 for 12-step, etc.). Setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform more than one period out. The model will skip testing if the test_length attribute is set to 0.

  • test_again (bool) – Default True. Whether to test the model before forecasting to a future horizon. If test_length is 0, this is ignored. Set this to False if you tested the model manually by calling f.test() and don’t want to waste resources testing it again.

  • **kwargs – passed to the _forecast_{estimator}() method and can include such parameters as Xvars, normalizer, cap, and floor, in addition to any given model’s specific hyperparameters. See https://scalecast.readthedocs.io/en/latest/Forecaster/_forecast.html.

>>> f.set_estimator('lasso')
>>> f.manual_forecast(alpha=.5)
normality_test(train_only=False)

Runs D’Agostino and Pearson’s test for normality ported from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html. Holds the null hypothesis that the series is normally distributed.

Parameters:

train_only (bool) – Default False. If True, will exclude the test set from the test (to avoid leakage).

Returns:

The derived statistic and pvalue.

Return type:

(float, float)

order_fcsts(models='all', determine_best_by='TestSetRMSE')

Gets estimated forecasts ordered from best-to-worst.

Parameters:
  • models (str or list-like) – Default ‘all’. If not ‘all’, each element must match an evaluated model’s nickname. ‘all’ will only consider models that have a non-null determine_best_by value in history.

  • determine_best_by (str) – Default ‘TestSetRMSE’. One of Forecaster.determine_best_by.

Returns:

The ordered models.

Return type:

(list)

>>> models = ('mlr','mlp','lightgbm')
>>> f.tune_test_forecast(models,dynamic_testing=False,feature_importance=True)
>>> ordered_models = f.order_fcsts(models,"TestSetRMSE")
plot(models='all', exclude=[], order_by=None, ci=False, ax=None, figsize=(12, 6))

Plots all forecasts with the actuals, or just actuals if no forecasts have been evaluated or are selected.

Parameters:
  • models (list-like, str, or None) – Default ‘all’. The forecasted models to plot. Can start with “top_” and the metric specified in order_by will be used to order the models appropriately. If None or models/order_by combo invalid, will plot only actual values.

  • exclude (collection) – Default []. Pass any models here that you don’t want displayed. Good to use in conjunction with models = ‘top_{n}’.

  • order_by (str) – Optional. One of Forecaster.determine_best_by. How to order the display of forecasts on the plots (from best-to-worst according to the selected metric).

  • ci (bool) – Default False. Whether to display the confidence intervals.

  • ax (Axis) – Optional. The existing axis to write the resulting figure to.

  • figsize (tuple) – Default (12,6). The size of the resulting figure. Ignored when ax is not None.

Returns:

The figure’s axis.

Return type:

(Axis)

>>> models = ('mlr','mlp','lightgbm')
>>> f.tune_test_forecast(models,dynamic_testing=False,feature_importance=True)
>>> f.plot(order_by='TestSetRMSE') # plots all forecasts
>>> plt.show()
plot_acf(diffy=False, train_only=False, **kwargs)

Plots an autocorrelation function of the y values.

Parameters:
  • diffy (bool or int) – One of {True,False,0,1}. default False. Whether to difference the data before passing the values to the function. If False or 0, does not difference. If True or 1, differences 1 time.

  • train_only (bool) – Default False. If True, will exclude the test set from the test (a measure added to avoid leakage).

  • **kwargs – Passed to plot_acf() function from statsmodels.

Returns:

If ax is None, the created figure. Otherwise the figure to which ax is connected.

Return type:

(Figure)

>>> import matplotlib.pyplot as plt
>>> f.plot_acf(train_only=True)
>>> plt.plot()
plot_fitted(models='all', exclude=[], order_by=None, ax=None, figsize=(12, 6))

Plots all fitted values with the actuals. Does not support level fitted values (for now).

Parameters:
  • models (list-like,str) – Default ‘all’. The forecated models to plot. Can start with “top_” and the metric specified in order_by will be used to order the models appropriately.

  • exclude (collection) – Default []. Pass any models here that you don’t want displayed. Good to use in conjunction with models = ‘top_{n}’.

  • order_by (str) – Optional. One of Forecaster.determine_best_by. How to order the display of forecasts on the plots (from best-to-worst according to the selected metric).

  • ax (Axis) – Optional. The existing axis to write the resulting figure to.

  • figsize (tuple) – Default (12,6). Size of the resulting figure. Ignored when ax is not None.

Returns:

The figure’s axis.

Return type:

(Axis)

>>> models = ('mlr','mlp','lightgbm')
>>> f.tune_test_forecast(models,dynamic_testing=False,feature_importance=True)
>>> f.plot_fitted(order_by='TestSetRMSE') # plots all fitted values
>>> plt.show()
plot_pacf(diffy=False, train_only=False, **kwargs)

Plots a partial autocorrelation function of the y values.

Parameters:
  • diffy (bool or int) – One of {True,False,0,1}. Default False. Whether to difference the data before passing the values to the function. If False or 0, does not difference. If True or 1, differences 1 time.

  • train_only (bool) – Default False. If True, will exclude the test set from the test (a measure added to avoid leakage).

  • **kwargs – Passed to plot_pacf() function from statsmodels.

Returns:

If ax is None, the created figure. Otherwise the figure to which ax is connected.

Return type:

(Figure)

>>> import matplotlib.pyplot as plt
>>> f.plot_pacf(train_only=True)
>>> plt.plot()
plot_periodogram(diffy=False, train_only=False)

Plots a periodogram of the y values (comes from scipy.signal).

Parameters:
  • diffy (bool or int) – One of {True,False,0,1}. Default False. Whether to difference the data before passing the values to the function. If False or 0, does not difference. If True or 1, differences 1 time.

  • train_only (bool) – Default False. If True, will exclude the test set from the test (a measure added to avoid leakage).

Returns:

Element 1: Array of sample frequencies. Element 2: Power spectral density or power spectrum of x.

Return type:

(ndarray,ndarray)

>>> import matplotlib.pyplot as plt
>>> a, b = f.plot_periodogram(diffy=True,train_only=True)
>>> plt.semilogy(a, b)
>>> plt.show()
plot_test_set(models='all', exclude=[], order_by=None, include_train=True, ci=False, ax=None, figsize=(12, 6))

Plots all test-set predictions with the actuals.

Parameters:
  • models (list-like or str) – Default ‘all’. The forecated models to plot. Can start with “top_” and the metric specified in order_by will be used to order the models appropriately.

  • exclude (collection) – Default []. Pass any models here that you don’t want displayed. Good to use in conjunction with models = ‘top_{n}’.

  • order_by (str) – Optional. One of Forecaster.determine_best_by. How to order the display of forecasts on the plots (from best-to-worst according to the selected metric).

  • include_train (bool or int) – Default True. Use to zoom into testing results. If True, plots the test results with the entire history in y. If False, matches y history to test results and only plots this. If int, plots that length of y to match to test results.

  • ci (bool) – Default False. Whether to display the confidence intervals. Default is 100 boostrapped samples and a 95% confidence interval.

  • ax (Axis) – Optional. The existing axis to write the resulting figure to.

  • figsize (tuple) – Default (12,6). Size of the resulting figure. Ignored when ax is not None.

Returns:

The figure’s axis.

Return type:

(Axis)

>>> models = ('mlr','mlp','lightgbm')
>>> f.tune_test_forecast(models,dynamic_testing=False,feature_importance=True)
>>> f.plot(order_by='TestSetRMSE') # plots all test-set results
>>> plt.show()
pop(*args)

Deletes evaluated forecasts from the object’s memory.

Parameters:

*args (str) – Names of models matching what was passed to call_me when model was evaluated.

>>> models = ('mlr','mlp','lightgbm')
>>> f.tune_test_forecast(models,dynamic_testing=False,feature_importance=True)
>>> f.pop('mlr')
reduce_Xvars(method='PermutationExplainer', estimator='lasso', keep_at_least=1, keep_this_many='auto', grid_search=True, use_loaded_grid=False, dynamic_tuning=False, monitor='ValidationMetricValue', overwrite=True, cross_validate=False, masker=None, cvkwargs={}, **kwargs)

Reduces the regressor variables stored in the object. Any feature importance type available with f.save_feature_importance() can be used to rank features in this process. Features are reduced one-at-a-time, according to which one ranked the lowest. After each variable reduction, the model is re-run and feature importance re-evaluated. By default, the validation-set error is used to avoid leakage and the variable set that most reduced the error is selected. The following attributes: pfi_dropped_vars and pfi_error_values, which are lists representing the error change with the corresponding dropped variable, are created and stored in the Forecaster object. See the example: https://scalecast-examples.readthedocs.io/en/latest/misc/feature-selection/feature_selection.html.

Parameters:
  • method (str) – One of try_order defaults in Forecater.save_feature_importance(). Default ‘PermutationExplainer’. The method for scoring features. Method ‘shap’ creates attributes in the object called pfi_dropped_vars and pfi_error_values that are lists representing the error change with the corresponding dropped variable. The pfi_error_values attr is one greater in length than pfi_dropped_vars attr because The first error is the initial error before any variables were dropped.

  • estimator (str) – One of Forecaster.sklearn_estimators. Default ‘lasso’. The estimator to use to determine the best set of variables.

  • keep_at_least (str or int) – Default 1. The fewest number of Xvars to keep.. ‘sqrt’ keeps at least the sqare root of the number of Xvars rounded down. This exists so that the keep_this_many keyword can use ‘auto’ as an argument.

  • keep_this_many (str or int) – Default ‘auto’. The number of Xvars to keep if method == ‘pfi’ or ‘shap’. “auto” keeps the number of xvars that returned the best error using the metric passed to monitor, but it is the most computationally expensive. “sqrt” keeps the square root of the total number of observations rounded down.

  • gird_search (bool) – Default True. Whether to run a grid search for optimal hyperparams on the validation set. If use_loaded_grid is False, uses a grids file currently available in the working directory or creates a new grids file called Grids.py with default values if none available to determine the grid to use. The grid search is only run once and then those hyperparameters are used for all subsequent pfi runs when method == ‘pfi’. In any utilized grid, do not include ‘Xvars’ as a key. If you want to access the chosen hyperparams after the fact, they are stored in the reduction_hyperparams attribute.

  • use_loaded_grid (bool) – Default False. Whether to use the currently loaded grid in the object instead of using a grid from a file. In any utilized grid, do not include ‘Xvars’ as a key.

  • dynamic_tuning (bool or int) – Default False. Whether to dynamically tune the model or, if int, how many forecast steps to dynamically tune it.

  • monitor (str) – One of Forecaster.determine_best_by. Default ‘ValidationSetMetric’. The metric to be monitored when making reduction decisions.

  • overwrite (bool) – Default True. If False, the list of selected Xvars are stored in an attribute called reduced_Xvars. If True, this list of regressors overwrites the current Xvars in the object.

  • cross_validate (bool) – Default False. Whether to tune the model with cross validation. If False, uses the validation slice of data to tune. If not monitoring ValidationMetricValue, you will want to leave this False.

  • masker (shap.maskers) – Optional. Pass your own masker to this function if desired. Default will use shap.maskers.Independent with default arguments.

  • cvkwargs (dict) – Default {}. Passed to the cross_validate() method.

  • **kwargs – Passed to the manual_forecast() method and can include arguments related to a given model’s hyperparameters or dynamic_testing. Do not pass hyperparameters if grid_search is True. Do not pass Xvars.

Returns:

None

>>> f.add_ar_terms(24)
>>> f.add_seasonal_regressors('month',raw=False,sincos=True,dummy=True)
>>> f.add_seasonal_regressors('year')
>>> f.add_time_trend()
>>> f.set_validation_length(12)
>>> f.reduce_Xvars(overwrite=False) # reduce with lasso (but don't overwrite Xvars)
>>> print(f.reduced_Xvars) # view results
>>> f.reduce_Xvars(
>>>     method='TreeExplainer',
>>>     estimator='gbt',
>>>     keep_at_least=10,
>>>     keep_this_many='auto',
>>>     dynamic_testing=False,
>>>     dynamic_tuning=True,
>>>     cross_validate=True,
>>>     cvkwargs={'rolling':True},
>>> ) # reduce with gradient boosted tree estimator and overwrite with result
>>> print(f.reduced_Xvars) # view results
restore_series_length()

Restores the original y values and current dates in the object from before keep_smaller_history() or determine_best_series_length() were called. If those methods were never called, this function does nothing. Restoring a series’ length automatically drops all stored regressors in the object.

>>> # write a pipeline
round(decimals=0)

Rounds the values saved to Forecaster.y.

Parameters:

decimals (int) – The number of digits to round off to. Passed to np.round(decimals).

Returns:

A copy of the object.

Return type:

(Forecaster)

save_feature_importance(method='shap', on_error='warn', try_order=['PermutationExplainer', 'TreeExplainer', 'LinearExplainer', 'KernelExplainer', 'SamplingExplainer'], masker=None, verbose=False)

Saves feature info for models that offer it (sklearn models). Call after evaluating the model you want it for and before changing the estimator. This method saves a dataframe listing the feature as the index and its score. This dataframe can be recalled using the export_feature_importance() method. The importance scores are determined as the average shap score applied to each feature in each observation.

Parameters:
  • method (str) – Default ‘shap’. As of scalecast 0.19.4, shap is the only method available, as pfi is deprecated.

  • on_error (str) – One of {‘warn’,’raise’,’ignore’}. Default ‘warn’. If the last model called doesn’t support feature importance, ‘warn’ will log a warning. ‘raise’ will raise an error.

  • try_order (list) – The order of explainers to try. If one fails, will try setting with the next one. This should be able to set feature importance on any sklearn model. What each Explainer does can be found in the shap documentation: https://shap-lrjball.readthedocs.io/en/latest/index.html

  • masker (shap.maskers) – Optional. Pass your own masker if desired and you are using the PermutationExplainer or LinearExplainer. Default will use shap.maskers.Independent masker with default arguments.

  • verbose (bool) – Default True. Whether to print out information about which explainers were tried/chosen. The chosen explainer is saved in Forecaster.history[estimator][‘feature_importance_explainer’].

>>> f.set_estimator('xgboost')
>>> f.manual_forecast()
>>> f.save_feature_importance()
>>> fi = f.export_feature_importance('xgboost') # returns a dataframe
save_summary_stats()

Saves summary stats for models that offer it and will not raise errors if not available. Call after evaluating the model you want it for and before changing the estimator.

>>> f.set_estimator('arima')
>>> f.manual_forecast(order=(1,1,1))
>>> f.save_summary_stats()
save_tf_model(name='model.h5')

Saves a fitted tensorflow (RNN/LSTM) model as a file. Call this after fitting a tensorflow model and before changing the estimator.

Parameters:

name (str) – Default ‘model.h5’. The name of the resulting file. A file directory with a file name is also accepted here.

>>> f.set_estimator('rnn')
>>> f.manual_forecast()
>>> f.save_tf_model('path/to/model.h5')
seasonal_decompose(diffy=False, train_only=False, **kwargs)

Returns a signal/seasonal decomposition of the y values.

Parameters:
  • diffy (bool) – Default False. Whether to difference the data before passing the values to the function. If False or 0, does not difference. If True or 1, differences 1 time.

  • train_only (bool) – Default False. If True, will exclude the test set from the test (a measure added to avoid leakage).

  • **kwargs – Passed to seasonal_decompose() function from statsmodels. See https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.seasonal_decompose.html.

Returns:

An object with seasonal, trend, and resid attributes.

Return type:

(DecomposeResult)

>>> import matplotlib.pyplot as plt
>>> f.seasonal_decompose(train_only=True).plot()
>>> plt.show()
set_cilevel(n)

Sets the level for the resulting confidence intervals (95% default).

Parameters:

n (float) – Greater than 0 and less than 1.

Returns:

None

>>> f.set_cilevel(.80) # next forecast will get 80% confidence intervals
set_estimator(estimator)

Sets the estimator to forecast with.

Parameters:

estimator (str) – One of Forecaster.estimators.

Returns:

None

>>> f.set_estimator('lasso')
>>> f.manual_forecast(alpha = .5)
set_grids_file(name='Grids')

Sets the name of the file where the object will look automatically for grids when calling tune(), cross_validate(), tune_test_forecast(), or similar function. If the grids file does not exist in the working directory, the error will only be raised once tuning is called.

Parameters:

name (str) – Default ‘Grids’. The name of the file to look for. This file must exist in the working directory. The default will look for a file called “Grids.py”.

>>> f.set_grids_file('ModGrids') # expects to find a file called ModGrids.py in working directory.
set_last_future_date(date)

Generates future dates in the same frequency as current_dates that ends on a specified date.

Parameters:

date (datetime.datetime, pd.Timestamp, or str) – The date to end on. Must be parsable by pandas’ Timestamp() function.

Returns:

None

>>> f.set_last_future_date('2021-06-01') # creates future dates up to this one in the expected frequency
set_metrics(metrics)

Set or change the evaluated metrics for all model testing and validation.

Parameters:

metrics (list) – The metrics to evaluate when validating and testing models. Each element must exist in utils.metrics and take only two arguments: a and f. See https://scalecast.readthedocs.io/en/latest/Forecaster/Util.html#metrics. For each metric and model that is tested, the test-set and in-sample metrics will be evaluated and can be exported. Level test-set and in-sample metrics are also currently available, but will be removed in a future version.

set_test_length(n=1)

Sets the length of the test set. As of version 0.16.0, 0-length test sets are supported.

Parameters:

n (int or float) – Default 1. The length of the resulting test set. Pass 0 to skip testing models. Fractional splits are supported by passing a float less than 1 and greater than 0.

Returns:

None

>>> f.set_test_length(12) # test set of 12
>>> f.set_test_length(.2) # 20% test split
set_validation_length(n=1)

Sets the length of the validation set. This will never matter for models that are not tuned.

Parameters:

n (int) – Default 1. The length of the resulting validation set.

Returns:

None

>>> f.set_validation_length(6) # validation length of 6
set_validation_metric(metric)

Sets the metric that will be used to tune all subsequent models.

Parameters:

metric – One of Forecaster.metrics. The metric to optimize the models with using the validation set. Although model testing will evaluate all metrics in Forecaster.metrics, model optimization with tuning and cross validation only uses one of these.

Returns:

None

>>> f.set_validation_metric('mae')
test(dynamic_testing=True, call_me=None, **kwargs)

Tests the forecast estimator out-of-sample. Uses the test_length attribute to determine on how-many observations. All test-set splits maintain temporal order.

Parameters:
  • dynamic_testing (bool or int) – Default True. Whether to dynamically/recursively test the forecast (meaning AR terms will be propagated with predicted values). If True, evaluates dynamically over the entire out-of-sample slice of data. If int, window evaluates over that many steps (2 for 2-step dynamic forecasting, 12 for 12-step, etc.). Setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform more than one period out. This will fail if the test_length attribute is 0.

  • call_me (str) – Optional. What to call the model when storing it in the object’s history. If not specified, the model’s nickname will be assigned the estimator value (‘mlp’ will be ‘mlp’, etc.). Duplicated names will be overwritten with the most recently called model.

  • **kwargs – passed to the _forecast_{estimator}() method and can include such parameters as Xvars, normalizer, cap, and floor, in addition to any given model’s specific hyperparameters. See https://scalecast.readthedocs.io/en/latest/Forecaster/_forecast.html.

>>> f.set_estimator('lasso')
>>> f.test(alpha=.5)
transfer_cis(transfer_from, model, transfer_to_model=None, transfer_test_set_cis='infer')

Transfers the confidence intervals from a model forecast in a passed Forecaster or MVForecaster object.

Parameters:
  • transfer_from (Forecaster or MVForecaster) – The object that contains the model from which intervals should be transferred.

  • model (str) – The model nickname of the already-evaluated model stored in transfer_from.

  • transfer_to_model (str) – Optional. The nickname of the model to which the intervals should be transferred. If not specified, inherits the name passed to model.

  • transfer_test_set_cis (bool or str) – Default ‘infer’. Whether to pass intervals for test-set predictions. If ‘infer’, the decision is made based on whether the inheriting MVForecaster object has test-set predictions evaluated.

Returns:

None.

>>> f.manual_forecast(call_me='mlr')
>>> f_new.transfer_predict(transfer_from=f,model='mlr')
>>> f_new.transfer_cis(transfer_from=f,model='mlr')
transfer_predict(transfer_from, model, model_type='sklearn', return_series=False, dates=[], save_to_history=True, call_me=None, regr=None)

Makes predictions using an already-trained model over any given forecast horizon. Will use the already-trained model from a passed Forecaster object to create a new model in the Forecaster object from which the method is called. Or the option is available to not save a new model but return the predictions in a pandas Series object. Confidence intervals cannot be transferred from this method but can be from the transfer_cis() method.

Parameters:
  • transfer_from (Forecaster) – The Forecaster object that contains the already-fitted model.

  • model (str) – The model nickname of the already-evaluated model stored in the Forecaster object passed to transfer_from.

  • model_type (str) – Default ‘sklearn’. The type of model that needs to be predicted. Right now, only ‘sklearn’ and ‘tf’ are supported but others will be added.

  • return_series (bool) – Default False. Whether to return a pandas Series with the date as an index of the values. If the dates argument is not specified, this will include all dates in the Forecaster instance that the method is called from.

  • dates (collection) – Optional. The dates to limit the predictions for. Ignored if return_series is not specified. If the passed dates are not in the same frequency as the dates stored in the Forecaster object, an IndexError is raised.

  • save_to_history (bool) – Default True. Whether to save the transferred predictions as if they were a model being run using a _forecast() method.

  • call_me (str) – Optional. What to call the resulting model. If save_to_history is False, this is ignored. If not specified, inherits the name passed to model.

  • regr – Optional. The model to make predictions with. If not supplied, the model will be searched for in the Forecaster passed to transfer_from.

Returns:

The date-indexed series if return_series is True.

Return type:

(Pandas Series or None)

>>> f.manual_forecast(call_me='mlr')
>>> f_new.transfer_predict(transfer_from=f,model='mlr')
tune(dynamic_tuning=False, set_aside_test_set=True)

Tunes the specified estimator using an ingested grid (ingests a grid from Grids.py with same name as the estimator by default). This is akin to cross-validation with one fold and a test_length equal to f.validation_length. Any parameters that can be passed as arguments to manual_forecast() can be tuned with this process. The chosen parameters are stored in the best_params attribute. The evaluated validation grid can be exported to a dataframe using f.export_validation_grid().

Parameters:
  • dynamic_tuning (bool or int) – Default False. Whether to dynamically/recursively test the forecast during the tuning process (meaning AR terms will be propagated with predicted values). If True, evaluates recursively over the entire out-of-sample slice of data. If int, window evaluates over that many steps (2 for 2-step recurvie testing, 12 for 12-step, etc.). Setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform more than one period out.

  • set_aside_test_set (bool) – Default True. Whether to separate the test set specified in f.test_length during this process.

Returns:

None

>>> f.set_estimator('xgboost')
>>> f.tune()
>>> f.auto_forecast()
tune_test_forecast(models, cross_validate=False, dynamic_tuning=False, dynamic_testing=True, summary_stats=False, feature_importance=False, fi_try_order=None, limit_grid_size=None, min_grid_size=1, suffix=None, error='raise', **cvkwargs)

Iterates through a list of models, tunes them using grids in a grids file, forecasts them, and can save feature information.

Parameters:
  • models (list-like) – Each element must be in Forecaster.can_be_tuned.

  • cross_validate (bool) – Default False. Whether to tune the model with cross validation. If False, uses the validation slice of data to tune.

  • dynamic_tuning (bool or int) – Default False. whether to dynamically tune the forecast (meaning AR terms will be propagated with predicted values). if True, evaluates dynamically over the entire out-of-sample slice of data. if int, window evaluates over that many steps (2 for 2-step dynamic forecasting, 12 for 12-step, etc.). setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform out x amount of periods.

  • dynamic_testing (bool or int) – Default True. whether to dynamically test the forecast (meaning AR terms will be propagated with predicted values). if True, evaluates dynamically over the entire out-of-sample slice of data. if int, window evaluates over that many steps (2 for 2-step dynamic forecasting, 12 for 12-step, etc.). setting this to False or 1 means faster performance, but gives a less-good indication of how well the forecast will perform out x amount of periods.

  • summary_stats (bool) – Default False. Whether to save summary stats for the models that offer those.

  • feature_importance (bool) – Default False. Whether to save feature importance information for the models that offer it.

  • fi_try_order (list) – Optional. If the feature_importance argument is True, what feature importance methods to try? If using a combination of tree-based and linear models, for example, it might be good to pass [‘TreeExplainer’,’LinearExplainer’]. The default will use whatever is specifiec by default in Forecaster.save_feature_importance(), which usually ends up being the PermutationExplainer.

  • limit_grid_size (int or float) – Optional. Pass an argument here to limit each of the grids being read. See https://scalecast.readthedocs.io/en/latest/Forecaster/Forecaster.html#src.scalecast.Forecaster.Forecaster.limit_grid_size.

  • min_grid_size (int) – Default 1. The smallest grid size to keep. Ignored if limit_grid_size is None.

  • suffix (str) – Optional. A suffix to add to each model as it is evaluated to differentiate them when called later. If unspecified, each model can be called by its estimator name.

  • error (str) – One of ‘ignore’,’raise’,’warn’; default ‘raise’. What to do with the error if a given model fails. ‘warn’ prints a warning that the model could not be evaluated.

  • **cvkwargs – Passed to the cross_validate() method.

Returns:

None

>>> models = ('mlr','mlp','lightgbm')
>>> f.tune_test_forecast(models,dynamic_testing=False,feature_importance=True)
validate_regressor_names()

Validates that all regressor names exist in both current_xregs and future_xregs. Raises an error if this is not the case.