util

Miscellaneous util functions native to scalecast.

metrics

class src.scalecast.util.metrics

Methods:

abias(a, f)

Returns the total bias over a given forecast horizon in terms of absolute values.

bias(a, f)

Returns the total bias over a given forecast horizon.

mae(a, f)

Mean absolute error (MAE).

mape(a, f)

Mean absolute percentage error (MAPE).

mase(a, f, obs, m)

Mean absolute scaled error (MASE).

mse(a, f)

Mean squared error (MSE).

msis(a, uf, lf, obs, m[, alpha])

Mean scaled interval score (MSIS) for evaluating confidence intervals.

r2(a, f)

R-squared (R2).

rmse(a, f)

Root mean squared error (RMSE).

smape(a, f)

Symmetric mean absolute percentage error (sMAPE).

static abias(a, f)

Returns the total bias over a given forecast horizon in terms of absolute values. Divide by the length of the forecast horizon to get average bias. This is a good metric to minimize when testing/tuning models.

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived bias.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.abias(a,f) # returns 1
static bias(a, f)

Returns the total bias over a given forecast horizon. When this is larger than 0, means aggregated predicted points are higher than actuals. Divide by the length of the forecast horizon to get average bias.

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived bias.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.bias(a,f) # returns 1
static mae(a, f)

Mean absolute error (MAE).

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived MAE.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.mae(a,f)
static mape(a, f)

Mean absolute percentage error (MAPE).

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived MAPE.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.mape(a,f)
static mase(a, f, obs, m)

Mean absolute scaled error (MASE). Uses the same definition as used in the M4 competition. See https://ideas.repec.org/a/eee/intfor/v36y2020i1p54-74.html.

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

  • obs (list-like) – The actual observations used to create the forecast.

  • m (int) – The seasonal period.

Returns:

The derived MASE.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> obs = [-5,-4,-3,-2,-1,0]
>>> metrics.mase(a,f,obs,1)
static mse(a, f)

Mean squared error (MSE).

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived MSE.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.mse(a,f)
static msis(a, uf, lf, obs, m, alpha=0.05)

Mean scaled interval score (MSIS) for evaluating confidence intervals. Uses the same definition as used in the M4 competition. Lower values are better. See https://ideas.repec.org/a/eee/intfor/v36y2020i1p54-74.html.

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • uf (list-like) – The upper-forecast bound according to the confidence interval.

  • lf (list-like) – The lower-forecast bound according to the confidence interval.

  • obs (list-like) – The actual observations used to create the forecast.

  • m (int) – The seasonal period.

  • alpha (float) – Default 0.05. 0.05 for 95% confidence intervals, etc.

Returns:

The derived MSIS.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> uf = [1.5,2.5,3.5,4.5,6.5]
>>> lf = [.5,1.5,2.5,3.5,5.5]
>>> obs = [-5,-4,-3,-2,-1,0]
>>> metrics.msis(a,uf,lf,obs,1) # returns a value of 5.0
static r2(a, f)

R-squared (R2).

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived R2.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.r2(a,f)
static rmse(a, f)

Root mean squared error (RMSE).

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived RMSE.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.rmse(a,f)
static smape(a, f)

Symmetric mean absolute percentage error (sMAPE). Uses the same definition as used in the M4 competition. Does not multiply by 100. See https://ideas.repec.org/a/eee/intfor/v36y2020i1p54-74.html.

Parameters:
  • a (list-like) – The actuals over the forecast horizon.

  • f (list-like) – The predictions over the forecast horizon.

Returns:

The derived sMAPE.

Return type:

(float)

>>> from scalecast.util import metrics
>>> a = [1,2,3,4,5]
>>> f = [1,2,3,4,6]
>>> metrics.smape(a,f)

Forecaster_with_missing_vals()

src.scalecast.util.Forecaster_with_missing_vals(y, current_dates, desired_frequency=None, fill_strategy=0.0, impute_value_pool=None, m=None, impute_lookback=None, add_noise=False, noise_value_pool=None, noise_std=None, noise_lookback=None, cannot_be_below=None, cannot_be_above=None, first_ob_strategy='drop', random_seed=None, **kwargs)

Imputes missing values in a given time series such that the result has a user-specified date frequency and/or no remaining null values. If you pass no missing values through this function, it will not raise errors.

Parameters:
  • y (collection) – An array of all observed values. Can include NAs for dates in which the values are unknown.

  • current_dates (collection) – An array of all observed dates. Must be same length as y and in the same sequence.

  • desired_frequency (str) – The desired frequency of the resulting Forecaster object. If this is left unspecified and a frequency cannot be inferred, the resulting object will not have a logical frequency. See available values here: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

  • fill_strategy (float or str) – Default 0.0. If str, must be one of {‘linear_interp’, ‘moving_average’, ‘moving_seasonal_average’, ‘impute_pool’}. If not one of those values, will be passed to the df.fillna() method from pandas (valid values include ‘ffill’ and ‘bfill’). Therefore, the default fills with 0.

  • m (int) – Optional. The number of steps that count one seasonal cycle if using a seasonal fill strategy. If left unspecified, will attempt to be inferred. If it cannot be inferred, will raise an error.

  • impute_value_pool (collection) – Optional. The pool of values to use when fill_strategy = ‘impute_pool’.

  • impute_lookback (int) – Required when fill_strategy in (‘moving_average’,’moving_seasonal_average’). The lookback to use when imputing a moving average to missing values. If using ‘moving_seaosnal_average’, make sure to include at least one full seasonal cycle in the lookback. Must be 1 or greater. If there are not enough observations to create a seasonal fill, will use all available observations for a normal moving average and raise a warning.

  • add_noise (bool) – Default False. Whether to add random noise to the imputed values.

  • noise_value_pool (collection) – Optional. The pool of values to randomly choose from when adding noise. The noise will add the imputed value with a random draw from this pool to come up with the final value. Specifying this argument overrides any of the subsequent noise-related arguments (noise_std, noise_lookback, etc.).

  • noise_std (float) – Optional. The standard deviation to use when adding a noise to the values. Assumes a normal distribution where the mean is the value imputed.

  • noise_lookback (int) – Optional. Must be 2 or greater. If adding noise, the lookback period before the missing obs to use to add the noise, assuming a normal distribution with the standard deviation from the lookback. If this is larger than the number of observations before a given missing observation, will use all observations before the missing one. If this and all the other noise-related arguments are left unspeficied, uses all observations before each missing one to find the standard deviation. If the first observation(s) is missing, no noise is given to it.

  • cannot_be_below (float) – Optional. A minimum value that the final imputation cannot drop below.

  • cannot_be_above (float) – Optional. A maximum value that the final imputation cannot be above.

  • first_ob_strategy (str) – Default ‘drop’. What to do if the first observation(s) is null. Default will drop. Other options include ‘ignore’, which could cause unexpected results depending on the employed strategy. Can also start with ‘fill_’, where the next digits will be used to create a static fill (‘fill_0’ fills with 0, for example).

  • random_seed (int) – Optional. A random seed to set for reproducible results.

  • **kwargs – Passed to the Forecaster object (https://scalecast.readthedocs.io/en/latest/Forecaster/Forecaster.html#src.scalecast.Forecaster.Forecaster.__init__)

Returns:

A Forecaster object with missing dates/values filled in.

Return type:

(Forecaster)

>>> # using the function with null values in y
>>> f = Forecaster_with_missing_vals(
>>>    y = [1,2,np.nan,4],
>>>    current_dates=['2020-01-01','2020-01-02','2020-01-03','2020-01-04'],
>>>    fill_strategy = 'linear_interp',
>>> ) # replaces missing val with 3
>>> # using the function with missing dates
>>> f = Forecaster_with_missing_vals(
>>>    y = [1,2,4],
>>>    current_dates=['2020-01-01','2020-01-02','2020-01-04'], # missing '2020-01-03'
>>>    desired_frequency = 'D', # tell it to use daily frequency
>>>    fill_strategy = 'linear_interp',
>>> ) # adds 3 to the 2nd index position in y and adds '2020-01-03' to 2nd index position in current_dates

backtest_for_resid_matrix()

src.scalecast.util.backtest_for_resid_matrix(*fs, pipeline, alpha=0.05, bt_n_iter=None, jump_back=1, **kwargs)

Performs a backtest on one or more Forecaster objects using pipelines. Specifically, performs a backtest so that a residual matrix to make dynamic intervals can easily be obtained. (See util.get_backtest_resid_matrix() and util.overwrite_forecast_intervals()).

Parameters:
  • *fs (Forecaster) – The objects that contain the evaluated forecasts. Send one if univariate forecasting with the Pipeline class, more than one if multivariate forecasting with the MVPipeline class.

  • pipeline (Pipeline or MVPipeline) – The pipeline to send *fs through.

  • alpha (float) – Default 0.05. The level that confidence intervals need to be evaluated at. 0.05 = 95%.

  • bt_n_iter (int) – Optional. The number of iterations to backtest. If left unspecified, chooses 1/alpha, the minimum needed to set reliable conformal intervals.

  • jump_back (int) – Default 1. The space between consecutive training sets in the backtest.

  • **kwargs – Passed to Pipeline.backtest().

Returns:

The results from each model and backtest iteration. Each dict element of the resulting list corresponds to the Forecaster objects in the order they were passed (will be length 1 if univariate forecasting). Each key of each dict is either ‘Actuals’, ‘Obs’, or the name of a model that got backtested. Each value is a DataFrame with the iteration values. The ‘Actuals’ frame has the date information and are the actuals over each forecast horizon. The ‘Obs’ frame has the actual historical observations to make each forecast, back padded with NA values to make each array the same length.

Return type:

(List[Dict[str,pd.DataFrame]])

backtest_metrics()

src.scalecast.util.backtest_metrics(backtest_results: list, models=None, mets=['rmse'], mase=False, msis=False, msis_alpha=0.05, m=1, names=None) DataFrame

Ingests the results output from Pipeline.backtest() and converts results to metrics.

Parameters:
  • backtest_results (list) – The output returned from Pipeline.backtest() or MVPipeline.backtest().

  • models (collection) – The names of the models to display metrics for. Default displays all models.

  • mets (list[str or callable]) – Default [‘rmse’]. A list of metrics to calculate. If the element is str type, must be taken from the util.metrics class where the only two accepted arguments are a and f. If the element in the list is callable, must be a function that only accepts two arguments (first actuals second forecast) and returns a float.

  • mase (bool) – Default False. Whether to also calculate mase. Must specify seasonality in m.

  • msis (bool) – Default False. Whether to also calculate msis. Must specify seasonality in m. This will fail if confidence intervals were not evaluated.

  • msis_alpha (float) – Default 0.05. The level that confidence intervals were evaluated at. Ignored if msis is False.

  • m (int) – Default 1. The number of steps that count one seasonal cycle. Ignored if both of msis and mase is False.

  • names (list) – Optional. The names to assign each passed series. Ignored if there is only one passed series.

Returns:

The metrics dataframe that gives info about each backtested series, model, and selected metric.

Return type:

(DataFrame)

>>> f1, f2, f3 = pipeline.fit_predict(f1,f2,f3)
>>> backtest_results = pipeline.backtest(f1,f2,f3,n_iter=2,jump_back=12)
>>> backtest_mets = backtest_metrics(
>>>     backtest_results,
>>>     mets = ['rmse','smape','r2','mae'],
>>>     names=['UTUR','UNRATE','SAHMREALTIME'],
>>>     mase = True,
>>>     msis = True,
>>>     m = 12,
>>> )

break_mv_forecaster()

src.scalecast.util.break_mv_forecaster(mvf, drop_all_Xvars=True)

Breaks apart an MVForecaster object and returns as many Foreaster objects as series loaded into the object.

Parameters:
  • mvf (MVForecaster) – The object to break apart.

  • drop_all_Xvars (bool) – Default True. Whether to drop all Xvars during the conversion. It’s a good idea to leave this True because length mismatches can cause future univariate models to error out.

Returns:

A sequence of at least two Forecaster objects

Return type:

(tuple[Forecaster])

>>> from scalecast.MVForecaster import MVForecaster
>>> from scalecast.util import break_mv_forecaster
>>>
>>> f1, f2 = break_mv_forecaster(mvf)

find_optimal_coint_rank()

src.scalecast.util.find_optimal_coint_rank(mvf, det_order, k_ar_diff, train_only=False, **kwargs)

Returns the optimal cointigration rank for a multivariate process using the function from statsmodels: https://www.statsmodels.org/dev/generated/statsmodels.tsa.vector_ar.vecm.select_coint_rank.html

Parameters:
  • mvf (MVForecaster) – The MVForecaster object with series loaded to find the optimal rank for.

  • train_only (bool) – Default False. Whether to use the training data only in the test. **kwargs: Passed to the referenced statsmodels function.

Returns:

Object containing the cointegration rank suggested by the test and allowing a summary to be printed.

Return type:

(CointRankResults)

>>> from scalecast.Forecaster import Forecaster
>>> from scalecast.MVForecaster import MVForecaster
>>> from scalecast.util import find_optimal_coint_rank
>>> import pandas_datareader as pdr
>>>
>>> s1 = pdr.get_data_fred('UTUR',start='2000-01-01',end='2022-01-01')
>>> s2 = pdr.get_data_fred('UNRATE',start='2000-01-01',end='2022-01-01')
>>>
>>> f1 = Forecaster(y=s1['UTUR'],current_dates=s1.index)
>>> f2 = Forecaster(y=s2['UNRATE'],current_dates=s2.index)
>>>
>>> mvf = MVForecaster(f1,f2,names=['UTUR','UNRATE'])
>>> coint_res = find_optimal_coint_rank(mvf,det_order=-1,k_ar_diff=8,train_only=True)
>>> print(coint_res) # prints a report
>>> rank = coint_res.rank # best rank

find_optimal_lag_order()

src.scalecast.util.find_optimal_lag_order(mvf, train_only=False, **kwargs)

Returns the otpimal lag order for a mutlivariate process using the statsmodels function: https://www.statsmodels.org/dev/generated/statsmodels.tsa.vector_ar.var_model.VAR.select_order.html. The exogenous regressors are set based on Xvars loaded in the MVForecaster object.

Parameters:
  • mvf (MVForecaster) – The MVForecaster object with series loaded to find the optimal order for

  • train_only (bool) – Default False. Whether to use the training data only in the test.

  • **kwargs – Passed to the referenced statsmodels function

Returns:

Lag selections.

Return type:

(LagOrderResults)

>>> from scalecast.Forecaster import Forecaster
>>> from scalecast.MVForecaster import MVForecaster
>>> from scalecast.util import find_optimal_lag_order
>>> import pandas_datareader as pdr
>>>
>>> s1 = pdr.get_data_fred('UTUR',start='2000-01-01',end='2022-01-01')
>>> s2 = pdr.get_data_fred('UNRATE',start='2000-01-01',end='2022-01-01')
>>>
>>> f1 = Forecaster(y=s1['UTUR'],current_dates=s1.index)
>>> f2 = Forecaster(y=s2['UNRATE'],current_dates=s2.index)
>>>
>>> mvf = MVForecaster(f1,f2,names=['UTUR','UNRATE'])
>>> lag_order_res = find_optimal_lag_order(mvf,train_only=True)
>>> lag_order_aic = lag_order_res.aic # picks the best lag order according to aic

find_optimal_transformation()

src.scalecast.util.find_optimal_transformation(f, estimator=None, monitor='rmse', test_length=None, train_length=None, num_test_sets=1, space_between_sets=1, lags='auto', try_order=['detrend', 'seasonal_adj', 'boxcox', 'first_diff', 'first_seasonal_diff', 'scale'], boxcox_lambdas=[-0.5, 0, 0.5], detrend_kwargs=[{'loess': True}, {'poly_order': 1}, {'poly_order': 2}], scale_type=['Scale', 'MinMax', 'RobustScale'], m='auto', model='add', set_aside_test_set=False, return_train_only=False, verbose=False, **kwargs)

Finds a set of transformations based on what maximizes forecast accuracy on some out-of-sample metric. Works by comparing each transformation individually and stacking the set of transformations that leads to the best performance. The estimator only uses series lags as inputs. When an attempted transformation fails, a warning is logged. The function uses Pipeline.backtest() to assure that the selected set of transformations is truly tested out-of-sample.

Parameters:
  • f (Forecaster) – The Forecaster object that contains the series that will be transformed.

  • estimator (str) – One of Forecaster.can_be_tuned. The estimator to use to choose the best transformations with. The default will read whatever is set to f.estimator.

  • monitor (str or callable) – Default ‘rmse’. The error metric to minimize. If str, must exist in util.metrics and accept only two arguments. If callable, must accept only two arguments (an array of actuals and an array of forecasts) and return a float. If ‘r2’ is passed, this will monitor a negative r2 value.

  • test_length (int) – The amount of observations to hold out-of-sample. By default reads the number of dates in f.future_dates.

  • train_length (int) – The number of observations to train the model in each iteration. By default, uses all available observations that come before each test set.

  • num_test_sets (int) – Default 1. The number of test sets to iterate through. The final metric will be an average across all test sets.

  • space_between_sets (int) – Default 1. The space between consecutive training sets. Not applicable when num_test_sets is 1.

  • lags (str or int) – Default ‘auto’. The number of lags that will be used as inputs for the estimator. If ‘auto’, uses the value passed or assigned to m (one seasonal cycle). If multiple values passed to m, uses the first.

  • try_order (list-like) – Default [‘detrend’,’seasonal_adj’,’boxcox’,’first_diff’,’first_seasonal_diff’,’scale’]. The transformations to try and also the order to try them in. Changing the order here can change the final transformations derived, since level will be compared to the first transformation and if it is found to be better than level, it will carry over to be tried in conjunction with the next transformation and so forth. The default list contains all possible transformations for this function.

  • boxcox_lambdas (list-like) – Default [-0.5,0,0.5]. The lambda values to try for a boxcox transformation. 0 means natural log. Only up to one boxcox transformation will be selected.

  • detrend_kwargs (list-like[dict]) – Default [{‘loess’:True},{‘poly_order’:1},{‘poly_order’:2}]. The types of detrending to try. Only up to one one detrender will be selected.

  • scale_type (list-like) – Default [‘Scale’,’MinMax’,’RobustScale’]. The type of scaling to try. Only up to one scaler will be selected. Must exist a SeriesTranformer.{scale_type}Transform() function for this to work.

  • m (int or str) – Default ‘auto’. The number of observations that counts one seasonal step. Ignored when seasonal_lags = 0. When ‘auto’, uses the M4 competition values: for Hourly: 24, Monthly: 12, Quarterly: 4. Everything else gets inferred if possible. If list, multiple adjustments will be tried and up to that many adjustments can be selected.

  • model (str) – Default ‘add’. One of {“additive”, “add”, “multiplicative”, “mul”}. The type of seasonal component. Only relevant for the ‘seasonal_adj’ option in try_order.

  • set_aside_test_set (bool) – Default False. Whether to separate the test set specified in f.test_length during this process. Setting this to True prevents leakage when testing the forecasts out-of-sample.

  • return_train_only (bool) – Default False. Whether the returned selections should be set to train_only. All tries are completely out-of-sample but the returned transformations will not hold out the test-set in the Forecaster object when detrending, deseasoning, and scaling, so setting this to True can prevent leakage.

  • verbose (bool) – Default False. Whether to print info about the transformers/reverters being tried.

  • **kwargs – Passed to the Forecaster.manual_forecast() function and possible values change based on which estimator is used.

Returns:

A Transformer object with the identified transforming functions and the Reverter object with the Transformer counterpart functions.

Return type:

(Transformer, Reverter)

>>> from scalecast.Forecaster import Forecaster
>>> from scaleast.Pipeline import Pipeline, Transformer, Reverter
>>> from scalecast.util import find_optimal_transformation
>>> import pandas_datareader as pdr
>>>
>>> def forecaster(f):
>>>     f.add_covid19_regressor()
>>>     f.auto_Xvar_select(cross_validate=True)
>>>     f.set_estimator('mlr')
>>>     f.manual_forecast()
>>> df = pdr.get_data_fred(
>>>     'HOUSTNSA',
>>>     start='1959-01-01',
>>>     end='2022-08-01'
>>> )
>>> f = Forecaster(
>>>     y=df['HOUSTNSA'],
>>>     current_dates=df.index,
>>>     future_dates=24,
>>>     test_length = .2, # this will be monitored for performance
>>> )
>>> f.set_validation_length(24)
>>> transformer, reverter = find_optimal_transformation(f)
>>> print(reverter) # see what transformers and reverters were chosen
>>> pipeline = Pipeline(
>>>   steps = [
>>>       ('Transform',transformer),
>>>       ('Forecast',forecaster),
>>>       ('Revert',reverter),
>>>   ],
>>> )
>>> f = pipeline.fit_predict(f)

find_statistical_transformation()

src.scalecast.util.find_statistical_transformation(f, goal=['stationary'], train_only=False, critical_pval=0.05, log=True, m='auto', adf_kwargs={}, **kwargs)

Finds a set of transformations to achieve stationarity or seasonal adjustment, based on results from statistical tests.

Parameters:
  • f (Forecaster) – The object that stores the series to test.

  • goal (list-like) – Default [‘stationary’]. One or multiple of ‘stationary’, ‘seasonally_adj’. Other options may be coming in the future. If more than one goal is passed, will try to satisfy all goals in the order passed. For stationary: uses an Augmented Dickey-Fuller test to determine if the series is stationary. If not stationary, returns a diff transformation and log transformation if log is True. For seasonall_adj: uses seasonal auto_arima to find the optimal seasonal diff.

  • train_only (bool) – Default False. Whether to use train set only in all statistical tests.

  • log (bool) – Default True. Whether to log and difference the series if it is found to be non-stationary or just difference. This will set itself to False if the lowest observed series value is 0 or lower.

  • critical_pval (float) – Default 0.05. The cutoff p-value to use to determine statistical signficance in the Augmented Dickey-Fuller test and to run the auto_arima selection (substitutes for alpha arg).

  • m (str or int) – Default ‘auto’: The number of observations that counts one seasonal step. When ‘auto’, uses the M4 competition values: for Hourly: 24, Monthly: 12, Quarterly: 4. everything else gets 1 (no seasonality assumed) so pass your own values for other frequencies.

  • adf_kwargs (dict) – Default {}. Keyword args to pass to the Augmented Dickey-Fuller test function. See the maxlag, regression, and autolag arguments from https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html.

  • **kwargs – Passed to the auto_arima() function when searching for optimal seasonal diff.

Returns:

A Transformer object with the identified transforming functions and the Reverter object with the Transformer counterpart functions.

Return type:

(Transformer, Reverter)

>>> from scalecast.Forecaster import Forecaster
>>> from scaleast.Pipeline import Pipeline, Transformer, Reverter
>>> from scalecast.util import find_statistical_transformation
>>> import pandas_datareader as pdr
>>>
>>> def forecaster(f):
>>>     f.add_covid19_regressor()
>>>     f.auto_Xvar_select(cross_validate=True)
>>>     f.set_estimator('mlr')
>>>     f.manual_forecast()
>>> df = pdr.get_data_fred(
>>>     'HOUSTNSA',
>>>     start='1959-01-01',
>>>     end='2022-08-01'
>>> )
>>> f = Forecaster(
>>>     y=df['HOUSTNSA'],
>>>     current_dates=df.index,
>>>     future_dates=24,
>>>     test_length = .2,
>>> )
>>> f.set_validation_length(24)
>>> transformer, reverter = find_statistical_transformation(
>>>     f,
>>>     goal=['stationary','seasonally_adj'],
>>>     train_only=True,
>>>     critical_pval = .01,
>>> )
>>> print(reverter) # see what transformers and reverters were chosen
>>> pipeline = Pipeline(
>>>   steps = [
>>>       ('Transform',transformer),
>>>       ('Forecast',forecaster),
>>>       ('Revert',reverter),
>>>   ],
>>> )
>>> f = pipeline.fit_predict(f)

gen_rnn_grid()

src.scalecast.util.gen_rnn_grid(layer_tries=5, min_layer_size=1, max_layer_size=3, layer_cell_pool=['LSTM'], units_pool=[8, 16, 32, 64], activation_pool=['relu', 'tanh'], dropout_pool=[0], uniform_layer_cells=True, uniform_units=True, uniform_activations=True, uniform_dropout=True, verbose=0, random_seed=None, **kwargs)

Randomly generates an RNN grid for tuning hyperparameters. The resulting grid may be very large, so it is generally a good idea to use Forecaster.limit_grid_size to make it feasible to use with tuning/cross validation.

Parameters:
  • layer_tries (int) – Default 5. How many layer tries to place into the grid.

  • min_layer_size (int) – Default 1. The smallest possible hidden layer structure will be this size.

  • max_layer_size (int) – Default 3. The largest possible hidden layer structure will be this size.

  • layer_cell_pool (collection['LSTM'|'SimpleRNN']) – Default [‘LSTM’]. The possible cell types to try.

  • units_pool (collection[int]) – Default [8,16,32,64]: The possible unit size for each cell.

  • activation_pool (collection[str|tf.Activation]) – Default [‘relu’,’tanh’]. The possible activation functions to try.

  • dropout_pool (collection) – Default [0]. The possible dropout values to add to each layer.

  • uniform_layer_cells (bool) – Default True. Whether each cell should be the same in all layers.

  • uniform_units (bool) – Default True. Whether each layer should have the same unit sizes.

  • uniform_activations (bool) – Default True. Whether each layer should have the same activation function.

  • uniform_dropout (bool) – Default True. Whether each layer should have the same dropout value.

  • verbose (int) – Default 0. The verbosity of the resulting models. 1 for fully verbose, 2 for medium verbose, 0 for no verbosity.

  • random_seed (int) – Optional. Set a random seed for consistent results.

  • **kwargs – Other keyword arguments to add to each hyperparameter. Can include callbacks for early stopping, optimizer, lags, etc. If wanting to try multiple values for a given keyword, pass a list/collection type. See: https://scalecast.readthedocs.io/en/latest/Forecaster/_forecast.html#rnn

Returns:

The resulting hyperparamter grid.

Return type:

(Dict)

>>>

get_backtest_resid_matrix()

src.scalecast.util.get_backtest_resid_matrix(backtest_results)

Converts results from a backtest pipeline into a matrix of residuals. Each row in this residual is for a backtest iteration and the columns are a forecast step.

Parameters:

backtest_results (list) – The output returned from Pipeline.backtest() or MVPipeline.backtest(). Recommend to obtain this from running util.backtest_for_resid_matrix() and to pass the results to util.overwrite_forecast_intervals().

Returns:

A list where each element corresponds to the given Forecaster object in a backtest. The elements of the list are dictionaries where each key is an evaluated model name and each value is a numpy matrix of the appropriate dimensions that can be used to determine a dynamic prediction interval.

Return type:

(list[dict[str,numpy.ndarray]])

infer_apply_Xvar_selection()

src.scalecast.util.infer_apply_Xvar_selection(infer_from, apply_to, return_copy=False)

Attempts to infer what Xvars have been added to one Forecaster object and applies the guess to another Forecaster object. If using default fourier seasonal terms, linear or log trend terms, and autoregressive terms only, with defaults, this will guess all variables successfully. Other variables (such as through Forecaster.add_Xvars_df()) will not be added. Any variables that cannot be inferred will be raised in a warning.

Parameters:
  • infer_from (Forecaster or MVForecaster) – The object to infer the Xvars from.

  • apply_to (Forecaster or MVForecaster) – The object to apply the guess to.

  • return_copy (bool) – Default False. Whether to create a copy of the object passed to apply_to. Default will add Xvars to the instance passed to apply_to.

Returns:

The Forecaster object with the inferred variables added to it.

Return type:

(Forecaster)

>>> f2 = infer_apply_Xvar_selection(infer_from=f1,apply_to=f2)

overwrite_forecast_intervals()

src.scalecast.util.overwrite_forecast_intervals(*fs, backtest_resid_matrix, models=None, alpha=0.05)

Overwrites naive forecast intervals stored in passed Forecaster objects with dynamic intervals. Overwrites future predictions only; does not overwrite intervals for test-set prediction intervals.

Parameters:
  • *fs (Forecaster) – The objects that contain the evaluated forecasts to overwrite confidence intervals.

  • backtest_resid_matrix (list) – The output returned from util.get_backtest_resid_matrix().

  • models (list) – Optional. The models to overwrite intervals for. By default, overwrites all models found in backtest_resid_matrix.

  • alpha (float) – Default 0.05. The level that confidence intervals need to be evaluated at. 0.05 = 95%. Use the same or larger value passed to backtest_for_resid_matrix() or else this will fail.

plot_reduction_errors()

src.scalecast.util.plot_reduction_errors(f, ax=None, figsize=(12, 6))

Plots the resulting error/accuracy of a Forecaster object where reduce_Xvars() method has been called with method = ‘pfi’ or method = ‘shap’.

Parameters:
  • f (Forecaster) – An object that has called the reduce_Xvars() method with method = ‘pfi’.

  • ax (Axis) – Optional. The existing axis to write the resulting figure to.

  • figsize (tuple) – Default (12,6). The size of the resulting figure. Ignored when ax is not None.

Returns:

(Axis) The figure’s axis.

>>> from scalecast.Forecaster import Forecaster
>>> from scalecast.util import plot_reduction_errors
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>>
>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-06-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.set_test_length(.2)
>>> f.generate_future_dates(24)
>>> f.add_ar_terms(24)
>>> f.add_seasonal_regressors('month',raw=False,sincos=True,dummy=True)
>>> f.add_seasonal_regressors('year')
>>> f.add_time_trend()
>>> f.reduce_Xvars(method='pfi')
>>> plot_reduction_errors(f)
>>> plt.show()