Regime-dependent commodity price dynamics: A predictive analysis

We develop an econometric modelling framework to forecast commodity prices taking into account potentially different dynamics and linkages existing at different states of the world and using different performance measures to validate the predictions. We assess the extent to which the quality of the fore-casts can be improved by entertaining different regime-dependent threshold models considering different threshold variables. We evaluate prediction quality using both loss minimization and profit maximization measures based on directional accuracy, directional value, the ability to predict turning points, and the returns implied by a simple trading strategy. Our analysis provides overwhelming evidence that allowing for regime-dependent dynamics leads to improvements in predictive ability for the Goldman Sachs Commodity Index, as well as for its five sub-indices (energy, industrial metals, precious metals, agriculture, and livestock). Our results suggest the existence of a trade-off between predictive ability based on loss and profit measures, which implies that the particular aim of the prediction exercise carried out plays a very important role in terms of defining which set of models is the best to use.


| INTRODUCTION
This study aims at creating an econometric modelling framework to forecast commodity prices, taking explicitly into account the potentially different dynamics and linkages existing in different states of the world and using different performance measures to validate the predictions.The literature on commodity price forecasts can be categorized into two broad groups depending on the approach they take.While some studies use asset prices as predictors of commodity prices, a more agnostic approach exploits statistical methods to search for the most effective set of predictors of commodity price changes.The more common approach based on asset prices, routinely used by central banks, creates predictions of commodity prices using futures prices.Recently, some authors argue that such a forecasting method rather provides noisy signals about future spot prices (see Gorton & Rouwenhorst, 2006; Groen &  Pesenti, 2011; Hong & Yogo, 2012).
The early literature on commodity price modelling and forecasting builds upon large macroeconometric specifications (Just & Rausser, 1981), while modern methods rely on univariate and multivariate time series modelling which jointly assess the dynamics of macroeconomic variables and commodity prices (see, e.g., Ahumada & Cornejo, 2015, 2016).Groen and Pesenti  (2011) and Gargano and Timmermann (2014) provide relevant examples of the more agnostic and flexible approach to model building in the context of commodity price forecasting.In both studies, the authors assess whether forecasts of commodity prices based on a large pool of macroeconomic predictors can systematically improve upon naive benchmarks.Groen and Pesenti  (2011) study the predictability of 10 commodity indices in an out-of-sample forecasting experiment.They conclude that neither commodity exchange rates nor a broad crosssection of macroeconomic variables produce overwhelmingly strong evidence of spot price predictability when compared with random walk or autoregressive benchmarks.Gargano and Timmermann (2014), on the other hand, examine the out-of-sample predictability of seven commodity indices over the period 1947-2010, using macroeconomic and financial variables.They find that commodity currencies have some predictive power at short (monthly and quarterly) forecast horizons, while growth in industrial production and the investmentcapital ratio have some predictive power at longer (yearly) horizons, a result that resembles that by Chen  et al. (2010).Other modelling frameworks aimed at forecasting short-term changes in agricultural commodity prices are employed in more recent contributions to the literature, such as those by Xu ((2017), ( 2018), ( 2020)).In parallel, efforts to improve forecasts of commodity prices by explicitly modelling their volatility have also been carried out (see, e.g., Bernard et al., 2008; Ramirez &  Fadiga, 2003; or the recent contribution by ; Degiannakis  et al., 2020).
In striving for modelling frameworks with good predictive accuracy for commodity prices, in this contribution, we assess the extent to which the quality of the forecasts depends on the state of the economy.Issues related to optimizing out-of-sample prediction in the presence of structural breaks and parameter instability have been particularly prevalent in the modern forecasting literature (see, e.g., Giacomini & Rossi, 2010).We aim at assessing whether, for example, models tend to provide more accurate predictions of commodity prices in calm than in turbulent times.First findings in this direction were provided by Gargano and Timmermann (2014), who observe that commodity price predictability is better during recessions than during expansions.In stock and bond markets, the importance of models that account for regime-dependent parameters has often been acknowledged.
Recent studies (e.g., Guidolin  &  Timmermann, 2005; for excess stock and bond returns or Guidolin & Timmermann, 2009; for short-term interest rates) have found that regime switching models may prove extremely useful to forecast over intermediate horizons, using monthly data.Guidolin and Ono () find overwhelming evidence of regime switching in the joint process for asset prices and macroeconomic variables.They also find that modelling explicitly the presence of such regimes improves considerably the out-of-sample performance of a model of the linkages between asset prices and the macroeconomy.Guidolin and Pedio (2021)  forecast commodity futures returns using a Markovswitching model that identifies different volatility regimes and maps the observations into high-volatility and low-volatility states.In addition, they find that the models that outperform under a statistical loss function are not necessarily the best when an economic loss function is used to evaluate the predictive performance of the different models.Jacobsen et al. (2019) investigate stock return predictability and find a strong positive relation between industrial metals and equity returns in times of recessions and a negative relation during expansions.In this study, we entertain different regime-dependent models (threshold models), considering different threshold variables to capture states of the world.
In addition, we assess the quality of commodity forecasts not only with the mean squared error (MSE), the traditional forecast performance measure used in many studies including Gargano and Timmermann (2014) but also with measures that evaluate directional accuracy (DA), directional value (DV), the ability to predict price movements when large swings take place, and returns implied by a trading strategy based on commodity price forecasts.These additional measures (profit measures, as opposed to the loss measures like mean-squared error or mean absolute error [MAE]) do not directly assess forecast accuracy but relate to other dimensions of forecasting quality and may be more relevant than accuracy for particular applications in policy and applied work.
We create models to predict commodity price dynamics as captured by the changes in an overall commodity price index, as well as in five subindices (energy, industrial metals, precious metals, agriculture, and livestock), for short-and long-term forecast horizons, using monthly observations in the period 1980-2018.Our forecast models include threshold models that are based on different threshold variables, and we consider the various performance measures discussed above.For the multivariate threshold models, we use the following variables: composite leading indicator for the USA and real effective exchange rate of US dollar (macroeconomic variables), world stock market index (financial variable), and stock-to-use ratio 1 (fundamental variable).Based on the extensive empirical evidence, we find overwhelming evidence that allowing for regime-dependent dynamics leads to improvements in predictive ability for commodity prices.This is the case because the characteristics of the dynamics and the interactions with other variables are not constant over time but differ depending on particular phenomena (e.g., periods of high and low volatility, good and bad economic times, times of high/low interest rates or inflation).To the extent that the estimated models lead to stable dynamics, modelling the interactions in a regime-specific fashion allows for better predictions of commodity price changes.However, the nature of these improvements also differs across predictive measures and sectors.
Our results show that an interesting trade-off appears between loss and profit measures, which implies that the particular aim of the prediction exercise carried out plays a very important role in terms of defining which set of models is the best to use.Our results indicate a systematic correlation between loss-based and profit-based predictive error measures that suggests that correctly predicted directions of change tend to happen in periods where MSEs are particularly large.The optimal specifications for applications where the metrics for success are related to systematically predicting the direction of change of commodity prices accurately may thus be systematically different from those aimed at providing point predictions with an absolute minimal distance to the realized values.In the context of the existing literature, we employ a relatively large model space in terms of potential covariates and threshold variables, which can explain the differences in results as compared to other studies where the predictive performance of nonlinear models is humble compared with that of simpler linear specifications.
The paper is structured as follows.In Section 2, we present the forecast models, where we describe the class of threshold models, which are our main focus, in more detail.In Section 3, we introduce the commodity price data and the explanatory and threshold variables.We present forecast performance measures, including traditional and new measures, in Section 4. The following section presents and discusses the empirical results, and Section 6 concludes the study.

| METHODOLOGY
In order to address our research question, which deals with how different states of the economy (like recessions/ expansions, high/low volatility, high/low inflation, high/ low interest rates, market sentiment, …) affect the price forecasting performance of different commodity classes, we assess threshold models (both univariate and multivariate).These types of models allow their parameters to change in different regimes (states of the world), whose occurrence depends on the value of a given threshold variable.In principle, there is a large universe of potential threshold variables that could be used as a trigger quantity which determines the regime where the process resides at a given moment.It has often been suggested, for example, that variables may behave differently in booming and declining markets.Hence, indicators describing different stages of the business cycle (e.g., business cycle indicators, economic sentiment indicators, inflation, spreads between long-and short-term interest rates) may prove useful in defining the corresponding states of the economy.On the other hand, the behaviour of economic variables may vary in periods of high and low risk, which are usually identified by a high or low volatility in the equity markets.The level of oil price inflation may also induce different types of dynamics in commodity prices.We also examine whether the use of threshold variables based on the rolling correlation between stock and government bond markets, as well as the correlation between stock and oil markets (which are both relevant in portfolio diversification) may lead to differences in the quality of commodity price forecasting models.Finally, we are interested in whether the level of the target variable itself, that is, the commodity index, may be useful to define different states of the world.
In our application, the set of variables that are assessed as potential drivers of the threshold-nonlinearity and thus define the states of the economy is given by te following: the composite leading indicator for the USA (CLI), the consumer confidence indicator for the USA (CCI), the USA inflation rate (INF), the spread between long-term and short-term US interest rates (spread), the volatility of the US stock market (VOLA), oil price inflation (Δoil), the correlation between the US stock and government bond markets based on a 6-month rolling window (COR), the correlation between the world stock market and the oil price based on a 6-month rolling window (COR-oil), the S&P Coldman Sachs commodity index (GSCI), and its subindices.For more details, see Table A2 in Appendix A.
As the set of potential specifications aimed at forecasting commodity prices, we consider a large battery of model classes, including autoregressive models, Bayesian vector autoregressive models, GARCH models, and vector error correction models.In addition to these specifications, which do not allow for threshold effects, we consider univariate and multivariate two-regime threshold models.In a preliminary analysis, we recursively tested for the optimal number of regimes in different threshold specifications making use of the test by Bai and  Perron (2003).The data appear to strongly support tworegime models against threshold specifications with a higher number of regimes, which leads us to fix the number of thresholds to one throughout the study, thus reducing the computational costs involved in the analysis. 2All the models employed are listed in Table 1.The simplest threshold model is the threshold autoregression in levels with p lags and with k lags in the threshold variable, TAR(p,k), where y t is the log of the Goldman Sachs commodity index (or its subindex) at time t, z Z, with Z being the set of above mentioned threshold variables, namely, Z ¼ fy, CLI, CCI, INF, spread, VOLA, Δoil, COR, COR-oilg.Finally, ε t $ NIDð0, σ 2 ε Þ.The estimator of γ ϕ is the value of z that minimizes the sum of squared residuals in the nonlinear regression (1), that is, Once the estimator of γ ϕ is found, (1) can be estimated in a straightforward manner making use of OLS.
Given that the objective of the analysis is to assess the relative performance exclusively in terms of outof-sample predictive power and exploiting a large space of specifications, we entertain both models with variables in first differences and models where the variables are included in levels.We also consider threshold autoregressions in first differences with p lags and with a k-th lag in threshold variable, TDAR(p, k) where ϵ t $ NIDð0, σ 2 ϵ Þ, γθ ¼ arg min z P εðzÞ 2 È É and z Z.In addition to univariate threshold models, we entertain multivariate threshold models, which generalize the class of threshold vector autoregression in levels with p lags and with a k-th lag threshold variable, TVAR (p, k).Let x t be an N-dimensional vector, then the model under consideration is

Abbreviations
where Ψ 01 and Ψ 02 are N-dimensional column vectors, Ψ l1 and Ψ l2 are N Â N matrices, μ t $ NIDð0, Σ μ Þ, the S&P GS commodity index (or its sub-index) is the first element of x t , that is, thus, the estimator of γ Ψ is the value of z that minimizes the sum of squared residuals corresponding to the first equation in (4), that is, the residuals corresponding to the commodity index.Vector x t consists of the following macroeconomic and financial variables: the US composite leading indicator (CLI), the real effective exchange rate with respect to the US dollar (REER), the world stock market index (stock), stock-to-use ratios, 3  Finally, we consider also a variation of threshold vector autoregression in first differences with p lags and with k-th lag in threshold variable, TDVAR(p, k) such as with parameter vectors and matrices defined analogously to those in the model above and Thus, the estimator of γ χ is the value of z that minimizes the sum of squared residuals corresponding to the first equation in ( 6), that is, the residuals corresponding to the commodity index in first differences ΔGSCI.As in (4), the regimes are implied by the first equation and taken as given for the remaining equations in (6).With the choice of a threshold value that minimizes the sum of squared residuals of the commodity price regression equation, we aim at optimizing predictive ability for our objective variable and ensure that the nonlinearities identified are related to the dynamics of commodity prices. 4n our empirical analysis, when we compare threshold and linear models, we consider up to three lags of the variables (with p ¼ 3 being the maximum lag length) and up to 12 lags for the threshold variable under consideration (with k ¼ 12 being the maximum lag length).Models are compared and selected according to outof-sample performance measures.We explicitly consider all combinations of explanatory variables and all lags of the explanatory and threshold variables up to the specified maximal lag lengths and choose the best model according to the given forecast performance measure.It should be noted that the space of models we address implies that we are agnostic about the time series properties of commodity prices, with particular specifications building upon the assumption of mean reversion, while others assume nonstationary behaviour of the commodity price variable.Since we address different predictive measures and use a rolling window design for the forecast validation, exploiting short-term mean reverting dynamics may actually lead to satisfactory predictions in particular periods.Such an approach makes it particularly difficult for nonlinear models to achieve superior predictive ability in a systematic manner.

| DATA
We use the family of S&P GSCI (Standard & Poors Goldman Sachs Commodity Index) indices to measure commodity prices.We employ both the total aggregate commodity index (S&P GSCI) and five subindices that reflect the developments of certain components of the index, namely energy (with a weight in the total commodity index of 63%), industrial metals (with a weight of 11%), precious metals (with a weight of 4%), agriculture (with a weight of 15%), and livestock (with a weight of 7%).The S&P GSCI is regarded as a benchmark for investment in commodity markets and is designed to be a tradable index.It is calculated using a world production-weighted basis and includes physical commodities that are traded in liquid futures markets.The criteria for inclusion into the index are based on trading volume.In addition, the contracts must be denominated in US dollars and traded in an OECD country or on a trading facility that has its principal place of business in an OECD country.The current S&P GSCI comprises 24 commodities from all commodity sectors with a high exposure to energy.These energy contracts include crude oil, heating oil, and gasoline traded in the US, as well as crude oil and gasoil traded in Europe.Table A1 in Appendix A lists all contracts included in the S&P GSCI and their respective weights and trading places.We consider the class of total return indices. 5For more information on the S&P GSCI, see S&P Dow Jones (2019).Some descriptive statistics related to the indices are given in Table 2. Price developments are quite heterogeneous across indices, with only the overall and the energy indices displaying rather similar dynamics.The volatility in returns varies considerably as well, which has a direct impact on the forecasting accuracy of econometric models.The monthly returns of the energy index, for example, show a standard deviation of 7.7% over the total data sample (1980-2018), while the corresponding value for the livestock index is only 3.5%.Overall, the correlations between different commodity sector returns are low (with the exception of the overall index and the energy index), which reinforces the need to analyze the different sectors separately.
As macroeconomic and finance variables in our models, we take the composite leading indicator for the USA (CLI), the real effective exchange rate related to the US dollar (REER), and the world stock market indicator (stock). 6In addition, we employ fundamental variables summarizing the forces in the commodity market: stockto-use ratios (stu) for oil (worldwide), wheat (USA), and meat (USA).More precisely, we use the worldwide oil stock-to-use ratio for the aggregate index and for the subindices energy, industrial metals, and precious metals, we use the US wheat stock-to-use ratio for the subindex agriculture, and we use the US meat stock-to-use ratio for the subindex livestock.In those cases where we model commodity subindices, we also use the total commodity index as an additional variable.As threshold variables, in addition to lagged values of the modelled index itself, we use the composite leading indicator for the USA (CLI), the consumer sentiment indicator for the US (CCI), the US inflation measured by the consumer price index (INF), the spread between long-term and short-term US interest rates (spread), the volatility of the S&P 500 (VOLA), the oil price inflation (Δoil), the correlation between the US stock and government bond markets (COR), and the correlation between the global stock market and the oil market (COR-oil).The correlations are calculated between daily returns in the respective markets, over a rolling window of 130 trading days (i.e., approximately 6 months), recorded at the end of a given month.For details on all the data we use, see Table A2.
The data sample covers monthly observations for the period ranging from January 1980 through December 2018.We consider rolling-window estimation for our analysis, that is, we keep the size of the estimation sample constant and equal to 20 years and move forward the sample by one month while re-estimating the model parameters.The use of a rolling window for the predictive assessment of the models allows our class of threshold models to better identify changes in regimes if they happen at the end of the in-sample period, which could prove important to preserve predictive ability.The rolling-window design should thus avoid that the thresholds identified are exclusively driven by nonlinear behaviour at the beginning of the available sample.
The out-of-sample period used to evaluate the forecast performance spans from January 2005 to December 2018. 7Note that "best" models are chosen based on the forecast performance of the individual models for all lags (up to specified maximum lags) and all combinations of variables under consideration.

| FORECAST EVALUATION
The evaluation of different commodity price forecasts are carried out employing not only traditional loss measures, like MAE and MSE, but also profit-based measures like DA, DV, and DV of turning points (TP).The latter might be more relevant in situations where getting the right future value of commodity prices may be of lesser importance than predicting their direction of change, in particular if the change in prices is large.The DA indicator, or hit rate, is a binary variable measuring whether the direction of a price change was correctly forecast.The DV incorporates the economic value of directional forecasts by assigning to each correctly predicted change its magnitude.The DA of TPs describes the ability to predict TPs in commodity price dynamics. 8he loss-based and profit-based performance measures are formally defined as follows: where P t is the price of the commodity index at time t, Ptþhjt is the forecast of the price of the commodity index for time t þ h conditional on the information available at time t, that is, h is the forecast horizon, and IðÁÞ is the indicator function.
In addition, we consider forecast ability measures based on the returns implied by predicting commodity prices and using a simple "buy low, sell high" trading strategy.This strategy is based on buying the commodity index if its price is forecast to rise and selling it when its price is forecast to fall.This strategy is described (for exchange rates), for example, in Gençay (1998) and will be used under the assumption of no transaction costs. 9Predicted upward movements of the commodity index with respect to the actual value (positive returns) are executed as long positions, while predicted downward movements (negative returns) are executed as short positions.The following discrete return r tþh,h is implied by the "buy low, sell high" trading strategy, Later on, we will sometimes refer to the return implied by this trading strategy simply as the return.
The aggregate performance measures for each model are calculated over the out-of-sample period for a given forecasting horizon as follows: where where T 0 ¼ January 1980, T 1 ¼ January 2005, and T 2 ¼ December 2018.The aim of our analysis is to evaluate the potential improvement in out-of-sample predictive ability for commodity prices that can be obtained by entertaining different regime-dependent threshold models (i.e., models where threshold effects are triggered by different variables).In this respect, the linear alternative plays the role of a general benchmark, so as to answer the question: Can threshold models improve predictions compared to models that do not include regime dependence?In addition, individual threshold models also appear as a benchmark reference in our comparisons when we look for an answer to the question whether particular threshold variables lead to better predictive performance than others.

| RESULTS
When analyzing the precision of threshold models in commodity price forecasting, we focus on different metrics.At first, we compare threshold models with linear models, which is the most natural benchmark to find out about the value of threshold models as predictive instrument.In this context, we also analyze the differences across threshold models implied by the use of different threshold variables.We employ different performance metrics to evaluate the forecasting performance and consider both total and regime-specific accuracy measures.In addition to assessing the models in terms of predictive power, we also examine the nature of the threshold variables and selected explanatory variables in the best threshold models and also discuss the pattern of forecasting performance for the two regimes.Furthermore, we look at the sector-specific performance of best threshold models.Finally, we compare threshold models with a larger set of models to find out whether threshold models tend to outperform specifications created out of this expanded set of covariates and consider the additional performance measure related to forecasting TPs.

| Threshold models and linear models
Our primary focus is to compare the performance of best threshold models (for a given threshold variable) with the performance of linear models.The threshold models entertained contain (vector) autoregression threshold models in levels and differences (TAR, TDAR, TVAR, and TDVAR), including self-exciting threshold autoregressive models, and linear models, that is, (vector) autoregressive specifications in levels and differences (AR, DAR, VAR, and DVAR), as described in Table 1.We examine threshold models where the threshold variable presents stationary behaviour and thus restrict the following variables to act as threshold variables: the lagged value of the dependent variable, the composite leading indicator for the US (CLI), the consumer sentiment indicator for the US (CCI), the US inflation measured by the consumer price index (INF), the spread between long-term and short-term US interest rates (spread), the volatility of the S&P 500 (VOLA), oil price inflation (Δoil), the correlation between the US stock and government bond markets (COR), and the correlation between the global stock market and the oil market (COR-oil).
Before we examine the relative performance of threshold versus linear models, we investigate the performance of threshold variables other than the dependent variable itself and examine whether different threshold variables imply large differences in the forecasting performance of their corresponding specifications.We therefore compare the performance of the best self-exciting threshold model with that of the best threshold model when the threshold variable is one of the other eight threshold variables listed in Table A2.With this exercise, we assess whether states of the world defined by the commodity price itself are informative enough to capture the economic environment implied by various other threshold variables.Figure 1 shows how many threshold models (from the maximum number of eight threshold variables) outperform the self-exciting model, for different performance measures, different forecast horizons, and the various commodity sectors.Our results suggest that the use of other threshold variables different from the overall commodity price index adds predictive information to our models.The self-exciting model is only better than any other threshold model for the index corresponding to precious metals, agriculture, and livestock when considering profit measures (DV and return).For the overall GSCI index, at least half of the threshold models outperform the self-exciting specification for all forecast horizons, irrespective of which performance measure used.This implies that explicitly acknowledging information like economic sentiment, uncertainty, interest rate spread, oil prices, or correlation can help to improve commodity price forecasting.Results are somewhat less clear-cut for energy, industrial metals and agriculture, and they are the least strong for precious metals and livestock.Even in these two sectors, however, in most cases, the best threshold models in terms of forecasting ability are not the self-exciting ones.
We turn to comparing the performance of the best threshold model with the performance of the linear counterpart that uses the same variables and lag structure.We compare the predictive performance over the whole outof-sample period, as well as in the two regimes implied by the threshold model separately.The best threshold models with respect to specific threshold variables mostly outperform the corresponding linear models and also the best linear models. 10In addition, threshold models outperform the corresponding linear specifications in at least one regime, mostly, however, in both regimes.In Table 3, we show the performance of the best threshold model and the performance of the corresponding linear model for the aggregate GSCI, for the threshold variable "spread" (difference between long and short-term US interest rates), as a representative example of the results obtained.This particular class of models was chosen based on the best short-term forecasting performance (MSE, 1-month ahead) for the overall GSCI index.For horizons of 1, 3, 6, and 12 months ahead, the total performance of the best threshold model is better than that of the corresponding linear model.When comparing the regime-based performance of the best threshold model and the regime-based performance of the corresponding linear model, the best threshold model outperforms the corresponding linear model in both regimes in most of the cases (17 out of 20 cases), and the threshold model is never outperformed by the corresponding linear specification in both regimes. 11In addition, we also present the results of the Diebold-Mariano test of equal forecasting accuracy of the best threshold model against the corresponding linear model (Diebold & Mariano, 1995), which indicate statistically significant differences in predictive performance for many of the forecast error measures, in particular, for longer term forecasting. 12s a next step, we evaluate whether the total performance of the best threshold model is better than the total performance of the best linear model (out of all possible linear models, not just those including similar variables).The best threshold model (across all threshold variables) always outperforms the best linear model if we consider mean values of the performance criteria over the full out- of-sample period.Furthermore, in virtually all cases, the best threshold model for any given threshold variable outperforms the best linear model.Figure 2 presents the results of the analysis by showing the number of threshold models that outperform the best linear models for the different error measures, horizons, and commodity price subindices.The superiority of threshold models is systematic across all dimensions and can be observed when considering the distribution of the difference in squared prediction errors between the best linear and and the best threshold models.Figure 3 presents boxplots of these differences across all commodity sectors for forecast horizons of one and twelve months.The average difference is always positive, and the support of the distribution varies substantially across the different commodity sectors.The pattern observed is relatively similar for forecast horizons of one and twelve months: The interquartile range is largest in the energy sector and smallest in the livestock and precious metals sectors for both forecast horizons.

| Threshold and explanatory variables
Analyzing the best performing threshold models with respect to threshold variables across commodity sectors, a pattern can be extracted (see Figure 4, which presents the ranking of models by threshold variable).For most of the commodity price indices, as well as for the general index, the threshold variables which tend to systematically appear in the best forecasting models in terms of MSE are the spread, the correlation between the US stock and government bond markets, and the composite leading indicator and inflation.The results indicate that capturing the dynamics of particular commodity markets requires different threshold variables.For example, while the correlation between stock and bond markets appears as a good predictor of regime changes in industrial metals, precious metals, agriculture and livestock, it performs weakly in the energy sector.The differences between loss and profit measures of predictive error are remarkable: while using the correlation between stocks and bond markets as a threshold variable leads also to clear return predictive gains in industrial metals, precious metals, agriculture and livestock, in other sectors, the best performing threshold variable changes depending on the predictive error measure used.
In general, the forecasting performance of different best threshold models (implied by the different threshold variables) does not vary substantially.Table 4 provides some information on the variability of predictive performance across best threshold models, as compared with that of the best linear models.In particular, the table reports the average deviation of the predictive error of the best threshold models for a certain threshold variable from the best overall threshold model ("average deviation"), in proportion to the deviation of the best linear model from the best threshold model ("linear deviation").Note that the best threshold model is always better than the best linear model; the "average" threshold model, however, may be worse than the best linear model (implied by a ratio larger than one in the table).The latter is rarely the case.In almost all cases (111 out of 120), the average deviation is smaller than the linear deviation (reflected by a number in the table that is smaller than one) and often to a very large extent.In a clear majority of all cases, the average deviation is less than half the linear deviation, implying that, in general, threshold models seem to perform (similarly) well and considerably better than the best linear model.
We turn to examining the nature of the variables included in the set of best threshold models, so as to assess the relative importance of different theoretical drivers of commodity price dynamics.Within the group of best linear models, one group of commodity indices can be found whose explanatory factors are similar among themselves but different from those of other indices.This group includes the aggregate sector, the energy subsector, and the industrial metals subsector.Best models in the remaining indices (precious metals, agriculture, livestock) tend to contain determinants different from this group and also different from each other.In this (first) group, the CLI indicator appears particularly important for prediction, while information on the oil stock-to-use ratio does not seem to systematically improve forecasting.By contrast, the importance of the real effective exchange rate (REER), the world stock market index and the aggregate GSCI index (for the subsectors) depends on the forecast horizon and performance criterion used.For the best threshold models, the pattern is relatively similar to that for best linear models.For the aggregate sector, energy, and industrial metals, the CLI is an important predictor, the oil stock-to-use ratio is not particularly important, and the real effective exchange rate, the world stock market index, and the GSCI F I G U R E 3 Difference between squared errors for best linear and best threshold models.Note: The graphs show boxplots of the differences between the squared errors for the best linear model and the squared errors for the best threshold model, for forecast horizons of one (left) and twelve (right) months.The differences are taken such that a higher mass in the positive region (or a positive mean) indicates a better performance of the threshold model.aggregate index (for sub-sectors) are sometimes included in the best predictive specifications but not systematically so. Figure 5 shows how often a given explanatory variable is included in the best threshold model, considering the total of nine best threshold models (one for each threshold variable under consideration: commodity price, VOLA, CLI, CCI, INF, Δoil, spread, COR, COR-oil).For the precious metals sector, the most important variable appears to be the REER, while for the sectors agriculture and livestock, the most important variable is the world stock market index, followed by the CLI.These results emphasize the need to assess sectoral dynamics differently in commodity markets in order to optimize the predictive power of multivariate time series models.

| Threshold models and performance criteria
In a next step, we investigate patterns concerning the performance of threshold models across predictive criteria.In some situations, loss measures (MAE, MSE) and profit-based measures (DA, DV, return) behave differently when comparing predictive accuracy between regimes.For instance, threshold models with stock market volatility as the threshold variable perform systematically better in times of low volatility than in times of high volatility in terms of loss measures (MAE, MSE), while they perform better in times of high volatility than in times of low volatility in terms of profit-based measures (DA, DV, return).Table 5 presents the forecasting results of the aggregate GSCI with the threshold variable being the US stock market volatility.In the table, shading indicates better performance across the two regimes implied by the threshold model.The results suggest that, for all forecast horizons, commodity prices can be forecast more accurately in times of low volatility than in times of high volatility, but DA, DV, and the returns of a simple trading strategy (i.e., all profit measures) are higher in times of high volatility.While the first observation can probably be explained through lower price variability and thus better forecasting ability in times of low uncertainty, the second observation may be related to the chances of making more profits in large volatility markets when the direction of price change is forecast correctly.An analysis of the forecast errors over the period confirms that loss and profit measures tend to be positively correlated over the out-of-sample period for threshold models, with high forecast errors occurring in times when the direction of change was nevertheless correctly predicted.Such a behavior can be observed by comparing the MSE with profit measures over the out-of-sample period.Figure 6 shows the two-year rolling average of MSEs, returns, and DVs measured over the out-of-sample period for the threshold model using the stock market volatility as a threshold variable, when forecasting aggregate GSCI one month ahead.These results indicate that the financial instability in the aftermath of the financial crisis of 2008, which led to large increases in commodity prices, caused large prediction errors in terms of MSEs.However, threshold models based on stock market volatility (and other threshold variables) were able to predict direction of change in such times of high uncertainty and large price changes systematically better than their linear counterparts.The same phenomenon can be observed (albeit in smaller magnitude) for the generalized drop in commodity prices that started in 2015.The correlation between the 2-year rolling-averaged MSE and the return is 0.88, and for DV, it is 0.77.These results give evidence that threshold models, if specified efficiently, show a high degree of flexibility in adapting to structural changes in the dynamics of commodity prices and are able to achieve systematic gains in predictive ability for directional change.In this respect, our results add to a growing literature comparing statistical and economic approaches to measure predictive loss and profit and that find conflicting evidence of predictive power depending on the measure employed (see Dal Pra et al., 2018, for instance).

| Threshold models across sectors
Figure 7 presents MSEs and returns of the set of best threshold models for the different commodity sectors.Prices of livestock, precious metals, and agricultural commodities can be predicted comparatively well compared to the rest of the sectors and the aggregate index, while they tend to lead to low returns.On the other hand, prices of energy and industrial metals lead to the highest prediction errors but yield the largest returns. 13This observation holds over all forecast horizons and can be explained due to the fact that larger deviations of the forecasts from its realizations are needed in order to increase the implied profit.To a lower extent, this pattern persists also for the other profit-based measures.DA and directional deviation appear higher for commodity sectors which are harder to predict in terms of MSE.An overview of all performance measures across all sectors and forecast horizons is presented in Figure 8.
Considering best threshold models, both loss measures, MAE and MSE, and the return display a clear structure relating to the forecast horizon.The loss measures increase, that is, forecast accuracy decreases, with an increasing forecast horizon.For example, the MSE when forecasting aggregate commodity prices increases from 0.28% when forecasting 1 month ahead to 6.83% when forecasting 12 months ahead.Using the return as a predictive ability measure, we observe the best performance for the shortest forecast horizon, with decreasing performance for increasing forecast horizons.While the return implied by a simple trading strategy for the aggregate commodity index is 31.46%when forecasting 1 month ahead, the corresponding return is only 12.41% when forecasting 12 months ahead. 14The observed patterns (for MAE, MSE and return) with respect to the forecast horizon hold for all commodity sectors (Table 6).For the other two profit-based measures (DA, DV), the behavior with respect to the forecast horizon is not similar across sectors.While for precious metals and agriculture, the DA and DV grow with increasing forecast horizons, the picture is mixed for energy, industrial metals, livestock and for the aggregate sector.In most cases, however, the DA and value statistics are largest when forecasting twelve months ahead.See Table 7 and Figure 8.
Table 6 indicates that the commodity sector whose returns dominate those of the others is most of the time the industrial metals sector.Exceptions are the energy sector for return and DV for h ¼ 1 and the sector of agriculture for DA and DV for h ¼ 12.The sector with the best loss-based performance is livestock.The smallest loss-based performance occurs for livestock in case of one month forecast horizon, namely, 2.42% for MAE and 0.1% for MSE, and the largest profit-based performance occurs for agriculture for 12 months forecast horizon, namely, 80.95% for DA and 89.41% for DV and for energy sector where the return of 46.74% occurs in the case of 1 month forecast horizon.

| Threshold models and larger class of models
In addition to standard linear vector autoregressive models, we also consider a much larger class of models in order to find out whether threshold models also outperform other specifications.This class includes different univariate GARCH models, vector error correction models and Bayesian VAR models (see Table 1).For this larger class of models, we choose the lag structure based on in-sample model selection based on optimizing the Akaike information criterion. 15We also use an additional performance measure, namely, the proportion of correctly forecast TPs. 16ur results show that threshold models have the best predictive performance in the vast majority of cases (see   None of the best models in this expanded specification set can keep up with the best prediction models found in the smaller set used before.In all cases without any exception, the best model determined in our previous analysis, which is always a threshold model, outperforms the best model found now, including the cases when the best model now is not a threshold model (see Table 8). 17As best threshold models for different threshold variables do often perform similarly (well), as found in our previous analysis, not only the best threshold model but often also other threshold models (with different threshold variables) outperform the corresponding best model found now.
The performance of best models with respect to both loss measures and the return show a clear pattern with respect to the forecast horizon: Forecast accuracy decreases with an increasing forecast horizon and so does the return.The proportion of correctly forecast TPs, which was not analysed before, does not show a uniform pattern with respect to the forecast horizon.However, it is largest for the 12-month forecast horizon for the total commodity index, for energy, and industrial metals, while it is largest for the 1-month forecast horizon for the remaining sectors (precious metals, agriculture, livestock).When forecasting 12 months ahead, the overall index and energy are actually among the best (ranking third and second) according to TP (see Table 8).
The vast majority (all but one) of the best performing threshold models with respect to correctly forecasting Notes.The table shows the performance criteria of best models in "Smaller class of models" and of best models in "Larger class of models" for different GSCI sectors and different forecast horizons.The best model in the smaller class of models (left panel) is always better than, or at least as good as, the best model in the larger class of models (right panel).In the smaller class of models best models are always threshold models, in the larger class of models, in 25 out of the total of 144 cases the best model is not a threshold model.Light petrol shading indicates the cases when the best model is not a threshold model.
TPs for the aggregate index and the energy sector rely on a threshold variable that is connected to oil (Δoil or COR-oil).All models for the aggregate index and for energy, except for one case, contain the oil stock-to-use ratio as a determinant.Best models for precious metals according to TP are either based on a threshold variable related to oil or have the oil stock-to-use ratio among the explanatory variables.The same holds for industrial metals and livestock.For all indices, best models according to TP rely on an oil related threshold variable for a forecast horizon of 12 months.For all indices (but agriculture), the REER is included in the best model (according to TP) for a 12-month forecast horizon. 18

| CONCLUSIONS
In this paper, we present overwhelming evidence that allowing for regime-dependent dynamics in models for commodity prices leads to improvements in predictive ability.This follows from the fact that the characteristics of the dynamics of commodity prices and their interactions with other variables are not constant over time but differ depending on particular phenomena (e.g., periods of high and low volatility in the equity markets, good and bad economic times or the level of inflation).If these regimes can be properly defined out of the data, the stability of dynamics and interactions within particular regimes allow for better predictions.However, the nature of these improvements also differs across predictive measures and sectors.
We assess the quality of commodity forecasts with a variety of different performance measures.In addition to the MSE, the traditional forecast performance measure used in many studies, we also consider measures that evaluate DA, DV, the ability to predict TPs, and the returns implied by a simple trading strategy based on commodity price forecasts.These additional profit-based measures do not directly assess forecast accuracy but relate to other dimensions of forecasting quality and may be more relevant for particular applications in policy and applied work.We create an econometric modeling framework to predict commodity price dynamics as captured by the changes in an overall commodity price index, the S&P Goldman Sachs Commodity Index, as well as in five sub-indices (energy, industrial metals, precious metals, agriculture, livestock).We consider short-term and longterm forecast horizons (ranging from one month to twelve months) and use monthly observations in the period 1980-2018.Our forecast models include threshold models that are based on different threshold variables.
We provide a rich set of empirical results.In addition to the forecast performance comparison of threshold and linear models we investigate the threshold variables and explanatory variables that imply "best" models, the structural pattern of evaluation criteria across different regimes, and best sector-specific forecast performance.We observe that threshold models with volatility in equity markets defining the states of the economy seem to perform better in times of low volatility than in times of high volatility with respect to loss measures, while, on the other hand, they seem to perform better in times of high volatility than in times of low volatility with respect to profit-based measures.Our results suggest that an interesting trade-off appears between loss and profit measures, which implies that the particular aim of the prediction exercise carried out plays a very important role in terms of defining which set of models is the best to use.The optimal specifications for applications where the metrics for success are related to systematically predicting the direction of change of commodity prices accurately may thus be systematically different from those aimed at providing point predictions with an absolute minimal distance to the realized values.In addition, the positive results found in the paper for threshold models (as compared to part of the literature) are also related to the fact that we exploit a large specification space as compared with other studies, both in terms of potential covariates and threshold variables.
The importance of the oil market as a determinant of commodity price dynamics is reflected in the results of our analysis, with oil related variables appearing in the best forecasting models for TPs (either as a covariate or a threshold variable) in the aggregate GSCI, energy, and precious metals models.This result indicates that particular oil price dynamics may act as a leading indicator of changes in trends in commodity prices, and its inclusion in econometric specifications aimed at predicting TP probabilities may lead to significant improvements in forecasting ability.
Exploiting the potential for improving predictive ability in order to refine the specification and estimation of models may be a potentially fruitful avenue of future research.In particular, entertaining estimation methods that differ from least squares (and thus do not build on the minimization of in-sample squared errors) or Bayesian methods with suitable prior specifications could lead to further improvements in the prediction of commodity prices.Enlarging the set of possible models to account for nonlinearities to include smooth transition in the parameters appears also as a natural next step that builds upon the results presented in this study, as does the implementation of threshold dynamic factor models in the spirit of the specifications in Massacci (2017).
best models when a larger class of models is included can be obtained from the authors upon request.

F
I G U R E 2 Number of threshold models outperforming the best linear model.Note: The graphs show a comparison of the best threshold model (for a given threshold variable, including the dependent variable) with the best linear model.The numbers indicate how many of the best threshold models outperform the best linear model.The maximum possible number is nine.

F
I G U R E 4 Best threshold variables according to mean squared error (MSE) and return.Note: The graph indicates which threshold variables yield the best (1), second best (2), … , to the worst (8) performance according to MSE and return.

F
I G U R E 5 Inclusion of explanatory variables in best threshold model.Note: The graph shows the number of times a given explanatory variable (CLI, REER, stock, stu, GSCI aggregate) is included in the best threshold model (aggregated over the nine different threshold variables).The maximum number possible is nine.

F
I G U R E 6 Mean-squared error, return and directional value, 2-year rolling average (threshold model with stock market volatility as the threshold variable, aggregate GSCI, 1-month forecast horizon) F I G U R E 7 Returns and mean squared error (MSE) of best threshold models for different GSCI sectors.Note: The graph shows the returns (left) and MSE (right) of best threshold models for different GSCI sectors and different forecast horizons.

F
I G U R E 8 Loss and profit measures for different S&P Coldman Sachs commodity index (GSCI) sectors.Note: The graphs show MAE, MSE, DA, DV and return for different GSCI sectors and different forecast horizons.
Summary statistics for commodity returns.
T A B L E 2Note.The table reports the mean, standard deviation, skewness, and kurtosis for monthly commodity returns over the sample period from January 1980 to December 2018.Commodity returns are computed from the S&P GSCI commodity indices.The last column shows returns of the world stock market index.

1
Number of threshold models outperforming the self-exciting threshold specification.Note: The heatmap shows the result of comparing the best threshold model for a given threshold variable other than the dependent variable with the best threshold model for the threshold variable being the dependent variable.The numbers indicate the number of threshold variables where the best model outperforms the best self-exciting threshold model.Eight different threshold models are employed.
Performance of best threshold model for the spread as a threshold variable against the corresponding linear model for the aggregate commodity price index.* / * * * ) Indicates rejection of the null hypothesis of equal forecasting accuracy between the best threshold model and the corresponding linear model at 10% (5%/1%).The four-digit combination of ones and zeros below the model shows the inclusion (1) or exclusion (0) of the explanatory variables CLI, REER, stock market index, and oil stock-to-use ratio.Petrol shading indicates that the best threshold model outperforms the best linear model.Light petrol shading shows better total performance between best threshold model and corresponding linear model.Red shading indicates better regime-based performance between best threshold model and corresponding linear model.Regime 1 is defined by spread t−k < γ, while regime 2 is defined by spread t−k > γ.
Note: * ( * T A B L E 4 Deviation of average performance of threshold models from performance of the best threshold model divided by deviation of best linear model from best threshold model.Each figure is calculated as the average deviation of performance of a threshold model (across different threshold variables) from the performance of the best threshold model divided by the deviation of the best linear model from the best threshold model.Deviations are taken in absolute values, so the numbers are always positive.Note that the best threshold model is always better than the best linear model; the average threshold model, however, may be worse than the best linear model (implied by a ratio larger than one).The smaller the ratio, the better the average threshold model compared with the best linear model.Light petrol shading indicates smallest deviation of average threshold model compared with best linear model; red shading indicates largest deviation, across commodity sectors.Abbreviations: DA, directional accuracy; DV, directional value; MAE, mean absolute error; MSE, mean squared error. Note: T A B L E 5 Performance of best threshold model (threshold variable = volatility) and of corresponding linear model for the aggregate index.( * * / * * * ) Indicates rejection of the null hypothesis of equal forecasting accuracy between the best threshold model and the corresponding linear model at 10% (5%/1%).The four-digit combination of ones and zeros below the model shows the inclusion (1) or the exclusion (0) of the explanatory variables CLI, REER, stock market index, and oil stock-to-use ratio.Petrol shading indicates that the best threshold model outperforms the best linear model.Light petrol shading shows better total performance between best threshold model and corresponding linear model.Grey shading indicates better performance between the two regimes for the best threshold model.Regime 1 is defined by VOLA tÀk ≤ γ, while regime 2 is defined by VOLA tÀk > γ.Abbreviations: DA, directional accuracy; DV, directional value; MAE, mean absolute error; MSE, mean squared error.
Note. * Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/for.3152by CochraneAustria, Wiley Online Library on [21/05/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License T A B L E 6 Performance of best threshold model across GSCI sectors.
Abbreviations: DA, directional accuracy; DV, directional value; MAE, mean absolute error; MSE, mean squared error.Notes.The table shows the forecast performance of best threshold models for different GSCI sectors and different forecast horizons.Light petrol shading indicates best performance across GSCI sectors, red shading indicates worst performance.Abbreviations: DA, directional accuracy; DV, directional value; MAE, mean absolute error; MSE, mean squared error.

Table 8 )
. In only 25 out of a total of 144 cases (six commodity sectors, six performance measures and four forecast horizons), the threshold model is outperformed by a different specification.
T A B L E 8 Best models in smaller and larger class of models.