How well do we do?

To evaluate the performance of ViEWS, we predict out-of-sample on the January 2014 - December 2016 period for which we have available information on conflict (see the evaluation section for a discussion on the suite of evaluation metrics in ViEWS)

We estimated the models on data from January 1990 - December 2010 and calibrated the predictions using data for the January 2011 - December 2013 period. The computed model representations, after calibration, are used to generate forecasts for the January 2014 - December 2016 evaluation test-set period. These forecast values are then compared to the available observed values in this evaluation test set. For detailed results, see the ViEWS paper (pdf).

Overall performance

Table 1 shows summary statistics for the predictive performance of the ViEWS ensembles for state-based conflict (SB), one-sided violence (OS), and non-state violence (NS). Since the performance of the ensembles cannot be interpreted in and of themselves, we include the predictive performance of baseline models for each outcome-level combination. The baseline models predicts the probability of conflict in each group (country or priogrid cell) in the test period to equal the mean probability of conflict in the same group in the training period.

The table shows that the ensembles increase the predictive performance considerably in comparison to these baseline models. For SB at the CM level, for example, there is an increase in AUROC from 0.585 to 0.930, a decrease in Brier score from 0.175 to 0.092, and an increase in AUPR from 0.345 to 0.797. There are similar increases for the OS and NS ensembles as well. At the PGM level, there are also large differences between the baseline models and the ensembles. For SB, AUROC increases from 0.536 to 0.896 and AUPR from 0.002 to 0.049. There is a smaller change in Brier score from 0.00632 to 0.00589. The discrepancy between improvements in AUROC, AUPR and Brier highlights that the ensembles can successfully sort observations into higher and lower probability and thus improve AUROC and AUPR. At the same time, the modest improvement in the Brier score shows that the value of the ensemble predicted probabilities for true positives is still quite far from 1. In other words, the ensemble is not making sharp predictions that clearly separate the classes on the probability scale. The pattern for SB is consistent with NS and OS. The pattern for SB is consistent with NS and OS.

CM level, SB baseline 0.59465 0.17544 0.34507
CM level, SB ensemble 0.93066 0.09248 0.79742
CM level, OS baseline 0.62301 0.13593 0.31912
CM level, OS ensemble 0.92341 0.07582 0.78521
CM level, NS baseline 0.65748 0.09701 0.32645
CM level, NS ensemble 0.92338 0.06157 0.67972
PGM level, SB baseline 0.53370 0.00632 0.00513
PGM level, SB ensemble 0.95198 0.00589 0.24656
PGM level, OS baseline 0.56548 0.00523 0.00468
PGM level, OS ensemble 0.95484 0.00503 0.19899
PGM level, NS baseline 0.53694 0.00394 0.00250
PGM level, NS ensemble 0.89649 0.00393 0.04929

Table 1: Summary statistics for predictive performance across all months in test window