Are you using test log-likelihood correctly?

Abstract : Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.


An earlier version of this paper was presented as a poster at the I Can’t Believe It’s Not Better workshop on empirical falsification at NeurIPS 2022. The workshop paper is available at this link.

Download the journal version of the paper here.

A pre-print is available here.

Recommended citation: Deshpande, S.K., Ghosh, S., Nguyen, T.D., and Broderick T. (2024). "Are you using test log-likelihood correctly?" Transactions on Machine Learning Research