This posting was inspired by a conversation I had with one of my students a few weeks ago about the Wikipedia page for SSIM, the Structural Similarity Image Metric, and discussion on the topic today in my lab.
First some background
Image Quality Assessment (IQA) is about trying to estimate how good an image would seem to a human. This is especially important for video, image, and texture compression. In IQA terms, if you are doing a full reference comparison, have the reference (perfect) image, and a distorted image (e.g. after lossy compression), and want to know how good or bad the distorted image is.
Many texture compression papers have used Mean Square Error (MSE), or one of the other measures derived from it, Root Mean Square Error (RMS), or Peak Signal to Noise Ratio (PSNR). All of these easy to compute, but are sensitive to a variety of things that people don’t notice: global shifts in intensity, slight pixel shifts, etc. IQA algorithms try to come up with a measurement that’s more connected with the differences humans notice.
How does an IQA researcher know if their algorithm is doing well? Well, there are several databases of original and distorted images together with the results of human experiments comparing or rating them (e.g. ). You want the results of your algorithm to correlate well with the human data. Most of the popular algorithms have been shown better match the human experiments than MSE alone, and this difference is statistically significant (p=0.05) .
What is wrong with IQA?
There are a couple of problems I see that crop up from this method of evaluating image quality assessment algorithms.
First, the data sets are primarily photographs with certain types of distortions (different amounts of JPEG compression, blurring, added noise, etc.). If your images are not photographs, or if your distortions are not well matched by the ones used in the human studies, there’s no guarantee that any of the IQA results will actually match what a human would say . In fact Cadik et al. didn’t find any statistically significant differences between metrics when applied to their database of rendered image artifacts, even though there were statistically significant differences between these same algorithms on the photographic data.
Second, even just considering those photographic datasets, there is a statistically significant difference between the user study data and every existing IQA algorithm . Ideally, there would be no significant difference between image comparisons from an IQA algorithm and what how a human would judge those same images, but we’re not there yet. We can confidently say that one algorithm is better than another, but none are as good as people.
Is SSIM any good?
SSIM  is one of the most popular image quality assessment algorithms. Some IQA algorithms try to mimic what we know about the human visual system, but SSIM just combines computational measures of luminance, contrast, and structure to come up with a rating. It is easy to compute and correlates pretty well with the human experiments.
The Wikipedia article (at least as of this writing) mentions a paper by Dosselmann and Yang  that questions SSIM’s effectiveness. In fact, the Wikipedia one-sentence summary of the Dosselmann and Yang paper is egregiously wrong (“show […] that SSIM provides quality scores which are no more correlated to human judgment than MSE (Mean Square Error) values.”). That’s not at all what that paper claims. The SSIM correlation to the human experiments is 0.9393, while MSE is 0.8709 (where correlations go from -1 for complete inverse correlation to 0 for completely uncorrelated, to 1 for completely correlated). Further, the difference is is statistically significant (p=0.05). The paper does “question whether the structural similarity index is ready for widespread adoption”, but definitely doesn’t claim that it is equivalent to MSE. They do point out that SSIM and MSE are algebraically related (that the structure measure is just a form of MSE on image patches), but that’s not the same as equivalent. That MSE in image patches does better than MSE over the whole image is the whole point!
Overall, when it comes to evaluating image quality, I’m probably going to stick with SSIM, at least for now. There are some better-performing metrics, but SSIM is far easier to compute than anything else comparable that I’ve found (yet). It definitely does better than the simpler MSE or PSNR on at least some types of images, and is statistically similar on others. In other words, if the question is “will people notice this error”, SSIM isn’t perfect, but it’s also not bad.
Extending to other areas?
We had some interesting discussion about whether this kind of approach could apply in other places. For example, if you could set up a user study to compare a bunch of cloth simulations, maybe changing grid, step size, etc. From that data alone, you just directly have a ranking of those simulations. However, if you use that dataset to evaluate some model that could measure the simulation and estimate the quality, you might then be able to use that assessment model to say whether a new simulation was good enough or not. Like the image datasets, the results would likely be limited to the types of cloth in the initial study. If you only tested cotton but not silk, any quality assessment measure you built wouldn’t be able to tell you much useful about how well your simulation matches on silk. I’m not likely to try doing these tests, but it’d be pretty interesting if someone did!
 Ponomarenko, Lukin, Egiazarian, Astola, Carli, and Battisti, “Color Image Database for Evaluation of Image Quality Metrics“, MMSP 2008.
 Sheikh, Sabir, and Bovik, “A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms“, IEEE Transactions on Image Processing, v15n11, 2006.
 Cadik, Harzog, Mantiuk, Myszkowski, and Seidel, “New Measurements Reveal Weaknesses of Image Quality Metrics in Evaluting Graphics Artifacts“, ACM SIGGRAPH Asia 2012.
 Wang, Bovik, Sheik, and Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity“, IEEE Transactions on Image Processing, v13n4, 2004.
 Dosselmann and Yang, “A Comprehensive Assessment of the Structural Similarity Image Metric”, Signal, Image and Video Processing, v5n1, 2011.