Thursday, July 28, 2016

The Judgment of Denver

At last week's 2016 Colorado Governor's Cup wine competition I, in my capacity with the Colorado Wine Industry Development Board, organized a wine tasting I'm calling "The Judgment of Denver." For those that do not know, in 1976 British wine merchant Steve Spurrier organized a blind tasting with French wine judges (wine journalists, critics, sommeliers, merchants or winemakers). The wines were broken into two flights; in the first flight, the judges rated 10 Chardonnay, 6 from California and 4 from Burgundy and in the second flight, they rated 10 Cabernet Sauvignon-based red wines, 6 from California and 4 from Bordeaux, France. In each of these flights a California wine, a then relatively little known wine region, was declared the winner. Stag's Leap Wine Cellars' 1973 S.L.V. Cabernet Sauvignon was the top red and Chateau Montelena's 1973 Chardonnay bested some of France's best – and most expensive – wines. The results were published to the world in TIME magazine and forever changed the American – and global – wine industry.

Each year at the Governor's Cup we do a calibration tasting to have the judges calibrate their palate/scores to benchmark wine (that benchmark isn't always high). This year, I decided to model the calibration portion of the competition after the 1976 "Judgment of Paris" because it was the 40th anniversary of that original blind tasting and Warren Winiarski, founder of Stag's Leap Wine Cellars, was once again one of the judges. These two facts seemed like good enough reason to reenact the tasting once more – multiple retastings of the original wines and a New Jersey vs. French wine tasting have been reported on many times.

At the Denver tasting, 16 wine judges1 (wine journalists, critics, sommeliers, merchants or winemakers) from around the U.S. tasted Colorado wines against French and California wines in a blind setting. The French and California wines selected were from the same producers as in 1976 including the winning producers: Chateau Montelena and Stag's Leap Wine Cellars. Unlike the Judgment of Princeton, no First Growth Bordeaux were in the mix. Hundred dollar French and California wines are worthy enough competition! Prices of the French and California wines were $30–$110/bottle. I selected Colorado wines that would not appear in the Governor's Cup competition later that day. Prices of the Colorado wines were $15–$50/bottle The results were as similarly surprising as the original tasting. Although, the winner in each category was a California wine (Chalone Vineyard for the whites and Ridge Vineyards Estate Cabernet Sauvignon for the reds) CO wines are at the same level qualitatively.

To prepare for the competition (which only included wines produced by licensed Colorado wineries), judges were asked to score the wines on a 10-point scale. Scores equate to various medals handed out at wine competitions. Scores of 10 and 9 represented Gold Medal quality wines that exemplify their varietal character, terroir or stylistic expression in such a way that makes them memorable and desirable. Scores of 8 or 7 equate to Silver Medal wines with distinctive/unique qualities and characteristics that would cause one to recommend or purchase this wine. Scores of 5 or 6 indicate  a well-made wine of good, solid quality that is above general commercial viability and which someone would recommend to others to drink. A score of 4 or below results in no medal because though the wine may be drinkable, but with little distinction beyond being sound wine  – or it may be flawed. Judges were not told to aim for medal quotas, but to use their expert judgment to award – or not award – medals to worthy wines. Judges were allowed to award as many awards in each category as they thought the quality merits.

None of the judges, except Warren Winiarski (I thought I should check with him on the idea), had any idea what they were tasting apart from a flight of Chardonnay and a flight of red Bordeaux blends/varietals. All sixteen judges were in the sensory lab in the Hospitality Learning Center at Metropolitan State University in Denver. The eight Chardonnay were served in Riedel Chardonnay stems and placed at the judges seats prior to them entering the room. After spending approximately 20 minutes evaluating the wines, their scores were verbal submitted to my colleague at the CWIDB who recorded them in a spreadsheet. The Chardonnay glasses were removed by a team of awesome volunteers and eight Riedel Bordeaux stems with the eight red wines were placed at each judges' station. Judges evaluated the wines and again verbal submitted their scores.

After both flights were evaluated, the judges were provided with the identity of the 16 wines and the total sum score given to each wine by all the judges. The judges were actually shocked when the names of the famous French and California wines were revealed. They said that they really couldn't tell that there were a mix of California, Colorado and French wines in their glasses.

In addition to summing the judges'  raw scores I performed statistical analyses similar to those conducted for the Judgments of Paris and Princeton. In his statistical analyses of the Paris and Princeton tastings Richard Quandt showed that evaluating the rank order of the tasting results was most statistically meaningful. To do so, each judges' score is converted into a ranking position; 1 equals the highest scoring wine and (in our case) 8 is the lowest scoring wine. The ranked sums are termed "points against." For this tasting, 16 points against would be the lowest (best) possible ranked sum and 128 points against would be the highest (worst). As Quandt identified in the analyses of the previous "Judgment" tastings, ranking the score by the judge overcomes some problems caused by using the raw judge score. Some judges are more strict with their scores, some judges use the entire range liberally and some judges cluster his/her scores very closely. Ranking order provides comparative information about the judges' preferences but information about the distance between the scores and overall assessed quality is lost.

Using a rank-order analysis (similar to Quandt's analysis of both the Judgment of Paris and the Judgment of Paris) the results of the "Judgment of Denver" are as follows in order of ranked sum (it just so happens raw scores follow the same order):


1. Chalone Vineyard 2011 (55 points against – 110 total score)a
1. Guy Drew Vineyard *2014 (55 points against – 108.5 total score)a
1. Freemark Abbey 2013 (55 points against – 107.5 total score)a
4. Joseph Drouhin 2008 (59.5 points against – 105.5 total score)a
5. Plum Creek 2014 Reserve* (61.5 points against – 104 total score)ab
6. Settembre Cellars 2011* (79.5 points against – 80.5 total score)bc
7. Stone Cottage Cellars 2014* (89.5 points against – 70 total score)c
8. Chateau Montelena 2013 CORKEDd

Red Bordeaux Blends/Varietals

1. Ridge Vineyards 2011 Estate Cabernet Sauvignon (39.5 points against – 123.5 total score)a
2. Chateau Montrose 2012 St-Estephe (59 points against – 112 total score)ab
3. Winery at Holy Cross Abbey 2012 Merlot Reserve* (60.5 points against – 109.5 total score)b
4. Bookcliff Vineyards 2010 Cabernet Franc Reserve* (69 points against – 105.5 total score)b
5. Freemark Abbey 2013 Cabernet Sauvignon (77.7 points against – 99 total score)bc
6. Stag's Leap Wine Cellars 2011 Fay Vineyard Cabernet Sauvignon (82.5 points against – 96 total score)c
7. Sutcliffe Vineyards 2010 Cabernet Sauvignon* (91 points against – 82 total score)c
8. Creekside Cellars 2010 Robusto* (97 points against – 79.5 total score)c

*denotes Colorado wine
a, b, c Wines with letters in common are not significantly different at α=0.05 according to multiple comparisons using the Turkey HSD test. Wines with no letters in common are significantly different.

Pretty interesting just looking at the list, but does it mean anything? Using Kendall's W coefficient of concordance for the judges' rankings, it actually does. The coefficient ranges between 1.0 (perfect correlation) and 0.0 (no correlation) and in the case of the data for the Chardonnay, W=0.387 with p-value essentially zero. The concordance was not as strong for the red wines, but W=0.244 with p-value also zero. These data suggest the order of each judges' ranking for each flight were similar enough that if the tasting were conducted again, they would be very much the same and not just occur by random chance. In other words, the probability that random chance could be responsible for this correlation is quite small, basically zero indicating that the judges' preferences are strongly related.

Taking this information, I conducted an analysis of variance test to confirm there were significant differences among the wines' rankings. For both the red and white flights, there were statistically significant difference among the rankings of the wines suggested but the concordance analysis. One drawback to ANOVA is that the test does not specify what the differences are, but only that they exist.

To actually determine which wines' rankings were significantly different from the others, I used Tukey's Honest Significant Difference test. This method identified three groupings of wines in each flight that were indistinguishable from each other, but significantly different from other wines. In each flight there was a group of significantly higher ranked wines, a group in the middle and a few wines bring up the rear.

In the Chardonnay flight, it is easy to see that the top three wines were ranked essentially as equals. In fact, the top five Chardonnay were not statistically significantly different from each other. That is saying if conducted again those top five wines would almost certainly be the top five in some order. The bottom two wines (not including the corked bottle of Montelena) were ranked significantly different by the judges than the other five. In conclusion, two of the four Colorado wines were statistically and qualitatively indistinguishable from a Premier Cru White Burgundy and two fabled California producers.

I also conducted Tukey's HSD test for the red wines – with similar conclusions. The Ridge and Montrose were statistically not different from each other qualitatively. In fact, the Ridge was statistically different from all the wines except the Montrose. However, the Chateau Montrose was also not significantly different from the Holy Cross Merlot, the Bookcliff Cabernet Franc nor the Freemark Abbey Cabernet Sauvignon. That is pretty good company for Colorado to be in! Finally, the bottom four wines were all grouped together without any meaningful differentiation (yes, the Freemark Abbey appears in both the middle and bottom grouping – this is just how the statistics came out).

I had no idea what the results of the tasting would look like, but I can safely say that the Colorado wines more than held their own to stand shoulder to shoulder with some of the most famous wine producers in the world. I would not pretend to say that this tasting means Colorado wines are better than wines from California or France. However, I can say that the results pass the eye test and statistical analyses that suggest the quality of wines produced in Colorado can be just as good as top French and California wines. I know this tasting and this report will have nowhere near the effect of the original 1976 Paris tasting, but I do hope this will open a few minds as to what Colorado wine is currently and what it can be in the future.

1the website is still being updated with the judges for the 2016 competition. They are: Alder Yarrow, Andrew Stover, Becca Yeaman-Irwin, Cindy Onkenglimm, Dave Buchanan, Denise Clarke, Gary Awdey, Glenn Exline, Jay Bileti, Jeff Siegel, Michael Wray, Mike Dunne, Roberta Backlund, Sarah Latham-Moore, Shawn Carney, Warren Winiarski


  1. You could probably have highlighted which were the Colorado wines for those of us less familiar with them. That asterisk is easy to miss.

  2. Chris, yes that asterisk is hard to miss. I've italicized the CO wines to help identify them.

  3. I hope this becomes a regular tradition!

  4. Why no second bottle for the "corked" Chateau Montelena 2013? Any professional tasting has several bottles of each wine (unless it is some ancient library wine) for just such a moment. Perhaps the "backup bottles" had somehow "moved on"?

    1. As a simple calibration tasting for the actual CO Governor's Cup competition, "backup" bottles for the non-CO wines were not purchased due to financial restrictions. Multiple bottles of wines in the competition were submitted and used when needed.

  5. Kyle, as a former economist I enjoyed this article on statistics and wine - although it's been so long I need to refresh my memory on the variance tests. Cheers


Note: Only a member of this blog may post a comment.