Sunday, February 22, 2009

算着玩

wine quality = 12.145 + 0.00117 winter rainfall + 0.0614 average growing season temperature - 0.00386 harvest rainfall

开学的第一次课,Data Mining and Machine Learning的老师发了一篇阅读文章:《Super Crunchers》 by Ian Ayres。里面看到了这个公式。写这个公式的人叫Orley Ashenfelter,普林斯顿的经济学教授,红酒爱好者。文章里这么解释了这个公式:

... ...

“It's really a no-brainer," he (Ashenfelter) said. "Wine is an agricultural product dramatically affected by the weather from year to year." Using decades of weather data from France's Bordeaux region, Orley found that low level of harvest rain and high average summer temperature produce the greatest wines. The statistical fit on data from 1952 through 1980 was remarkably tight for the red wines of Burgundy as well as Bordeaux.

Bordeaux are best when the grapes are ripe and their juice is concentrated. In years when the summer is particularly hot, grapes get ripe, which lowers their acidity. And, in years when there is below-average rainfall, the fruit gets concentrated. So it's in the hot and dry years that you tend to get the legendary vintages. Ripe grapes make supply (low-acid) wines. Concentrated graps make full-bodied wines.

... ...

因为红酒酿制好后要储存少则几个月多则几年才能喝,Ashenfelter想找个方法预测红酒的质量,在能真正喝到之前。这种工作有专门的专家来干,而这些专家看到Ashenfelter的公式以后,就气坏了。神圣又神秘的品酒事业被一个不解风情的家伙挤压成了几个干巴巴的数字,真是“somewhere between violent and hysterical”。更让有多年wine spitting经验的红酒专家们恼火的是,几年下来的市场证明,Ashenfelter的预测比他们的还要准确。

波尔多的葡萄园一定技术成熟,发挥稳定。天气是为数不多的不好控制的因素。作为一个热情的业余的啤酒酿制爱好者,我觉得Ashenfelter的公式相当的make sense。如果给我的家酿啤酒做个公式,变量选择应该正好相反。我家啤酒厂一直处于刻苦研发阶段,所有能变的变量没有一次不变的:首次发酵时间,二次发酵时间,carbonation时间,首次发酵温度,二次发酵温度,carbonation温度,priming suger用量,瓶子的大小颜色和摆放位置,洗瓶子的工序,甚至晒没晒太阳。天气的影响反而不大,因为麦芽汁是买的现成的,虽然不同批次的质量也应该有所不同,但那点小起伏在前面那些大起大落的变数面前,多半可以忽略不计了。为了保证二次发酵和后期储存的温度恒定,我家啤酒厂买了一个特大冰箱,专门放置啤酒。公式里的这个变量大概可以去掉了。为了保证首次发酵和carbonation阶段的温度恒定,我家啤酒厂应该再买一个特大冰箱。只是啊,俺们厂再放不下第三个冰箱了。

还是Data Mining的老师,她的一个朋友用往年评奖数据和logistic regression做了今年奥斯卡最佳影片的预测:

Probability(Slumdog Millionaire = winner|X’s)= .855

这个容易,过几个小时就知道对不对了。(顺便说一句,我太喜欢这个电影了,希望它能得奖。)

============ 几个小时后 ============

呵呵,赢了 :)

7 comments:

Anonymous said...

你们厂需要会计和后勤吗?包喝包住就行

Anonymous said...

这个值是越大越好,还是处于某个区间好?

bing said...

anonymous,

让我和猪干事商量一下子。不过,有给您放张床的地方,俺们厂多半就把第三个冰箱摆在那儿了 :)

anonymous,

概率在0-1之间。这个值是说,考虑到种种条件,slumdog得奖的机率在85.5%。很高了,尤其是用同一个公式算的时候,另几个电影的机率才5%左右 :)

Anonymous said...

I am talking about the wine quality value, not that dog.

bing said...

这个值是wine quality的话,应该是越高越好吧 :)

Anonymous said...

Interesting post, I love statistics, they can be so fun... I hope he tasted each wine while developing that regression model and is able to verify his formula pretty much on the spot, we call that model training....

bing said...

the model was used to "predict", so i guess he couldn't taste before the wine became available. but with the outcome variable being the tested wine quality data over 38 years, it says something about experience too, maybe as good as the experience of someone who actually tasted each and every bottle over those years?