Thanks. Please tell me also what does ZES mean. I've searched in your 2009 notes but nothing showed up.
Baseball Over-Under Estimation
Dave Zes
May 14, 2006
1 Introduction
A baseball (or football, hockey, cricket...) over-under estimate is a “median” value; it is that value
which evenly divides the frequency population of total run outcomes which has been augured for
some game. That is, the estimate is indifferent to the torque of outliers. From this follows another,
perhaps more arcane fact: a frequentist approach to optimizing some system of predicting total
runs will rely on the L1 metric, the absolute difference between the predicted and actual value.
By definition, the median value from a list of natural numbers will be some natural number. In
our context, this is very dissatisfying, as we are dealing with a coarse frequency, and this integer
middle value is not very descriptive of the frequency of total runs near the middle. Put another
way, suppose we have for some game a total runs best guess of 9.21, then we have violated the
definition of median value. What is needed is a modified, Real-valued median value, one optimally
descriptive of near-central occupancy. Passing our discrete distribution into a continuous one is
hindered slightly because, in baseball, ties are not permitted, so even totals are disfavored. The
following method is a facile solution whereby the discrete histogram is interpolated linearly. The
resulting middle value is called a “zedian”, and is denoted by .
1
2 Study
Histogram of MLB Tot Runs, 1984−2005
Total Runs Scored
Frequency
0 10 20 30 40
0 1000 2000 3000 4000 5000
0 10 20 30 40
0 1000 2000 3000 4000 5000
Interpolated Distribution of MLB Tot Runs, 1984−2005
Total Runs Scored
xcnts
To construct our interpolated distribution, for each x (Tot Runs) we connect the corresponding
count with a straight line. The line function connecting (x0, y0) to (x1, y1) will be given by a1x+b1.
Let us call the Area under our interpolated frequency distribution A.
Let the particular gap upon which A’s area will be evenly divided be called I, and defined on the
domain by x1, x2, to which y1, y2 correspond, and let AI be its area.
Let AW be the area west of I, AE the area east of I.
Let the function over I be be called aIx + bI . (Do not confuse the I subscript with the number 1
subscripts above).
Let be our “zedian.”
So, we seek that
AW +
Z
x1
aIs + bI = AE +
Z x2
aIs + bI
The integral on the RHS is equal to AI minus the integral on the LHS, so we have
AW + 2
Z
x1
aIs + bI = AE + AI
2
() 2
aI
2 2 + bI −
aI
2 x21
− bIx1
+ AW − AE − AI = 0
= aI2 + 2bI +
?
AW − AE − AI − aIx21
− 2bIx1
So, when we have A = aI , B = 2bI , C = AW − AE − AI − aIx21
− 2bIx1, we find our zedian by
= −B ± pB2 − 4AC
2A
So, for example, looking at the whole of MLB baseball games from 1984-2005 inclusively, we have
where we readily find our values of interest:
AW = 21565
AE = 22967
AI = 4255
aI = 1222
bI = −6132
x1 = 8
And, so, = 8.6952.
3