Posts with mood curious and mathy (1)

baseball series probabilities
Mood: curious and mathy
Posted on 2006-06-30 09:36:00
Tags: baseball math
Words: 790

So when Rice was in the NCAA baseball tournament and College World Series, I thought about how one baseball game or even a three-game series weren't really enough to determine which team was better, which is why MLB playoff series are 5 or 7 games, which certainly seems enough to tell for sure which team is better. Trying to figure out how true that was, I modeled it mathematically:

So, let's be formal here: the Astros are playing the Blue Jays, and the Astros have an x chance of beating the Blue Jays in any given game. (for 0 <= x <= 1) Assume that we have no prior knowledge of x - i.e. the probability of x is evenly distributed from 0 to 1. Given that the Astros beat the Blue Jays, what is the probability that x >= .5? (that is, the Astros are "better" than the Blue Jays)

Since the probability that the Astros beat the Blue Jays in one game is simply x, we're looking at the area under the curve y = x, which is of course the integral of x. So the probability that x >= .5 is just (the integral of x from .5 to 1)/(the integral of x from 0 to 1) = (1/2 - 1/8)/(1/2) = 3/4.

(hmm, this would look a lot nicer in LaTeX)

The nice thing about this method is that it easily generalizes - let's say that the Astros beat the Blue Jays two times. Since the probability that the Astros win both games is x^2, the probability that x >= .5 is (the integral of x^2 from .5 to 1)/(the integral of x^2 from 0 to 1) = (1/3 - 1/24)/(1/3) = 7/8.

And let's say the Astros beat the Blue Jays in a best out of 3 series. The probability that the Astros win the series is x^2 + 2*(1-x)*x^2 = 3*x^2 - 2*x^3, so the probability x >= .5 is (1/2 - 3/32)/(1/2) = 13/16, which is .8125. (which is less than 7/8, which smells wrong to me, but sort of makes sense because a best of 3 series always chooses a winner, while "winning 2 games" doesn't. Feel free to start checking my math at this point, though :-) )

If the Astros beat the Blue Jays in a best out of 5 series, the probability that the Astros win the series is x^3 + 3*(1-x)*x^3 + 6*(1-x)^2*x^3 = 10*x^3 - 15*x^4 + 6*x^5. So the probability that x >= .5 is (1/2 - 5/64)/(1/2) = 27/32, which is about .844.

Finally, for a best out of 7 series, the probability that the Astros win the series is x^4 + 4*(1-x)*x^4 + 10*(1-x)^2*x^4 + 20*(1-x)^3*x^4 = 35*x^4 - 84*x^5 + 70*x^6 - 20*x^7. So the probability that x >= .5 is (1/2 - 35/512)/(1/2) = 221/256, which is about .863.

Probabilities with 0 <= x <= 1
# games in series	Probability winner is "better" team
1	.75
3	.8125
5	.844
7	.863

So, this is all well and good, but the results seems a little unrealistic - I find it hard to believe that the best team wins 3 out of 4 times in just a single game. Let's try to remove some of the simplifying assumptions.

In the real world, the Astros beating the Blue Jays 100% of the time is just not going to happen. If we look at the final MLB standings from 2005, no team had a winning percentage below .3 or above .7, so let's try using .3 <= x <= .7 instead of 0 <= x <= 1. So we just have to recalculate all the integrals. This leads to the following results, assuming my math is correct (thanks to this numerical integrator):

Probabilities with .3 <= x <= .7
# games in series	Probability winner is "better" team
1	.6
3	.646
5	.678
7	.702

These probabilities seems a bit more realistic.

Finally, we've been assuming that x is distributed uniformly, but I'd say more teams are close in relative ability to each other. So let's try distributing x as a probability function over .3 to .7. We can use a "tent" function that peaks at .5 - something like 1 - 5 * abs(x-.5). (this hits the points (.3, 0), (.5, 1), and (.7, 1)). Again, this is relatively easy to calculate - we just multiply the expression we were integrating by (1 - 5 * abs(x - .5)). Doing this leads to our final results:

Probabilities with .3 <= x <= .7 and x weighted
# games in series	Probability winner is "better" team
1	.567
3	.598
5	.621
7	.639

So there you have it - under this model, even a 7 game series will only pick the better team 64% of the time. The probability function may have been a little too harsh here, so the 70% in table 2 might be a better guideline.

To make this more accurate, we should recognize that even if the Astros have an x chance of beating the Blue Jays, on any particular night the starting pitcher for both teams adds a bit of variance to that factor. This might be interesting to look at at some point.

Thanks for reading this far! Comments (especially pointing at mistakes) are most welcome.

1 comment

This backup was done by LJBackup.