>This is the second half of a two part series dealing with Wins Above Replacement (WAR), specifically the differences between Fangraphs’ version of WAR and Baseball-Reference’s version. Part one can be found here, which explains the idea of replacement level and the difference between the version of WAR for position players. If you don’t know anything about WAR, you should read that first. This post deals with pitchers. I’m still talking to myself, partially because it’s the easiest way to do this, and partially to combat crippling loneliness.
So. Pitchers and WAR. Go on.
Yeah. We want to know how many runs a pitcher saved over a replacement level pitcher, and how many wins that was worth. This is where it gets weird.
When we looked at position players, the biggest difference in WAR came from the way defense was handled. Baseball-Reference’s WAR for position players generally agrees with Fangraphs’ version of WAR for position players. Albert Pujols was the WAR leader for 2009 on both Fangraphs and B-R, and the same players appear on the top of both lists — the order is different, but the same guys are on top. It’s almost impossible to find a position player whose value differs much more than 1.5 WAR between the two sites, and a gap that wide only happens when the fielding statistics don’t agree. A good player is seen as a good player by both versions of WAR.
But the pitchers . . . oh dear, the pitchers. Here’s where you get the funny stuff. We’ll use Ricky Nolasco of the Florida Marlins as our guinea pig.
Wait, why Ricky Nolasco? Isn’t this a Mets blog? Why did you mislead me with a picture of Johan Santana?
Sorry. Yes, this is nominally a Mets blog, but I couldn’t find a useful enough Mets example. You’ll see why we’re going with Ricky Nolasco in a second. Here’s what Nolasco’s 2009 looks like in traditional statistics:
And here are the two WAR totals that generated:
Fangraphs: 4.2 WAR
Baseball-Ref: -0.3 WAR
Oh, well look at that. I’m all set to embrace Sabermetrics now. Where do I sign up?
Yeah, it’s a 4.5 win gap. He was either a top twenty pitcher or a AAA-level starter last season, depending on who you ask. That’s a problem if you want people to buy into WAR for pitchers.
How is that large of a gap between the two WAR even possible?
The problem is still the same thing: defense. We understand the offensive side of the game far better, because we can isolate the batter from his teammates. We know that Albert Pujols is responsible for hitting Albert Pujols’ home runs, and we credit Albert Pujols for them. That’s easy enough.
The problem is on the other side of the ball. It’s really, really difficult to separate a pitcher’s performance from the eight other players fielding the ball in support of him, and then separate the performances of the fielders from one another. We know which teams are good at preventing runs, because they’re the teams allowing the fewest runs . . .
Anyway, we have troubling figuring out how to assign individual credit for that run prevention.
Fangraphs and B-R both try to separate pitchers from their defense when determining their WAR — they just do it in vastly different ways. This difference can sometimes become comically large. As it is in the case of Nolasco.
So trying to figure out defense/pitching is like trying to figure out which Backstreet Boy really makes a vocal harmony work? You know when it’s working, but it’s hard to pinpoint who’s really carrying the load?
Uh. I guess, if you want to look at it like that . . .
I certainly do. Anyway, how does Fangraphs separate a pitcher from his fielders? More importantly, how did they decide that the guy with the 5.06 ERA was worth 4.2 WAR?
Fangraphs basically says that they’re not even going to try to separate a pitcher from his defense fully, because it’s something we can’t do yet. Instead, they only choose to look at the things that we know the fielders don’t directly influence, namely:
Walks + Hit Batters
Those are three events that involve ONLY the pitcher and batter. Ricky Nolasco walks a batter; that’s on Nolasco. Nolasco strikes someone out; that’s also on Nolasco. Nolasco surrenders a home run; it’s on Nolasco. The fielders don’t get involved in any of those plays.
That seems like a weird way to look at pitching . . .
It can seem like it, yeah. The reasoning behind looking at it this way is that a pitcher’s strikeouts, walks, and home runs remain relatively steady from year to year. This suggests that pitchers have an amount of control over those things. It suggests that it’s a repeatable skill.
However, the number of hits a pitcher surrenders varies wildly from season to season. This implies that the number of hits a pitcher surrenders appears to have far more to do with his defense, the ballpark he plays in, and plain old chance than it does the pitcher himself. (Greg Maddux in the late 90s is a good example of someone whose hits jump up and down year to year.) A pitcher with a good defense behind him is going to allow fewer hits. Someone pitching in the Oakland Coliseum is going to have more foul outs — and therefore less hits allowed — than someone pitching in Citi Field, because Oakland has roughly 197 million more square miles of foul territory. Fangraphs doesn’t want to unfairly credit a pitcher for something he doesn’t have much control over, so they JUST look at strikeouts, walks, and home runs.
How do you turn strikeouts, walks, and home runs into WAR?
There’s a simple formula using those three events that gives you an ERA-like number. It’s called FIP (Fielding Independent Pitching). You use FIP to get the number of runs a pitcher is credited with allowing based on those three things (only walks here are actually “walks – intentional walks + HBP”).
In 2009, in 185 innings pitched, our test dummy Ricky Nolasco
struck out 195 batters
walked 39 (44 walks minus 7 intentional, plus 2 HBP)
and allowed 23 home runs.
Those are good numbers, and it comes out to a FIP of 3.35. That’s a good FIP (keeping in mind that FIP is supposed to look like ERA). In the 185 innings Nolasco worked, Fangraphs charges him about 75 runs allowed, based on that FIP.
Because Fangraphs’ version of WAR is based on FIP runs — and just FIP runs — and FIP runs are based on just strikeouts, walks, and home runs, Nolasco wound up with a high WAR despite an ERA above 5.00.
Um . . .
Yeah. I got nothing.
Alrighty. So Fangraphs uses FIP runs for the number of runs a pitcher allowed in their version of WAR. Or something. What complicated formula does Baseball-Reference use?
The pitcher’s actual runs allowed.
Oh, great . . . wait, really? Just that?
Yup. Our test subject Nolasco actually allowed 111 runs in 2009, and that’s all that B-R uses.
Why all runs? Why not just earned runs?
Earned runs are an attempt to correct a pitcher’s record for defense — he isn’t charged for runs caused by the errors of his fielders. While ERA is not the best way to correct for defense, its heart is in the right place. Baseball-Reference doesn’t use earned runs because it corrects for defense in a better way.
It splits up the total defensive contribution among the pitchers. Baseball-Reference takes the Total Zone rating of the entire team, then divides the runs saved based on the cut of balls in play allowed by each pitcher. Strikeout pitchers rely on their defense less than sinkerballers. This accounts for that.
If a good defensive team is +50 runs above average by Total Zone, and one starter allowed 18% of his team’s balls in play, Baseball-Reference credits nine (nine being 18% of 50) of the runs saved while that pitcher was on the mound to the defense. So it basically adds those runs back on. If another pitcher allowed 10% of the team’s balls in play, five (five being 10% of 50) of his runs saved are attributed to the defense. And so on.
There are some problems doing defense like this. If a team has a good infield and a slow outfield, they might appear to be an AVERAGE defensive team overall, but a flyball pitcher is going to be hurt more and a groundball pitcher helped more. B-R is going to credit them with the same defensive assistance regardless. It’s not perfect.
The Marlins, in 2009, were rated as an exactly average defensive team by Total Zone. According to that, Nolasco received no help or harm from his defense. He still gets charged with all 111 runs.
So Fangraphs says 75 runs belong to Nolasco, and Baseball-Reference says 111 runs. I think I see a problem here. Let’s pretend I don’t and move on. What are these numbers of runs being compared to?
It’s still Wins Above Replacement, so the pitcher’s runs allowed are compared to the runs a replacement level pitcher would allow in his innings by both sites.
And right here is where we need to separate relief pitchers from starting pitchers.
Because . . .
More because it’s easier to relieve than it is to start.
A “replacement level”
pitcher — such as, say, Luis Ayala — is likely to pitch better out of the bullpen than as a starting pitcher because of the shorter outings. He can pump up his velocity and doesn’t have to face the same lineup multiple times. Most pitchers are more effective out of the pen. Bobby Parnell was good out of the pen last season and this one; he was miserable as a starter. Basically, a reliever with an ERA of 4.25 is going to be less valuable than a starter with an ERA of 4.25.
The runs a pitcher is being compared to for WAR depends on the role he was used in.
So how many runs is Ricky Nolasco being compared to for WAR?
Well, Nolasco is compared by both sites to the number of runs a barely-serviceable pitcher in his league, in his ballpark, in his role (as a starter), would be expected to allow in the same number of innings. It’s adjusted for certain factors. American League pitchers are expected to surrender more runs because they face the DH. A pitcher in Colorado would be expected to surrender more runs because of the thin air. A pitcher in San Diego would be expected to surrender less runs due to the pitcher’s park.
Baseball-Reference also adjusts for the pitcher’s opponents — it takes into account if someone playing for Toronto was unlucky enough start against the Yankees, Rays, and Red Sox every time out. I don’t know if Fangraphs does the same thing for the quality of opponents; if they do, I missed where they say so.
Despite similar methods, each site comes up with a slightly different number for replacement. Fangraphs says that a replacement level pitcher throwing Nolasco’s 185 innings would allow around 114 runs; Baseball-Reference sets it at 108 runs. They’re close, but there are some little differences.
All you do it subtract the number of runs the pitcher actually allowed from the number of runs our imaginary replacement pitcher would allow. That gives you how many runs he saved over a replacement pitcher.
So for Nolasco:
Fangraphs: 114 replacement runs – 75 FIP runs = 39 runs above replacement.
Baseball-Reference: 108 replacement runs – 111 real runs + 0 defensive runs = 3 runs BELOW replacement.
And then to get from runs above replacement to wins above replacement, we do what?
Well, we need to take two more things into account. The first is the importance of the innings pitched.
For example: A closer is only going to pitch 60 or 70 innings in a season. That’s not many. However, a majority of those innings are going to be in critical situations. The importance of these situations is measured by something called “leverage index,” sometimes shortened to LI. The average situation gets a leverage index of around 1.0. Mop up time is going to get a leverage index below 1.0, something like 0.5 or lower. Clutch spots get high leverage indexes, usually 2.0 or higher. Here’s what the Mets pitching staff’s leverage index looks like this season:
You really like these Baseball-Reference tables, don’t you?
I love everything Baseball-Reference related.
Weird. Continue . . .
You can see that Frankie Rodriguez and Pedro Feliciano have the highest leverage index, because they work primarily in big spots; Raul Valdes gets garbage time and has a low leverage index. Most of the starters are around 1.0 — Mike Pelfrey’s LI is raised slightly because of his save in the 20-inning game against the Cardinals.
Anyway, Baseball-Reference takes every pitcher’s runs above replacement and adjusts it based on the leverage index. The runs saved by pitchers working in big spots are more valuable than the runs saved by Raul Valdes when the Mets are down 10 runs in the seventh inning.
Francisco Rodriguez has saved 10 runs this season, but is credited with saving 14 runs; Bobby Parnell has saved 4 runs, but is only credited with saving 3 runs because of some mop up work.
Fangraphs makes the same adjustment, but I believe they only use leverage index for relievers. Most starters see leverage indexes around 1.0 anyway, so it’s not a big deal if it’s taken into account for them or if it’s not.
So that’s the first thing for runs to wins. What’s the second thing?
Because a good starting pitcher will allow fewer runs, the runs in his game become even more valuable. Runs also suffer from inflation. Runs scored in a Johan Santana-Josh Johnson matchup are worth more than runs in a John Maine-Livan Hernandez. This is taken into account when turning runs into wins.
So, now can we do runs to wins?
Yup. Like the batters, it’s still close to “10 runs equals a win,” but it’s actually slightly different for every pitcher for the above reasons.
For our subject Ricky Nolasco:
Fangraphs: 39 runs above replacement = 4.2 WAR
Baseball-Reference: 3 runs BELOW replacement = – 0.3 WAR
That’s still a weird gap.
Yeah, it is. It’s two different approaches to the same problem, but you get two totally different answers. Basically, we still don’t know which things are the pitcher and which are the defense. We’re better at it now then we were just using errors and unearned runs, but we’re still not there yet.
To solve this problem, Fangraphs says, “We know that a pitcher is responsible for his walks, strikeouts, and home runs, but we’re not sure about anything else. Let’s build a system based on that, one which ignores everything else. We don’t want to give a pitcher more credit than we should.”
For their part, Baseball-Reference says, “We’re going correct for defense a bit, but that’s about it. We’re not sure what is the pitcher’s responsiblity and what isn’t. We don’t want to take any more credit away from the pitcher than we should.”
Because B-R’s WAR is a bit more based in reality, so it tends to avoid uncomfortable things like Ricky Nolasco having a 5.06 ERA and a high WAR. It might be giving pitchers too much credit for their successes and failures, but Fangraphs might not be giving enough. Fangraphs’ WAR, being based on FIP, is better for predicting how a pitcher will perform in the future, but it’s not great for saying who pitched well in the past, at least according to how we normally define pitching well.
Any other problems with WAR for pitchers?
Cute. Seriously though.
It doesn’t include batting and fielding. Baseball-Reference does thankfully include both on the career WAR leaderboards, but that’s mostly so Babe Ruth sits comfortably in first place. If you take his pitching out, he’s barely ahead of Barry Bonds — less than a full win — but who, other than Barry Bonds, wants to see Barry Bonds even close to first place?
You can look up a pitcher’s WAR as a hitter on the player pages, but you generally have to add to his pitching value on your own.
Is that a big deal? Most pitchers are miserable hitters.
And most center fielders are good defensive outfielders. That does that mean we shouldn’t bother quantifying how good they are defensively when we compare them? Every run counts, right?
That being said, for most pitchers, batting doesn’t make much of a difference. For some, however, it’s a huge deal. Don Newcombe hit .271 with a .338 on-base percentage and 15 home runs in his career; one-fourth of Newcombe’s career WAR value came from his batting. The same goes for Mike Hampton. Sandy Koufax’s .097 career average loses him about 5 wins over his career, undoing about a full season of pitching value. It’s something worth looking at for many pitchers, even if the American League is using that silly rule about pitchers not hitting.
Also, WAR for pitchers isn’t as great for comparing pitchers across history as it is for comparing position players. Batters throughout history have received a similar number of plate appearances — it changes with offensive levels and the number of games in a season, but it’s close enough to be comparable. Babe Ruth came to the plate 691 times in 1927; Albert Pujols came to the plate 700 times in 2009.
It doesn’t work like that for pitchers. WAR doesn’t adjust for the number of innings thrown, and pitchers in the deadball era and in the early 1970s worked far more innings than pitchers do today. Johan Santana’s run prevention abilities are similar to Cy Young’s, but Young worked about twice as many innings a season as Santana. Young’s WAR per year is much higher for that reason. So it’s not great for that.
Is that sort of it?
Because it’s based on FIP or runs allowed, WAR runs into some of the same problems as those stats — punishing pitchers for meltdowns more than it should. There’s not a huge difference between allowing eight runs in an appearance and allowing twelve, and that’s not accounted for. And it doesn’t take into account clutch pitching and the like.
And is that it?
Yeah, I’m done.