# Baseball SeRies: Computing Park Factors.

As pointed out in Practicing Sabermetrics, one of the major characteristics of Baseball is the fact that the dimensions of a field may vary from one ballpark to the next.  Furthermore, these dimensions (including the size of the foul territory, the height of the outfield walls, the distance from the infields to the fences and others) may impact weather a park is more conductive to offensive or defensive play.

Park Factors (PF’s) measure the effect the dimensions of a field have on the performance of a particular team by comparing the stats of the team at home vs.  the stats of the team on the road.

While there are multiple formulas out there (some of them involving several complex metrics) to compute PF’s, here I’ll only be showing you to code what I believe is the most basic formula for getting this metric. Please note that a PF higher than 1 favors the hitter, a PF below 1 favors the pitcher and a PF equal to 1 means a park is neutral. The first step for calculating Park Factors using the above formula has to do with collecting the runs scored by the MLB teams. Said that, the f_runs method loads the 1974-2014 Retrosheet’s game files and gets the runs scored by every team as home or visitant.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 f_runs <- function( p_file ) { g_file <- fread( input = p_file , header = T , sep = ',' , na.strings = '' , stringsAsFactors = F , colClasses = c( 'character' # GAME_ID 0 , rep( x = 'NULL', 6 ) # 6 , 'character' # AWAY_TEAM_ID 7 , 'character' # HOME_TEAM_ID 8 , 'character' # PARK_ID 9 , rep( x = 'NULL', 24) # 34 , 'integer' # AWAY_SCORE_CT 35 , 'integer' # HOME_SCORE_CT 36 , rep('NULL', 143 ) # 178 ) ) mlb_data <- g_file %>% mutate( YEAR = as.integer( x = substr( x = GAME_ID, start = 4, stop = 7) ) ) %>% select( YEAR, AWAY_TEAM_ID, HOME_TEAM_ID, PARK_ID, AWAY_SCORE_CT, HOME_SCORE_CT ) r_home <- mlb_data %>% group_by( YEAR, PARK_ID, HOME_TEAM_ID ) %>% summarise( H_RS = sum( x = HOME_SCORE_CT ) , H_RA = sum( x = AWAY_SCORE_CT ) , H_G = n() ) %>% rename( TEAM_ID = HOME_TEAM_ID ) r_away <- mlb_data %>% group_by( YEAR, AWAY_TEAM_ID ) %>% summarise( A_RA = sum( x = HOME_SCORE_CT ) , A_RS = sum( x = AWAY_SCORE_CT ) , A_G = n() ) %>% rename( TEAM_ID = AWAY_TEAM_ID ) inner_join( x = r_home , y = r_away , by = c("YEAR", "TEAM_ID") ) }
view raw f_runs.R hosted with ❤ by GitHub

On the other hand the f_park_factor method gets the park factors for every field using n historical seasons. This way if the p_hist is equal to 5 and p_year is equal to 2014, the f_park_factor function will aggregate the runs scored in a field during the 2014, 2013, 2012,  2011, and 2010 seasons before computing the PF’s.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 f_park_factors <- function( p_year, p_data, p_hist ) { p_data %>% filter( YEAR <= p_year & YEAR >= p_year - p_hist + 1 ) %>% group_by( TEAM_ID, PARK_ID ) %>% summarise( H_RS = sum( x = H_RS ) , H_RA = sum( x = H_RA ) , H_G = sum( x = H_G ) , A_RS = sum( x = A_RS ) , A_RA = sum( x = A_RA ) , A_G = sum( x = A_G ) , YEARS = n() ) %>% mutate( YEAR = p_year , PK_FACTOR = ((H_RS + H_RA)/H_G)/((A_RS + A_RA)/A_G) ) }

Moreover this piece of code will compute a year column to let the user know how many years used to compare a specific PF, this due to the fact that fields not have complete historical data for a p_hist value.

Function p_name just reads the parks.csv file and assigns the name of the field to the created dataset. All files can be downloaded from here. Whole code can be accessed from here aswell.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 f_names <- function( p_pk_factors ) { p_file <- fread( input = "./parks.csv" , header = T , sep = ',' , na.strings = '' , stringsAsFactors = F , colClasses = c( 'character' # BALL_PARK_ID , 'character' # BALL_PARK_NAME , rep( x = 'NULL', 7 ) # 9 ) , col.names = c( 'PARK_ID', 'PARK_NAME' ) ) inner_join( x = p_file , y = p_pk_factors , by = c("PARK_ID") ) %>% select( YEAR, PARK_ID, PARK_NAME, TEAM_ID, YEARS, H_RS, H_RA, H_G, A_RS, A_RA, A_G, PK_FACTOR ) }
view raw f_names.R hosted with ❤ by GitHub

Now that you are able to compute park factors you are ready to do some crazy plots like this, this and even this! Code for all of these plots can be accessed from here.