The MLBAM Pitch by Pitch Files
MLBAM is the technology provider for the Major League Baseball and it’s probably the biggest media company you’ve never heard of. Based in New York City, MLBAM develops and maintains live streaming platforms, designs products for several types of devices, builds digital marketing solutions, supports ticketing strategies and most important: produces data files that store all the events that took place in every game in a season.
Some of these files contain the (x,y) locations of the batted balls put in play and some others have general information about the game itself. In this post I’ll be showing you how to extract data from the hit and game files and how to plot these in R. You can download the MLBAM’s files I used for this post from here.
Scrapping the Data from MLBAM
So just for fun I decided to get the data using python and lxml. Code is pretty straight forward; supposing the script is stored where the hit and game files are, it:
- Stores the path to the lists the game and hip files.
- Loops through the files and scraps the desired data from them via XPath.
- Saves the scrapped data into a csv file.
from lxml import etree | |
import glob | |
import csv | |
# Input Files. | |
hipFiles = glob.glob('*hip.xml') | |
gameFiles = glob.glob('*data.xml') | |
# Output File. | |
csvFile = open('hitsPerGame.csv', 'wb') | |
# CSV Writer. | |
writer = csv.writer( csvFile ) | |
# XML Parser. | |
parser = etree.XMLParser( ns_clean = True ) | |
# Lists were data will be stored. | |
h_desc = [] # Hit Description. | |
x_cord = [] # Hit x coordinate. | |
y_cord = [] # Hit y coordinate. | |
g_stad = [] # Stadium game was played in . | |
b_team = [] # Batting team. | |
g_num = [] # Game number | |
for i in xrange( len( gameFiles ) ): | |
# Game Data: Playing teams, Stadium. | |
gameTree = etree.parse( gameFiles[i], parser ) | |
# hip Data: hip description, X & Y coordinates, Batting team. | |
hipTree = etree.parse( hipFiles[i], parser ) | |
h_desc.extend( hipTree.xpath( '//hip/@des' ) ) | |
x_cord.extend( hipTree.xpath( '//hip/@x' ) ) | |
y_cord.extend( hipTree.xpath( '//hip/@y' ) ) | |
# Check if team is guest or home, then set team name based on that. | |
t_goh = hipTree.xpath( '//hip/@team' ) | |
t_name = gameTree.xpath('//team/@name_brief') | |
# Extend stadium to fit number of hits. | |
g_stad.extend( gameTree.xpath('//stadium/@name') * len( t_goh ) ) | |
b_team.extend([ t_name[0] if j == 'H' else t_name[1] for j in t_goh ]) | |
# Game Number. | |
g_num.extend( [i + 1] * len( t_goh ) ) | |
data = zip( g_num, g_stad, b_team, x_cord, y_cord, h_desc ) | |
# Add data to csv file. | |
for row in data: | |
writer.writerow( row ) | |
csvFile.close() |
The Dataset
Once the data gets scrapped from the MLBAM files, we are ready to do some quick analysis on it. Since the goal of this research is to plot the balls put in play by both teams, it’s probably a good idea to see how the batted balls in play look like in a Cartesian plane:
So at first glance we are able to see a couple of issues:
- There some batted balls that hold an Error description and a (0,0) position in the plane.
- Balls with a higher distance travelled have a lower y value. In other words, Home Runs, have a value near to zero and bunted balls carry a value very distant from x axis. This a problem because when values get plotted, they give the impression that the baseball field is turned upside-down( just like in the above plot ).
You can get the code I used to create this plot from here. Moreover, you can download the data set created with the python script from here.
Creating the Plot
So as I explained in the last section, the error and coordinates issues need to be solved before we can do any graphical representations of the batted balls in play. As you can see in the R code below, I fixed these problems in lines 24 and 27 respectively. Please note that I equaled y_max to 250 because this value fit the graph dimensions perfectly.
Furthermore, I drew the foul lines and the bases based on the Pythagorean Thereom knowing that the distance between every base is 90 ft and that the foul lines have a 45 degree angle from the home plate. Please note that just like the y_max variable, the f_len and b_len variables were also adapted to fit the graph.
library('ggplot2') | |
# Dataset column names and classes. | |
l_colnames = c( 'game_no','stadium', 'team', 'x_cord', 'y_cord', 'desc' ) | |
l_colClasses = c( 'numeric', 'character', 'character', 'numeric', 'numeric', 'character' ) | |
# Load the dataset. | |
hip_data <- read.csv( file = 'hitsPerGame.csv' | |
, header = F | |
, col.names = l_colnames | |
, colClasses = l_colClasses | |
, na.strings = '' | |
, stringsAsFactors = T | |
) | |
# Stadium specs. | |
x_hp = 125 # X coordinate of the Homeplate. | |
y_hp = 43 # Y coordinate of the Homeplate. | |
f_len = 150 # Length of the foul line. | |
b_len = 45 # Length between bases. | |
y_max = 250 # Max y value. | |
f_angl = sqrt(2)/2 # Angle for foul lines and bases | |
# Remove errors. | |
hip_data <- hip_data[ hip_data$desc != 'Error', ] | |
# Flip y coordinate of the batted balls. | |
hip_data$y_cord <- -( hip_data$y_cord - y_max ) | |
# Starting and ending coordinates of foul lines. | |
d_bounds <- data.frame( x = rep( x = x_hp, 2 ) | |
, y = rep( x = y_hp, 2 ) | |
, xend = c( x_hp + f_angl * f_len, x_hp - f_angl * f_len ) | |
, yend = c( y_hp + f_angl * f_len, y_hp + f_angl * f_len ) | |
) | |
# Starting and ending coordinates of foul lines. | |
d_bases <- data.frame( x = c( x_hp, x_hp + f_angl * b_len | |
, x_hp, x_hp - f_angl * b_len | |
, x_hp | |
) | |
, y = c( y_hp, y_hp + f_angl * b_len | |
, y_hp + sqrt( 2 * b_len^2) | |
, y_hp + f_angl * b_len | |
, y_hp | |
) | |
) | |
# Create plot. | |
( ggplot() | |
+ geom_segment( data = d_bounds | |
, aes( x = x | |
, y = y | |
, xend = xend | |
, yend = yend | |
) | |
, color = 'white' | |
, size = 1.1 | |
) | |
+ geom_path( data = d_bases | |
, aes( x = x | |
, y = y | |
) | |
, color = 'white' | |
, size = 1.1 | |
) | |
+ geom_point( data = hip_data | |
, aes( x = x_cord | |
, y = y_cord | |
, color = team | |
) | |
, size = 1.2 | |
) | |
+ facet_wrap( ~stadium ) | |
+ coord_equal() | |
+ labs( x = 'X' | |
, y = 'Y' | |
, title = 'Batted Balls in Play - WS 2016' | |
) | |
+ scale_color_manual( values = c('royalblue3', 'firebrick1' ) | |
, guide = guide_legend( title = 'Team' ) | |
) | |
+ theme( panel.grid = element_blank() | |
, panel.background = element_rect( fill = '#ace456' ) | |
, strip.background = element_rect( fill = '#5d6c93' ) | |
, strip.text = element_text( color = 'white' ) | |
, axis.text = element_text( size = 8 ) | |
, axis.title = element_text( size = 8 ) | |
, plot.title = element_text( size = 11 ) | |
, legend.title = element_text( size = 10 ) | |
) | |
) |
World Series 2016
Here’s how the batted balls in play look like for the World Series that took place this year. Is there any pattern you can see in the balls batted by the Chicago Cubs in the Progressive Field? Fly the W !
Great article! Can this be used to improve player’s position when defending against a particular opponent?
Me gustaMe gusta