baseball
code
Author

Matthew Spence

Published

April 9, 2015

Modified

February 24, 2023

In honor of Opening Day (and the Nats’ loss), I wanted to see which teams have had the most success on opening day.

It’s…the Mets (?)

Code
import os, requests, zipfile, io
from IPython.display import HTML#, Markdown
#from itables import show
import pandas as pd

DATADIR = os.path.join(os.getenv("MY_PROJECT_DIR"), "data", "gl") # Using Quarto's _environment.local file (not saved to GitHub) to set the directory where this project lives

if not os.path.exists(DATADIR):
    os.makedirs(DATADIR)

DOWNLOADFROMRS = True

if DOWNLOADFROMRS:
    r = requests.get("https://www.retrosheet.org/CurrentNames.csv")
    with open(os.path.join(DATADIR,"CurrentNames.csv"),"w") as f:
        f.write("current_team_id,hist_franchise_id,league,division,location,nickname,alt_nickname,first_used,last_used,city,state\n")
        f.write(r.text)

"""
From https://www.retrosheet.org/Nickname.htm:
The fields are:

Current franchise ID
Franchise ID (may the same as current one) for earlier data
League
Division (blank before divisional play)
Location name (e.g. Brooklyn, Texas)
Nickname
Alternate nicknames (blank if none for the period)
Date of the first game with this combination
Date of the last game with combination (blank if current one)
City
State
"""

if DOWNLOADFROMRS:
    r = requests.get("https://www.retrosheet.org/gamelogs/glfields.txt")
    with open(os.path.join(DATADIR, "glfields.txt"),"w") as f:
        f.write(r.text)

if DOWNLOADFROMRS:
    for y in range(1980,2015):
        zip_file_url = f"https://www.retrosheet.org/gamelogs/gl{y}.zip"
        r = requests.get(zip_file_url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(path=DATADIR)

fields = ['date','visit_team_id','visit_team_league','visit_game_number','home_team_id','home_team_league','home_game_number','visit_final_score','home_final_score']
cols = [0,3,4,5,6,7,8,9,10]

namefields = ['current_team_id','hist_franchise_id','league','division','location','nickname','alt_nickname','first_used','last_used','city','state']

def assign_outcome(row):
    if row.final_score>row.opponent_final_score:
        return 'W'
    elif row.final_score<row.opponent_final_score:
        return 'L'

def read_in_data(year):
    dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date'])
    v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_'))
    h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_'))
    games = pd.concat([v,h], ignore_index=True)
    games['outcome'] = games.apply(assign_outcome,axis=1)
    games['season'] = year
    return games

teamnames = pd.read_csv(os.path.join(DATADIR,'CurrentNames.csv'), parse_dates=['first_used','last_used']) 

df = pd.DataFrame()
for year in range(1980,2015):
    g = read_in_data(year)
    df = pd.concat([df,g], ignore_index=True)

teamsgrouped = df.groupby('team_id')

pct = teamsgrouped.apply(lambda x: x.outcome.value_counts()/x.outcome.size).round(3)*100
pct = pct.unstack(1)
pct = pct.sort_values('W', ascending=False) ## decent answer here, but let's merge in team names

pctteamids = list(pct.index.values)
pctteams = teamnames[teamnames.hist_franchise_id.isin(pctteamids)].sort_values('last_used').drop_duplicates(subset='hist_franchise_id',keep="last")

finaltable = pct.merge(pctteams, left_index=True, right_on='hist_franchise_id', how="left").reset_index().assign(Team = lambda x: x['location'] + " " + x['nickname']).rename(columns={'W':'Wins','L':'Losses'})[['Team','Wins','Losses']]

HTML(finaltable.to_html(index=False))  # alternative to using the itables package
# with pd.option_context("display.float_format", "{:,.1f}%".format):
#     show(finaltable, dom="lrtp", lengthMenu = [5,10,len(finaltable)]) #paging=False to show full table
Table 1: Opening Day Records, 1980 – 2014
Team Wins Losses
New York Mets 71.4 28.6
Baltimore Orioles 65.7 34.3
Chicago White Sox 65.7 34.3
Los Angeles Angels 61.1 38.9
Seattle Mariners 60.0 40.0
Arizona Diamondbacks 58.8 41.2
Atlanta Braves 57.1 42.9
Los Angeles Dodgers 54.3 45.7
San Francisco Giants 54.3 45.7
New York Yankees 54.3 45.7
Cincinnati Reds 54.3 42.9
Milwaukee Brewers 54.3 42.9
Detroit Tigers 54.3 45.7
Tampa Bay Rays 52.9 47.1
Florida Marlins 52.6 47.4
Toronto Blue Jays 51.4 48.6
St. Louis Cardinals 51.4 48.6
Colorado Rockies 50.0 50.0
Texas Rangers 48.6 51.4
Pittsburgh Pirates 48.6 51.4
Montreal Expos 48.0 52.0
California Angels 47.1 52.9
Chicago Cubs 45.7 54.3
Boston Red Sox 45.7 54.3
Houston Astros 42.9 57.1
Washington Nationals 40.0 60.0
San Diego Padres 40.0 60.0
Minnesota Twins 40.0 60.0
Cleveland Indians 40.0 60.0
Philadelphia Phillies 37.1 62.9
Oakland Athletics 37.1 62.9
Miami Marlins 33.3 66.7
Kansas City Royals 28.6 71.4

Here’s how I did it:

I downloaded1 game logs from 1980 – 2014 from RetroSheet and wrote the following script to analyze them. The key was changing the “dyadic” structure of the raw game logs (one row contains information about the two teams playing the game) into one that had one row for each team. I did this in the read_in_data function:

def read_in_data(year):
    dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date'])
    v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_'))
    h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_'))
    games = pd.concat([v,h], ignore_index=True)
    games['outcome'] = games.apply(assign_outcome,axis=1)
    games['season'] = year
    return games

Here’s an example of one year’s worth of gamelog data once it’s been run through read_in_data. 2014 was a bit of an odd season, since the Dodgers and the Diamondbacks started their seasons on March 22 by playing three exhibition games in Syndey, Australia. Each team went on to be another team’s opening day opponent – the Dodgers faced the Padres and the Diamondbacks went up against the Giants.

date team_id game_number opponent_team_id opponent_game_number final_score opponent_final_score outcome
2014-03-22 LAN 1 ARI 1 3 1 W
2014-03-22 ARI 1 LAN 1 1 3 L
2014-03-30 SDN 1 LAN 3 3 1 W
2014-03-31 MIA 1 COL 1 10 1 W
2014-03-31 CIN 1 SLN 1 0 1 L
2014-03-31 TEX 1 PHI 1 10 14 L
2014-03-31 TBA 1 TOR 1 9 2 W
2014-03-31 OAK 1 CLE 1 0 2 L
2014-03-31 DET 1 KCA 1 4 3 W
2014-03-31 CHA 1 MIN 1 5 3 W
2014-03-31 BAL 1 BOS 1 2 1 W
2014-03-31 ANA 1 SEA 1 3 10 L
2014-03-31 PIT 1 CHN 1 1 0 W
2014-03-31 CHN 1 PIT 1 0 1 L
2014-03-31 WAS 1 NYN 1 9 7 W
2014-03-31 ATL 1 MIL 1 0 2 L
2014-03-31 COL 1 MIA 1 1 10 L
2014-03-31 SLN 1 CIN 1 1 0 W
2014-03-31 SFN 1 ARI 3 9 8 W
2014-03-31 PHI 1 TEX 1 14 10 W
2014-03-31 TOR 1 TBA 1 2 9 L
2014-03-31 CLE 1 OAK 1 2 0 W
2014-03-31 KCA 1 DET 1 3 4 L
2014-03-31 MIN 1 CHA 1 3 5 L
2014-03-31 BOS 1 BAL 1 1 2 L
2014-03-31 SEA 1 ANA 1 10 3 W
2014-03-31 MIL 1 ATL 1 2 0 W
2014-03-31 NYN 1 WAS 1 7 9 L
2014-04-01 NYA 1 HOU 1 2 6 L
2014-04-01 HOU 1 NYA 1 6 2 W

The results table includes two teams that have renamed themselves (the Angels and the Marlins), and one that moved and changed names (Expos). I’d like to update my script to deal with these, but should I really lump the Expos in with the Nats?

Also, I was relieved to see the Dodgers and the Giants were tied, at least.

Footnotes

  1. I did this manually back in 2015, but as I was porting this to the new site, I wrote a script to automatically download and unzip the files.↩︎