Scheduling Python Scripts with Github Actions
Fantasy Football season has sadly come to an end, so the next best thing I can do is…write about Fantasy Football! Indeed, this past season I decided to manage my team differently. While I’ve always used a “data-driven” approach to making roster decisions, I’ve predominantly relied on the forecasts and opinions of the internet. So this year, I decided to build a custom model to make my own weekly player point forecasts.
My goal wasn’t to beat the “professional” fantasy football players, specifically Draft Kings or Yahoo. Instead, I wanted to see if I could come close to their point projections in terms of accuracy by creating a simple model with some essential inputs, like past player performance, betting lines, weather forecasts, and playing environment. These inputs were part of a more extensive data-scraping process that ran every Wednesday. Upon completion, the scraped data was fed into a predictive model. The model predicted how many points each player on my roster would score in the upcoming week of games. Players with higher projections would “play,” while those with lower point projections would stay “on the bench.”
After comparing my model’s performance against Yahoo’s projections following the completion of the season, my projections were, as expected, close but less accurate (but that’s for a different discussion). Still, I discovered a simple way to automate and schedule the execution of .py
scripts. The process is identical to the all-too-common Extract-Transform-Load (ETL) pattern we observe in data science:
- Query a database or scrape data from a website.
- Do some data cleaning.
- Save the resulting output to a database (e.g., Snowflake) or object storage (e.g., AWS S3).
Despite the ubiquity of this process, I’ve noticed it is also one of the more challenging aspects for beginners to grasp. The task of moving code developed locally and executed manually to a remote location that runs on a schedule can seem daunting at first.
A standard recommendation for creating workflows is to learn Airflow. While Airflow is the industry standard for building workflows, there is a lot to know before starting. The learning curve can deter data scientists from taking on this type of work, which translates into a few (less than optimal) outcomes: You ask a separate team to do the job, or you just run it manually every time data is needed. Thus, this post outlines a different process for scheduling and executing “simple” workflows. Please note that the proposed approach does not replace Airflow. Instead, it can serve as a guide for getting a project up and running quickly with minimal overhead.
With that in mind, here are the main things we’ll cover:
- Managing dependencies with
poetry
- Setting up a Github Actions workflow
- Connecting to Amazon Web Services (AWS) via Github Actions
A Brief Tangent
If you’ve made it this far, you might be thinking to yourself, “Data scientists don’t typically concern themselves with these topics.” Five years ago, I would’ve agreed with that statement. But the longer I spend working in this field, the more I realize that your best data scientists can own a product end-to-end. This is because they understand their models and inference process and the stuff required to make those models run, including databases, schedulers, monitoring systems, and serving infrastructure.
Indeed, one of the hallmarks of a “senior” data scientist is someone who can take a systems-level view of a project and not just focus on a single component (e.g., the model) but instead on how all the parts fit together. They understand that most of the work is not getting something to run but rather ensuring it can run reliably. So my advice to folks just getting started in this field is not to shy away from the “plumbing” or “glue” work. You don’t have to be an expert on everything needed to run an ML model in production or even a simple weekly data refresh. But learning the fundamentals – things like dependency management, Version Control, CI/CD, and some of the more popular services of the big cloud providers (e.g., AWS S3) – can set you apart from data scientists who can’t move beyond a local Jupyter notebook. With that in mind, let’s get back to the actual post!
Project Setup
I find learning by example to be best. All source code is here if you’d like to follow along. We’ll start by copying the repository to our local machine.
Step 1: Clone the repository.
git clone https://github.com/thecodeforest/stats-scrape.git
Step 2: Create a virtual environment.
conda create --name statsscrape python=3.8.3 -y
Step 3: Activate the virtual environment.
conda activate statsscrape
If you want to verify that you’ve successfully created and activated the statsscrape
environment, run the following command, which will list all libraries installed:
pip list
You’ll notice that no third-party libraries appear (e.g., numpy, scipy, pandas, requests) because we have created a brand new environment and thus will need to install all of the required packages.
Step 4: Install poetry
. Poetry helps to install and manage dependencies.
pip install poetry
Step 5: Install the dependencies listed in poetry.lock
. The lock file (included in the repository) logs all the dependencies we’ll need for this walkthrough.
poetry install
Step 6: Run our main script in “debug” mode. Instead of normally writing the output to S3, it will appear on your local machine when a --debug
argument is passed.
python statsscrape/statsscrape.py --s3_bucket "test" --year 2021 --debug True
There should be two new files in your repository: A .csv
file with a single player’s game statistics for the 2021 season and a .log
file that captures some high-level details about what happened in our script.
In the next few sections, we’ll dive deeper into the logic used to create these outputs as well as Github Actions.
Collecting Player Data
The statsscrape.py
script is what we’ll use to collect game data for each player active within a given season. We executed this script in “debug” mode above, which I like to do when trying out code for the first time. When the data collection process runs on Github’s server, all outputs will land in S3 instead of our local machine.
# statsscrape.py
import logging
from datetime import datetime
import argparse
import awswrangler as wr
import pandas as pd
from createplayerid import create_player_id_df
def read_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Create player id dataframe")
parser.add_argument("--s3_bucket", type=str, help="s3 bucket name")
parser.add_argument("--year", type=int, help="Year to create dataframe for")
parser.add_argument("--debug", default=False, type=bool, help="Run in debug mode")
args = parser.parse_args()
return args
logging.basicConfig(
format="%(levelname)s - %(asctime)s - %(filename)s - %(message)s",
level=logging.INFO,
filename="player-stats-{start_time}.log".format(
start_time=datetime.now().strftime("%Y-%m-%d")
),
)
def collect_stats():
logging.info(f"Starting collection of player statistics")
args = read_args()
season_year = args.year
s3_bucket = args.s3_bucket
debug = args.debug
s3_path = f"s3://{s3_bucket}/data/playerstats/raw/{season_year}/playerstats.csv"
logging.info(f"Collecting data for {season_year}")
logging.info(f"Writing data to {s3_path}")
logging.info(f"Running module in Debug mode: {debug}")
player_id_df = create_player_id_df(season_year=season_year)
# take the first player returned
player_name, player_id, _ = player_id_df.iloc[0]
player_last_name_first_letter = player_name.split()[1][0]
player_stats_url = f"https://www.pro-football-reference.com/players/{player_last_name_first_letter}/{player_id}.htm"
# scrape the stats for a single player
player_stats_df = pd.read_html(player_stats_url)[0]
# write locally or save data to S3
if debug:
player_stats_df.to_csv(f"debug_df_season_{season_year}.csv", index=False)
else:
wr.s3.to_csv(player_stats_df, s3_path, index=False)
logging.info("Completed collection of player statistics")
if __name__ == "__main__":
collect_stats()
To summarise what’s happening above:
- We’ll load our dependencies, including the
createplayerid
module, which consists of a series of helper functions that create a unique ID to access the actual season data for a given player. - Set up some basic logging functionality.
- Pass in our arguments for which season we want data, the name of the S3 bucket where we’ll store our data, and if we’re debugging.
- Extract all of the player’s IDs for a given season.
- Create the URL for an individual player. Again, we keep things simple and take the first player on the list.
- Read the player stats associated with the URL we’ve created.
- Save the data to S3 or locally, depending on our running mode.
Pretty simple stuff! But how do we get this thing to run every week without executing the module above manually? That’s what we’ll cover in the next section!
Automating Recurring Processes with Github Actions
Let’s start with a quick primer on Github Actions (GA). GA is primarily used for CI/CD, which stands for Continuous Integration/Continuous Deployment. CI automates the testing, monitoring, and deployment of changes to an existing codebase. For example, a CI workflow might run units tests to determine if changes to a current function introduced bugs or unwanted outcomes. It also might calculate code coverage or update documentation. In contrast, CD is about shipping updates you’ve made to your codebase, assuming all of the checks outlined during the CI process pass.
GA is relatively new to the CI/CD space, and several other solutions are available (e.g., Circle CI, Travis CI). However, it’s my go-to-solution for personal projects for a few reasons:
- It’s one less dependency to manage, given that Version Control and CI/CD are within the same place.
- For public repositories, you have 2,000 free minutes per month. Each time a CI/CD workflow runs, Github spins up a machine to execute a series of steps. For small projects with only a few simple steps (like the one outlined here), you’ll never come close to the limit, which makes it easy to get started.
- The configuration consists of a single
.yaml
file, which can be tracked and versioned just like your code.
We’ll discuss scheduling a weekly data refresh now that we’ve covered GA’s typical use-case and benefits. But, again, I wouldn’t recommend using GA for scheduling an intricate ETL process. However, for solo, small projects (like a personal project or a minimal viable product), it is a simple yet effective way to get code out of a local development environment and onto a remote server.
Creating a Workflow
A workflow automates a series of “actions,” which could be executing a .py
script, installing some packages, running a suite of unit tests, or verifying that code syntax adheres to a particular style. Actions are associated with a specific “step” of the workflow. A collection of steps comprises a “job.” A job is all the steps that get executed on a single Github machine. An event can trigger each workflow. In our case, we chose a schedule, but a workflow can be prompted by changes to our codebase, such as merging a new feature branch into the main branch.
All workflows executed by GA are in the .github
directory within a repository. I named mine schedule-data-refresh.yaml, so it is located at stats-scrape/.github/workflows/schedule-data-refresh.yaml
, but the generic path will look something like <your-repository-name/.github/workflows/<your-workflow-name.yaml>
.
Below is the complete workflow I’ve used to collect up-to-date player statistics on a weekly cadence for the past season. We’ll go through each section in greater detail.
# Describe what your workflow does
name: Refresh NFL Data
# How frequently you need data refreshed
on:
schedule:
- cron: "0 12 * * 3"
# uncomment below if you need to debug
# push:
# branches: [ main ]
jobs:
refresh-data:
runs-on: ubuntu-latest
strategy:
fail-fast: false
# set our environment variables
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY}}
S3_BUCKET_NAME: player-data-season
AWS_DEFAULT_REGION: us-west-2
SEASON_YEAR: 2021
steps:
# check-out the repository <stat-scrape> so job can access all your code
- uses: actions/checkout@v2
# create a python environment
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8.3
# install poetry
- name: Install Poetry
uses: snok/install-poetry@v1
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
# if an environment already exists, load it; otherwise create a new one
- name: Load Cached Virtual Environment
id: cached-poetry-dependencies
uses: actions/cache@v2
with:
path: .venv
key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
# if no cache exists, install packages
run: poetry install --no-interaction --no-root
# scrape data and save to s3
- name: Collect + Load Raw Player Stats Data to S3
run: |
source .venv/bin/activate
python statsscrape/statsscrape.py --s3_bucket "$S3_BUCKET_NAME" --year $SEASON_YEAR
# install AWS command line interface
- name: Install AWS CLI
uses: unfor19/install-aws-cli-action@v1
with:
version: 2
verbose: false
# save log file to S3 bucket
- name: Save Log File to S3
run: aws s3 cp "player-stats-$(date +'%Y-%m-%d').log" "s3://$S3_BUCKET_NAME/logs/"
1: Decide how frequently you need data refreshed
We’ll use CRON to specify the cadence we want the data refresh process to run. The schedule below says, “Run this every Wednesday at noon.” If there are two things I have to frequently Google - it’s regular expressions and CRON. Save your memory and use a tool like this.
name: Refresh NFL Data
on:
schedule:
- cron: "0 12 * * 3"
As an aside, it is more common for a workflow to run in response to a particular event. For example, the Workflow below executes when either a push or pull request is made for the main
branch.
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
2. Set Environment variables
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY}}
S3_BUCKET_NAME: player-data-season
AWS_DEFAULT_REGION: us-west-2
SEASON_YEAR: 2021
Environment variables are available to every step in your workflow. A common use-case is to store sensitive information, like passwords or API keys, as a secret within the repository and then make this information available to your workflow as an environment variable. In this example, we write output from our workflow to S3. To access our S3 Bucket, we need the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. We don’t want to store keys in the source code, so we add them to the environment and securely access them.
Another use-case for environment variables is when you have “static” or “constant” data - things that will change infrequently. For instance, SEASON_YEAR
indicates which season of data we want to collect. I want to reuse all of this logic for next year’s season, and storing this information in a configuration file makes it easier to change in the future relative to keeping it in the source code.
3. Set up Python and Install Dependencies
steps:
# check-out the repository <stat-scrape> so job can access all your code
- uses: actions/checkout@v2
# create a python environment
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8.3
# install poetry
- name: Install Poetry
uses: snok/install-poetry@v1
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
# if an environment already exists, load it; otherwise create a new one
- name: Load Cached Virtual Environment
id: cached-poetry-dependencies
uses: actions/cache@v2
with:
path: .venv
key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
# if no cache exists, install packages
run: poetry install --no-interaction --no-root
This section is a bit more involved, so let’s break it down into digestible sections:
- Install Python version 3.8.3.
- Install
poetry
. Note thewith
keyword, which provides additional input parameters to the step - Check if we’ve installed the packages outlined in the
.lock
file before. Github won’t have to download packages and re-install them if we have. Instead, we can use the cached dependencies for our job, which leads to a faster run-time and reduced cost (assuming you are paying for the time). - The initial run will take the longest, as none of the dependencies a cached.
At this point, we have most of the critical dependencies in our environment, and the remaining steps are straightforward.
4. Collect Player Data and Save to S3
# scrape data and save to s3
- name: Collect + Load Raw Player Stats Data to S3
run: |
source .venv/bin/activate
python statsscrape/statsscrape.py --s3_bucket "$S3_BUCKET_NAME" --year $SEASON_YEAR
# install AWS command line interface
- name: Install AWS CLI
uses: unfor19/install-aws-cli-action@v1
with:
version: 2
verbose: false
# save log file to S3 bucket
- name: Save Log File to S3
run: aws s3 cp "player-stats-$(date +'%Y-%m-%d').log" "s3://$S3_BUCKET_NAME/logs/"
Again, let’s go line-by-line and understand what’s happening in our workflow:
- Activate our virtual environment. The
|
symbol executes multiple commands within a step. - Execute the
statscrape.py
script. We’ll also pass in two arguments – the name of the S3 bucket where we want to save our.csv
file and the year of the season. - Install the AWS Command Line Interface (CLI).
- Copy the log file from the Github machine executing our workflow to S3. Assuming our workflow completes successfully, the log file helps us diagnose any unexpected results.
The entire process is repeated each Wednesday during the season, so a new .csv
and .log
file will be added to S3 each week.
Wrapping Up
That’s it! Hopefully, this has been a helpful overview of using Github Actions as a scheduler for simple workflows. As always, please comment below if you have suggestions or feedback!