Skip to content

[DEV-14608] Change to local environment for Spark upgrade#4659

Open
sethstoudenmier wants to merge 46 commits into
qatfrom
mod/dev-14608-upgrade-local-to-match-emr-upgrade
Open

[DEV-14608] Change to local environment for Spark upgrade#4659
sethstoudenmier wants to merge 46 commits into
qatfrom
mod/dev-14608-upgrade-local-to-match-emr-upgrade

Conversation

@sethstoudenmier

@sethstoudenmier sethstoudenmier commented May 8, 2026

Copy link
Copy Markdown
Contributor

Description:

Updates to support local development with EMR 7.12 and the corresponding libraries.

Technical Details:

This PR is broken down into a few categories of changes. Some of the changes were made while troubleshooting and left in place as they are improvements.

EMR Serverless Image

Currently the codebase supports usage of PySpark via two different approaches:

  1. pyspark and using a spark session locally on your machine
  2. spark without hadoop and then install a specific version of hadoop; all inside of a docker container

The usage of EMR makes it difficult to continue supporting the first approach because AWS uses patched versions of hive, hadoop, etc. that we don't readily have access to. However, they do supply the EMR Serverless images that we can use which includes their patched versions of different tools. This led to the creation of a new "usaspending-development" image that includes everything needed to run our entire test suite and CI / CD pipeline.

Some initial thoughts regarding the pros of this change:

  • local development environments now more closely match the deployed environments and CI / CD checks in GitHub
  • more consistent experience across developers
  • the image is large, but it builds significantly quicker than the current spark image for local development
  • modern IDEs do support integration with a Python interpreted based on a docker compose service

With pros there are always cons:

  • developers are forced to use the docker approach now in order to perform any spark action
  • GitHub actions are required to build the Docker image prior to running tests (more on this in the following section)

GitHub Actions

Changes to the overall workflow of the test suite to use docker compose resulted in changes to the GitHub Actions for PR checks.

Reformatting

The step that generates the Broker branch to be used across the test suite was added as a single step in the beginning and the value is now provided to the subsequent steps. This change is reflected in the workflow changes shown below.

OLD:
Screenshot 2026-06-01 133054

NEW:
Screenshot 2026-06-01 133006

Usage of Docker Compose

The move away from installing Spark, Hive, and Hadoop individually meant that the test suite also needed to move towards a docker container approach. All of the test suite is now run via the Docker Compose "usaspending-test" service. A downside is that we have to build the image on each runner because it is too large to store as an artifact. However, the overall time to run the test suite is comparable to the current functionality because of the removed need to setup a Python environment (shown in screenshots below). Additionally, this removes the need to pre-install the JARs as they are found on the image.

OLD:
Screenshot 2026-06-01 133750

NEW:
Screenshot 2026-06-01 133638

This also allowed for removing the declaration of ENV in the different GitHub workflows because we can now use the .env.template file to populate the values for Docker Compose.

Misc.

The "init-python-environment" action is still used by the style checks because we use a Django management command for checking endpoint documentation. While we could look to update this, the changeset for this is already quite large and I would like to reduce further changes that aren't needed. However, to try and reclaim some processing time the "spark" subset of packages was removed from the install.

Code Changes

Docker Compose

A lot of changes were made to the docker-compose.yml while troubleshooting issues and making sure that the containers will work for other developers.

Some notable changes:

  • The project's volume mount was updated to be /usaspending-api for all containers to avoid issues when mapping the source and remote paths. This was causing issues with some tests and also breakpoints were failing in IDEs due to this mismatched mapping.
  • Almost all aspects of the Docker Compose services are now based on the usaspending-development image, except for the API and download services. The idea is to leave those two services as-is so they match the deployed environments more closely.
  • The different ENV across the services were normalized at the top as YAML anchors.
  • Health checks were added some containers to aid the usage of Docker Compose statements in the GitHub CI/CD test suite.
  • usaspending-ci service was updated to match our current CI/CD workflow.
  • Added the usaspending-style-checks service to better perform any necessary checks prior to merge.
    • This was more of a "want" than a "need" with all of the other changes.
  • Added the spark-shell service to take the place of accessing the pyspark shell with local development.

Configurations

conf.set("spark.databricks.delta.merge.materializeSource", "none")

Newer versions of Delta try to materialize the source table of a merge by default to help with performance. In almost all cases our tables are too large to materialize into memory without spilling onto disk. To help with performance this was set to "none" to avoid materialize the source.

Misc.

  • References to any JARs have been removed as we now use the JARs found inside of the EMR Serverless image. The exception is the Postgres JAR that gets downloaded when building the development image.
  • The .env.template was updated to support the test suite without any changes.
  • Some tests are CDF were different in their order and needed to be sorted.
  • README.md and CONTRIBUTING.md were updated to account for the changes in the PR and some overlooked lines that should have been previously updated.
  • spark.Dockerfile and testing.Dockerfile were both removed in favor of the new development.Dockerfile
  • Lot of changes to the Makefile to support usage of the new / updated Docker Compose services to support local development

Requirements for PR Merge:

  1. Unit & integration tests updated
  2. API documentation updated (examples listed below)
    1. API Contracts
    2. API UI
    3. Comments
  3. Data validation completed (examples listed below)
    1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
    2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
    3. Is the expected data returned with the expected format?
  4. Appropriate Operations ticket(s) created
  5. Jira Ticket(s)
    1. DEV-14608

Explain N/A in above checklist:

@sethstoudenmier sethstoudenmier added do not merge [PR] shouldn't be merged in progress [ISSUE | PR] being worked labels May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge [PR] shouldn't be merged in progress [ISSUE | PR] being worked

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant