[DEV-14608] Change to local environment for Spark upgrade by sethstoudenmier · Pull Request #4659 · fedspendingtransparency/usaspending-api

sethstoudenmier · 2026-05-08T13:36:56Z

Description:

Updates to support local development with EMR 7.12 and the corresponding libraries.

Technical Details:

This PR is broken down into a few categories of changes. Some of the changes were made while troubleshooting and left in place as they are improvements.

EMR Serverless Image

Currently the codebase supports usage of PySpark via two different approaches:

pyspark and using a spark session locally on your machine
spark without hadoop and then install a specific version of hadoop; all inside of a docker container

The usage of EMR makes it difficult to continue supporting the first approach because AWS uses patched versions of hive, hadoop, etc. that we don't readily have access to. However, they do supply the EMR Serverless images that we can use which includes their patched versions of different tools. This led to the creation of a new "usaspending-development" image that includes everything needed to run our entire test suite and CI / CD pipeline.

Some initial thoughts regarding the pros of this change:

local development environments now more closely match the deployed environments and CI / CD checks in GitHub
more consistent experience across developers
the image is large, but it builds significantly quicker than the current spark image for local development
modern IDEs do support integration with a Python interpreted based on a docker compose service

With pros there are always cons:

developers are forced to use the docker approach now in order to perform any spark action
GitHub actions are required to build the Docker image prior to running tests (more on this in the following section)

GitHub Actions

Changes to the overall workflow of the test suite to use docker compose resulted in changes to the GitHub Actions for PR checks.

Reformatting

The step that generates the Broker branch to be used across the test suite was added as a single step in the beginning and the value is now provided to the subsequent steps. This change is reflected in the workflow changes shown below.

OLD:

NEW:

Usage of Docker Compose

The move away from installing Spark, Hive, and Hadoop individually meant that the test suite also needed to move towards a docker container approach. All of the test suite is now run via the Docker Compose "usaspending-test" service. A downside is that we have to build the image on each runner because it is too large to store as an artifact. However, the overall time to run the test suite is comparable to the current functionality because of the removed need to setup a Python environment (shown in screenshots below). Additionally, this removes the need to pre-install the JARs as they are found on the image.

OLD:

NEW:

This also allowed for removing the declaration of ENV in the different GitHub workflows because we can now use the .env.template file to populate the values for Docker Compose.

Misc.

The "init-python-environment" action is still used by the style checks because we use a Django management command for checking endpoint documentation. While we could look to update this, the changeset for this is already quite large and I would like to reduce further changes that aren't needed. However, to try and reclaim some processing time the "spark" subset of packages was removed from the install.

Code Changes

Docker Compose

A lot of changes were made to the docker-compose.yml while troubleshooting issues and making sure that the containers will work for other developers.

Some notable changes:

The project's volume mount was updated to be /usaspending-api for all containers to avoid issues when mapping the source and remote paths. This was causing issues with some tests and also breakpoints were failing in IDEs due to this mismatched mapping.
Almost all aspects of the Docker Compose services are now based on the usaspending-development image, except for the API and download services. The idea is to leave those two services as-is so they match the deployed environments more closely.
The different ENV across the services were normalized at the top as YAML anchors.
Health checks were added some containers to aid the usage of Docker Compose statements in the GitHub CI/CD test suite.
usaspending-ci service was updated to match our current CI/CD workflow.
Added the usaspending-style-checks service to better perform any necessary checks prior to merge.
- This was more of a "want" than a "need" with all of the other changes.
Added the spark-shell service to take the place of accessing the pyspark shell with local development.

Configurations

conf.set("spark.databricks.delta.merge.materializeSource", "none")

Newer versions of Delta try to materialize the source table of a merge by default to help with performance. In almost all cases our tables are too large to materialize into memory without spilling onto disk. To help with performance this was set to "none" to avoid materialize the source.

Misc.

References to any JARs have been removed as we now use the JARs found inside of the EMR Serverless image. The exception is the Postgres JAR that gets downloaded when building the development image.
The .env.template was updated to support the test suite without any changes.
Some tests are CDF were different in their order and needed to be sorted.
README.md and CONTRIBUTING.md were updated to account for the changes in the PR and some overlooked lines that should have been previously updated.
spark.Dockerfile and testing.Dockerfile were both removed in favor of the new development.Dockerfile
Lot of changes to the Makefile to support usage of the new / updated Docker Compose services to support local development

Requirements for PR Merge:

Unit & integration tests updated
API documentation updated (examples listed below)
1. API Contracts
2. API UI
3. Comments
Data validation completed (examples listed below)
1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
3. Is the expected data returned with the expected format?
Appropriate Operations ticket(s) created
Jira Ticket(s)
1. DEV-14608

Explain N/A in above checklist:

…n search incremental load

sethstoudenmier added 2 commits May 8, 2026 00:36

[DEV-14608] Initial progress using EMR image for development

eeba01f

[DEV-14608] handle merge conflicts

e3ab3e0

sethstoudenmier added do not merge [PR] shouldn't be merged in progress [ISSUE | PR] being worked labels May 8, 2026

github-actions Bot assigned sethstoudenmier May 8, 2026

sethstoudenmier added 25 commits May 8, 2026 09:43

[DEV-14608] fix typo

9235d5d

[DEV-14608] additional updates for running test suite

2c848ec

[DEV-14608] additional updates for running test suite

9cc06f4

[DEV-14608] updates for test suite

5997fe5

[DEV-14608] troubleshoot github actions

41b7364

[DEV-14608] breakout creation of broker branch

21d06f3

[DEV-14608] cleanup

3d2ec02

[DEV-14608] troubleshoot github actions

3479f35

[DEV-14608] troubleshoot github actions

c7ac20d

[DEV-14608] troubleshoot github actions

d6caa91

[DEV-14608] troubleshoot failing tests and github actions

6b9fd45

[DEV-14608] troubleshoot failing tests and github actions

488307f

[DEV-14608] troubleshoot github actions

301e069

Use .env for all docker compose env

2e63f99

Handle merge conflict

3feaea7

[DEV-14608] troubleshoot github actions

9f45537

[DEV-14608] troubleshoot github actions

fdaa4ec

[DEV-14608] troubleshoot github actions

6fb2686

[DEV-14608] troubleshoot github actions

f6bce38

[DEV-14608] troubleshoot github actions

a41fda6

[DEV-14608] troubleshoot github actions

7322c62

[DEV-14608] troubleshoot github actions

c669f0c

[DEV-14608] adjust test cases

1a55247

[DEV-14608] make sure processes are cleaned up on containers

fdf5b10

[DEV-14608] undo removing search_path in github action

be85dc6

sethstoudenmier added 19 commits May 21, 2026 10:21

[DEV-14608] add postgres package to dockerfile for psql

bccb9c1

[DEV-14608] fix tests

7094638

[DEV-14608] fix test

cb197b2

[DEV-14608] troubleshoot slow merge statement

ffc80e3

[DEV-14608] further local config updates; test changes for transactio…

f72587d

…n search incremental load

[DEV-14608] cleanup and validating docker compose further

5b239ce

[DEV-14608] cleanup documentation

faf76ba

[DEV-14608] cleanup makefile

07ed1b4

[DEV-14608] repartition transaction_search dataframe

1042e9a

[DEV-14608] repartition search dataframes

ccc4e7b

[DEV-14608] add skew hint

03c813d

[DEV-14608] capture explain plan

11a05fb

[DEV-14608] test broadcast join

941546b

[DEV-14608] test broadcast join

8e16d43

[DEV-14608] troubleshoot

83cb206

[DEV-14608] troubleshoot

2ec0e83

[DEV-14608] undo troubleshooting

6b7f2ca

[DEV-14608] test checkpoints

0818e68

[DEV-14608] remove checkpoints

ef59250

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEV-14608] Change to local environment for Spark upgrade#4659

[DEV-14608] Change to local environment for Spark upgrade#4659
sethstoudenmier wants to merge 46 commits into
qatfrom
mod/dev-14608-upgrade-local-to-match-emr-upgrade

sethstoudenmier commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sethstoudenmier commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Technical Details:

EMR Serverless Image

GitHub Actions

Reformatting

Usage of Docker Compose

Misc.

Code Changes

Docker Compose

Configurations

Misc.

Requirements for PR Merge:

Explain N/A in above checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sethstoudenmier commented May 8, 2026 •

edited

Loading