[DEV-14608] Change to local environment for Spark upgrade#4659
Open
sethstoudenmier wants to merge 46 commits into
Open
[DEV-14608] Change to local environment for Spark upgrade#4659sethstoudenmier wants to merge 46 commits into
sethstoudenmier wants to merge 46 commits into
Conversation
…n search incremental load
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
Updates to support local development with EMR 7.12 and the corresponding libraries.
Technical Details:
This PR is broken down into a few categories of changes. Some of the changes were made while troubleshooting and left in place as they are improvements.
EMR Serverless Image
Currently the codebase supports usage of PySpark via two different approaches:
The usage of EMR makes it difficult to continue supporting the first approach because AWS uses patched versions of hive, hadoop, etc. that we don't readily have access to. However, they do supply the EMR Serverless images that we can use which includes their patched versions of different tools. This led to the creation of a new "usaspending-development" image that includes everything needed to run our entire test suite and CI / CD pipeline.
Some initial thoughts regarding the pros of this change:
With pros there are always cons:
GitHub Actions
Changes to the overall workflow of the test suite to use docker compose resulted in changes to the GitHub Actions for PR checks.
Reformatting
The step that generates the Broker branch to be used across the test suite was added as a single step in the beginning and the value is now provided to the subsequent steps. This change is reflected in the workflow changes shown below.
OLD:

NEW:

Usage of Docker Compose
The move away from installing Spark, Hive, and Hadoop individually meant that the test suite also needed to move towards a docker container approach. All of the test suite is now run via the Docker Compose "usaspending-test" service. A downside is that we have to build the image on each runner because it is too large to store as an artifact. However, the overall time to run the test suite is comparable to the current functionality because of the removed need to setup a Python environment (shown in screenshots below). Additionally, this removes the need to pre-install the JARs as they are found on the image.
OLD:

NEW:

This also allowed for removing the declaration of ENV in the different GitHub workflows because we can now use the
.env.templatefile to populate the values for Docker Compose.Misc.
The "init-python-environment" action is still used by the style checks because we use a Django management command for checking endpoint documentation. While we could look to update this, the changeset for this is already quite large and I would like to reduce further changes that aren't needed. However, to try and reclaim some processing time the "spark" subset of packages was removed from the install.
Code Changes
Docker Compose
A lot of changes were made to the
docker-compose.ymlwhile troubleshooting issues and making sure that the containers will work for other developers.Some notable changes:
/usaspending-apifor all containers to avoid issues when mapping the source and remote paths. This was causing issues with some tests and also breakpoints were failing in IDEs due to this mismatched mapping.usaspending-developmentimage, except for the API and download services. The idea is to leave those two services as-is so they match the deployed environments more closely.usaspending-ciservice was updated to match our current CI/CD workflow.usaspending-style-checksservice to better perform any necessary checks prior to merge.spark-shellservice to take the place of accessing the pyspark shell with local development.Configurations
Newer versions of Delta try to materialize the source table of a merge by default to help with performance. In almost all cases our tables are too large to materialize into memory without spilling onto disk. To help with performance this was set to "none" to avoid materialize the source.
Misc.
.env.templatewas updated to support the test suite without any changes.spark.Dockerfileandtesting.Dockerfilewere both removed in favor of the newdevelopment.DockerfileRequirements for PR Merge:
Explain N/A in above checklist: