Skip to content

Commit 86acf61

Browse files
[GLUTEN][CI] Gate Delta Spark UT against a known-failures baseline
Running delta-io/delta's spark suite against the Gluten Velox bundle produces many expected failures. This adds a committed known-failures baseline and a per-shard gate so CI is green when only baseline failures occur and red on a genuine regression, enabling incremental fixes. - Inject ScalaTest's -u JUnit XML reporter (Delta only configures the console reporter, so no machine-readable per-test results existed). - Capture sbt's exit so expected test failures don't fail the step; fail loudly only when zero reports are produced (compile/launch failure). - compare-test-results.py classifies each test vs known-failures.txt (regression / expected / now-passing) in enforce/seed/aggregate modes. - Add update_baseline + fail_on_fixed inputs and an aggregate job that emits a ready-to-commit baseline artifact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 03dfda5 commit 86acf61

4 files changed

Lines changed: 679 additions & 1 deletion

File tree

.github/workflows/delta_spark_ut.yml

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,16 @@ on:
4646
description: 'Forked test JVMs per shard (TEST_PARALLELISM_COUNT)'
4747
required: true
4848
default: '1'
49+
update_baseline:
50+
description: 'Seed/refresh the known-failures baseline instead of enforcing it'
51+
type: boolean
52+
required: false
53+
default: false
54+
fail_on_fixed:
55+
description: 'Fail when a baseline test now passes (keeps the baseline honest)'
56+
type: boolean
57+
required: false
58+
default: true
4959
pull_request:
5060
paths:
5161
- '.github/workflows/delta_spark_ut.yml'
@@ -71,6 +81,11 @@ env:
7181
DELTA_REF_DEFAULT: 'v4.2.0'
7282
DELTA_SPARK_VERSION_DEFAULT: '4.1'
7383
DELTA_TEST_PARALLELISM_DEFAULT: '1'
84+
# Default mode for pull_request runs (where inputs.* is empty): enforce the
85+
# committed baseline and fail when a baseline test starts passing. Override
86+
# via the workflow_dispatch inputs above.
87+
DELTA_UPDATE_BASELINE_DEFAULT: 'false'
88+
DELTA_FAIL_ON_FIXED_DEFAULT: 'true'
7489
DELTA_SCALA_VERSION: '2.13.16'
7590
# Number of shards in the delta-spark-test matrix. Must equal the length of
7691
# the `shard` matrix below.
@@ -212,13 +227,19 @@ jobs:
212227
delta_ref='${{ github.event.inputs.delta_ref }}'
213228
spark_version='${{ github.event.inputs.spark_version }}'
214229
test_parallelism='${{ github.event.inputs.test_parallelism }}'
230+
update_baseline='${{ github.event.inputs.update_baseline }}'
231+
fail_on_fixed='${{ github.event.inputs.fail_on_fixed }}'
215232
: "${delta_ref:=${DELTA_REF_DEFAULT}}"
216233
: "${spark_version:=${DELTA_SPARK_VERSION_DEFAULT}}"
217234
: "${test_parallelism:=${DELTA_TEST_PARALLELISM_DEFAULT}}"
235+
: "${update_baseline:=${DELTA_UPDATE_BASELINE_DEFAULT}}"
236+
: "${fail_on_fixed:=${DELTA_FAIL_ON_FIXED_DEFAULT}}"
218237
{
219238
echo "delta_ref=${delta_ref}"
220239
echo "spark_version=${spark_version}"
221240
echo "test_parallelism=${test_parallelism}"
241+
echo "update_baseline=${update_baseline}"
242+
echo "fail_on_fixed=${fail_on_fixed}"
222243
} | tee -a "$GITHUB_OUTPUT"
223244
224245
- name: Download Gluten bundle jar
@@ -235,7 +256,7 @@ jobs:
235256
# launcher needs). Install the rest of what Delta's build/sbt and the
236257
# tests may need. We deliberately do NOT install the full `curl`
237258
# package -- it conflicts with the pre-installed curl-minimal.
238-
yum install -y java-17-openjdk-devel which findutils gzip
259+
yum install -y java-17-openjdk-devel which findutils gzip python3
239260
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
240261
export PATH=$JAVA_HOME/bin:$PATH
241262
java -version
@@ -341,13 +362,54 @@ jobs:
341362
# Delta's own Test/javaOptions seq so our `-Xmx6G` comes AFTER
342363
# `-Xmx1024m` and wins (last `-Xmx` wins). We also turn on heap
343364
# dump on OOM so if it happens again we can analyze the dump.
365+
# `-u target/test-reports` enables ScalaTest's JUnit XML reporter so
366+
# every suite writes per-test results. Delta itself only configures
367+
# the console reporter (-oDF), so without this we'd have no machine-
368+
# readable results to gate on. The path is relative to the forked
369+
# test JVM's working dir (Test / baseDirectory = spark/), i.e.
370+
# delta/spark/target/test-reports/TEST-*.xml.
371+
#
372+
# We deliberately do NOT let an sbt non-zero exit (which fires on the
373+
# MANY expected Delta-on-Gluten failures) fail this step directly.
374+
# Instead the known-failures gate below decides pass/fail: the build
375+
# is green when the only failures are ones already recorded in the
376+
# baseline, and red on a genuine regression.
377+
set +e
344378
./build/sbt \
345379
-DsparkVersion=${{ steps.resolve.outputs.spark_version }} \
346380
-v \
347381
-J-XX:+UseG1GC -J-Xmx4G \
348382
"++ ${DELTA_SCALA_VERSION}" \
349383
'set spark / Test / javaOptions ++= Seq("-Xmx6G", "-XX:+HeapDumpOnOutOfMemoryError", "-XX:HeapDumpPath=/tmp/")' \
384+
'set spark / Test / testOptions += Tests.Argument(TestFrameworks.ScalaTest, "-u", "target/test-reports")' \
350385
"spark/test"
386+
SBT_EXIT=$?
387+
set -e
388+
echo "sbt spark/test exited with ${SBT_EXIT}"
389+
390+
# A compile/launch failure leaves no reports at all. In that case the
391+
# gate would see zero failures and pass spuriously, so fail loudly.
392+
REPORT_COUNT=$(find . -path '*/target/test-reports/*.xml' 2>/dev/null | wc -l || true)
393+
echo "Found ${REPORT_COUNT} JUnit XML report file(s)."
394+
if [ "${REPORT_COUNT}" -eq 0 ]; then
395+
echo "::error::sbt produced no test reports (exit ${SBT_EXIT}) -- likely a compile or launch failure, not test failures."
396+
exit 1
397+
fi
398+
399+
# update_baseline=true -> SEED mode (record failures, never fail) so the
400+
# baseline can be (re)generated. Otherwise ENFORCE against the baseline.
401+
GATE_MODE=enforce
402+
if [ "${{ steps.resolve.outputs.update_baseline }}" = "true" ]; then
403+
GATE_MODE=seed
404+
fi
405+
mkdir -p "$GITHUB_WORKSPACE/gate-out"
406+
python3 "$GITHUB_WORKSPACE/.github/workflows/util/delta-spark-ut/compare-test-results.py" \
407+
--mode "${GATE_MODE}" \
408+
--reports-dir "$GITHUB_WORKSPACE/delta" \
409+
--known-failures "$GITHUB_WORKSPACE/.github/workflows/util/delta-spark-ut/known-failures.txt" \
410+
--failures-out "$GITHUB_WORKSPACE/gate-out/failures-shard-${{ matrix.shard }}.txt" \
411+
--ran-out "$GITHUB_WORKSPACE/gate-out/ran-shard-${{ matrix.shard }}.txt" \
412+
--fail-on-fixed "${{ steps.resolve.outputs.fail_on_fixed }}"
351413
352414
- name: Compress heap dumps (if any)
353415
if: ${{ failure() }}
@@ -364,6 +426,14 @@ jobs:
364426
echo "No heap dumps found in /tmp/."
365427
fi
366428
429+
- name: Upload per-shard gate lists
430+
if: always()
431+
uses: actions/upload-artifact@v4
432+
with:
433+
name: delta-spark-ut-gate-lists-shard-${{ matrix.shard }}
434+
path: gate-out/*.txt
435+
if-no-files-found: warn
436+
367437
- name: Upload test reports
368438
if: always()
369439
uses: actions/upload-artifact@v4
@@ -386,3 +456,37 @@ jobs:
386456
/tmp/*.hprof
387457
/tmp/*.hprof.gz
388458
if-no-files-found: ignore
459+
460+
# Merges every shard's failure/ran lists into a single, sorted, ready-to-commit
461+
# known-failures.txt and reports global regressions / now-passing / stale
462+
# entries. Runs even when some shards went red (if: always()) so the refreshed
463+
# baseline artifact is always available -- this is what you download and commit
464+
# to bootstrap or refresh the baseline (see util/delta-spark-ut/README.md).
465+
delta-spark-aggregate:
466+
needs: delta-spark-test
467+
if: always()
468+
runs-on: ubuntu-22.04
469+
steps:
470+
- uses: actions/checkout@v4
471+
- name: Download per-shard gate lists
472+
uses: actions/download-artifact@v4
473+
continue-on-error: true
474+
with:
475+
pattern: delta-spark-ut-gate-lists-shard-*
476+
path: gate-lists
477+
merge-multiple: true
478+
- name: Aggregate known failures
479+
run: |
480+
set -euo pipefail
481+
python3 .github/workflows/util/delta-spark-ut/compare-test-results.py \
482+
--mode aggregate \
483+
--inputs-dir gate-lists \
484+
--known-failures .github/workflows/util/delta-spark-ut/known-failures.txt \
485+
--baseline-out aggregated/known-failures.txt
486+
- name: Upload refreshed baseline
487+
if: always()
488+
uses: actions/upload-artifact@v4
489+
with:
490+
name: delta-spark-ut-known-failures
491+
path: aggregated/known-failures.txt
492+
if-no-files-found: warn
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one or more
3+
contributor license agreements. See the NOTICE file distributed with
4+
this work for additional information regarding copyright ownership.
5+
The ASF licenses this file to You under the Apache License, Version 2.0
6+
(the "License"); you may not use this file except in compliance with
7+
the License. You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Delta Spark UT (Gluten) — managing expected failures
19+
20+
Running delta-io/delta's `spark` ScalaTest suite against the Gluten Velox
21+
bundle produces **many expected failures**: Gluten does not yet offload every
22+
Delta code path, and falls back or behaves differently in places. If CI simply
23+
went red on any failure, the signal would be useless and we could never tell a
24+
*new* breakage from the hundreds of already-known ones.
25+
26+
To make this manageable we keep a **baseline of known failures** and gate each
27+
run against it. The build is green when the only failing tests are ones already
28+
recorded in the baseline; it goes red the moment a **previously-passing test
29+
starts failing** (a regression).
30+
31+
## Files
32+
33+
| File | Purpose |
34+
|---|---|
35+
| `known-failures.txt` | Committed baseline: the tests currently expected to fail. One `<suite>#<test>` per line. |
36+
| `compare-test-results.py` | Parses the JUnit XML from `sbt spark/test` and gates / seeds / aggregates against the baseline. Standard-library only. |
37+
| `setup-delta.sh` | Clones Delta, drops in the Gluten bundle, and patches `DeltaSQLCommandTest`. |
38+
39+
## How the gate works
40+
41+
Each test shard:
42+
43+
1. Runs `sbt spark/test` with ScalaTest's JUnit XML reporter enabled
44+
(`-u target/test-reports`), so every suite writes per-test results. (Delta
45+
itself only configures the console reporter, so the workflow injects this.)
46+
2. Runs `compare-test-results.py --mode enforce`, which classifies every test:
47+
- **regression** — failed, but not in the baseline → **fails the shard**.
48+
- **expected** — failed and in the baseline → ignored.
49+
- **now-passing** — in the baseline but passed this run → fails the shard
50+
(so the baseline is kept honest), unless `fail_on_fixed=false`.
51+
52+
A final `aggregate` job merges every shard's results into a single, sorted,
53+
ready-to-commit `known-failures.txt` artifact and reports **stale** baseline
54+
entries (tests no longer present in any shard, e.g. after a Delta version bump).
55+
56+
Because Delta shards **by suite**, every suite (and therefore every test) runs
57+
in exactly one shard, so per-shard enforcement sees complete suites and never
58+
double-counts.
59+
60+
## Bootstrapping the baseline (first time)
61+
62+
While `known-failures.txt` has no entries the gate auto-runs in **seed mode**
63+
(it never fails — it only records failures). To create the initial baseline:
64+
65+
1. Trigger **Actions → Delta Spark UT (Gluten) → Run workflow** with
66+
`update_baseline = true`.
67+
2. When it finishes, download the **`delta-spark-ut-known-failures`** artifact.
68+
3. Replace `known-failures.txt` with the file from that artifact and commit it.
69+
70+
From the next run onward the gate enforces the baseline.
71+
72+
## Day-to-day: fixing tests incrementally
73+
74+
- **You fixed Gluten and some Delta tests now pass.** CI will flag them as
75+
*now-passing*. Delete those lines from `known-failures.txt` in your PR. That
76+
is the whole point — the baseline only ever shrinks as coverage improves.
77+
- **You intentionally added a new expected failure** (e.g. a Delta path Gluten
78+
can't offload yet). Add the exact `Suite#test` line(s) the gate prints under
79+
*Regressions* to `known-failures.txt`, ideally with a comment explaining why.
80+
- **A genuine regression.** Fix it; do **not** add it to the baseline.
81+
82+
The error log prints copy-pasteable `Suite#test` lines for both regressions and
83+
now-passing tests, and each run's job summary shows the full breakdown.
84+
85+
## Regenerating / refreshing the whole baseline
86+
87+
After a Delta version bump or a large Gluten change, regenerate from scratch the
88+
same way as bootstrapping: run the workflow with `update_baseline=true`, download
89+
the `delta-spark-ut-known-failures` artifact, and commit it. The aggregate job
90+
also lists **stale** entries you can prune.
91+
92+
## Caveats
93+
94+
- **Flaky tests.** A flaky test that usually passes will be flagged as a
95+
regression when it flakes; one that usually fails (and is in the baseline)
96+
may be flagged as now-passing when it happens to pass. Re-run, or set
97+
`fail_on_fixed=false` for that run, and keep genuinely flaky tests out of the
98+
enforced set.
99+
- **Known failures still execute** (and fail) — they are gated *after* the run,
100+
not skipped — so they still consume CI time. This keeps us decoupled from
101+
Delta's sources; skipping them at runtime would require patching Delta.
102+
103+
## Running the comparison locally
104+
105+
```bash
106+
# after an sbt spark/test run that wrote delta/**/target/test-reports/*.xml
107+
python3 .github/workflows/util/delta-spark-ut/compare-test-results.py \
108+
--mode enforce \
109+
--reports-dir delta \
110+
--known-failures .github/workflows/util/delta-spark-ut/known-failures.txt \
111+
--failures-out /tmp/failures.txt --ran-out /tmp/ran.txt
112+
```

0 commit comments

Comments
 (0)