Description / Background
This was raised previously, and it looks like the fix hasn't worked:
We've seen ParallelCompactionsST fail because of a SIGSEGV in a compaction task. It's happening in the DataFusion compaction runner but the logs aren't telling us much more than that right now.
Steps to reproduce
- Run ParallelCompactionsST
- Sometimes see error
Expected behaviour
Compactions should complete reliably.
Technical Notes / Implementation Details
ParallelCompactionsST runs a lot of very small compactions at once on 200 compaction tasks. It tests that the system can keep up with a lot of compactions happening at once. It writes 10 files to each of 8192 partitions, with standard ingest writing 1 million rows to one file per leaf partition, then it runs the basic compaction strategy with 2 files per compaction job, resulting in 40960 jobs being created at once.
Screenshots/Logs
Logs from compaction task after it picked up the job:
[main] job.execution.DefaultCompactionRunnerFactory INFO - Selecting DataFusionCompactionRunner for job ID <job-id>, table <table-name> (<table-id>)
[main] statestore.transactionlog.TransactionLogHead DEBUG - Not checking for snapshot of StateStorePartitions for table <table-name> (<table-id>), next check at <time>
[main] statestore.transactionlog.TransactionLogHead DEBUG - Updating StateStorePartitions for table <table-name> (<table-id>) from log from transaction 2
[main] statestore.transactionlog.TransactionLogHead DEBUG - No new transactions found in log of StateStorePartitions for table <table-name> (<table-id>) in 0.003s, last transaction number is 2
[WARN] sleeper_df/src/lib.rs:77 - Couldn't install color_eyre error handler could not set the provided `Theme` via `color_spantrace::set_theme` globally as another was already set
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000ffffbcbc3010, pid=7, tid=8
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.19.10.1 (17.0.19+10) (build 17.0.19+10-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.19.10.1 (17.0.19+10-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C [libc.so.6+0xa3010]
#
# Core dump will be written. Default location: //core.7
#
# An error report file with more information is saved as:
# //hs_err_pid7.log
#
# If you would like to submit a bug report, please visit:
# https://github.com/corretto/corretto-17/issues/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
/run.sh: line 20: 7 Aborted (core dumped) java --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED -cp /compaction-job-execution.jar sleeper.compaction.job.execution.ECSCompactionTaskRunner $*
Description / Background
This was raised previously, and it looks like the fix hasn't worked:
We've seen ParallelCompactionsST fail because of a SIGSEGV in a compaction task. It's happening in the DataFusion compaction runner but the logs aren't telling us much more than that right now.
Steps to reproduce
Expected behaviour
Compactions should complete reliably.
Technical Notes / Implementation Details
ParallelCompactionsST runs a lot of very small compactions at once on 200 compaction tasks. It tests that the system can keep up with a lot of compactions happening at once. It writes 10 files to each of 8192 partitions, with standard ingest writing 1 million rows to one file per leaf partition, then it runs the basic compaction strategy with 2 files per compaction job, resulting in 40960 jobs being created at once.
Screenshots/Logs
Logs from compaction task after it picked up the job: