Skip to content

SIGSEGV in compaction task in ParallelCompactionsST #7499

Description

@patchwork01

Description / Background

This was raised previously, and it looks like the fix hasn't worked:

We've seen ParallelCompactionsST fail because of a SIGSEGV in a compaction task. It's happening in the DataFusion compaction runner but the logs aren't telling us much more than that right now.

Steps to reproduce

  1. Run ParallelCompactionsST
  2. Sometimes see error

Expected behaviour

Compactions should complete reliably.

Technical Notes / Implementation Details

ParallelCompactionsST runs a lot of very small compactions at once on 200 compaction tasks. It tests that the system can keep up with a lot of compactions happening at once. It writes 10 files to each of 8192 partitions, with standard ingest writing 1 million rows to one file per leaf partition, then it runs the basic compaction strategy with 2 files per compaction job, resulting in 40960 jobs being created at once.

Screenshots/Logs

Logs from compaction task after it picked up the job:

[main] job.execution.DefaultCompactionRunnerFactory INFO  - Selecting DataFusionCompactionRunner for job ID <job-id>, table <table-name> (<table-id>)
[main] statestore.transactionlog.TransactionLogHead DEBUG  - Not checking for snapshot of StateStorePartitions for table <table-name> (<table-id>), next check at <time>
[main] statestore.transactionlog.TransactionLogHead DEBUG  - Updating StateStorePartitions for table <table-name> (<table-id>) from log from transaction 2
[main] statestore.transactionlog.TransactionLogHead DEBUG  - No new transactions found in log of StateStorePartitions for table <table-name> (<table-id>) in 0.003s, last transaction number is 2
[WARN] sleeper_df/src/lib.rs:77 - Couldn't install color_eyre error handler could not set the provided `Theme` via `color_spantrace::set_theme` globally as another was already set
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffbcbc3010, pid=7, tid=8
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.19.10.1 (17.0.19+10) (build 17.0.19+10-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.19.10.1 (17.0.19+10-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  [libc.so.6+0xa3010]
#
# Core dump will be written. Default location: //core.7
#
# An error report file with more information is saved as:
# //hs_err_pid7.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
/run.sh: line 20:     7 Aborted                 (core dumped) java --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED -cp /compaction-job-execution.jar sleeper.compaction.job.execution.ECSCompactionTaskRunner $*

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions