Skip to content

Benchmark: ppi #485

@github-actions

Description

@github-actions

Benchmark scenario ID: ppi
Benchmark scenario definition: https://github.com/ESA-APEx/apex_algorithms/blob/9a6041f4791ed0ca95a8d7be07abdf63151da5f4/algorithm_catalog/vito/ppi/benchmark_scenarios/ppi.json
openEO backend: openeo.dataspace.copernicus.eu

GitHub Actions workflow run: https://github.com/ESA-APEx/apex_algorithms/actions/runs/25001294085
Workflow artifacts: https://github.com/ESA-APEx/apex_algorithms/actions/runs/25001294085#artifacts

Test start: 2026-04-27 14:35:36.400636+00:00
Test duration: 0:14:01.046927
Test outcome: ❌ failed

Last successful test phase: create-job
Failure in test phase: run-job

Contact Information

Name Organization Contact
Victor Verhaert VITO Contact via VITO (VITO Website, GitHub)

Process Graph

{
  "ppi1": {
    "process_id": "ppi",
    "namespace": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/refs/heads/main/algorithm_catalog/vito/ppi/openeo_udp/ppi.json",
    "arguments": {
      "temporal_extent": [
        "2022-06-11",
        "2022-06-12"
      ],
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              4.4387,
              50.42624
            ],
            [
              5.9539,
              50.42624
            ],
            [
              5.9539,
              51.4424
            ],
            [
              4.4387,
              51.4424
            ],
            [
              4.4387,
              50.42624
            ]
          ]
        ]
      }
    },
    "result": true
  }
}

Error Logs

scenario = BenchmarkScenario(id='ppi', description='ppi example', backend='openeo.dataspace.copernicus.eu', process_graph={'ppi1'...PosixPath('/home/runner/work/apex_algorithms/apex_algorithms/algorithm_catalog/vito/ppi/benchmark_scenarios/ppi.json'))
connection_factory = <function connection_factory.<locals>.get_connection at 0x7f4237716a20>
tmp_path = PosixPath('/home/runner/work/apex_algorithms/apex_algorithms/qa/benchmarks/tmp_path_root/test_run_benchmark_ppi_0')
track_metric = <function track_metric.<locals>.track at 0x7f4237716b60>
track_phase = <apex_algorithm_qa_tools.pytest.pytest_track_metrics._PhaseTracker object at 0x7f42377420f0>
upload_assets_on_fail = <apex_algorithm_qa_tools.pytest.pytest_upload_assets.upload_assets_on_fail.<locals>._Collector object at 0x7f4237fadeb0>
request = <FixtureRequest for <Function test_run_benchmark[ppi]>>

    @pytest.mark.parametrize(
        "scenario",
        [
            # Use scenario id as parameterization id to give nicer test names.
            pytest.param(uc, id=uc.id)
            for uc in get_benchmark_scenarios()
        ],
    )
    def test_run_benchmark(
        scenario: BenchmarkScenario,
        connection_factory,
        tmp_path: Path,
        track_metric,
        track_phase,
        upload_assets_on_fail,
        request,
    ):
        track_metric("scenario_id", scenario.id)

        with track_phase(phase="connect"):
            # Check if a backend override has been provided via cli options.
            override_backend = request.config.getoption("--override-backend")
            backend_filter = request.config.getoption("--backend-filter")
            if backend_filter and not re.match(backend_filter, scenario.backend):
                # TODO apply filter during scenario retrieval, but seems to be hard to retrieve cli param
                pytest.skip(
                    f"skipping scenario {scenario.id} because backend {scenario.backend} does not match filter {backend_filter!r}"
                )
            backend = scenario.backend
            if override_backend:
                _log.info(f"Overriding backend URL with {override_backend!r}")
                backend = override_backend

            connection: openeo.Connection = connection_factory(url=backend)

        report_path = None
        if request.config.getoption("--upload-benchmark-report"):
            report_path = tmp_path / "benchmark_report.json"
            report_path.write_text(json.dumps({
                "scenario_id": scenario.id,
                "scenario_description": scenario.description,
                "scenario_backend": scenario.backend,
                "scenario_source": str(scenario.source) if scenario.source else None,
                "reference_data": scenario.reference_data,
                "reference_options": scenario.reference_options,
            }, indent=2))
            upload_assets_on_fail(report_path)

        def _on_phase_exception(phase: str, exc: Exception):
            if report_path is not None:
                report = json.loads(report_path.read_text())
                report["test_failed"] = True
                report["test_failed_phase"] = phase
                report["test_error_message"] = str(exc)
                report_path.write_text(json.dumps(report, indent=2))
                cwd_report_dir = Path("benchmark_reports")
                cwd_report_dir.mkdir(exist_ok=True)
                (cwd_report_dir / f"{scenario.id}_benchmark_report.json").write_text(
                    json.dumps(report, indent=2)
                )
                report_url = upload_assets_on_fail.get_url(report_path)
                if report_url:
                    exc.add_note(f"Benchmark report: {report_url}")

        track_phase.on_exception = _on_phase_exception

        with track_phase(phase="create-job"):
            # TODO #14 scenario option to use synchronous instead of batch job mode?
            job = connection.create_job(
                process_graph=scenario.process_graph,
                title=f"APEx benchmark {scenario.id}",
                additional=scenario.job_options,
            )
            track_metric("job_id", job.job_id)

            if report_path is not None:
                report = json.loads(report_path.read_text())
                report["job_id"] = job.job_id
                report_path.write_text(json.dumps(report, indent=2))

        with track_phase(phase="run-job"):
            # TODO: monitor timing and progress
            # TODO: separate "job started" and run phases?
            max_minutes = request.config.getoption("--maximum-job-time-in-minutes")
            if max_minutes:
                def _timeout_handler(signum, frame):
                    raise TimeoutError(
                        f"Batch job {job.job_id} exceeded maximum allowed time of {max_minutes} minutes"
                    )

                old_handler = signal.signal(signal.SIGALRM, _timeout_handler)
                signal.alarm(max_minutes * 60)
            try:
>               job.start_and_wait()

tests/test_benchmarks.py:117:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <BatchJob job_id='j-26042714354145a88a343b3e01a60c5b'>

    def start_and_wait(
        self,
        *,
        print=print,
        max_poll_interval: float = DEFAULT_JOB_STATUS_POLL_INTERVAL_MAX,
        connection_retry_interval: float = DEFAULT_JOB_STATUS_POLL_CONNECTION_RETRY_INTERVAL,
        soft_error_max: int = DEFAULT_JOB_STATUS_POLL_SOFT_ERROR_MAX,
        show_error_logs: bool = True,
        require_success: bool = True,
    ) -> BatchJob:
        """
        Start the batch job, poll its status and wait till it finishes (or fails)

        :param print: print/logging function to show progress/status
        :param max_poll_interval: maximum number of seconds to sleep between job status polls
        :param connection_retry_interval: how long to wait when status poll failed due to connection issue
        :param soft_error_max: maximum number of soft errors (e.g. temporary connection glitches) to allow
        :param show_error_logs: whether to automatically print error logs when the batch job failed.
        :param require_success: whether to raise an exception if the job did not finish successfully.

        :return: Handle to the job created at the backend.

        .. versionchanged:: 0.37.0
            Added argument ``show_error_logs``.

        .. versionchanged:: 0.42.0
            All arguments must be specified as keyword arguments,
            to eliminate the risk of positional mix-ups between heterogeneous arguments and flags.

        .. versionchanged:: 0.42.0
            Added argument ``require_success``.
        """
        # TODO rename `connection_retry_interval` to something more generic?
        start_time = time.time()

        def elapsed() -> str:
            return str(datetime.timedelta(seconds=time.time() - start_time)).rsplit(".")[0]

        def print_status(msg: str):
            print("{t} Job {i!r}: {m}".format(t=elapsed(), i=self.job_id, m=msg))

        # TODO: make `max_poll_interval`, `connection_retry_interval` class constants or instance properties?
        print_status("send 'start'")
        self.start()

        # TODO: also add  `wait` method so you can track a job that already has started explicitly
        #   or just rename this method to `wait` and automatically do start if not started yet?

        # Start with fast polling.
        poll_interval = min(5, max_poll_interval)
        status = None
        _soft_error_count = 0

        def soft_error(message: str):
            """Non breaking error (unless we had too much of them)"""
            nonlocal _soft_error_count
            _soft_error_count += 1
            if _soft_error_count > soft_error_max:
                raise OpenEoClientException("Excessive soft errors")
            print_status(message)
            time.sleep(connection_retry_interval)

        while True:
            # TODO: also allow a hard time limit on this infinite poll loop?
            try:
                job_info = self.describe()
            except requests.ConnectionError as e:
                soft_error("Connection error while polling job status: {e}".format(e=e))
                continue
            except OpenEoApiPlainError as e:
                if e.http_status_code in [HTTP_502_BAD_GATEWAY, HTTP_503_SERVICE_UNAVAILABLE]:
                    soft_error("Service availability error while polling job status: {e}".format(e=e))
                    continue
                else:
                    raise

            status = job_info.get("status", "N/A")

            progress = job_info.get("progress")
            if isinstance(progress, int):
                progress = f"{progress:d}%"
            elif isinstance(progress, float):
                progress = f"{progress:.1f}%"
            else:
                progress = "N/A"
            print_status(f"{status} (progress {progress})")
            if status not in ('submitted', 'created', 'queued', 'running'):
                break

            # Sleep for next poll (and adaptively make polling less frequent)
            time.sleep(poll_interval)
            poll_interval = min(1.25 * poll_interval, max_poll_interval)

        if require_success and status != "finished":
            # TODO: render logs jupyter-aware in a notebook context?
            if show_error_logs:
                print(f"Your batch job {self.job_id!r} failed. Error logs:")
                print(self.logs(level=logging.ERROR))
                print(
                    f"Full logs can be inspected in an openEO (web) editor or with `connection.job({self.job_id!r}).logs()`."
                )
>           raise JobFailedException(
                f"Batch job {self.job_id!r} didn't finish successfully. Status: {status} (after {elapsed()}).",
                job=self,
            )
E           openeo.rest.JobFailedException: Batch job 'j-26042714354145a88a343b3e01a60c5b' didn't finish successfully. Status: error (after 0:13:56).

/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/openeo/rest/job.py:382: JobFailedException
----------------------------- Captured stdout call -----------------------------
0:00:00 Job 'j-26042714354145a88a343b3e01a60c5b': send 'start'
0:00:17 Job 'j-26042714354145a88a343b3e01a60c5b': created (progress 0%)
0:00:22 Job 'j-26042714354145a88a343b3e01a60c5b': created (progress 0%)
0:00:28 Job 'j-26042714354145a88a343b3e01a60c5b': created (progress 0%)
0:00:37 Job 'j-26042714354145a88a343b3e01a60c5b': created (progress 0%)
0:00:47 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:00:59 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:01:15 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:01:34 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:01:58 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:02:28 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:03:06 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:03:53 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:04:51 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:05:51 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:06:51 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:07:52 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:08:52 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:09:52 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:10:53 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:11:53 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:12:55 Job 'j-26042714354145a88a343b3e01a60c5b': running (progress N/A)
0:13:55 Job 'j-26042714354145a88a343b3e01a60c5b': error (progress N/A)
Your batch job 'j-26042714354145a88a343b3e01a60c5b' failed. Error logs:
[{'id': '[1777300822160, 903951]', 'time': '2026-04-27T14:40:22.160Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#68,Executor task launch worker for task 4.0 in stage 54.0 (TID 8302),5,main]'}, {'id': '[1777300824374, 665311]', 'time': '2026-04-27T14:40:24.374Z', 'level': 'error', 'message': 'Lost executor 7 on 10.42.210.127: \nThe executor with id 7 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:37:55Z\n\t container finished at: 2026-04-27T14:40:22Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777300852235, 575376]', 'time': '2026-04-27T14:40:52.235Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#62,Executor task launch worker for task 4.1 in stage 54.0 (TID 8306),5,main]'}, {'id': '[1777300856686, 614109]', 'time': '2026-04-27T14:40:56.686Z', 'level': 'error', 'message': 'Lost executor 2 on 10.42.79.150: \nThe executor with id 2 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:36:22Z\n\t container finished at: 2026-04-27T14:40:52Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777300857330, 58202]', 'time': '2026-04-27T14:40:57.330Z', 'level': 'error', 'message': 'Missing an output location for shuffle 25 partition 0'}, {'id': '[1777300857365, 17195]', 'time': '2026-04-27T14:40:57.365Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 25 partition 0\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1777300956041, 729466]', 'time': '2026-04-27T14:42:36.041Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#158,Executor task launch worker for task 1.0 in stage 54.1 (TID 8393),5,main]'}, {'id': '[1777300956467, 907998]', 'time': '2026-04-27T14:42:36.467Z', 'level': 'error', 'message': 'Missing an output location for shuffle 25 partition 4'}, {'id': '[1777300956524, 978196]', 'time': '2026-04-27T14:42:36.524Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 25 partition 4\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1777300958836, 375793]', 'time': '2026-04-27T14:42:38.836Z', 'level': 'error', 'message': 'Lost executor 3 on 10.42.16.33: \nThe executor with id 3 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:36:50Z\n\t container finished at: 2026-04-27T14:42:36Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777300970855, 9901]', 'time': '2026-04-27T14:42:50.855Z', 'level': 'error', 'message': 'Exception while beginning fetch of 1 outstanding blocks'}, {'id': '[1777300970862, 406944]', 'time': '2026-04-27T14:42:50.862Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.210.127:43729'}, {'id': '[1777300988735, 472555]', 'time': '2026-04-27T14:43:08.735Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#81,Executor task launch worker for task 1.1 in stage 54.1 (TID 8394),5,main]'}, {'id': '[1777300990879, 232755]', 'time': '2026-04-27T14:43:10.879Z', 'level': 'error', 'message': 'Lost executor 11 on 10.42.49.146: \nThe executor with id 11 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:39:22Z\n\t container finished at: 2026-04-27T14:43:09Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777301028471, 353839]', 'time': '2026-04-27T14:43:48.471Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#108,Executor task launch worker for task 1.0 in stage 54.2 (TID 8449),5,main]'}, {'id': '[1777301029927, 807916]', 'time': '2026-04-27T14:43:49.927Z', 'level': 'error', 'message': 'Lost executor 13 on 10.42.35.71: \nThe executor with id 13 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:39:49Z\n\t container finished at: 2026-04-27T14:43:48Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777301075982, 908426]', 'time': '2026-04-27T14:44:35.982Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#169,Executor task launch worker for task 1.1 in stage 54.2 (TID 8451),5,main]'}, {'id': '[1777301078126, 988507]', 'time': '2026-04-27T14:44:38.126Z', 'level': 'error', 'message': 'Lost executor 1 on 10.42.84.8: \nThe executor with id 1 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:36:22Z\n\t container finished at: 2026-04-27T14:44:36Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777301078253, 40309]', 'time': '2026-04-27T14:44:38.253Z', 'level': 'error', 'message': 'Missing an output location for shuffle 25 partition 0'}, {'id': '[1777301078267, 928240]', 'time': '2026-04-27T14:44:38.267Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 25 partition 0\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1777301107481, 562754]', 'time': '2026-04-27T14:45:07.481Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#188,Executor task launch worker for task 1.2 in stage 54.2 (TID 8452),5,main]'}, {'id': '[1777301111306, 585783]', 'time': '2026-04-27T14:45:11.306Z', 'level': 'error', 'message': 'Lost executor 6 on 10.42.15.135: \nThe executor with id 6 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:37:38Z\n\t container finished at: 2026-04-27T14:45:08Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777301151714, 455668]', 'time': '2026-04-27T14:45:51.714Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#103,Executor task launch worker for task 2.0 in stage 54.3 (TID 8516),5,main]'}, {'id': '[1777301155623, 306682]', 'time': '2026-04-27T14:45:55.623Z', 'level': 'error', 'message': 'Lost executor 14 on 10.42.208.168: \nThe executor with id 14 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260330-591\n\t container state: terminated\n\t container started at: 2026-04-27T14:41:31Z\n\t container finished at: 2026-04-27T14:45:52Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1777301280979, 930071]', 'time': '2026-04-27T14:48:00.979Z', 'level': 'error', 'message': 'Exception while beginning fetch of 2 outstanding blocks'}, {'id': '[1777301280995, 495542]', 'time': '2026-04-27T14:48:00.995Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.15.135:42085'}, {'id': '[1777301280997, 638471]', 'time': '2026-04-27T14:48:00.997Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.15.135:42085'}, {'id': '[1777301281355, 803501]', 'time': '2026-04-27T14:48:01.355Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.FetchFailedException\n\tat org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:439)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1253)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:983)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:87)\n\tat org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)\n\tat scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)\n\tat scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)\n\tat org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)\n\tat org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)\n\tat org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)\n\tat org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)\n\tat org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:145)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 6), which maintains the block data to fetch is dead. SQLSTATE: XX000\n\tat org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:146)\n\tat org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:181)\n\tat org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:160)\n\tat org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:157)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:376)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.send$1(ShuffleBlockFetcherIterator.scala:1223)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:1215)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:721)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:195)\n\tat org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:73)\n\t... 18 more\n'}, {'id': '[1777301281361, 168828]', 'time': '2026-04-27T14:48:01.361Z', 'level': 'error', 'message': 'Stage error: Job aborted due to stage failure: ShuffleMapStage 54 (load_collection: read by input product) has failed the maximum allowable number of times: 4. Most recent failure reason:\norg.apache.spark.shuffle.FetchFailedException\n\tat org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:439)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1253)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:983)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:87)\n\tat org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)\n\tat scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)\n\tat scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)\n\tat org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)\n\tat org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)\n\tat org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)\n\tat org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)\n\tat org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:145)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 6), which maintains the block data to fetch is dead. SQLSTATE: XX000\n\tat org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:146)\n\tat org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:181)\n\tat org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:160)\n\tat org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:157)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:376)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.send$1(ShuffleBlockFetcherIterator.scala:1223)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:1215)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:721)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:195)\n\tat org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:73)\n\t... 18 more\n'}, {'id': '[1777301282518, 52527]', 'time': '2026-04-27T14:48:02.518Z', 'level': 'error', 'message': 'OpenEO batch job failed: A part of your process graph failed multiple times. Simply try submitting again, or use batch job logs to find more detailed information in case of persistent failures. Increasing executor memory may help if the root cause is not clear from the logs.'}]
Full logs can be inspected in an openEO (web) editor or with `connection.job('j-26042714354145a88a343b3e01a60c5b').logs()`.
------------------------------ Captured log call -------------------------------
INFO     conftest:conftest.py:145 Connecting to 'openeo.dataspace.copernicus.eu'
INFO     openeo.config:config.py:193 Loaded openEO client config from sources: []
INFO     conftest:conftest.py:158 Checking for auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_CDSEFED' to drive auth against url='openeo.dataspace.copernicus.eu'.
INFO     conftest:conftest.py:162 Extracted provider_id='CDSE' client_id='openeo-apex-benchmarks-service-account' from auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_CDSEFED'
INFO     openeo.rest.connection:connection.py:302 Found OIDC providers: ['CDSE']
INFO     openeo.rest.auth.oidc:oidc.py:410 Doing 'client_credentials' token request 'https://identity.dataspace.copernicus.eu/auth/realms/CDSE/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-apex-benchmarks-service-account')
INFO     openeo.rest.connection:connection.py:401 Obtained tokens: ['token_type', 'access_token', 'expires_in', 'id_token', 'scope']

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions