[TPTP Benchmark] master — 2026-05-25 #9622
Closed
Replies: 1 comment
-
|
This discussion was automatically closed because it expired on 2026-06-08T17:17:11.417Z.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Date: 2026-05-25
Branch: master
Commit:
8c989f8Workflow Run: 26410384849
TPTP version: v9.2.1
Problems benchmarked: 500 (random sample, seed=42, timeout 5 s per problem)
Summary
Expected Status Distribution
Actual Verdict Distribution
None detected.
💥 Crashes
SWW989_1.pexpected=TheoremStatus Mismatches (Theorem ↔ Unsatisfiable)
None detected.
View all Timeouts (157 problems where Z3 exceeded the 5-second limit)
View full per-problem results table (all 500)
Recommendations
Timeouts (157/500 = 31.4%): Z3 timed out on a significant fraction of problems, particularly in domains GRP (group theory), LCL (logic calculi), SET (set theory), KLE (Kleene algebra), NUM (number theory), SWC/SWV/SWW (software verification), SEU/SEV (set/type theory). These are challenging for resolution-style or SMT reasoning within 5 seconds. Consider investigating whether Z3's TPTP front-end uses complete search strategies that could be bounded earlier, or whether domain-specific tactics could be applied.
Crashes (1):
SWW989_1.p(expected Theorem) caused a crash with no SZS output and a non-zero exit code. This should be investigated and filed as a bug — crashes in the TPTP front-end are unexpected regardless of problem difficulty.Soundness: No soundness errors detected. All conclusive answers that could be cross-checked were consistent with expected statuses. This is a positive result.
GaveUp (15): Z3 returned
GaveUpfor 15 problems within the time budget. These are cases where Z3 determined it could not solve the problem with its current strategies. This is expected behavior for problems outside Z3's supported fragment.Correct rate: Of 468 problems with a conclusive expected status, Z3 solved 297 correctly (63.5%). The remaining 36.5% were mostly timeouts on hard problems.
Domain coverage: Problems solved quickly (< 0.1 s) span AGT, CSR, DAT^, GRP (simple), KRS, MED, MGT, NUM (simple) — these are well within Z3's core competency. Problems consistently timing out tend to involve large axiom sets or require extended search.
Beta Was this translation helpful? Give feedback.
All reactions