This document tracks non-obvious behaviors, common pitfalls, and architectural "gotchas" in the libmagic-rs codebase to assist future maintainers and contributors.
src/error.rs is compiled by both the library and build.rs. It cannot reference lib-only types like crate::io::IoError.
- Workaround:
FileError(String)wraps structured I/O errors as strings. - Rule: Use
ParseError::IoErrorfor I/O errors in parser code, notParseError::invalid_syntax.
Serialization functions live in src/parser/codegen.rs, shared by both build.rs (via #[path] include) and src/build_helpers.rs (via crate::parser::codegen). Only format_parse_error remains duplicated because ParseError has different import paths in each context.
generate_builtin_rules() emits a use crate::parser::ast::{...} import line in generated code. When adding new AST types (e.g., PStringLengthWidth), this import must be updated or the build pipeline breaks on any rule that uses the new type.
- Symptom: Build fails with
E0433("cannot find type") in the generatedbuiltin_rules.rs. - Fix: Update the import in
generate_builtin_rules()insrc/parser/codegen.rs.
Adding a variant to TypeKind requires updating exhaustive matches in 10+ files: ast, grammar, types, codegen, strength, property_tests, evaluator/types/mod.rs (read_typed_value, coerce_value_to_type, bytes_consumed -- variable-width variants must be matched explicitly or relative-offset anchors will silently corrupt), output/mod.rs (2 length matches), output/json.rs (format_value_as_hex), and grammar/tests.rs (stale assertions). Note: coerce_value_to_type, output matches, and bytes_consumed use catch-all _ => so they compile without changes but may need semantic updates -- bytes_consumed will fire a debug_assert in test/dev builds for unhandled variable-width variants.
Adding a variant to Operator requires updating: ast, grammar, codegen, strength, property_tests, evaluator/operators/.
Adding a variant to Value requires updating: ast, codegen, strength, property_tests, output/mod.rs (2 length matches), output/json.rs (format_value_as_hex), evaluator/operators/comparison.rs (compare_values). Bitwise/equality operators use catch-all _ => so they are safe.
- Note:
Valueno longer derivesEq(removed whenValue::Float(f64)was added) -- no production code depends onValue: Eq.
Type keyword parsing lives in src/parser/types.rs (parse_type_keyword + type_keyword_to_kind); the grammar module in src/parser/grammar/mod.rs orchestrates and handles type-specific suffixes (e.g., pstring /B, /H, /L).
- Gotcha:
parse_type_keywordonly consumes the base keyword (e.g.,"pstring"), leaving suffixes for the grammar layer. - Gotcha:
parse_type_and_operatorconsumes trailing whitespace after parsing type+suffix -- test expectations for remaining input must account for this.
parse_number returns i64 and is shared by offset + value parsing. Never widen its return type. Use the separate parse_unsigned_number (u64) path only in parse_numeric_value.
The nom tuple combinator is deprecated. Use bare tuple syntax (a, b, c) directly as Parser is implemented for tuples.
type_keyword_to_kind has #[allow(clippy::too_many_lines)] because it exceeds 100 lines with all date keywords.
parse_number handles - signs but not +. When parsing syntax like +4 (e.g., indirect offset adjustments), consume the + character manually before calling parse_number.
parse_value() does not accept bare unquoted strings. String values in magic file rules must be quoted (e.g., string "MZ" not string MZ). Integration tests writing magic files must use r#"0 string "MZ" description"# format.
Lowercase pointer specifiers (.s, .l, .q) map to little-endian, not native endian. Uppercase (.S, .L, .Q) map to big-endian. All numeric pointer types are signed by default (per S6.3). The adjustment is parsed after the closing paren: (base.type)+adj, not (base.type+adj).
OffsetSpec::Relative(N) resolves against EvaluationContext::last_match_end(), which is updated after every successful match in evaluate_rules and is never saved/restored across child recursion. This is intentional and matches GNU file: a sibling rule sees the anchor wherever the deepest descendant of the previous sibling left it. The anchor is global/shared rather than stack-scoped, but its numeric value is not guaranteed to be non-decreasing -- a successful Relative(-N) rule (or any later rule that matches at a lower absolute position) can move it earlier. Do not wrap recursion in a save/restore pair "for safety" -- it would silently break sibling-after-nested chains. The recursion-depth pattern in the same loop is save/restore, and the asymmetry is correct.
The load-bearing invariant is that the anchor is updated before recursing into children (so children and their followers see the new anchor). The current code also happens to set the anchor before matches.push(...), but the push-ordering relative to set_last_match_end is incidental for anchor correctness -- only the ordering before the evaluate_rules recursion call matters. (Future code that reads the anchor while iterating matches would make this ordering load-bearing, so do not "optimize" the order without checking call sites first.) bytes_consumed() (in evaluator/types/mod.rs) is the source of truth for advance distance; for variable-width types it re-derives consumption from the buffer rather than trusting Value::String.len() (which can drift from the original byte length via from_utf8_lossy). Pascal-string consumption is also clamped against the remaining buffer to prevent attacker-controlled length prefixes from poisoning the anchor to usize::MAX.
evaluator::engine is private. Integration tests must import libmagic_rs::evaluator::evaluate_rules (re-exported), not evaluator::engine::evaluate_rules.
evaluator::MatchResult was renamed to evaluator::RuleMatch (issue #60). output::MatchResult is the public-facing type. Do not confuse the two.
Prefer pub mod over pub use module::* for submodules -- glob re-exports expand public API surface unintentionally.
There are two distinct EvaluationResult and EvaluationMetadata types with overlapping names but different fields:
src/lib.rs: the top-level library API types.EvaluationResultcarriesdescription,mime_type,confidence,matches: Vec<evaluator::RuleMatch>,metadata.EvaluationMetadatacarriesfile_size: u64,evaluation_time_ms: f64,rules_evaluated: usize,magic_file: Option<PathBuf>,timed_out: bool.src/output/mod.rs: the output-formatting types.EvaluationResultcarriesfilename: PathBuf,matches: Vec<MatchResult>,metadata,error: Option<String>.EvaluationMetadatacarriesfile_size: u64,evaluation_time_ms: f64,rules_evaluated: u32,rules_matched: u32.
The two pairs are not interchangeable -- notice that rules_evaluated is usize in one and u32 in the other, and timed_out/magic_file exist only on the lib.rs variant while rules_matched/filename/error exist only on the output variant. A one-way conversion exists: output::EvaluationResult::from_library_result(...) converts from the library type to the output type; there is no reverse conversion. When importing, qualify the path (libmagic_rs::EvaluationResult vs libmagic_rs::output::EvaluationResult) to make intent explicit. Do not "unify" them without coordinated API changes across the CLI, integration tests, and any downstream consumers.
There is no From<u32> for usize (fails on 32-bit targets). Use as usize for widening conversions from u32.
Buffer offset calculations must use checked_add to prevent overflow panics from malicious/malformed input. Always test with usize::MAX -- otherwise regressions go undetected.
inf - inf = NaN, so (a - b).abs() <= epsilon fails for infinities. Handle is_infinite() before the epsilon check. Use #[allow(clippy::float_cmp)] on the exact infinity comparison.
- Note:
apply_equal/apply_not_equaluse epsilon-aware comparison forValue::Float; ordering operators (<,>,<=,>=) use directpartial_cmp-- these are deliberately different semantics.
Uppercase pstring suffix letters indicate big-endian byte order, lowercase indicate little-endian: /H (2-byte BE), /h (2-byte LE), /L (4-byte BE), /l (4-byte LE). The /J flag means the stored length includes the length field itself (JPEG convention). Flags are combinable (e.g., pstring/HJ). Test data must use the correct byte order for the variant (e.g., \x00\x05 for TwoByteBE length 5, \x05\x00 for TwoByteLE length 5).
Middle-endian date keywords are NOT supported. They were removed until real middle-endian decoding is implemented end-to-end.
libmagic types are signed by default (byte, short, long, quad). Unsigned variants use u prefix (ubyte, ushort, ulong, uquad, etc.).
Doctests compile as external crates. Use libmagic_rs:: imports, never crate::.
Table-driven tests with byte-array buffers of different sizes need &[(&[u8], ...)] slice type annotation. Bare arrays can't hold heterogeneous-length byte literals in a [T; N].
cargo test --doc verifies doc examples. Ensure example strings don't accidentally match multiple patterns.
Adding fields to the clap Args struct requires updating ALL manual Args { ... } constructions in unit tests (search for Args { in src/main.rs). process_file and run_analysis are called from unit tests with mocked stdin -- signature changes require updating all test call sites.
Reserved-for-future enum variants (e.g., TypeReadError::UnsupportedType) need explicit doc comments explaining intent -- otherwise every reviewer flags them as dead code.
Enum variants with a shared suffix (e.g., OneByte, TwoByte, FourByte) trigger clippy::enum_variant_names. Add #[allow(clippy::enum_variant_names)] when renaming would hurt readability.
unsafe_code = "forbid" in Cargo.toml workspace lints cannot be overridden with #[allow(unsafe_code)]. Use vetted crates (e.g., chrono for timezone) instead of libc FFI or subprocess calls.
Pre-commit hook runs cargo fmt. First commit attempt may fail if fmt modifies files -- just re-stage and retry.
On error return paths, use let _ = cleanup() not cleanup()? -- the ? masks the original error with cleanup failures. Bind original errors with e @ pattern instead of reconstructing after state changes (avoids data loss).
- Use
ParseError::IoErrorfor I/O errors in parser code, notParseError::invalid_syntax - Use
LibmagicError::ConfigErrorfor config validation, notParseError::invalid_syntax
When extracting public functions to new modules, verify # Examples, # Arguments, # Returns doc sections are preserved -- they silently disappear.
When refactoring docs, verify # Returns and # Security sections match actual behavior. For example, from_utf8_lossy never errors but docs may claim it does.
All public enum variants need # Examples rustdoc sections. Clippy enforces this.
read_date/read_qdate return Value::Uint(raw_secs) for correct numeric comparisons. Use format_timestamp_value(secs, utc) at output time for GNU file-compatible display. There is no special date coercion in coerce_value_to_type.
chrono crate (no default features, std + clock) is a vetted dependency for timezone offset computation in evaluator/types/date.rs. Do not use libc FFI or subprocess calls (see S8.2).
Repository is EvilBit-Labs/libmagic-rs. NEVER guess repo/org names -- always run git remote -v to verify.
All tags and commits MUST be signed -- use git tag -s and git commit -s -S. If signing fails, STOP and troubleshoot. Never push unsigned tags/commits. Never delete or force-push tags without explicit user permission.
.github/workflows/release.yml is auto-generated by cargo-dist -- never modify manually.
EvaluationConfig::default() sets timeout_ms: None, meaning library consumers who construct a config via Default::default() get unbounded evaluation time. On crafted or adversarial input, rules with expensive operations (deep nesting, large buffers, pathological backtracking in future regex/search types) can run indefinitely and expose callers to a denial-of-service vector.
- Rule: Library consumers embedding libmagic-rs in services or untrusted-input pipelines should not use
EvaluationConfig::default(). UseEvaluationConfig::performance()(which setstimeout_ms: Some(1000)) as the safe preset, or construct a config explicitly with a non-Nonetimeout sized for your workload. - Validation:
timeout_msis clamped toMAX_SAFE_TIMEOUT_MS(5 minutes) by config validation and must be> 0if specified -- see the validation logic insrc/config.rs. - Note:
Defaultcannot be changed to set a timeout without breaking API expectations of callers who deliberately want no timeout (e.g., CLI one-shot invocations). The gotcha is that the unsafe default is the ergonomic choice; document the tradeoff prominently in any new consumer-facing docs.