Skip to content

williamhallpreston/musicprov

musicprov

Open-source AI music attribution & provenance layer

Embed verifiable, auditable provenance metadata into AI-generated audio — binding each output to its training sources, consent records, and generation parameters. Designed for EU AI Act compliance and the emerging music attribution ecosystem of 2026.

CI PyPI Python License Security Policy


Why this exists

As of April 2026, over 20,000 AI-generated tracks are uploaded to Deezer daily, yet almost no open tooling exists to record how those tracks were created — which training data was used, who consented, and whether the output meets legal thresholds. The EU AI Act's full obligations for high-risk AI systems take effect in August 2026, mandating dataset disclosure and provenance tracking.

musicprov is the missing infrastructure layer. It answers "where did this music come from?" in a machine-readable, tamper-evident, embeddable format — with zero required dependencies.


Features

Capability Details
Provenance records Structured attribution linking generated audio to training sources, influence weights, and consent status
Audio binding PCM-level SHA-256 hash binds the record to audio content — survives metadata re-tagging
EU AI Act compliance Automatic scoring with human-readable failure reasons (90% consented-weight threshold)
WAV embedding Zero-dependency RIFF chunk (MPRO) embedding and extraction
MP3 / FLAC / OGG Via optional mutagen backend
Chain verification Cryptographic hash linking for remix/continuation chains
Tamper detection hmac.compare_digest constant-time comparison prevents timing oracle attacks
CLI tool musicprov tag, show, verify, init
58 tests, 0 hard dependencies Core library uses Python stdlib only

Tech stack

  • Language: Python 3.10+
  • Core deps: None (stdlib only: hashlib, hmac, json, struct, wave, uuid)
  • Optional: mutagen==1.47.0 for MP3/FLAC/OGG tag I/O
  • Dev tools: pytest, ruff, mypy, pip-audit
  • CI: GitHub Actions (test matrix × 3 Python versions, ruff lint, gitleaks, pip-audit)

Installation

# WAV support only — zero external dependencies
pip install musicprov

# WAV + MP3 + FLAC + OGG
pip install "musicprov[audio]"

# Development (includes linting and type-checking tools)
pip install "musicprov[dev]"

From source

git clone https://github.com/musicprov/musicprov.git
cd musicprov
pip install -e ".[dev]"

Quick start

from musicprov import (
    ProvenanceRecord,
    SourceContribution,
    embed_provenance,
    extract_provenance,
    verify_chain,
)
from musicprov.provenance import ConsentStatus, GenerationParameters, LicenseType

# 1. Build a provenance record describing the AI generation
record = ProvenanceRecord(
    generator_tool="ACE-Step",
    generator_version="1.5",
    generation_params=GenerationParameters(
        model_name="ACE-Step",
        model_version="1.5",
        prompt="upbeat jazz piano with brushed drums, bossa nova feel",
        seed=42,
        duration_seconds=120.0,
    ),
    sources=[
        SourceContribution(
            source_id="USUM71402404",
            source_name="Take Five",
            artist="Dave Brubeck Quartet",
            influence_weight=0.6,
            consent_status=ConsentStatus.LICENSED,
            license_type=LicenseType.CUSTOM,
            license_url="https://your-registry.io/license/USUM71402404",
            dataset_name="JazzCorpus-2024",
            dataset_version="2.1.0",
            contribution_type="rhythmic",
            confidence=0.88,
        ),
        SourceContribution(
            source_id="CC_PIANO_001",
            source_name="Open Piano Recordings Vol. 3",
            artist="Community Contributors",
            influence_weight=0.4,
            consent_status=ConsentStatus.PUBLIC_DOMAIN,
            license_type=LicenseType.CC0,
        ),
    ],
)

# 2. Embed provenance into the generated audio file
embed_provenance("generated_track.wav", record)

# 3. Extract anywhere — no registry required
extracted = extract_provenance("generated_track.wav")
print(extracted.to_json())

# 4. Verify integrity and EU AI Act compliance
result = verify_chain(extracted, audio_path="generated_track.wav")
print(result)
# Verification: PASS  record=…
#   ✓ Hash integrity
#   ✓ EU AI Act compliance
#   ✓ Audio binding

CLI

# Generate a starter provenance template
musicprov init --tool "ACE-Step" --version "1.5" --output provenance.json

# Embed provenance into a WAV file (modifies in place)
musicprov tag output.wav --record provenance.json

# Embed and write to a new file
musicprov tag output.wav --record provenance.json --output tagged.wav

# Display embedded provenance
musicprov show tagged.wav

# Output raw JSON
musicprov show tagged.wav --json

# Verify chain integrity and EU AI Act compliance
musicprov verify tagged.wav
# exits 0 on PASS, 1 on FAIL

Data model

ProvenanceRecord

Field Type Description
record_id str UUID, auto-generated
generation_id str UUID identifying this generation run
generator_tool str Name of the AI tool (e.g. "ACE-Step")
generator_version str Tool version
sources list[SourceContribution] Training data sources (max 256)
generation_params GenerationParameters Prompt, seed, checkpoint…
output_sha256 str sha256:<hex> of raw PCM audio data
output_fingerprint str Spectral fingerprint (robust to re-encode)
parent_record_id str | None For remixes / continuations
parent_hash str | None sha256:<hex> of parent record
eu_ai_act_compliant bool | None Set by assess_compliance()
compliance_notes str | None Human-readable failure reason

SourceContribution

Field Type Description
source_id str ISRC, catalogue ID, or UUID
source_name str Track or work title (max 4 096 chars)
artist str Rights-holder name
influence_weight float 0.0–1.0; all sources must sum to ≤ 1.0
consent_status ConsentStatus licensed / opt_in / public_domain / unknown / refused
license_type LicenseType CC0-1.0 / CC-BY-4.0 / custom / …
dataset_name str | None Training dataset that included this source
contribution_type str One of: style, melodic, harmonic, rhythmic, timbral, lyrical, structural, other
confidence float Model's self-reported attribution confidence, 0.0–1.0

EU AI Act compliance logic

A record passes compliance assessment if the combined influence_weight of non-consented sources is ≤ 10% of the total:

compliant = sum(s.influence_weight for s in sources
                if s.consent_status in {LICENSED, OPT_IN, PUBLIC_DOMAIN}) >= 0.90

The threshold is configurable:

verify_chain(record, compliance_threshold=0.95)  # stricter

Audio binding

output_sha256 is a SHA-256 hash of the raw PCM audio data only, not the full file bytes. This design means:

  • ✅ Embedding or updating provenance metadata does not invalidate the binding
  • ✅ Verifying the hash does not require re-parsing the container format
  • ❌ Re-encoding to a different bitrate does invalidate the binding — this is intentional, as the audio content changed

WAV storage format

Provenance JSON is stored in a custom MPRO RIFF chunk appended to the WAV file:

RIFF <size> WAVE
  fmt  <size> <format data>
  data <size> <PCM samples>
  MPRO <size> <UTF-8 JSON>   ← musicprov chunk

The MPRO chunk is ignored by audio players and preserved by most DAW export workflows. It is human-readable and extractable by any RIFF-aware tool.


Security considerations

Input validation

  • All string fields are capped at 4 096 characters; JSON payloads at 32 MB; WAV MPRO chunks at 1 MB. These prevent resource exhaustion when processing untrusted audio files.
  • Unknown enum values in deserialised JSON fall back to UNKNOWN rather than raising exceptions.
  • The metadata dict on SourceContribution only accepts JSON primitive types (str, int, float, bool, None).

Hash comparison

  • All hash equality checks in verify_chain() use hmac.compare_digest() for constant-time comparison, preventing timing oracle attacks.

No network I/O

  • The core library makes no network requests. Registry integration is optional and requires explicit configuration via environment variables.

No eval, pickle, or yaml.load

  • Deserialisation uses only json.loads() with no object_hook. No pickle, yaml.load, or dynamic code execution anywhere in the library.

Path safety

  • All file paths are resolved via Path.resolve() before use. is_file() is checked before reads and writes.

For a full threat model, see docs/security.md.

Known limitations

  • Truthfulness is not verified. A generator tool that lies about its training sources produces a cryptographically valid but factually incorrect record. External dataset audit is required to verify source attributions.
  • No signing in v0.1. ed25519 record signatures are on the roadmap. Until then, records are integrity-hashed but not cryptographically attested by the generator.
  • MP3/FLAC/OGG security depends on mutagen's parser hardening. If processing untrusted files in those formats, use the latest mutagen release.

Environment variables

Copy .env.example to .env and fill in values. The library operates in fully offline mode if no variables are set.

cp .env.example .env
Variable Default Description
MUSICPROV_REGISTRY_URL (unset) URL of an external provenance registry
MUSICPROV_REGISTRY_API_KEY (unset) API key for registry authentication
MUSICPROV_SIGNING_KEY_PATH (unset) Path to ed25519 private key (future)
MUSICPROV_COMPLIANCE_THRESHOLD 0.90 Override the default compliance threshold
MUSICPROV_LOG_LEVEL WARNING Log verbosity (do not use DEBUG in production)

Testing

# Run the full test suite
pytest tests/ -v

# With coverage
pytest tests/ --cov=musicprov --cov=cli --cov-report=term-missing

# Run only security tests
pytest tests/ -v -k "Security or Bounds or Constant"

Current status: 58 tests, 0 failures.


Project structure

musicprov/
├── musicprov/           # Core library (zero external deps)
│   ├── __init__.py      # Public API surface
│   ├── provenance.py    # Data model: ProvenanceRecord, SourceContribution
│   ├── embed.py         # WAV/MP3/FLAC/OGG read-write
│   ├── verify.py        # Chain verification
│   └── fingerprint.py   # Spectral audio fingerprinting
├── cli/
│   └── __main__.py      # musicprov CLI entry point
├── tests/
│   ├── fixtures/        # Small test audio fixtures (< 100 KB each)
│   └── test_musicprov.py
├── examples/
│   └── quickstart.py    # End-to-end demonstration
├── dashboard/
│   └── index.html       # Browser-based provenance inspector
├── docs/
│   └── security.md      # Threat model and security architecture
├── .github/
│   └── workflows/ci.yml # CI: test, lint, secret-scan, dep-audit
├── .env.example         # Environment variable reference
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE              # Apache 2.0
├── SECURITY.md          # Vulnerability disclosure policy
└── pyproject.toml       # Build config, pinned dev deps, ruff/mypy config

Roadmap

  • ed25519 signing support for attested records
  • IPFS-pinned registry backend
  • Chromaprint integration for robust cross-encode fingerprinting
  • ACE-Step / YuE / SongGeneration integration adaptors
  • REST API server for centralised provenance registry
  • MusicXML / MIDI provenance for symbolic music outputs

Contributing

See CONTRIBUTING.md. All contributions are welcome — especially integrations with AI music generation tools and new audio format backends.

Please read our Code of Conduct before participating.


Security

To report a vulnerability, see SECURITY.md. Do not open a public issue for security bugs.


Licence

Apache 2.0 — see LICENSE.

About

Open-source AI music attribution & provenance layer. Embed verifiable training-source metadata into generated audio. EU AI Act aligned, zero dependencies, WAV/MP3/FLAC/OGG support.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages