Skip to content

Hourly GitHub sync reprocesses all mapping files instead of only changed ones #47

@Kartiiyer12

Description

@Kartiiyer12

Description:

The scheduled sync job (TokenMetadataSyncCronJob, every 60 minutes) reprocesses every mapping file in the repository on each cycle, even though only a git pull --rebase is needed to fetch new changes.

Current behavior:

  1. GitService.cloneCardanoTokenRegistryGitRepository() — does git pull --rebase (incremental, efficient)
  2. TokenMetadataSyncService.synchronizeDatabase() lines 59-90 — calls mappings.listFiles() and iterates over all files
  3. For each file: parses the JSON mapping, runs git log -n 1 to get author/timestamp, then calls tokenMetadataRepository.save()
  4. No commit hash or timestamp is tracked between sync cycles
  5. No git diff is used to identify only changed files

Impact:

  • The syncStatus goes in-progress state for a longer duration every hour resulting in api status not being ready to query.
  • The cardano-token-registry repo has thousands of mapping files. Every hour, all of them are re-parsed, each triggers a separate git log subprocess, and each is upserted into the database via JPA .save()
  • The git log call per file (GitService.getMappingDetails() line 99-101) spawns a shell process for every single file — this is O(n) subprocesses where n = total mapping files
  • Database receives unnecessary UPDATE statements for unchanged records

Suggested improvement:

Track the last synced commit hash (e.g., in a database table or in-memory) and use git diff ..HEAD --name-only after pulling to identify only the files that changed. Then process only those files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions