Skip to content

bwstefano/wp-datawrapper-embed-normalizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Datawrapper Normalize Script

datawrapper_normalize.php is a WP-CLI helper script for auditing and repairing broken or outdated Datawrapper embeds stored in WordPress post and page content.

It was created to support content cleanup in cases where Datawrapper embeds were saved in inconsistent formats, where legacy markup still points to old iframe/script patterns, or where published assets need to be restored before the embed can be replaced safely.

The script is WordPress-aware and WPML-aware:

  • It runs through wp eval-file.
  • It scans only post and page entries.
  • It preserves localized permalinks when WPML is active.
  • It can work in inventory mode or normalization mode.

WPML compatibility matters because the script writes a CSV report with the permalink for each inspected entry. When WPML is active, the script asks WPML for the language assigned to the current post/page and then resolves the permalink in that language. This means the report points editors to the correct localized URL instead of always falling back to the default-language permalink.

What The Script Does

For each WordPress post or page that contains a datawrapper.dwcdn.net reference, the script:

  1. Extracts candidate Datawrapper chart IDs from the stored HTML.
  2. Requests fresh embed code from the Datawrapper API.
  3. Checks whether the public embed asset is available.
  4. If the asset is missing, tries to republish the chart through the API.
  5. If the API returns 404, tries a public CDN fallback embed.
  6. Replaces matching legacy embed markup in the post or page content.
  7. Removes redundant full.png fallback images when a working embed has been restored.
  8. Writes a CSV report with the result for every chart it inspects.
  9. Creates an HTML backup before updating any content in normalize mode.

Why It Was Created

This script is meant for editorial maintenance and migration cleanup.

Typical problems it helps resolve:

  • Old Datawrapper embeds saved as legacy iframe markup.
  • Broken or malformed embed script URLs in existing content.
  • Posts or pages where the public Datawrapper asset is missing even though the chart still exists.
  • Mixed embed states where a chart embed and a redundant static full.png image are both present.

Requirements

  • WordPress with WP-CLI available.
  • PHP with cURL enabled.
  • Access to the WordPress database through the normal WP runtime.
  • A Datawrapper API token in the DATAWRAPPER_API_TOKEN environment variable.

The script can still scan content without a token, but it will not be able to fetch official embed codes or republish charts, so the report will mostly document missing credentials.

WPML Compatibility

The script is compatible with sites that use WPML.

It does not translate or duplicate content by itself, but it is aware of the language context of each post/page when generating report output:

  • It uses the wpml_post_language_details filter to determine the language of the current post/page.
  • It uses the wpml_permalink filter to resolve the permalink in that same language.
  • If WPML is not active, it falls back to the normal WordPress permalink.

This is especially useful during editorial cleanup because the CSV report can be used as an action list, and editors need links to the exact localized entry they are reviewing.

Datawrapper API Endpoints Used

The script relies on the official Datawrapper API for two operations:

  • GET /v3/charts/{id}/embed-codes Retrieves the available embed code for a chart.
  • POST /v3/charts/{id}/publish Republishes a chart when its public embed asset appears to be missing.

It also validates the public embed asset on the Datawrapper CDN before deciding whether to use the returned embed code as-is, republish the chart, or fall back to a public script embed.

Relevant official documentation:

Token Scopes

Because the script fetches embed codes and may republish charts, the token should have enough permissions for both operations.

Required scopes for full script functionality

If you want the script to run with all recovery features enabled, including republishing charts when public assets are missing, the token should include all of these scopes:

  • chart:read is needed to request embed codes.
  • chart:write is needed to publish the chart again.
  • theme:read is required by Datawrapper's publish endpoint.
  • visualization:read is required by Datawrapper's publish endpoint.

According to the official Datawrapper API reference:

  • GET /v3/charts/{id}/embed-codes requires chart:read.
  • POST /v3/charts/{id}/publish requires chart:read, chart:write, theme:read, and visualization:read.

Minimum useful scope for inventory-only runs

If you only want to audit content in inventory mode and do not care about republishing missing assets, chart:read is the minimum useful scope.

However, if a chart needs republishing during a later normalize run and your token does not include the publish-related scopes, the script will report the failure and skip updating that embed.

How to create the token

Datawrapper explains token creation in its official getting started guide:

The guide points to the token management page in the Datawrapper app:

In short:

  1. Open the API token page in Datawrapper.
  2. Click Create new Access Token.
  3. Give the token a name.
  4. Select the scopes listed above.
  5. Generate the token and store it safely, because Datawrapper will not show it again after the page is refreshed.

How To Run

Run the script with wp eval-file from the WordPress environment:

wp eval-file datawrapper_normalize.php -- inventory

The -- separator is important because it tells WP-CLI to pass the remaining arguments to the script.

Modes

The script supports two modes:

  • inventory Read-only audit. It validates charts and writes the CSV report, but does not update content.
  • normalize Attempts to replace stored embed markup with a fresh working embed code.

Arguments

The script accepts up to five arguments after --:

<mode> <limit> <dry-run> <start-after-id> <batch-size>
  • mode inventory or normalize
  • limit Maximum number of posts/pages to process. Use 0 for no limit.
  • dry-run 1 means simulate updates without writing to WordPress. 0 means write changes to the database.
  • start-after-id Optional. Process only posts/pages with an ID greater than this value. Use 0 to start from the beginning. This is useful when a long production run is interrupted and you want to resume after the last completed post.
  • batch-size Optional. Number of candidate posts/pages to fetch per database batch. Smaller batches reduce memory pressure during long runs. The default is 20.

Usage Examples

1. Audit all matching posts and pages

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file datawrapper_normalize.php -- inventory

2. Audit only the first 20 matching entries

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file datawrapper_normalize.php -- inventory 20

3. Test normalization on five entries without writing changes

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file datawrapper_normalize.php -- normalize 5 1

4. Normalize all matching entries and write updates

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file datawrapper_normalize.php -- normalize 0 0

5. Resume a normalization run after a known post ID

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file datawrapper_normalize.php -- normalize 0 0 167685

This skips everything up to and including post 167685 and continues with higher IDs only.

6. Resume with smaller batches for safer long production runs

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file datawrapper_normalize.php -- normalize 0 0 167685 10

This is a good recovery pattern after a fatal memory error because it both resumes the run and lowers the amount of content loaded per batch.

7. Run from another working directory by passing the absolute file path

export DATAWRAPPER_API_TOKEN="YOUR_TOKEN_HERE"
wp eval-file /Users/stefano/Datawrapper/datawrapper_normalize.php -- inventory

Output Files

The script generates two kinds of output in the current working directory:

  • datawrapper_normalize_report.csv A per-chart report describing what happened for each inspected embed.
  • dw-normalize-backups/ A directory of HTML backups, one file per updated post/page, written before any database update.

CSV Columns

The CSV report contains:

  • post_id
  • post_type
  • post_status
  • post_title
  • permalink
  • chart_id
  • api_status
  • asset_status
  • republish_status
  • fallback_source
  • action
  • reason
  • script_src

These fields make it easier to review what was changed, what failed, and which charts may need manual follow-up.

Replacement Logic

The script is designed to replace several common stored embed variants found in WordPress content, including:

  • Datawrapper iframe embeds
  • Script-based embeds
  • Broken legacy script URL variants
  • Gutenberg wrapper blocks that contain Datawrapper embeds

It updates only the first matching block for each detected chart at a time, which keeps replacements predictable and makes the CSV easier to interpret.

Safety Features

The script includes several safeguards:

  • inventory mode is read-only.
  • normalize mode supports dry-run operation.
  • Backups are written before content updates.
  • Each chart is processed independently, so one failure does not stop the whole post/page.
  • Candidate posts/pages are processed in batches instead of being loaded all at once.
  • Long runs can be resumed with start-after-id.
  • On fatal shutdown, the script reports the current post and the last fully processed post.
  • The script logs per-chart outcomes to CSV for review after the run.

Suggested Workflow

For production use, a safe workflow is:

  1. Run inventory first.
  2. Review datawrapper_normalize_report.csv.
  3. Run normalize with a small limit and dry-run=1.
  4. Inspect the intended changes.
  5. Run normalize again with dry-run=0 once the results look correct.
  6. If a long production run is interrupted, resume it with start-after-id set to the last fully processed post ID and consider lowering batch-size.

Notes

  • The script only scans WordPress post and page content.
  • It does not scan custom post types.
  • It expects Datawrapper embeds to exist in post_content.
  • If WPML is not active, permalinks are handled normally.
  • If a chart cannot be resolved through the API and the public CDN asset is unavailable, the script will report the issue and skip updating that embed.

🤖 Created with Codex.

About

WP-CLI script to audit and normalize Datawrapper embeds in WordPress posts and pages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages