Skip to content

GeoParquet 2.0: allow <authority>:<code> CRS in addition to inline PROJJSON#284

Open
cholmes wants to merge 2 commits into
mainfrom
crs-representation-2.0
Open

GeoParquet 2.0: allow <authority>:<code> CRS in addition to inline PROJJSON#284
cholmes wants to merge 2 commits into
mainfrom
crs-representation-2.0

Conversation

@cholmes

@cholmes cholmes commented Jun 1, 2026

Copy link
Copy Markdown
Member

Permits writers to emit the Parquet crs property as either inline PROJJSON (the canonical, preferred form) or an <authority>:<code> string (e.g. EPSG:4326, OGC:CRS84). The other Parquet-core forms (srid:<id>, projjson:<key>) remain disallowed for writers, but readers SHOULD continue to tolerate them for interoperability with the broader Parquet geospatial ecosystem.

The GeoParquet column-metadata crs field is widened to accept the same <authority>:<code> form, and the two CRS fields (Parquet logical-type crs and GeoParquet column-metadata crs) MUST mirror each other exactly in both form and value.

Why

The current 2.0 draft requires inline PROJJSON for any non-default CRS, which forces every writer to ship a PROJJSON generator (pyproj or equivalent) even to emit EPSG:4326. That's a real burden for pure-SQL pipelines, lightweight browser-side writers, and high-file-count workloads. Apache Parquet's own geospatial spec lists <authority>:<code> as a first-class option alongside PROJJSON; constraining writers to those two forms (and rejecting the more ambiguous srid:<id> and indirect projjson:<key> forms) gives us alignment with Parquet core without exposing readers to four parsing paths.

PROJJSON remains the preferred writer form per spec language; <authority>:<code> is a permitted shortcut.

Summary of changes

  • format-specs/schema.json: crs field now accepts object (PROJJSON), string (matching ^[A-Za-z][A-Za-z0-9_-]*:[A-Za-z0-9_.-]+$), or null.
  • format-specs/geoparquet.md:
    • Column-metadata crs row updated (object|string|null).
    • §crs subsection rewritten to enumerate the two permitted forms and the writer preference.
    • §crs Parquet property subsection rewritten with explicit writer MUST / MUST NOT / SHOULD / MAY rules, the reader MUST + SHOULD-tolerate rules, and a mirroring table for the GeoParquet/Parquet crs correspondence.
    • §OGC:CRS84 details: added a sentence noting that authority-code OGC:CRS84 and EPSG:4326 are likewise equivalent to OGC:CRS84.
  • scripts/test_json_schema.py: new positive cases (crs_authority_code_*) for the four common authority codes; new negative cases (crs_string_*) for malformed strings; the previous crs_string invalid case (for "EPSG:4326") is now valid.

Mirroring table (from the new spec text)

Parquet logical-type crs GeoParquet column-metadata crs Meaning
absent (Parquet default) absent OGC:CRS84
inline PROJJSON object the same inline PROJJSON object CRS fully described in metadata
<authority>:<code> string the same <authority>:<code> string CRS identified by authority code
absent (writer signaling "no CRS") null CRS explicitly undefined / unknown

Open / followup

The treatment of null (explicit "no CRS / unknown") in 2.0 is preserved here for 1.x compatibility but is awkward: Parquet core has no native way to express "no CRS." A followup discussion may propose removing null from 2.0; not in scope for this PR.

Test results

57 passed in 0.5s  (test_json_schema.py)
72 passed in 12.2s (full scripts/ test suite)

🤖 Generated with Claude Code

cholmes and others added 2 commits June 1, 2026 10:37
The GeoParquet 2.0 schema now accepts the crs field as either inline
PROJJSON, an <authority>:<code> string (e.g. 'EPSG:4326'), or null.
Adds positive and negative test cases for the string form.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Updates the column-metadata table and the crs / crs Parquet property
subsections to:
- permit either inline PROJJSON or an <authority>:<code> string,
- disallow srid:<id> and projjson:<key> on the writer side,
- preserve the reader SHOULD-tolerate language for the disallowed forms,
- require the GeoParquet column-metadata crs and the Parquet
  logical-type crs to mirror each other exactly,
- document the null/no-CRS corner case explicitly,
- note that authority-code OGC:CRS84 / EPSG:4326 are equivalent to
  OGC:CRS84 in §OGC:CRS84 details.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread format-specs/schema.json
{
"type": "string",
"description": "An <authority>:<code> CRS reference (e.g. 'OGC:CRS84', 'EPSG:4326', 'EPSG:3857', 'IGNF:ATI').",
"pattern": "^[A-Za-z][A-Za-z0-9_-]*:[A-Za-z0-9_.-]+$"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the pattern backed by any specification? It seems rather arbitrary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only CRS related spec regarding names of authorities or codes is WKT CRS, which allows pretty much everything (cf "wkt Latin1 text", https://docs.ogc.org/is/18-010r11/18-010r11.pdf page 15, which would allow colon itself in the authority or code, although that would be a bad idea). The proposed pattern is indeed a bit restrictive, but it should capture all real world CRS codes, at least the ones in proj.db.
It has the advantage to exclude things like "EPSG:4326+3855" that PROJ understands (WGS 84 with EGM2008 height), but that might not be universally understood besides it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the pattern backed by any specification? It seems rather arbitrary.

No, it was just an attempt to capture the ones that are actually valid in spatialreference.org right now. I'm totally fine to relax it, like if we wanted to enable "EPSG:4326+3855". The idea on being restrictive was to help people catch spelling errors / typos.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The + case seems too difficult for me.

A slightly simpler version, if we want:

Suggested change
"pattern": "^[A-Za-z][A-Za-z0-9_-]*:[A-Za-z0-9_.-]+$"
"pattern": "^[\w.-]+:[\w.-]+$"

- A GeoParquet 2.0 writer SHOULD prefer inline PROJJSON. It is self-describing, requires no external CRS registry, and matches the GeoParquet 1.x convention.
- A GeoParquet 2.0 writer MAY use `<authority>:<code>` when:
- the CRS is precisely identified by a well-known authority entry, AND
- the writing environment lacks a PROJJSON generator (e.g., pure-SQL pipelines, lightweight browser-side writers), OR

@m-mohr m-mohr Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moves the requirement to ship a heavy database from the writer to the reader. I'd assume readers occur more often, so this somehow worsens the situation IMHO.

But if native GEOMETRY already does this authority:code syntax, I guess the ship has sailed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if native GEOMETRY already does this authority:code syntax, I guess the ship has sailed?

Yeah, that was my main thinking. I went back and forth for awhile - the geoparquet 'way' is clearly to ship with all the info needed. But with 2.0 we want to enable the adoption of parquet geometry / geography types, and they have that option. They have a weird srid:34 option which just seems bad, so we don't endorse that. And they have a proj json reference in a file, which is hard to use, so we don't recommend that.

I didn't end that strongly on adding this vs. sticking to projjson, so could be convinced the other way. But it seems like most tools will want to implement the support for what the Parquet spec says, so they'd have to do this anyways.

It does make things a bit easier for writers, to not have to fully create projjson. But does make it more difficult for readers. I think part of my hope is that we could lean on spatialreference.org to not make people ship the full database. We could make that pattern a bit more explicit.

@m-mohr m-mohr Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, wouldn't it scale better to only have the writers read from spatialreference.org once instead of a client that may even do thousands of requests per second or so?

The question to me is: What is the intent behind having GeoParquet 2 on top of the native types?
Is it just mirroring the native types and add a bit on top or is it meant to be a best practice on top? In the second case, I feel like I'd go with only projjson in the spec as it seems better to embed it during writing instead of requesting it each time a client opens it. If it's the first option, we could go either way.

- A GeoParquet 2.0 writer MAY use `<authority>:<code>` when:
- the CRS is precisely identified by a well-known authority entry, AND
- the writing environment lacks a PROJJSON generator (e.g., pure-SQL pipelines, lightweight browser-side writers), OR
- reducing per-file metadata size is desirable (e.g., very many very small files).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a usecase?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- reducing per-file metadata size is desirable (e.g., very many very small files).

No, not really. Should have read what my claude came up with a bit more closely.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an admittedly niche use case, but this shows up in Iceberg appends (where 4KB of PROJJSON in the always uncompressed footer can be >50% of the file).

| absent (Parquet default) | absent | OGC:CRS84 |
| inline PROJJSON object | the same inline PROJJSON object | CRS fully described in metadata |
| `<authority>:<code>` string | the same `<authority>:<code>` string | CRS identified by authority code |
| absent (writer signaling "no CRS") | `null` | CRS explicitly undefined / unknown |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "absent" in the first column here is confusing, because it is the same as the first row, but with different mappings to GeoParquet?

@jorisvandenbossche

Copy link
Copy Markdown
Collaborator

which forces every writer to ship a PROJJSON generator (pyproj or equivalent) even to emit EPSG:4326

FWIW, I would say this part of the rationale is not really true. If your writer only wants to emit EPSG:4326, you can easily hardcode the PROJJSON for that instead of hardcoding the authority:code string (like we list the OGC:CRS84 value in the spec as detail)

@paleolimbot

Copy link
Copy Markdown
Collaborator

I'm happy with either the existing language or allowing authority:code. Writers that don't have the ability to output PROJJSON also don't have to write GeoParquet metadata (they can just write regular Parquet). Also, many readers don't actually need to resolve the full CRS definition (e.g., they want to calculate ellipsoidal distance or have a hand-rolled non-PROJ transformation), so in practice lazily resolving the definition probably presents few problems.

@cholmes

cholmes commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

We had lots of discussion of this on the community call on June 1st. No one feels that strongly, and realistically what we do in the spec isn't going to have a huge effect on the ecosystem - indeed we hope that readers will implement all the recommendations in the parquet spec.

It did come up that with spatial reference.org clients would not necessarily have to ship a full projection database, and that we could lay out that path a bit in our spec, to explain that they can call spatial reference and look up the projjson. But if that's the route then it seems better to just have an explicit crs with the link to the projjson, with a new prefix, as the parquet spec allows anything in that field.

There was a slight lean in the group towards just not allowing it and sticking just with projjson, just to stay a bit more backwards compatible and to nudge the ecosystem towards in-lining projjson.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants