Skip to content

feat: make pyarrow hot fix optional#11977

Open
aandrestrumid wants to merge 5 commits intoibis-project:mainfrom
aandrestrumid:make-pyarrow-hotfix-optional
Open

feat: make pyarrow hot fix optional#11977
aandrestrumid wants to merge 5 commits intoibis-project:mainfrom
aandrestrumid:make-pyarrow-hotfix-optional

Conversation

@aandrestrumid
Copy link
Copy Markdown

Description of changes

Make pyarrow-hotfix optional and only needed for pyarrow<=14.1 where it is effective.

@github-actions github-actions bot added tests Issues or PRs related to tests impala The Apache Impala backend ci Continuous Integration issues or PRs clickhouse The ClickHouse backend pyspark The Apache PySpark backend datafusion The Apache DataFusion backend duckdb The DuckDB backend snowflake The Snowflake backend flink Issues or PRs related to Flink databricks The Databricks backend athena The Amazon Athena backend labels Mar 24, 2026
@aandrestrumid aandrestrumid marked this pull request as ready for review March 24, 2026 16:31
@NickCrews
Copy link
Copy Markdown
Contributor

This makes me wonder: should ibis, as a library, even be applying the hotfix for users? Or should we leave that up to the user to control when and how to do that?

From the comment, it looks like the hotfix is a patch for https://nvd.nist.gov/vuln/detail/cve-2023-47248, which is a RCE vulnerability when reading malicious files. In all of our tests and CI environments, I don't think we read arbitrary files, so I don't think we need the hotfix for CI for OUR protection. I assume we are applying it more as a courtesy to end users.

There is something to be said for applying it when an ibis maintainer is mucking around with their dev env and might try reading some malicious file, but in our lockfile we use pyarrow 23, so we shouldn't be exposed to this unless the maintainer intentionally installs an old version pf pyarrow, but I think that is rare enough that we should pretty much ignore it.

So, what if we

  • removed all deps on pyarrow-hotfix
  • never apply the hotfix
  • put this as a breaking change in the release notes
  • I DO like how in this PR your switch to test pyarrow versions old/new/None in CI, I think we should keep doing that. POSSIBLY if we wanted to really be safe we could still pip install pyarrow-hotfix in CI as you have done in this PR, and then in conftest.py do a if os.env.get("CI"): import pyarrow_hotfix.

This would

  • be less burden for us
  • give users more control
  • possibly expose users to dangerous input that we were previously protecting them from, but pyarrow 14

I took the download counts by pyarrow version over the last 3 months from https://pepy.tech/projects/pyarrow and it does look like ~20% of users are still downloading unsafe versions of pyarrow

Details of Pyarrow downloads by version

image

version,downloads_raw,downloads
23.0.1,81.23M,81230000
23.0.0,74.72M,74720000
22.0.0,104.45M,104450000
21.0.0,59.11M,59110000
20.0.0,48.66M,48660000
19.0.1,25.63M,25630000
19.0.0,6.19M,6190000
18.1.0,38.67M,38670000
18.0.0,7.20M,7200000
17.0.0,77.05M,77050000
16.1.0,46.85M,46850000
16.0.0,9.53M,9530000
15.0.2,8.41M,8410000
15.0.1,469.70k,469700
15.0.0,3.69M,3690000
14.0.2,35.98M,35980000
14.0.1,5.41M,5410000
14.0.0,1.64M,1640000
13.0.0,4.07M,4070000
12.0.1,50.25M,50250000
12.0.0,3.18M,3180000
11.0.0,22.12M,22120000
10.0.1,32.29M,32290000
10.0.0,583.42k,583420
9.0.0,8.27M,8270000
8.0.0,5.45M,5450000
7.0.0,6.83M,6830000
6.0.1,10.24M,10240000
6.0.0,822.17k,822170
5.0.0,2.94M,2940000
4.0.1,2.81M,2810000
4.0.0,842.80k,842800
3.0.0,3.17M,3170000
2.0.0,11.84M,11840000
1.0.1,1.09M,1090000
1.0.0,1.08M,1080000
0.17.1,2.03M,2029999
0.17.0,96.22k,96220
0.16.0,353.70k,353700
0.15.1,389.46k,389460
0.15.0,8.20k,8200
0.14.1,228.10k,228100
0.14.0,156.88k,156880
0.13.0,344.35k,344350
0.12.1,21.43k,21430
0.12.0,61.21k,61210
0.11.1,132.78k,132780
0.11.0,33.39k,33390
0.10.0,18.56k,18560
0.9.0,12.72k,12720
0.8.0,994,994
0.7.1,660,660
0.7.0,615,615
0.6.0,666,666
0.5.0,348,348
0.4.1,610,610
0.4.0,558,558
0.3.0,545,545
0.2.0,10.95k,10950
0.1.0,0,0
0.9.0.post1,207,207
0.5.0.post2,312,312

Any thoughts?

If we don't do this direction I suggest, then this PR looks mostly right to me, I can take another more thorough review and then approve.

@NickCrews
Copy link
Copy Markdown
Contributor

Like, this change only removes the pyarrow-hotfix dep for CI. For users, they ALWAYS get it installed, because python dep specification doesn't allow us to define the semantics of "package X is required IFF package Y version < 14.0.0". So I'm curious, what is the actual reason behind this PR? What is the benefit of making pyarrow hotfix optional only in this one CI job? If that is the only benefit, then I think the current implementation is simpler, and I don't really see the downside? Is there some downside for a user to apply the hotfix when they have a pyarrow version > 14?

@aandrestrumid
Copy link
Copy Markdown
Author

Hi @NickCrews, thanks for looking into this.

I'm curious, what is the actual reason behind this PR?

For context, I am using ibis with bigquery, but I had it installed as ibis-framework and not ibis-framework[bigquery] (so without the optional dependencies).

[project.optional-dependencies]
bigquery = [
  "db-dtypes>=0.3",
  "google-cloud-bigquery>=3",
  "google-cloud-bigquery-storage>=2",
  "pyarrow>=10.0.1",
  "pyarrow-hotfix>=0.4",
  "pydata-google-auth>=1.4.0",
  "pandas-gbq>=0.26.1",
  "numpy>=1.23.2,<3",
  "pandas>=1.5.3,<4",
  "rich>=12.4.4",
]

It was working for me because I had the other dependencies installed already (by coincidence).

But then I removed pyarrow-hotfix when removing another library. Ibis stopped working. I then realized that technically I don't need pyarrow-hotfix and wanted to remove it, but ibis wouldn't let me.

I think for this PR to make sense we should also remove all "pyarrow-hotfix>=0.4" from pyproject.toml.

However I do agree that it's currently simpler. The overhead of pyarrow-hotfix is small. This PR adds complexity. That's probably the price to pay to support an older version of pyarrow.

@NickCrews
Copy link
Copy Markdown
Contributor

Thanks, that story of how you ended up here makes sense. I bet other users could have that same experience.

What about this:

# ibis/__init__.py
def try_apply_pyarrow_hotfix():
    """If pyarrow and pyarrow_hotifx are available, apply on pyarrow < 14.0.1.
    
    This is a patch for https://nvd.nist.gov/vuln/detail/cve-2023-47248
    """
    # Optional config, IDK if we want this or if there is a better way
    import os
    if os.env.get("IBIS_PYARROW_HOTFIX_SKIP"):
        return
    
    try:
        import pyarrow as pa
    except ImportError:
        return

    if tuple(int(x) for x in pa.__version__.split(".")[:3]) > (14, 0, 0):
        return
    try:
        import pyarrow_hotfix  # noqa: F401
    except ImportError:
        warnings.warn(f"""You are using an old version of pyarrow ({pa.__version__}) that is vulnerable to security vulnerability https://nvd.nist.gov/vuln/detail/cve-2023-47248. Either upgrade pyarrow to version >14, or install `pyarrow-hotfix` to patch the old version. Set `os.env["IBIS_PYARROW_HOTFIX_SKIP"] = 'true'` to skip this autopatch and silence this warning.""")

try_apply_pyarrow_hotfix()
  • allows us to remove ALL junk from the various files, just import pyarrow as pa and go
  • doesn't error if the user doesn't have pyarrow or pyarrow_hotfix installed
  • warns them of the security risk if they are doing something dangerous
  • This DOES incur the performance cost of importing pyarrow every time ibis is imported, even if the user doesn't use pyarrow.

If we wanted to avoid the above perf impact, and only import when needed, we could instead put the above implementation in ibis/import_to_try_pyarrow_hotfix.py and then switch the exisiting import pyarrow_hotfix lines to import ibis.import_to_try_pyarrow_hotfix. Then at least all the call sites stay the same length as they are currently.

@aandrestrumid
Copy link
Copy Markdown
Author

I've updated the PR to apply the hot fix with a simple import

from ibis.common import import_to_try_pyarrow_hotfix  # noqa: F401

To summarize I see 5 solutions:

  1. We remove pyarrow_hotfix from the whole code base and make it the responsibility of the user to import it. Very easy, it will declutter the codebase, but may be unsafe. I would suspect that's how other libraries do it.
  2. We remove pyarrow_hotfix from the whole code base and mark pyarrow>=14.0.1 in pyproject.toml. Very easy but we will force some users to upgrade pyarrow. I know there are still many users of pyarrow<14.0.1, but how many of these are using ibis we don't know, it may be very little.
  3. Do nothing. This is not a huge deal. pyarrow_hotfix over head is small and I only had this issue because I didn't install ibis-framework with the necessary extras.
  4. Use your suggested solution, it works but it makes ibis import pyarrow when present which has a a non negligible overhead.
  5. Go ahead with this solution, with your suggestion of using import ibis.import_to_try_pyarrow_hotfix, instead of calling apply_pyarrow_hotfix (less clutter)

@aandrestrumid aandrestrumid force-pushed the make-pyarrow-hotfix-optional branch from 7a3a587 to 73b02b5 Compare March 26, 2026 16:17
gen_name,
normalize_filename,
normalize_filenames,
warn_deprecated,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easy, revert this change that is just restyling so our git diff is cleaner. But not a blocker if you don't get to it.

)
from ibis.backends.sql import SQLBackend
from ibis.backends.sql.compilers.base import STAR, AlterTable, C, RenameTable
from ibis.common import import_to_try_pyarrow_hotfix # noqa: F401
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got moved up here to the top level. So now as soon as you import this file, you need pyarrow installed. I'm guessing you did this to reduce the number of imports from 3 to 1, which is a nice simplification. But I think we should remain more conservative, and only import_to_try_pyarrow_hotfix lazily when we actually DO need pyarrow. So move this back to the original callsites.

The one exception is I don't think we need to import_to_try_pyarrow_hotfix in the if TYPE_CHECKING block?? So I think you can just delete that one??

@NickCrews
Copy link
Copy Markdown
Contributor

Thanks for your work here. Those summaries are great.

I think we go ahead with your solution 5, the one that this PR implements.

Let's switch the actual implementation to this though:

  • includes better docs on what the fix does
  • includes link to this discussion
  • if the user doesn't have pyarrow_hotfix installed, now this warns instead of errors. I think this is friendlier.
# ibis/common/import_to_try_pyarrow_hotfix

def try_apply_pyarrow_hotfix():
    """If pyarrow and pyarrow_hotifx are available, apply on pyarrow < 14.0.0.
    
    This is a patch for https://nvd.nist.gov/vuln/detail/cve-2023-47248.
    See https://github.com/ibis-project/ibis/pull/11977 for discussion
    """
    # Optional config, IDK if we want this or if there is a better way
    import os
    if os.env.get("IBIS_PYARROW_HOTFIX_SKIP"):
        return

    import pyarrow as pa

    if tuple(int(x) for x in pa.__version__.split(".")[:3]) > (14, 0, 0):
        return
    try:
        import pyarrow_hotfix  # noqa: F401
    except ImportError:
        warnings.warn(f"""You are using an old version of pyarrow ({pa.__version__}) that is vulnerable to security vulnerability https://nvd.nist.gov/vuln/detail/cve-2023-47248. Either upgrade pyarrow to version >14, or install `pyarrow-hotfix` to patch the old version. Set `os.env["IBIS_PYARROW_HOTFIX_SKIP"] = 'true'` to skip this autopatch and silence this warning.""")

try_apply_pyarrow_hotfix()

After this:

  • if someone does uv add ibis-framework[duckdb], they get pyarrow_hotfix installed automatically. They can opt out if they really want with the os env.
  • if someone does uv add ibis-framework and then manually install the other deps, then they will get the hotfix applied automatically if they need it and have pyarrow_hotfix installed, otherwise they get a warning if they need it but don't have it installed. Again, they can opt out of us doing anything with the os env.

Once you switch the implementation to the above, and fix the two other very small nits I had, then I will merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

athena The Amazon Athena backend ci Continuous Integration issues or PRs clickhouse The ClickHouse backend databricks The Databricks backend datafusion The Apache DataFusion backend duckdb The DuckDB backend flink Issues or PRs related to Flink impala The Apache Impala backend pyspark The Apache PySpark backend snowflake The Snowflake backend tests Issues or PRs related to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants