Skip to content

Releases: dathere/qsv

8.0.0

06 Oct 00:43

Choose a tag to compare

[8.0.0] - 2025-10-06

FAIRdataAIREADYdataBanner1
Findable, Accessible, Interoperable & Reusable (FAIR) Data is AI-Ready Data.

A week and a half after launching our "People's API" AI Chatbot and "AI-Ready" service, we fine-tune qsv further, as it powers the FAIRification engine that allows us to "open your data" (as a verb) - to infer and calculate AI-Ready, FAIR metadata at blazing speed even for large datasets.

This release features:

These changes set the stage for even more advanced, powerful, configurable FAIRification capabilities to

make ALL your Data AI-Ready, Useful, Usable & Used by Machines & Humans alike.

Added

  • table: add leftendtab alignment option #3004
  • table: add leftfwf (Fixed Width Format) alignment option 590c861
  • validate: add Extended Input Support to RFC 4180 validation mode #3012
  • added PowerPC64 LE Linux prebuilt

Changed

  • describegpt: fine-tuned default LLM Prompt template (v3.1.0) 00e52a3 6b09b7e 5be7f2e
  • luau: bump embedded Luau from 0.690 to 0.693 #3017
  • schema: make Decimal Type Scale configurable for polars schema with QSV_POLARS_DECIMAL_SCALE env var - f20edd5
  • updated optimized csv crate, adding non-allocating StringRecord::trim() and more inline()s 4a1c82a
  • deps: bump calamine to 0.31.0 bd7a04c
  • deps: Bump polars to 0.51.0 from 0.50.0 at py-1.33.1 tag #2995
  • deps: bump polars to 0.51.0 at py-1.34.0-beta.4 tag at revision b973cac (latest upstream) #3022
  • deps: bump polars to 0.51.0 at py-1.35.0 tag revision b973cac 4164875
  • deps: replace tabwriter with renamed fork qsv-tabwriter #3010
  • deps: use patched fork of whatlang-rs. Though our PR was merged, there is still no new release 6afff4f
  • build(deps): bump base62 from 2.2.2 to 2.2.3 by @dependabot[bot] in #3003
  • build(deps): bump bytemuck from 1.23.2 to 1.24.0 by @dependabot[bot] in #3026
  • build(deps): bump chrono from 0.4.41 to 0.4.42 by @dependabot[bot] in #2974
  • build(deps): bump fancy-regex from 0.16.1 to 0.16.2 by @dependabot[bot] in #3000
  • build(deps): bump flate2 from 1.1.2 to 1.1.3 by @dependabot[bot] in #3027
  • build(deps): bump flexi_logger from 0.31.2 to 0.31.3 by @dependabot[bot] in #3005
  • build(deps): bump flexi_logger from 0.31.3 to 0.31.4 by @dependabot[bot] in #3008
  • build(deps): bump indexmap from 2.11.0 to 2.11.1 by @dependabot[bot] in #2973
  • build(deps): bump indexmap from 2.11.1 to 2.11.3 by @dependabot[bot] in #2993
  • build(deps): bump indexmap from 2.11.3 to 2.11.4 by @dependabot[bot] in #2999
  • build(deps): bump libc from 0.2.175 to 0.2.176 by @dependabot[bot] in #3009
  • build(deps): bump mlua from 0.11.3 to 0.11.4 by @dependabot[bot] in #3021
  • build(deps): bump regex from 1.11.2 to 1.11.3 by @dependabot[bot] in #3011
  • build(deps): bump redis from 0.32.5 to 0.32.6 by @dependabot[bot] in #3016
  • build(deps): bump qsv-stats from 0.38.0 to 0.39.0 by @dependabot[bot] in #3028
  • build(deps): bump qsv-stats from 0.39.0 to 0.39.1 by @dependabot[bot] in #3029
  • build(deps): bump redis from 0.32.6 to 0.32.7 by @dependabot[bot] in #3025
  • build(deps): bump serde from 1.0.219 to 1.0.223 by @dependabot[bot] in #2983
  • build(deps): bump serde from 1.0.223 to 1.0.224 by @dependabot[bot] in #2988
  • build(deps): bump serde from 1.0.224 to 1.0.225 by @dependabot[bot] in #2994
  • build(deps): bump serde from 1.0.225 to 1.0.226 by @dependabot[bot] in #3002
  • build(deps): bump serde from 1.0.226 to 1.0.227 by @dependabot[bot] in #3014
  • build(deps): bump serde from 1.0.227 to 1.0.228 by @dependabot[bot] in #3019
  • build(deps): bump serde_json from 1.0.143 to 1.0.145 by @dependabot[bot] in #2981
  • build(deps): bump semver from 1.0.26 to 1.0.27 by @dependabot[bot] in #2982
  • build(deps): bump sysinfo from 0.37.0 to 0.37.1 by @dependabot[bot] in #3015
  • build(deps): bump sysinfo from 0.37.1 to 0.37.2 by @dependabot[bot] in #3024
  • build(deps): bump tempfile from 3.21.0 to 3.22.0 by @dependabot[bot] in #2975
  • build(deps): bump tempfile from 3.22.0 to 3.23.0 by @dependabot[bot] in #3007
  • build(deps): bump toml from 0.9.6 to 0.9.7 by @dependabot[bot] in #3001
  • pin zip to 4.6, as zip 5 has features that are not widely adopted b231a23
  • applied select clippy lint suggestions
  • updated indirect dependencies
  • bumped MSRV to Rust 1.90

Fixed

  • describegpt: init cache vars even when --no-cache is used #2970
  • describegpt: --base-url option being ignored #2977
  • schema: delimiter detection #2998
  • extdedup: really use memmapped ondisk hash table #3020

Removed:

  • removed powerpc64-le cross-compilation directive now that we have access to IBM-provided native PowerPC GH Action runner 9659bfc
  • removed macOS on Intel (x86_64-apple-darwin) prebuilt binaries

Full Changelog: 7.1.0...8.0.0


  1. SangyaPundir, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:FAIR_data_principles.jpg

7.1.0

06 Sep 16:07
df89a22

Choose a tag to compare

[7.1.0] - 2025-09-06

🇮🇹 csv,conf,v9 edition 🍝

   
csvconfv9-flavor-small Just in time for csv,conf,v9, we're Bologna-bound and will be talking all things qsv, CSV, open data, metadata standards, AI, POSE and CKAN!

For this feature release, we polished describegpt a bit more for the occasion...

Towards the "People's API!"! Verso l'API del Popolo!
(Answering People/Policymaker Interface)

🚀 Enhanced describegpt Command

  • Configurable Frequency Limits: Make frequency distribution limit configurable for better control over data analysis
  • Few-shot Learning: Add --fewshot-examples option to improve LLM response quality with contextual examples
  • Advanced SQL Generation: Fine-tuned SQL generation guidance for better date handling and query optimization
  • Conditional SQL Results: Implement conditional --sql-results format for more efficient "SQL RAG" processing - i.e. if the generated SQL query executes successfully - the results are saved to the specified file with a .csv extension. If a "SQL hallucination" fails, the file is saved with a .sql extension instead for the user to tweak and edit.
  • TogetherAI Support: Add support for TogetherAI models endpoint, expanding LLM provider options
  • Enhanced Error Handling: Improved SQL parsing error handling and more informative error messages
  • Disk Cache by Default: The disk cache is now enabled by default for better performance
  • TOML Configuration: Migrate from JSON to more readable TOML format for more easily modifiable prompt files.
    (see https://github.com/dathere/qsv/blob/master/resources/describegpt_defaults.toml)
  • Better Local LLM Support: --api-key can now be set to NONE for local LLM configurations that may not necessarily run on localhost (e.g. a shared Local LLM service running on the local network)

partition Command Enhancements

  • New --limit Option: Implement --limit option to set the maximum number of open files
  • Streaming to Enhanced Batching Logic: Convert from streaming to a simplified, two-pass batched approach designed to partition on columns with high cardinality for very large datasets

Added

  • describegpt: add configurable frequency limit #2950
  • describegpt: migrate prompt file from JSON to more easier to edit TOML format #2954
  • describegpt: refactor default prompt file; add --fewshot-examples option #2955
  • describegpt: add TogetherAI support for models endpoint #2965
  • partition: add --limit option #2960
  • added Windows ARM64 prebuilt binaries

Changed

  • describegpt: enable disk cache by default #2951
  • describegpt: Polars SQL generation tweaks #2958
  • python: replace deprecated with_gil with attach #2949. This sets the stage for "free-threaded" Python 3.14 support when its released in October 2025. Buh-bye GIL!
  • deps: bump embedded Luau from 0.688 to 0.690 #2967
  • deps: bump Polars to 0.50.0 at py-1.33.0 tag
  • build(deps): bump actions/setup-python from 5.6.0 to 6.0.0 by @dependabot[bot] in #2962
  • build(deps): bump actions/stale from 9 to 10 by @dependabot[bot] in #2963
  • build(deps): bump log from 0.4.27 to 0.4.28 by @dependabot[bot] in #2961
  • build(deps): bump mlua from 0.11.2 to 0.11.3 by @dependabot[bot] in #2948
  • build(deps): bump pyo3 from 0.25.1 to 0.26.0 by @dependabot[bot] in #2946
  • build(deps): bump uuid from 1.18.0 to 1.18.1 by @dependabot[bot] in #2956
  • build(deps): bump zip from 4.5.0 to 4.6.0 by @dependabot[bot] in #2952
  • applied select clippy lints
  • updated indirect dependencies

Full Changelog: 7.0.1...7.1.0

7.0.1

29 Aug 03:06
aa404c3

Choose a tag to compare

[7.0.1] - 2025-08-28

A patch release with some minor bug fixes, benchmark tweaks and build system improvements.

Added

  • publish: add dedicated powerpc64le-unknown-linux-gnu publishing workflow (WIP)

Changed

  • docs: describegpt expanded error message about LLM URL or API key
  • deps: remove planus pinned dependency

Fixed

  • fix: geocode --batch 0 causes panic when polars feature is enabled
  • publish: remove luau feature from x86_64-pc-windows builds that was causing builds to fail
  • publish: remove powerpc64le from main publish workflow
  • benchmarks: updated to v6.8.0 with fixes to luau and clustered sample benchmarks

Full Changelog: 7.0.0...7.0.1

7.0.0

28 Aug 14:13

Choose a tag to compare

[7.0.0] - 2025-08-28

🥳 Open Weights with Open Data, Local LLM 🤖 edition 🚀

This is the biggest release yet - 470+ commits since v6.0.1! Packed with new AI-powered features, fixes and significant performance improvements suite-wide!

With the release of OpenAI's gpt-oss open-weight reasoning model earlier this month setting the stage, we continue on our "Automagical Metadata" journey by revamping describegpt.

🤖 Revamped describegpt - AI-Powered Metadata Inferencing and Data Analysis:

  • Intelligent Metadata Generation: Automatically generate comprehensive metadata - Data Dictionaries, Description and Tags for your Datasets using Large Language Models (LLM) prompted with summary statistics and frequency tables as detailed context - without sending your data to the cloud!
    Even if you elect to use a cloud-based LLM, your Raw Data is never sent.
  • Chat with your Data: If your prompt can be answered using this high-quality, high-resolution Metadata, describegpt will answer it! If your prompt is not remotely related to the data, it will politely refuse - "I'm sorry, I can only answer questions about the Dataset."
  • Auto SQL RAG Mode: Should the LLM decide that it doesn't have the necessary information in the metadata it compiled to answer your prompt, it will automatically enter SQL Retrieval-Augmented Generation (RAG) mode - using the rich metadata instead as context to craft an expert-level, deterministic, reproducible, "hallucination-free" SQL query1 to respond to your prompt.
  • Database Engine Support: If DuckDB is installed or the Polars feature is enabled, and --sql-results <ANSWER.CSV> is specified - an optimized SQL query will be automatically executed with the query results saved to the specified file.
    As both DuckDB and Polars are purpose-built OLAP engines that support direct queries (no database pre-loading required), you get answers in a few seconds2 - even for very large datasets.
  • Multi-LLM Support: Works with any OpenAI-API compatible LLM - with special support for local LLMs like Ollama, Jan and LM Studio, with the ability to customize model behavior with the --addl-props option.
  • Advanced Caching: Disk and Redis caching support for performance and cost optimization.
  • Flexible Prompting: Custom prompt files and built-in intelligent templates for various analysis tasks.

Check out these examples using a 1 million row sample of NYC's 311 data!

On top of other improvements in Datapusher+ with its new Jinja-based "metadata suggestion engine" - we're using this AI-inferred metadata along with other precalcs to prepopulate DCATv3 (both US and European profiles) and Croissant metadata fields that are otherwise too hard and expensive to compile manually.

The inferred and precalculated metadata values are offered as "suggestions", using a UI/UX purpose-built to facilitate interactive metadata curation chats.

This allows Data Stewards to compile high-quality, high-resolution metadata catalogs with an accelerated "Data Steward in the Loop" data ingestion and metadata curation workflow.

If you want to see and learn more, we're Bologna-bound to attend csv,conf,v9 to present and share how we're using this to auto-infer metadata in CKAN. Hope to see you there!

Towards the People's API!

(Answering People/Policymaker Interface)


📊 Enhanced frequency Command:

  • Rank Column: Ranking of frequency results for better data insights
  • JSON Output Mode: New --json option not only provides structured output beyond the default CSV format - it also takes advantage of JSON's nested support to include 15 additional summary statistics per field
  • Performance Boost: Speed improvements with SIMD-accelerated number parsing, remaining performant even with the added functionality

stats Command Improvements:

  • Faster Still: Enabled by improvements in the underlying qsv-stats crate
  • Improved Precision: Faster, streamlined precision calculation
  • SIMD Number Parsing: Hardware-accelerated parsing for int/float values
  • Unix Epoch Support: Proper handling of Unix timestamp 0 as valid date
  • Enhanced Date Inference: Better date and boolean type inference capabilities

🔧 validate & schema Enhancements:

  • Fancy Regex Support: You can now use "advanced" regex features with your JSON Schema patterns with the --fancy-regex option. Previously, you can only use the standard Rust regex engine which does not support backreferences or look-arounds (for performance reasons)
  • JSON Schema Improvements: Better error handling and format validation options
  • Schema Validation Refinements: More granular validation control with --no-format-validation

🔄 rename Reverted and Improved:

When pairwise renaming was introduced in v6.0.0, it broke some some workflows. It's now fixed by introducing two modes:

  • Positional Mode: Renaming by position is now once again the default
  • Pairwise Mode: New --pairwise flag for column renaming by column pairs

🗂️ partition Improvements:

  • Case-Insensitive Safety: Improved case-aware partitioning algorithm. Previously, case insensitive file systems like macOS APFS and Windows NTFS was causing incorrect partitioning of case-sensitive values
  • Faster still: With better use of I/O bufferring - with deferred, batched, async writes instead of after every record

Added

  • frequency add rank info to frequency table #2878
  • frequency add --json output option #2868
  • validate add --fancy-regex option #2845
  • add CPU-accelerated, mem-mapped, chunked sha256 file checksum helper #2909

Changed

  • apply use SIMD-accelerated base64-simd crate for Encode64 and Decode64 operations #2863
  • stats faster precision calculation #2852
  • perf: Use simd_json instead of serde_json to serialize to JSON #2884
  • refactor: create and use reqwest client helpers to eliminate redundant code #2888
  • perf: Faster parallelized sha256 hash file #2918
  • refactor: describegpt #2890
  • refactor: describegpt setting --timeout to 0 sets no timeout #2891
  • refactor: describegpt more refinements #2892
  • feat: describegpt refactor round3 #2893
  • feat: describegpt disk & redis caching #2895
  • refactor: describegpt #2896
  • refactor: describegpt create get_cache_key helper; customizable stats options #2902
  • feat: describegpt auto SQL RAG for --prompt #2904
  • feat: describegpt major refactor #2913
  • refactor: describegpt default promptfile is now embedded in qsv binary; fine-tune tests #2924
  • feat: describegpt returning reasoning with --json option #2926
  • feat: describegpt add DuckDB support in SQL RAG mode #2929
  • feat: describegpt various DuckD...
  1. LLMs can still hallucinate a syntactically wrong SQL query. But once a valid SQL query is generated, its fully reproducible.

  2. Depending on your LLM setup, SQL query generation may take some time. Once generated however, the SQL query itself will be blazing-fast.

Read more

6.0.1

12 Jul 13:49

Choose a tag to compare

[6.0.1] - 2025-07-12

This is a patch release with bug fixes and minor improvements.


Changed

  • feat: updated completions for qsv v6.0.0 by @rzmk in #2838
  • docs: updated sample schema.json based on NYC311 1M row sample benchmark data
  • docs: updated sample stats output using NYC 311 1M row sample benchmark data
  • build(deps): bump chrono-tz from 0.10.3 to 0.10.4 by @dependabot[bot] in #2839
  • build(deps): bump qsv-stats from 0.35.0 to 0.36.0 by @dependabot[bot] in #2840
  • bumped indirect dependencies
  • Added benchmark_data.* to .gitignore

Fixed

  • geocode: make --batch=0 mode more robust by setting a minimum batch size of 1,000 rows 2fa90bc
  • jsonl: correct batchsize calculation to use input file instead of output file for line counting 742dc77
  • benchmarks: fixed benchmarks with unescaped parameters with embedded spaces ad95596

Removed

  • Removed retired publishing workflows (linux-glibc-231-musl-123 and wix-installer)

Full Changelog: 6.0.0...6.0.1

6.0.0

11 Jul 12:10

Choose a tag to compare

Highlights:

This is a major release with significant improvements and new features!

🔍 Enhanced lens command:

  • File prompt support: You can now load prompts from files using the new file: support, making it easier to reuse complex prompts
  • Wrap mode option: Added --wrap-mode option for better text display control when viewing data
  • Improved examples: Enhanced usage examples and documentation

🔄 Improved rename command:

  • Pair-based renaming: Easier column renaming with more intuitive syntax for bulk operations.

📊 Enhanced sort command:

  • Natural sorting: Added --natural option for human-friendly sorting (e.g., "file1.txt", "file2.txt", "file10.txt" sorts "naturally"; previously lexicographical sorting would sort it as "file1.txt", "file10.txt", "file2.txt")

⚡ Performance improvements:

  • Memory optimizations: Multiple performance enhancements across frequency, stats, validate, and transpose commands
  • Buffer optimizations: Improved I/O performance with better buffer sizing for various operations
  • Polars engine upgrade: Updated to the latest Polars 0.49.x series for better performance and stability

🔧 Enhanced validation:

  • Robust JSON Schema validation: More granular error messages and better schema validation
  • Improved error reporting: Clearer messages to help debug validation issues
  • UTF-8 handling: Better handling of invalid records with improved debug output

🌐 Geocoding improvements:

  • Updated geosuggest: Bumped to version 0.8 with direct index update support for better geocoding performance

🔗 SQL enhancements (joinp and sqlp):

  • Decimal comma support: Added --decimal-comma option for writing operations, improving international data support
  • Better validation: Enhanced delimiter and decimal comma validation

🏗️ Infrastructure updates:

  • Rust 1.88 MSRV: Updated minimum supported Rust version
  • Dependency updates: Comprehensive updates to all major dependencies including Polars, Tokio, and many others
  • Compilation optimizations: Various improvements for faster builds and better runtime performance

Added

New Features:

  • lens: add file: support to load prompts from files #2805
  • lens: add --wrap-mode option #2805
  • rename: pair-based renaming for easier bulk column renaming #2806
  • sort: add --natural option for natural/human-friendly sorting #2808
  • schema: set JSON Schema description to the command line used for generation
  • joinp & sqlp: add --decimal-comma option for writing operations
  • joinp: add --decimal-comma and --delimiter validation
  • sqlp: add --decimal-comma and --delimiter validation
  • validate: more robust JSON Schema schema validation with granular error messages
  • validate: show invalid record in debug format for UTF-8 failures
  • Enhanced completions for qsv v5.1.0 and v6.0.0

Documentation & Examples:

  • lens: improved examples in usage text
  • schema: expand examples and add -P shortcut for --prompt option
  • sqlp: update description to note support for input beyond CSVs
  • Polars SQL documentation noting it's a PostgreSQL dialect
  • Added link to Polars 0.49.0 release notes
  • MSRV documentation updated to Rust 1.88
  • Additional conditions for when to use "portable" binaries

Changed

Performance Improvements:

  • frequency: microoptimize null value handling and preallocate vectors
  • stats: preallocate with_capacity for Unsorted struct and coefficient of variation handling improvements
  • transpose: performance refactoring with optimized buffer handling
  • validate: microoptimizations for JSON instance handling and buffer capacity improvements
  • apply: bigger reader buffer as apply is batch oriented
  • Enabled setter for read and write buffer sizing configuration
  • Various microoptimizations across commands

Polars Engine Updates:

  • Bumped Polars from 0.48 to 0.49.x series
  • Adapted to new Polars PlPath API
  • Updated to use latest Polars upstream throughout development cycle
  • Enabled simd-json compiler hints feature on nightly builds

Dependency Updates:

  • Major updates:

    • Polars: 0.48 → 0.49.x
    • Tokio: 1.45.1 → 1.46.1
    • qsv-stats: 0.33.0 → 0.35.0
    • kiddo: 5.0.3 → 5.2.2
    • indexmap: 2.9.0 → 2.10.0
    • calamine: updated to latest upstream
    • redis: 0.32.2 → 0.32.3
    • sysinfo: 0.35.2 → 0.36.0
    • geosuggest: bumped to 0.8
  • Build dependencies:

    • flexi_logger: 0.31.0 → 0.31.2
    • arboard: 3.5.0 → 3.6.0
    • minijinja: 2.10.2 → 2.11.0
    • minijinja-contrib: 2.10.2 → 2.11.0
    • zip: 4.1.0 → 4.3.0
    • reqwest: 0.12.20 → 0.12.22
    • indicatif: 0.17.11 → 0.17.12
    • phf: 0.11.3 → 0.12.1
    • human-panic: 2.0.2 → 2.0.3
    • jaq-std: 2.1.1 → 2.1.2
    • jaq-core: 2.2.0 → 2.2.1
    • jaq-json: 1.1.2 → 1.1.3

Code Quality & Maintenance:

  • Applied clippy lint suggestions including collapsible_if, needless_return, redundant_clone, and manual_is_multiple_of
  • Updated MSRV to Rust 1.88
  • Set nightly to 2025-06-27
  • Removed hardware-lock-elision feature on parking_lot
  • No longer use similar-asserts crate, reverted to standard assert_eq
  • Better TOML formatting
  • Removed unneeded dependency aliases
  • Various code refactoring for better maintainability

Infrastructure:

  • Updated csvlens integration with natural sorting support
  • Switched dependency management approaches for better upstream compatibility
  • Pin plist to 1.7.3 to avoid unnecessary quick-xml bumps
  • Use latest calamine upstream consistently

Fixed

  • validate: clearer JSON Schema schema error messages to differentiate validation types
  • round_num(): should return an empty string if dec_f64.is_nan()
  • joinp: non-equi-join test result order deterministic issues
  • Enhanced Snappy file decompression robustness
  • Fixed geometric mean calculation in stats
  • Better UTF-8 record validation with debug output
  • Various test adjustments to account for dependency updates and behavior changes
  • Resolved several clippy warnings and code quality issues

Test Updates:

  • rename: add pair-renaming tests
  • sort: add natural sort tests
  • joinp: add decimal_comma tests
  • sqlp: add decimal-comma validation tests
  • validate: add JSON Schema schema validation tests
  • stats: adjust test cases for qsv-stats 0.35.0 changes
  • excel: re-enable and revert formula tests based on upstream changes

Development Notes

Benchmarks:

  • Comprehensive benchmarking for versions 5.1.0 and 6.6.1
  • Performance comparisons available for major operations

Continuous Integration:

  • Multiple dependency updates via Dependabot automation
  • Comprehensive test coverage maintained throughout development
  • Regular upstream synchronization with Polars and other major dependencies

Pull Requests

NOTE: The changelog entries below only document changes with a corresponding PR. Several changes were committed to master directly and are documented in the release highlights above.

Added

  • lens: add --wrap-mode option in #2805
  • rename: add pair-based renaming in #2806
  • sort: add --natural sort option in #2808

Changed

  • geocode: now uses the faster geosuggest 0.8 crate. index-update subcommand now generates command to use geosuggest crate directly to update/create the index instead of doing it internally.
  • schema: when generating JSON schema, description property set to cmdline used to generate the JSON schema in #2796
  • sqlp & joinp: --decimal-comma option is not only for parsing input CSVs, it's also used when writing output CSVs in #2800
  • transpose: performance refactoring in #2827
  • validate improved JSON Schema schema validation in #2803
  • update completions for qsv v5.1.0 by @rzmk in #2804
  • dep: bump polars to latest upstream - adapt to PlPath api reqt in #2822
  • perf: bump to faster geosuggest to 0.8 in #2837
  • build(deps): bump arboard from 3.5.0 to 3.6.0 by @dependabot[bot] in #2814
  • build(deps): bump flexi_logger from 0.31.0 to 0.31.1 by @dependabot[bot] in #2801
  • build(deps): bump flexi_logger from 0.31.1 to 0.31.2 by @dependabot[bot] in #2812
  • build(deps): bump libc from 0.2.173 to 0.2.174 by @dependabot[bot] in #2794
  • build(deps): bump human-panic from 2.0.2 to 2.0.3 by @dependabot[bot] in #2833
  • build(deps): bump indicatif from 0.17.11 to 0.17.12 by @dependabot[bot] in #2818
  • build(deps): bump jaq-std from 2.1.1 to 2.1.2 by @dependabot[bot] in #2830
  • build(deps): bump jaq-core from 2.2.0 to 2.2.1 by @dependabot[bot] in #2831
  • build(deps): bump jaq-json from 1.1.2 to 1.1.3 by @dependabot[bot] in #2832
  • build(deps): bump minijinja from 2.10.2 to 2.11.0 by @dependabot[bot] in #2815
  • build(deps): bump minijinja-contrib from 2.10.2 to 2.11.0 by @dependabot[bot] in #2816
  • build(deps): bump phf from 0.11.3 to 0.12.1 by @dependabot[bot] in #2797
    ...
Read more

5.1.0

17 Jun 10:41

Choose a tag to compare

[5.1.0] - 2025-06-17

Highlights

  • lens is now colorful by default, with a --monochrome option to turn it off:

     qsv lens /tmp/NYC_311_SR_2010-2020-sample-1M.csv
    
Screenshot 2025-06-17 at 10 02 43 PM
  • lens can now have custom prompts with the --prompt option (with support for ANSI escape codes to format the prompt). Meant to be paired with the --echo-column <colname> option, e.g.:

    qsv lens --prompt $'\033[1;5;31mBlinking red, bold text\033[0m' --echo-column 'Unique Key' \
     /tmp/NYC_311_SR_2010-2020-sample-1M.csv
    

qsvprompt

  • the qsv-stats crate - the underlying engine behind the central stats, frequency and "smart" commands, got a lot of love in this release
  • validate got a tad faster while decreasing its memory footprint. The new --no-format-validation option now also allows you to ignore all JSON Schema "format" keywords (e.g. date, email, url, currency, etc.) when validating CSVs.

Added

  • lens: add --prompt option, add examples to regex-enabled options #2772
  • lens: add --monochrome option, otherwise, columns displayed in different colors #2761
  • validate: add --no-format-validation option when in JSON Schema mode #2762
  • docs: add shell completions badges by @rzmk in #2760
  • feat: added criterion trim algorithm microbenchmarks #2789

Changed

  • frequency: performance microoptimizations - use stats cache column cardinality to pre-alloc & size frequency hash tables
  • geocode: refactor regex handling for performance & maintainability
  • json: preserve key order #2777
  • stats: performance microoptimizations - use unwrap_unchecked() instead of just unwrap() in hot sampling functions
  • validate: major refactoring for added performance/memory efficiency
  • chore: temporarily use qsv-calamine until a new calamine is released #2790
  • Bump cpc from 1.9 to 2 #2770
  • deps: bump criterion from 0.5 to 0.6 #2791
  • deps: use latest csvlens upstream with colorful columnshttps://github.com/dathere/qsv/commit/f2c9322e33a0ac335dafec10a490c871d3de0a6c
  • deps: temporarily use qsv-calamine until a new calamine is released #2790
  • deps: bump our patched forks of cached, csvs_convert, json-objects-to-csv, jsonschema, localzone, rfd, self_update until PRs are merged or new releases are made
  • deps: bump zip from 3 to 4 in 75909d2
  • deps: bump polars to 0.48.1 at 49ce57a revision
  • build(deps): bump atoi_simd from 0.16.0 to 0.16.1 by @dependabot in #2766
  • build(deps): bump bytemuck from 1.23.0 to 1.23.1 by @dependabot in #2778
  • build(deps): bump flate2 from 1.1.1 to 1.1.2 by @dependabot in #2781
  • build(deps): bump flexi_logger from 0.30.1 to 0.30.2 by @dependabot in #2765
  • build(deps): bump flexi_logger from 0.30.2 to 0.31.0 by @dependabot in #2793
  • build(deps): bump hashbrown from 0.15.3 to 0.15.4 by @dependabot in #2779
  • build(deps): bump libc from 0.2.172 to 0.2.173 by @dependabot in #2787
  • build(deps): bump mimalloc from 0.1.46 to 0.1.47 by @dependabot in #2792
  • build(deps): bump mlua from 0.10.3 to 0.10.5 by @dependabot in #2758
  • build(deps): bump num_cpus from 1.16.0 to 1.17.0 by @dependabot in #2771
  • build(deps): bump parking_lot from 0.12.3 to 0.12.4 by @dependabot in #2768
  • build(deps): bump pyo3 from 0.25.0 to 0.25.1 by @dependabot in #2785
  • deps: upgrade qsv-stats from 0.32 to 0.33, which features major memory and performance optimizations behind the stats & frequency commands #2786
  • deps: bump redis from 0.29.5 to 0.32
  • build(deps): bump reqwest from 0.12.15 to 0.12.16 by @dependabot in #2764
  • build(deps): bump reqwest from 0.12.16 to 0.12.18 by @dependabot in #2767
  • build(deps): bump reqwest from 0.12.18 to 0.12.19 by @dependabot in #2773
  • build(deps): bump reqwest from 0.12.19 to 0.12.20 by @dependabot in #2782
  • build(deps): bump rust_decimal from 1.37.1 to 1.37.2 by @dependabot in #2788
  • build(deps): bump smallvec from 1.15.0 to 1.15.1 by @dependabot in #2780
  • build(deps): bump sysinfo from 0.35.1 to 0.35.2 by @dependabot in #2774
  • build(deps): bump titlecase from 3.5.0 to 3.6.0 by @dependabot in #2775
  • build(deps): bump tokio from 1.45.0 to 1.45.1 by @dependabot in #2759
  • build(deps): bump uuid from 1.16.0 to 1.17.0 by @dependabot in #2757
  • applied select clippy suggestions
  • updated indirect dependencies
  • set Rust nightly to 2025-05-21, the same nightly Polars uses 872ade1

Fixed:

  • fix: frequency recover from non-fatal absence of stats cache, instead of panicking b2821a0
  • fix: flaky json tests caused by hardcoding name of intermediate file - 62ca310
  • fix: flaky reverse property tests by handling BOM characters cefd490
  • fix: util::process_input helper does not honor QSV_SKIP_FORMAT_CHECK when processing dir input #2784

Full Changelog: 5.0.3...5.1.0

5.0.3

22 May 04:33

Choose a tag to compare

[5.0.3] - 2025-05-22 "The Geo Release" 🌍

qsv 5.0.3 represents a major milestone with significant enhancements to its geospatial data processing capabilities.
They're targeted to support the Datapusher+ Data Resource Upload First (DRUF) workflow for "automagical metadata inferencing" - focusing on DCAT-US v3 recommended spatial and temporal properties that would otherwise be too tedious to manually compile:

New Geocoding Capabilities

  • Added IP geolocation with new --iplookup and --iplookupnow subcommands in the geocode command
  • Integrated Maxmind GeoLite2 database support for accurate IP-to-location mapping
  • Enhanced geocoding performance (up to 5x faster) with rkyv serialization (contributed by @estin)

Enhanced geoconvert Command

  • Added CSV input support alongside existing geospatial formats
  • Introduced GeoJSONL output format for streaming workflows
  • Added stdin support for all formats except SHP input
  • New coordinate handling options: --latitude and --longitude parameters
  • Added --max-length option for output control
  • Comprehensive test coverage additions
  • all contributed by @rzmk!

🚀 Performance & Infrastructure Improvements

Polars Integration

  • Upgraded Polars from 0.46.0 to 0.48.1 with intermediate releases
  • Enhanced Polars schema support across multiple commands (schema, joinp, pivotp, sqlp)
  • Added --polars mode to the schema command to explicitly create a polars schema file on demand, rather than as a side-effect of the sqlp command using its --cache-schema option.

Core Performance

  • Microoptimizations in the sort command
  • Improved file handling with tempfile usage in edit --in-place
  • Enhanced auto-decompression support now available suite-wide for gz, zlib, and zst files

🛠️ New Features & Usability

Enhanced Commands

  • edit: New --in-place option for direct file modification with automatic backup (.bak) creation
  • foreach: Added "/" to splitter pattern for improved path handling
  • stats: New QSV_STATS_STRING_MAX_LENGTH environment variable for string analysis control
  • to: Added --all-strings option for simplified data type handling

Distribution & Installation

  • Added conda package support with installation instructions
  • New download badges and streamlined installation documentation
  • Retired older glibc-2.31 and musl-1.2.3 "prebuilt-older" binaries as Ubuntu 20.04 has been retired and no longer supported with GitHub Actions.
  • Discontinued MSI installer in favor of the easier qsv Windows Easy Installer (thanks @rzmk!)

Quality & Stability

  • Applied multiple clippy lint suggestions for code quality
  • Enhanced test coverage, particularly for geospatial functions
  • Improved documentation with better examples and clearer explanations
  • Fixed stdin handling issues in the split command

🎯 Default Feature Changes

The qsvdp variant now includes geocode and geoconvert commands by default, making geospatial functionality more accessible to Datapusher+ users with Jinja2-powered metadata formulas.

NOTE:

  • for qsv v5.0.3, cargo install will NOT worked as the calamine crate (which powers the excel command) is pinned to zip 2.5.0 which was yanked.
  • unfortunately, the broken zip dependency also prevents us from publishing qsv 5.0.3 to crates.io
  • for both cases, either install the prebuilts or compile from source with cargo build.

Added

  • edit: add --in-place (and test) which uses tempfile by @rzmk in #2744
  • foreach: add "/" to splitter pattern #2754
  • geoconvert: add CSV input and GeoJSONL output and use buf by @rzmk in #2690
  • geoconvert: add stdin support (except for SHP input) by @rzmk in #2699
  • geoconvert: add --latitude and --longitude options by @rzmk in #2707
  • geoconvert: add --max-length option #2711
  • geocode: add iplookup and iplookupnow subcommands #2741
  • tests: geoconvert - add basic tests and move tests to test_geoconvert.rs by @rzmk in #2717
  • qsvdp now include geocode & geoconvert commands by default #2697
  • stats: QSV_STATS_STRING_MAX_LENGTH env var #2709
  • to: add --all-strings option #2746
  • docs: add conda install command by @rzmk in #2718
  • docs: add qsv download badges and update install instructions by @rzmk in #2721

Changed

Fixed:

New Contributors

Full Changelog: 4.0.0...5.0.3

4.0.0

13 Apr 18:47
b4f1113

Choose a tag to compare

[4.0.0] - 2025-04-13

Highlights:

This is a major release with numerous improvements!

  • qsv can now read more file formats by leveraging the Polars engine:
    • Arrow/IPC, Avro, Parquet, JSON (JSON array) and JSONL
    • Automatic decompression support for compressed CSV file dialects (csv, tsv/tab & ssv) using gzip (.gz), zlib (.zlib), zstd (.zst) compression formats. (e.g. data.csv.gz, data.tsv.zst, data.ssv.zlib)
      qsv lens data.csv.gz
      qsv sample 1000 data.parquet | qsv stats | qsv lens
      qsv frequency data.tab.zlib | qsv lens
      qsv search Waldo data.ssv.zst | qsv table
      qsv select 2-5 data.jsonl | qsv lens
      
  • New geoconvert command for converting spatial formats (GeoJSON and SHP) to CSV:
     # convert TX_cities.geojson to CSV, filter out the geometry column and browse with lens
     qsv geoconvert TX_cities.geojson geojson csv | qsv select '!geometry' | qsv lens
    
  • Enhanced split command with new --filter option:
    • Similar to GNU split --filter
    • Spawns a subprocess for each chunk
      # split input.csv into outdir, each chunk having 100,000 rows, gzip compressing each chunk
      qsv split --size 100000 outdir data.csv --filter 'gzip $FILE'
      
  • Expanded to command:
    • added LibreOffice/OpenOffice Calc (ODS) support
    • re-enabled parquet generation now that it's using Arrow instead of DuckDB (which made for very long compiles)
  • New uniqueCombinedWith JSON Schema custom keyword in validate command:
    • Allows validating uniqueness across multiple columns
    • Useful for composite key validation
  • QSV_DOTENV_PATH now supports the sentinel value "<NONE>" to disable dotenv processing altogether.

Added

  • geoconvert: new command to convert spatial formats to CSV by @rzmk in #2681 & #2688
  • split: add --filter options #2660
  • sqlp: add decimal type support #2646
  • to: add back to parquet support #2665
  • feat: Extended auto decompression support. In addition to snappy auto-decompression, auto-decompress CSV dialects (tsv/tab & ssv files) using gzip, zlib and zstd compression formats #2671
  • to: add ODS support #2674
  • validate: add uniqueCombinedWith custom JSON Schema Validation keyword #2636
  • feat: prompt add file formats supported to dialog box filter when polars feature is enabled #2667
  • feat: add QSV_POLARS_FLOAT_PRECISION env var #2678
  • tests: add tests for https://100.dathere.com/lessons/3 by @rzmk in #2638

Changed

  • qsvdp binary variant can now use the geocode & geoconvert commands 50f0046
  • geocode feature now gates the geocode & geoconvert command 9d046e8
  • stats: made stdin handling more robust by adding delimiter inferencing ddecd98
  • feat: setting QSV_DOTENV_PATH to sentinel value "<NONE>" disables dotenv processing #2684
  • refactor: polars special formats support #2683
  • contrib(completions): update completions to v3.3.0 by @rzmk in #2626
  • contrib(completions): update completions for qsv v4.0.0 by @rzmk in #2677
  • deps: bump polars to 0.46.0 at py-1.27.1 tag #2675 and e5d29d7
  • build(deps): bump actions/setup-python from 5.4.0 to 5.5.0 by @dependabot in #2627
  • build(deps): bump arboard from 3.4.1 to 3.5.0 by @dependabot in #2653
  • build(deps): bump chrono-tz from 0.10.2 to 0.10.3 by @dependabot in #2623
  • build(deps): bump crossbeam-channel from 0.5.14 to 0.5.15 by @dependabot in #2672
  • build(deps): bump csvs_convert from 0.11.0 to 0.11.1 by @dependabot in #2686
  • build(deps): bump data-encoding from 2.8.0 to 2.9.0 by @dependabot in #2685
  • build(deps): bump flate2 from 1.1.0 to 1.1.1 by @dependabot in #2649
  • build(deps): bump flexi_logger from 0.29.8 to 0.30.0 by @dependabot in #2650
  • build(deps): bump flexi_logger from 0.30.0 to 0.30.1 by @dependabot in #2651
  • build(deps): bump governor from 0.8.1 to 0.9.0 by @dependabot in #2625
  • build(deps): bump governor from 0.9.0 to 0.10.0 by @dependabot in #2631
  • build(deps): bump jsonschema from 0.29.0 to 0.29.1 by @dependabot in #2635
  • build(deps): bump log from 0.4.26 to 0.4.27 by @dependabot in #2622
  • build(deps): bump mimalloc from 0.1.44 to 0.1.45 by @dependabot in #2652
  • build(deps): bump minijinja from 2.8.0 to 2.9.0 by @dependabot in #2643
  • build(deps): bump minijinja-contrib from 2.8.0 to 2.9.0 by @dependabot in #2642
  • build(deps): bump pyo3 from 0.24.0 to 0.24.1 by @dependabot in #2645
  • build(deps): bump qsv-dateparser from 0.12.1 to 0.13.0 by @dependabot in #2639
  • build(deps): bump qsv-sniffer from 0.10.3 to 0.11.0 by @dependabot in #2640
  • build(deps): bump redis from 0.29.2 to 0.29.4 by @dependabot in #2663
  • build(deps): bump redis from 0.29.4 to 0.29.5 by @dependabot in #2666
  • build(deps): bump smallvec from 1.14.0 to 1.15.0 by @dependabot in #2656
  • build(deps): bump sysinfo from 0.34.0 to 0.34.1 by @dependabot in #2637
  • build(deps): bump sysinfo from 0.34.1 to 0.34.2 by @dependabot in #2648
  • build(deps): bump titlecase from 3.4.0 to 3.5.0 by @dependabot in #2669
  • build(deps): bump tokio from 1.44.1 to 1.44.2 by @dependabot in #2662
  • applied select clippy lint suggestions
  • bumped indirect dependencies to latest version

Fixed

  • fix: select panic when idx is out of bounds #2670
  • fix: correct link to qsv-dateparser accepted date formats #2632
  • fix: reset SIGPIPE handling #2664
  • docs: fix typo it's -> its by @rzmk in #2680

Full Changelog: 3.3.0...4.0.0

3.3.0

23 Mar 17:05

Choose a tag to compare

[3.3.0] - 2025-03-23

Highlights:

  • stats got another round of improvements:
    • boolean inferencing is now configurable!
      Before, it was limited to a simple, English-centric heuristic:
      • When a column's cardinality is 2; and the 2 values' first characters are 0/1, t/f or y/n case-insensitive, the data type of the column is inferred as boolean
      • With the new --boolean-patterns <arg> option, we can now specify arbitrary true_pattern:false_pattern pattern pairs. Each pattern can be a string of length > 1, case-insensitive. If a pattern ends with "*", it is treated as a prefix.
        For example, t*:f* matches "true", "Truthy", "T" as boolean true so long as the corresponding false pattern (e.g. "Fake, False, f") is also matched. Bear in mind that the cardinality still needs to be 2, so multiple matches on the same column on different patterns will disqualify the field as boolean if cardinality > 2 (e.g. If a column's domain is "True", "truthy" and "False", it doesn't qualify as it's cardinality is 3. On the other hand, if it's "True", "true", "False", "false", "FALSE" - it still qualifies as they resolve to just "true/false" case-insensitive).
        For backwards compatibility, the default true/false pairs are 1:0,t*:f*,y*:n*.
    • percentiles can now be computed!
      By enabling the --percentiles flag, stats will now return the 5th, 10th, 40th, 60th, 90th and 95th percentile by default using the nearest-rank method for all numeric and date/datetime columns. The returned percentiles can be configured to return different percentiles using the --percentile-list <arg> option.
      Note that the method for computing quartiles (Method 3) is basically a specialized implementation of the nearest rank method for q1 (25th), q2 (50th or median) and q3 (75th percentile), thus the choice of non-overlapping defaults for --percentile-list.
  • frequency: now uses qsv-stats 0.32.0, which uses the more memory-efficient, often faster foldhash crate
  • in the same vein, by replacing ahash with foldhash suite-wide, qsv got a lot more memory-efficient and often faster when doing hash lookups
  • sample: "streaming" bernoulli sampling now works for any remotely hosted CSVs with servers that support chunked downloads, without requiring range request support.
  • we're now using the latest Polars engine - v0.46.0 at the py-1.26.0 tag.

Added

  • stats: add configurable boolean inferencing #2595
  • stats: add --percentiles option #2617

Changed

  • refactor: replace ahash with faster foldhash #2619
  • replace std assert_eq! macro with similar_asserts::assert_eq! macro for easier debugging #2605
  • deps: bump polars to 0.46.0 at py-1.25.2 tag #2604
  • deps: bump Polars to v0.46.0 at py-1.26.0 tag #2621
  • build(deps): bump actix-web from 4.9.0 to 4.10.2 by @dependabot in #2591
  • build(deps): bump indexmap from 2.7.1 to 2.8.0 by @dependabot in #2592
  • build(deps): bump mimalloc from 0.1.43 to 0.1.44 by @dependabot in #2608
  • build(deps): bump qsv-stats from 0.30.0 to 0.31.0 by @dependabot in #2603
  • build(deps): bump qsv-stats from 0.31.0 to 0.32.0 by @dependabot in #2620
  • build(deps): bump reqwest from 0.12.12 to 0.12.13 by @dependabot in #2593
  • build(deps): bump reqwest from 0.12.13 to 0.12.14 by @dependabot in #2596
  • build(deps): bump reqwest from 0.12.14 to 0.12.15 by @dependabot in #2609
  • build(deps): bump rfd from 0.15.2 to 0.15.3 by @dependabot in #2597
  • build(deps): bump rust_decimal from 1.37.0 to 1.37.1 by @dependabot in #2616
  • build(deps): bump simd-json from 0.14.3 to 0.15.0 by @dependabot in #2615
  • build(deps): bump tempfile from 3.18.0 to 3.19.0 by @dependabot in #2602
  • build(deps): bump tempfile from 3.19.0 to 3.19.1 by @dependabot in #2612
  • build(deps): bump uuid from 1.15.1 to 1.16.0 by @dependabot in #2601
  • build(deps): bump zip from 2.2.3 to 2.4.1 by @dependabot in #2607
  • apply select clippy lint suggestions
  • bumped indirect dependencies to latest version
  • set Rust nightly to 2025-03-07, the same version Polars uses 17f6bdb

Fixed

  • updated lock file, primarily to fix CVE-2025-29787 e44e5df
  • luau: fix flaky register_lookup_table CI test that only intermittently fails in Windows by using buffered writer in lookup write_cache_file helper f494b46
  • sample: refactor "streaming" Bernoulli sampling, so it actually works without requiring range requests support #2600

Full Changelog: 3.2.0...3.3.0