Releases: dathere/qsv
0.128.0
[0.128.0] - 2024-05-25
โค๏ธ csv,conf,v8 Edition ๐
๐๐ฝ ยกรndale! ยกรndale! ยกArriba! ยกArriba! ๐จ
Yii-hah! We're Mexico bound as we head to csv,conf,v8 to present and share qsv with fellow data-makers and wranglers from all over!
And we've packed a lot into this release for the occasion:
searchgot a lot of love as it now powers qsv pro's newsearchfeature to get near-instant search results even on large datasets.stats- the โค๏ธ of qsv, now has several cache fine-tuning options with--cache-threshold. It now also computesmax_precisionfor floats andis_asciifor strings. It also has a new--round9999 sentinel value to suppress rounding of statistics.schema&tojsonlare now faster thanks tostats --cache-thresholdautoindex & cache creation/deletion logic.- We upgraded Polars to 0.40.0 to unlock additional capabilities in the
count,joinp&sqlpcommands. countnow has an additional blazing fast counting mode using Polars'read_csv()table function.frequencygets some micro-optimizations for even faster frequency analysis.luauis now bundled with luau 0.625 from 0.622. We also upgraded the bundled LuaDate library from 2.2.0 to 2.2.1. All of this, while making it ~10% faster!
Overall, qsv manages to keep its performance edge despite the addition of new capabilities and features. We'll give a whirlwind tour of qsv and these updates in our talk at csv,conf,v8.
We'll also preview what we've been calling the People's APPI - our "Answering People/Policymaker Interface" in qsv pro.
This is a new way to interact with qsv that's more conversational and less command-line-y using a natural language interface. It's a way to make qsv more accessible to more people, especially those who are not comfortable with the command line.
We're excited to share all these qsv innovations with the csv,conf,v8 community and the wider world! Nos vemos en Puebla!
ยกรndele! ยกรndele! ยกEpa! ยกEpa! ยกEpa!
Added
count: additional Polars-powered counting mode usingread_csv()SQL table function 05c5809input: add--quote-styleoption df3c8f1joinp: add--coalesceoption 8d142e5search: add--preview-matchoption #1785search: add--jsonoutput option #1790search: add "match-only"--flagoption mode #1799search: add--not-oneflag for not using exit code 1 when no match by @rzmk in #1810sqlp: add--decimal-commaoption #1832stats: add--cache-thresholdoption #1795stats: add--cache-thresholdautoindex creation/deletion logic #1809stats: add additional mode to--cache-threshold63fdc55stats: now computes max_precision for floats #1815stats: add--round9999 sentinel value support to suppress rounding #1818stats: addis_asciicolumn #1824- added new benchmarks for
searchcommand 58d73c3
Changed
count: document three count modes 3d5a333describegpt: update--max-tokenstype for LLMs with larger context sizes by @rzmk #1841excel: use simplerrange::headers()to get headers 069acbffrequency: ensure--other-sortedworks with--other-text7430ad7frequency: microoptimize hot loop d9c01e1, 7c9f925 andluau: improve usage text cb6b4d9luau: we now bundle luau 0.625 from 0.622 4060975luau: update vendored LuaDate library from 2.2.0 to 2.2.1 #1840schema: adjust to reflectstats --cache-thresholdoption 92fed86slice: move json output helpers to util 1f44b48tojsonl: refactor boolcheck helper 74d5f5adocs: cross-referencesplit&partitioncommands #1828- contrib(bashly): update completions.bash for qsv v0.127.0 by @rzmk in #1776
- contrib(bashly): update completions.bash for qsv v0.128.0 by @rzmk in #1838
deps: upgrade to polars 0.40.0 #1831- build(deps): bump actix-web from 4.5.1 to 4.6.0 by @dependabot in #1825
- build(deps): bump anyhow from 1.0.82 to 1.0.83 by @dependabot in #1798
- build(deps): bump anyhow from 1.0.83 to 1.0.85 by @dependabot in #1823
- build(deps): bump anyhow from 1.0.85 to 1.0.86 by @dependabot in #1826
- build(deps): bump cached from 0.50.0 to 0.51.0 by @dependabot in #1789
- build(deps): bump cached from 0.51.0 to 0.51.1 by @dependabot in #1793
- build(deps): bump cached from 0.51.1 to 0.51.2 by @dependabot in #1802
- build(deps): bump cached from 0.51.2 to 0.51.3 by @dependabot in #1805
- build(deps): bump crossbeam-channel from 0.5.12 to 0.5.13 by @dependabot in #1827
- build(deps): bump csvs_convert from 0.8.9 to 0.8.10 by @dependabot in #1808
- build(deps): bump data-encoding from 2.5.0 to 2.6.0 by @dependabot in #1780
- build(deps): bump file-format from 0.24.0 to 0.25.0 by @dependabot in #1807
- build(deps): bump flate2 from 1.0.28 to 1.0.29 by @dependabot in #1778
- build(deps): bump flate2 from 1.0.29 to 1.0.30 by @dependabot in #1784
- build(deps): bump hashbrown from 0.14.3 to 0.14.5 by @dependabot in #1781
- build(deps): bump itertools from 0.12.1 to 0.13.0 by @dependabot in #1822
- deps: bump forked jsonschema from 0.17.1 to 0.18.0 f02620f
- build(deps): bump mimalloc from 0.1.41 to 0.1.42 by @dependabot in #1829
- build(deps): bump mlua from 0.9.7 to 0.9.8 by @dependabot in #1821
- build(deps): bump qsv-stats from 0.16.0 to 0.17.1 by @dependabot in #1813
- build(deps): bump qsv-stats from 0.17.1 to 0.17.2 by @dependabot in #1814
- build(deps): bump qsv-stats from 0.17.2 to 0.18.0 by @dependabot in #1816
- build(deps): bump ryu from 1.0.17 to 1.0.18 by @dependabot in #1801
- build(deps): bump semver from 1.0.22 to 1.0.23 by @dependabot in #1800
- build(deps): bump serde from 1.0.198 to 1.0.199 by @dependabot in #1777
- build(deps): bump serde from 1.0.199 to 1.0.200 by @dependabot in #1787
- build(deps): bump serde from 1.0.200 to 1.0.201 by @dependabot in #1804
- build(deps): bump serde from 1.0.201 to 1.0.202 by @dependabot in #1817
- build(deps): bump serde_json from 1.0.116 to 1.0.117 by @dependabot in #1806
- build(deps): bump serial_test from 3.1.0 to 3.1.1 by @dependabot in #1779
- build(deps): bump simple-expand-tilde from 0.1.5 to 0.1.6 by @dependabot in #1811
- build(deps): bump sysinfo from 0.30.11 to 0.30.12 by @dependabot in https://github.com/jq...
0.127.0
๐ Enhanced Frequency Analysis ๐
This a quick release adding several frequency enhancements for more detailed frequency analysis. The frequency command now includes a percentage column, calculates other values, and supports limiting unique counts and negative limits.
These options provides additional context for Datapusher+, qsv-pro and describegpt so their metadata inferences are more accurate and comprehensive.
Previously, for a 775-row CSV file containing one column named state with entries for all 50 states, frequency only showed1:
qsv frequency freq_state_example.csv | qsv table
field value count
state NY 100
state NJ 70
state CA 60
state MA 55
state FL 45
state TX 43
state NM 40
state AZ 39
state NV 38
state MI 35
Now, there's a new percentage column and other values calculation, both of which have configurable options:
qsv frequency freq_state_example.csv | qsv table
field value count percentage
state NY 100 12.90323
state NJ 70 9.03226
state CA 60 7.74194
state MA 55 7.09677
state FL 45 5.80645
state TX 43 5.54839
state NM 40 5.16129
state AZ 39 5.03226
state NV 38 4.90323
state MI 35 4.51613
state Other (40) 250 32.25806
This release is also out of cycle to address a big performance regression in the excel command caused by unnecessary formula info retrieval for the --error-format option introduced in 0.126.0. This has been fixed, and the excel command is now back to its speedy self.
Added
frequency: added percentage column;othervalues calculation, implementing #1774 #1775benchmarks: added newfrequencyandexcelbenchmarks b83ad3a
Changed
- contrib(bashly): update completions.bash for qsv v0.126.0 by @rzmk in #1771
- build(deps): bump mimalloc from 0.1.39 to 0.1.41 by @dependabot in #1772
- build(deps): bump qsv-stats from 0.14.0 to 0.15.0 by @dependabot in #1773
- updated several indirect dependencies
- applied select clippy recommendations
Fixed
excel: fixed performance regression because qsv was unnecessarily getting formula info (an expensive operation) for--error-formatoption even when not required 772af34- renamed 0.126.0 sqlp_vs_duckdb benchmark results so they're next to each other for easy direct comparison. 7bcd59e.
Per the benchmarks,sqlpis 2.87 times faster than duckdb v0.10.2 for a simple aggregation (0.066 secs vs 0.19 secs), and 1.42 times faster for an "expensive" aggregation (0.143 secs vs 0.203 secs).
Full Changelog: 0.126.0...0.127.0
-
with its default
--limitsetting of 10 only show the top 10 unique values in the column, sorted by occurence โฉ
0.126.0
๐ค Expanded Metadata Inferencing ๐ค
describegpt headlines this release, with its new ability to support other local Large Language Models (LLMs) using popular tools that serve them through APIs such as Ollama and Jan. This broadens the tool's utility in diverse AI environments. Beyond OpenAI, qsv can now use other popular LLMs like Llama 3, Mistral, and Gemma. It also unlocks expanded metadata inferencing capabilities in qsv pro.
Several commands got additional options: cat with --no-headers support in the rowskey subcommand; excel with new options like --error-format and short --metadata mode; and foreach with a --dry-run option. frequency also got new options, including --unq-limit for limiting unique counts, support for negative limits, and a --lmt-threshold option for compiling comprehensive frequencies below a threshold. slice now supports negative indices and new JSON output options, providing more flexibility in data slicing.
This is all rounded out with sqlp improvements, including support for single-line comments in SQL scripts and a special SKIP_INPUT value to skip input preprocessing when using table functions directly in Polars SQL (e.g. read_csv() and read_parquet()) - all while increasing performance thanks to the Polars engine being upgraded to 0.39.2.
New Features
cat: Added--no-headerssupport to therowskeysubcommand.describegpt: Added compatibility for other local Large Language Models (LLMs) such as Ollama and Jan, broadening the tool's utility in diverse AI environments.excel: Introduced new options in the excel command:--error-formatfor better error handling and a short--metadataJSON mode.foreach: added a--dry-runoption, allowing users to preview the results of scripts without executing them.frequency: New options added such as--unq-limitfor limiting unique counts; support for negative limits to only show frequencies >= abs(negative limit); and a--lmt-thresholdoption to allow the compilation of comprehensive frequencies below the threshold - all providing more detailed control over frequency analysis.slice: Support for negative indices to slice from the end and new JSON output options.sqlp: sqlp now supports single-line comments and includes a special SKIP_INPUT value for more efficient data loading. The Polars engine has also been upgraded to 0.39.2, providing enhanced performance and stability.
Changes and Optimizations
- Performance Enhancements: Microoptimizations in
datefmtandvalidatecommands, and increased default length for--infer-leninsqlpfor improved performance. - Dependency Updates: Numerous updates including bumping Luau, jql-runner, pyo3, and other dependencies to enhance stability and security.
- Benchmarks Added: New performance benchmarks for
sqlpvs duckdb added to ensure there are no performance regressions between releases. Right now,sqlpis faster thanduckdbin most cases (thanks to Polars - see the latest TPC-H benchmarks), but we want to make sure that we keep it that way.
Security and Robustness
- Security Fixes: Updated rustls to fix a specific CVE, and other minor fixes to enhance the security and robustness of network and data processing features.
- Bug Fixes: Various bug fixes including improvements in error formatting in excel and robustness in fetch and fetchpost commands.
Added
cat: add--no-headerssupport to rowskey subcommand #1762describegpt: add compatibility for other (local) LLMs (Ollama, Jan, etc.) by @rzmk in #1761excel: add--error-formatoption #1721excel: add--metadatashort JSON mode #1738foreach: add--dry-runoption #1740frequency: add--unq-limitoption #1763frequency: add support for negative--limits #1765frequency: add--lmt-thresholdoption #1766slice: add support for negative--indexoption values #1726slice: implement--jsonoutput option #1729sqlp: added support for single-line comments in SQL scripts bb52bcesqlp: added SKIP_INPUT special value to short-circuit input processing if the user wants to
load input files directly using table functions (e.g. read_csv(), read_parquet(), etc.) fe850advalidate: add--valid-outputoption #1730- contrib: add sample Bashly completions implementation by @rzmk in #1731
benchmarks: addedsqlpvsduckdbbenchmarks.
Changed
datefmt: microoptimize formatting 0ee27e7joinp: adapt to breaking change in Polars 0.39 for lazyframe sort c625ca9sqlp: change--infer-lenoption default from 250 to 1000 for increased performance da1d215validate: microoptimizeto_json_instance()c2e4a1c- bump Luau from 0.616 to 0.622 9216ec3
- build(deps): bump jql-runner from 7.1.6 to 7.1.7 by @dependabot in #1711
- build(deps): bump pyo3 from 0.21.0 to 0.21.1 by @dependabot in #1712
- build(deps): bump pyo3 from 0.21.1 to 0.21.2 by @dependabot in #1750
- build(deps): bump strsim from 0.11.0 to 0.11.1 by @dependabot in #1715
- build(deps): bump sysinfo from 0.30.7 to 0.30.8 by @dependabot in #1716
- build(deps): bump sysinfo from 0.30.8 to 0.30.9 by @dependabot in #1732
- build(deps): bump sysinfo from 0.30.9 to 0.30.10 by @dependabot in #1735
- build(deps): bump sysinfo from 0.30.10 to 0.30.11 by @dependabot in #1755
- build(deps): bump redis from 0.25.2 to 0.25.3 by @dependabot in #1720
- build(deps): bump mlua from 0.9.6 to 0.9.7 by @dependabot in #1724
- build(deps): bump reqwest from 0.12.2 to 0.12.3 by @dependabot in #1725
- build(deps): bump reqwest from 0.12.3 to 0.12.4 by @dependabot in #1759
- build(deps): bump anyhow from 1.0.81 to 1.0.82 by @dependabot in #1733
- build(deps): bump robinraju/release-downloader from 1.9 to 1.10 by @dependabot in #1734
- build(deps): bump chrono from 0.4.37 to 0.4.38 by @dependabot in #1744
- bump polars from 0.38 to 0.39 #1745
- build(deps): bump polars from 0.39.0 to 0.39.1 by @dependabot in #1746
- build(deps): bump polars from 0.39.1 to 0.39.2 by @dependabot in #1752
- build(deps): bump qsv-dateparser from 0.12.0 to 0.12.1 by @dependabot in #1747
- build(deps): bump serde_json from 1.0.115 to 1.0.116 by @dependabot in #1749
- build(deps): bump serde from 1.0.197 to 1.0.198 by @dependabot in #1751
- build(deps): bump rustls from 0.22.3 to 0.22.4 by @dependabot in #1758
- build(deps): bump simple-expand-tilde from 0.1.4 to 0.1.5 by @dependabot in #1767
- build(deps): bump serial_test from 3.0.0 to 3.1.0 by @dependabot in #1768
- build(deps): bump actions/setup-python from 5.0.0 to 5.1.0 by @dependabot in #1769
- applied select clippy recommendations
- updated several indirect dependencies
- added several benchmarks for new/changed commands
- pin Rust nightly to 2024-04-15 - the same nightly that Polars 0.39 is pinned to
- bumped MSRV to 1.77.2
Fixed
- Make init_logger more robust #1717
count: empty CSVs count as zero also for polars. Fixes #1741 #1742excel: fix #1682 by adding--error-formatoption #1689fetch&fetchpost: more robust JSON response validation ebc7287slice: usewrite!macro to get rid of GH Advanced Security lint c739097sqlp: fixed docopt defaults that were not being parsed correctly fe850addeps: bump h2 from 0.4.3 to 0.4.4 ...
0.125.0
In this release, we focused on the ๐๏ธ need for even more speed ๐๏ธ .
This was done primarily by tweaking several supporting qsv crates. qsv-docopt now parses command-line arguments slightly faster. qsv-stats, the crate behind commands like stats, schema, tojsonl, and frequency, has been further optimized for speed. qsv-dateparser has been updated to support new timezone handling options in datefmt. qsv-sniffer also got a speed boost.
Per the benchmark suite, stats is 25% faster (1.563 secs vs 2.067 secs) when computing the 13 "streaming" stats and 14% faster when computing --everything (17 columns of addl stats - 3.149 secs vs 3.656 secs) for the 1M row, 41 column, 520mb sample of NYC's 311 data.
The count command has been refactored to utilize Polars' SQLContext, which leverages LazyFrames evaluation to automagically count even very large files in just a few seconds. Previously, count was already using Polars, but it mistakenly fell back to a slower counting mode. Now, it consistently delivers fast performance, even without an index. On the same benchmark suite, it takes 0.052 secs vs 0.503 seconds - almost 10x faster!
As count is not just a top-level command, but also a widely used helper used by several qsv commands, this gives the entire suite a nice performance boost.
Continuing on the performance front, the excel command now has a new short --metadata mode, allowing users to just get a "shorter" version of the metadata report that only list the workbook's top level metadata (sheet index, sheet name, sheet type, visibility) instead of the full metadata report (which also has info like num rows, column metadata, etc.). On the benchmark suite, the short metadata report takes all of 0.005 secs vs 11.237 secs for the 1M row xlsx version of the same NYC 311 data - more than 3 orders of magnitude faster! (it may actually be faster since 0.005 secs is at the limits of what hyperfine can measure)
The datefmt command also got some major enhancements with new timezone handling and timestamp parsing options, though at the cost of a small 15% performance penalty.
Lastly, we are excited to announce that qsv will be featured at the CSV,Conf,V8 conference in Puebla, Mexico on May 28-29. I'll be presenting a talk titled "qsv: A Blazing Fast CSV Data-Wrangling Toolkit". Hope to see you there!.
Added
excel: added short mode to--metadataoption #1699datefmt: addedts-resolutionoption to specify resolution to use when parsing unix timestamps #1704datefmt: added timezone handling options #1706 #1707 #1642
Changed
count: refactored to use Polars SQLContext 43a236fstats: refactored stats_path helper function 174c30eapply,applydp,datefmt,excel,geocode,py,validate: use std::mem::take to avoid clone 1fd187f 8402d3a 8496157excel: optimized workbook opening operation 67f662e- build(deps): bump flexi_logger from 0.27.4 to 0.28.0 by @dependabot in #1673
- build(deps): bump polars from 0.38.2 to 0.38.3 by @dependabot in #1674
- build(deps): bump uuid from 1.7.0 to 1.8.0 by @dependabot in #1675
- build(deps): bump hashbrown from 0.14.3 to 0.14.4 by @dependabot in #1680
- build(deps): bump reqwest from 0.11.26 to 0.11.27 by @dependabot in #1679
- build(deps): bump bytes from 1.5.0 to 1.6.0 by @dependabot in #1685
- build(deps): bump regex from 1.10.3 to 1.10.4 by @dependabot in #1686
- build(deps): bump indexmap from 2.2.5 to 2.2.6 by @dependabot in #1687
- build(deps): bump rayon from 1.9.0 to 1.10.0 by @dependabot in #1688
- build(deps): bump qsv_docopt from 1.6.0 to 1.7.0 by @dependabot in #1691
- build(deps): bump reqwest from 0.12.1 to 0.12.2 by @dependabot in #1693
- build(deps): bump serde_json from 1.0.114 to 1.0.115 by @dependabot in #1694
- build(deps): bump itoa from 1.0.10 to 1.0.11 by @dependabot in #1695
- build(deps): bump actions/setup-python from 5.0.0 to 5.1.0 by @dependabot in #1700
- build(deps): bump rust_decimal from 1.34.3 to 1.35.0 by @dependabot in #1701
- build(deps): bump chrono from 0.4.35 to 0.4.37 by @dependabot in #1702
- build(deps): bump tokio from 1.36.0 to 1.37.0 by @dependabot in #1703
- build(deps): bump qsv-sniffer from 0.10.2 to 0.10.3 by @dependabot in #1708
- build(deps): bump titlecase from 2.2.1 to 3.0.0 by @dependabot in #1709
- build(deps): bump qsv-stats from 0.13.0 to 0.14.0 by @dependabot in #1710
- applied select clippy recommendations
- updated several indirect dependencies
- added several benchmarks for new/changed commands
- bumped MSRV to 1.77.1
- use
#[cfg(debug_assertions)]conditional compilation to avoid compiling debug code in release mode - use patched forks of
jsonschema,cached,self_updateandlocalzonecrates to avoid old dependencies
which was causing dependency bloat
Fixed
count: fixed polars_count_input helper, as it was always falling back to "slow" counting mode 3484c89
Full Changelog: 0.124.1...0.125.0
0.124.1
Datapusher+ "Speed of Insight" Release! ๐๐๐
This release is all about speed, speed, speed! We've made qsv even faster by leveraging Polars' multithreaded, mem-mapped CSV reader to get near-instant row counts of large CSV files, and near instant SQL queries and aggregations with Datapusher+ - automagically inferring metadata and giving you quick insights into your data in seconds!
We're demoing our qsv-powered Datapusher+ at the March 2024 installment of CKAN Montly Live on March 20, 2024, 13:00-14:00 UTC. Join us!
Beyond pushing data reliably at speed into your CKAN Datastore (it pushes real good! ๐), DP+ does some extended analysis, processing and enrichment of the data so it can be readily Used.
Both fetch and fetchpost commands now also have a --disk-cache option and are fully synched - forming the foundation for high-speed data enrichment from Web Services - including datHere's forthcoming, fully-integrated Data Enrichment Service.
๐๐ฝ Hi-ho Quicksilver, away! ๐๐ฝ
Added
count: automatically use Polars multithreaded, mem-mapped CSV reader whenpolarsfeature is enabled to get near-instant row counts of large CSV files even without an index #1656qsvdp: added polars support to Datapusher+-optimized binary variant, so we can do near instant SQL queries and aggregations during DP+ processing #1664fetchpost: added--disk-cacheoptions and synced usage options withfetch#1671- extended
.infile-listto skip empty and commented lines, and to validate file paths
20a45c8 and
2650930
Changed
sqlp: automatically disableread_csv()fast path optimization when a custom delimiter is specified #1648- refactored util::count_rows() helper to also use polars if available 1e09e17 and 8d321fe
- publish: updated Windows MSI publish GH Action workflow to use Wix 3.14 from 3.11 75894ef
- deps: bump polars from 0.38.1 to 0.38.2 5faf90e
- deps: update Luau from 0.614 to 0.616 eb197fe and 52331da
- build(deps): bump sysinfo from 0.30.6 to 0.30.7 by @dependabot in #1650
- build(deps): bump chrono from 0.4.34 to 0.4.35 by @dependabot in #1651
- build(deps): bump strum from 0.26.1 to 0.26.2 by @dependabot in #1658
- build(deps): bump qsv-stats from 0.12.0 to 0.13.0 by @dependabot in #1663
- build(deps): bump anyhow from 1.0.80 to 1.0.81 by @dependabot in #1662
- build(deps): bump reqwest from 0.11.25 to 0.11.26 by @dependabot in #1667
- applied select clippy recommendations
- updated several indirect dependencies
- added several benchmarks for new/changed commands
Fixed
dedup: fixed #1665 dedup not handling numeric values properly by adding a --numeric option #1666joinp: reenable join validation tests now that Polars 0.38.2 join validation is working again 5faf90e and fcfc75bcount: broken in unreleased 0.124.0. Polars-powered count require a "clean" CSV file as it infers the schema based on the first 1000 rows of a CSV. This will sometimes result in an invalid "error" (e.g. it infers a column is a number column, when its not). 0.124.1 fixes this by adding a fallback to the "regular" CSV reader if a Polars error occurs a2c0869
Removed
gender_guesser0.2.0 has been released. Remove patch.crates-io entry
97873a5
Full Changelog: 0.123.0...0.124.1
0.123.0
OPEN DATA DAY 2024 Release! ๐๐๐
In celebration of Open Data Day, we're releasing qsv 0.123.0 - the biggest release ever with 330+ commits! qsv 0.123.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.
We've been baking qsv pro for a while now, and it's almost ready for release. qsv pro is a cross-platform Desktop Data Wrangling tool marrying an Excel-like UI with the power of qsv, backed by cloud-based data cleaning, enrichment and enhancement service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.
Stay tuned!
Highlights:
sqlpnow has automaticread_csv()fast path optimization, often making optimized queries run dramatically faster - e.g what took 6.09 seconds for a non-trivial SQL aggregation on an 18 column, 657mb CSV with 7.43 million rows now takes just 0.14 seconds with the optimization - ๐ 43.5x FASTER ๐ ! 1
# with fast path optimization turned off
/usr/bin/time qsv sqlp taxi.csv --no-optimizations "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
6.09 real 6.82 user 0.16 sys
# with fast path optimization, fully exploiting Polars' multithreaded, mem-mapped CSV reader!
/usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
0.14 real 1.09 user 0.09 sys
# in contrast, csvq takes 72.46 seconds - 517.57x slower
/usr/bin/time csvq "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
+----------+---------------------+
| VendorID | SUM(total_amount) |
+----------+---------------------+
| 1 | 52377417.529256366 |
| 2 | 89959869.1264675 |
| 4 | 600584.6099999828 |
+----------+---------------------+
72.46 real 65.15 user 75.17 sys"Traditional" SQL engines
qsv and csvq both operate on "bare" CSVs. For comparison, let's contrast qsv's performance against "traditional" SQL engines
that require setup and import (aka ETL). Not counting setup and import time (which alone, takes several minutes), we get:
sqlite3.43.2 takes 2.910 seconds - 20.79x slower
sqlite> .timer on
sqlite> select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID;
1,52377417.53
2,89959869.13
4,600584.61
Run Time: real 2.910 user 2.569494 sys 0.272972PostgreSQL 15.6 using PgAdmin 4 v6.12 takes 18.527 seconds - 132.34x slower
even with an index, qsv sqlp is still 5.96x faster
sqlpnow supports JSONL output format and adds compression support for Avro and Arrow output formats.fetchnow has a--disk-cacheoption, so you can cache web service responses to disk, complete with cache control and expiry handling!jsonlis now multithreaded with additional--batchand--joboptions.splitnow has three modes: split by record count, split by number of chunks and split by file size.datefmtis a new top-level command for date formatting. We extracted it fromapplyto make it easier to use, and to set the stage for expanded date and timezone handling.enumnow has a--startoption.excelnow has a--keep-zero-timeoption and now has improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24.tojsonlnow has--trimand--no-booleanoptions and eliminated false positive boolean inferences.
Added
apply: addgender_guessoperation #1569datefmt: new top-level command for date formatting. #1638enum: add--startoption #1631excel: added--keep-zero-timeoption; improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24 #1595fetch: add--disk-cacheoption #1621jsonl: major performance refactor! Now multithreaded with addl--batchand--joboptions #1553sniff: added addl mimetype/file formats detected by bumpingfile-formatfrom 0.23 to 0.24 #1589split: add<outdir>error handling and add usage text examples #1585split: added--chunksoption #1587split: add--kb-sizeoption #1613sqlp: added JSONL output format and compression support for AVRO and Arrow output formats in #1635tojsonl: add--trimoption #1554- Add QSV_DOTENV_PATH env var #1562
- Add license scan report and status by @fossabot in #1550
- Added several benchmarks for new/changed commands
Changed
luau: bumped Luau from 0.606 to 0.614freq: major performance refactor - 1a3a4b4split: migrate to rayon from threadpool #1555split: refactored to actually create chunks <= desired--kb-size, obviating need for hacky--sep-factoroption #1615tojsonl: improved true/false boolean inferencing false positive handling #1641tojsonl: fine-tune boolean inferencing #1643schema: use parallel sort when sorting enums for fields 523c60a- Use array for rustflags to avoid conflicts with user flags by @clarfonthey in #1548
- Make it easier and more consistent to package for distros by @alerque in #1549
- Replace
simple_home_dirwithsimple_expand_tildecrate #1578 - build(deps): bump rayon from 1.8.0 to 1.8.1 by @dependabot in #1547
- build(deps): bump rayon from 1.8.1 to 1.9.0 by @dependabot in #1623
- build(deps): bump uuid from 1.6.1 to 1.7.0 by @dependabot in #1551
- build(deps): bump jql-runner from 7.1.2 to 7.1.3 by @dependabot in #1552
- build(deps): bump jql-runner from 7.1.3 to 7.1.5 by @dependabot in #1602
- build(deps): bump jql-runner from 7.1.5 to 7.1.6 by @dependabot in #1637
- build(deps): bump flexi_logger from 0.27.3 to 0.27.4 by @dependabot in #1556
- build(deps): bump regex from 1.10.2 to 1.10.3 by @dependabot in #1557
- build(deps): bump cached from 0.47.0 to 0.48.0 by @dependabot in #1558
- build(deps): bump cached from 0.48.0 to 0.48.1 by @dependabot in #1560
- build(deps): bump cached from 0.48.1 to 0.49.2 by @dependabot in #1618
- build(deps): bump chrono from 0.4.31 to 0.4.32 by @dependabot in #1559
- build(deps): bump chrono from 0.4.32 to 0.4.33 by @dependabot in #1566
- build(deps): bump mlua from 0.9.4 to 0.9.5 by @dependabot in #1565
- build(deps): bump mlua from 0.9.5 to 0.9.6 by @dependabot in #1632
- build(deps): bump serde from 1.0.195 to 1.0.196 by @dependabot in #1568
- build(deps): bump serde from 1.0.196 to 1.0.197 by @dependabot in #1612
- build(deps): bump serde_json from 1.0.111 to 1.0.112 by @dependabot in #1567
- build(deps): bump serde_json from 1.0.112 to 1.0.113 by @dependabot in #1576
- build(deps): bump serde_json from 1.0.113 to 1.0.114 by @dependabot in #1610
- bump Polars from 0.36 to 0.37 #1570
- build(deps): bump polars from 0.37.0 to 0.38.0 by @dependabot in #1629
- build(deps): bump polars from 0.38.0 to 0.38.1 by @dependabot in #1634
- build(deps): bump strum from 0.25.0 to 0.26.1 by @dependabot in #1572
- build(deps): bump indexmap from 2.1.0 to 2.2.1 by @dependabot in https://g...
-
measurements taken on an Apple Mac Mini 2023 model with an M2 Pro chip with 12 CPU cores & 32GB of RAM, running macOS Sonoma 14.4 โฉ
0.122.0
๐ REQUEST FOR USE CASES: ๐
Please help define the future of qsv.
Add what you're currently using qsv for here - #1529
Not only does it help us catalog what use cases we should optimize for, posters will get higher priority access to the qsv pro preview.
Highlights:
qsvpyis now available in the prebuilt binaries for select platforms! It's a new qsv binary variant with the python feature, enabling thepycommand. Three subvariants are available - qsvpy310, qsvpy311 and qsvpy312, corresponding to Python 3.10, 3.11 and 3.12 respectively.- Removed
generatecommand asgenerate's main dependency is unmaintained and has old dependencies.generatewas also not used much, as the test data it generated was not well suited for training models and it was too slow so we decided to remove it even before thesynthesize(#235) command is ready. reversenow has index support and can work in "streaming" mode and handle larger than memory CSV files.sortandsample: users can now choose from three Random Number Generator (RNG) algorithms with the--rngoption - standard, faster & cryptosecure.pseudonow has--start,--increment&--formatstroptions.fmtnow has a--no-final-newlineoption to suppress the final newline for better interoperability with other tools, specifically Excel. It also treats "T" as special value for tab character for the--out-delimiteroption.
Added
reverse: now has index support and can work in "streaming" mode #1531sort: added--rng <kind>for different kinds of RNGs - standard, faster & cryptosecure #1535sample: added--rng <kind>option (standard, faster & cryptosecure) #1532pseudo: major refactor. Added--start,--increment&--formatstroptions #1541fmt: add--no-final-newlineoption #1545- added additional benchmarks
- added additional test for new options. We now have ~1,300 tests!
Changed
fmt:--out-delimiternow treats "T" as special value for tab character #1546- build(deps): bump whatlang from 0.16.3 to 0.16.4 by @dependabot in #1525
- build(deps): bump serde_json from 1.0.110 to 1.0.111 by @dependabot in #1524
- build(deps): bump pyo3 from 0.20.1 to 0.20.2 by @dependabot in #1526
- build(deps): bump sysinfo from 0.30.3 to 0.30.4 by @dependabot in #1523
- build(deps): bump sysinfo from 0.30.4 to 0.30.5 by @dependabot in #1530
- build(deps): bump serial_test from 2.0.0 to 3.0.0 by @dependabot in #1534
- build(deps): bump mlua from 0.9.2 to 0.9.3 by @dependabot in #1540
- build(deps): bump mlua from 0.9.3 to 0.9.4 by @dependabot in #1542
- build(deps): bump simple-home-dir from 0.2.1 to 0.2.3 by @dependabot in #1544
- apply select clippy suggestions
- update several indirect dependencies
Removed
- removed
generatecommand #1527 - removed
generatefeature from GitHub Action workflows #1528 sample: removed--fasterRNG sampling option, replacing it with--rng#1532
Full Changelog: 0.121.0...0.122.0
0.121.0
Two days ago, qsv 0.120.0 was released. Hours later, significant updates occurred in our ecosystem: Polars upgraded to version 0.36, Homebrew rolled out support for Rust 1.75.0, and our pull request for 'cached' was merged.
In light of these developments, we're releasing 0.121.0 out of cycle to leverage the new features, fixes and performance enhancements in these key components integral to qsv.
๐ REQUEST FOR USE CASES: ๐
Please help define the future of qsv.
Add what you're currently using qsv for here - #1529
Not only does it help us catalog what use cases we should optimize for, posters will get higher priority access to the qsv pro preview.
Added
sqlp: with Polars 0.36, it now supports:- subqueries for JOIN and FROM (examples)
- REGEXP and RLIKE pattern matching (examples)
- common variant spelling STDEV in the SQL engine (in addition to STDDEV)
- and more under the hood improvements!
sqlp: now supports writing to Apache Avro format 32f2fbbsqlp: when writing to CSV--format, if the--outputfile has a TSV or TAB extension, it will automatically use the tab delimiter c97048c
Changed
- Bump polars from 0.35 to 0.36 #1521
- build(deps): bump serde from 1.0.193 to 1.0.194 by @dependabot in #1520
- build(deps): bump serde_json from 1.0.109 to 1.0.110 by @dependabot in #1519
- build(deps): bump semver from 1.0.20 to 1.0.21 by @dependabot in #1518
- build(deps): bump serde_stacker from 0.1.10 to 0.1.11 by @dependabot in #1517
- build(deps): bump cached from 0.46.1 to 0.47.0 by @dependabot in #1522
- bumped MSRV to 1.75.0
Fixed
cat: fixed performance regression inrowskeyby moving unchanging variables out of hot loop - 96a40e9sqlp: Polars 0.36 fixed the SQL SUBSTR() function
Full Changelog: 0.120.0...0.121.0
0.120.0
Happy New Year! ๐๐๐
Here's the first release of 2024, the biggest ever with 280+ commits! qsv 0.120.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.
Apart from wrapping qsv with a User Interface, qsv pro also comes with a retinue of related cloud-based data cleaning, enrichment and enhancement services along with expanded metadata inferencing to make your Data Useful, Usable and Used!
qsv pro draws inspiration from OpenRefine, but reimagined without its file size and speed limitations, with qsv pro having the ability to process multi-gigabyte files in seconds.
It incorporates hard lessons we learned in the past 12 years deploying Data Portals and Data Pipelines to create a new Data/Metadata Wrangling and AI-assisted Data Publishing service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.
But it's not quite ready for release yet, so stay tuned!
We're now taking signups for a preview release however, so if you're interested, please sign up here!
Excitingly, qsv was also mentioned on Hacker News in this thread Dec 23, 2023! As a result, we're now almost at 2,000+ stars on GitHub from 900 stars on Dec 22! ๐๐๐
Stay tuned for more advancements in 2024 โ it's set to be a landmark year for qsv! ๐ฆ๐ฆ๐ฆ
Added
cat: add rowskey --group options; increased perf of rowskey #1508validate: add --trim and --quiet options #1452apply&applydp:operations regex_replacenow supports empty--replacementwith the "<NULL>" special value #1470 and #1471exclude: also consider rows with empty fields #1498extsort: add--tmp-diroption ca1f461
Changed
validate: Faster RFC4180 validation with byterecords and SIMD-accelerated utf8 validation #1440excel: minor performance tweaks #1446apply,applydp,explode,geocode,pseudo: consolidate redundant code and use onereplace_column_valuehelper fn in util.rs #1456excel: bump calamine from 0.22 to 0.23 #1473excel&joinp: use atoi_simd for faster &[u8] to int conversion 9521f3ecat,describegpt,headers,sqlp,to,tojsonl: refactor commands that accept multiple input files to use improved process_input helper #1496fetch&fetchpost: get_response refactor for maintainability and performance #1507luau: replaced --no-colindex option with --colindex option. --col-index slows down processing and is not often used, so make it an option, not the default. a0c8568- make thousands crate optional with apply feature in #1453
- build(deps): bump uuid from 1.6.0 to 1.6.1 by @dependabot in #1430
- build(deps): bump serde from 1.0.192 to 1.0.193 by @dependabot in #1432
- build(deps): bump data-encoding from 2.4.0 to 2.5.0 by @dependabot in #1435
- build(deps): bump mlua from 0.9.1 to 0.9.2 by @dependabot in #1436
- build(deps): bump url from 2.4.1 to 2.5.0 by @dependabot in #1437
- build(deps): bump jql-runner from 7.0.6 to 7.0.7 by @dependabot in #1439
- build(deps): bump jql-runner from 7.0.7 to 7.1.0 by @dependabot in #1447
- build(deps): bump jql-runner from 7.1.0 to 7.1.1 by @dependabot in #1457
- build(deps): bump jql-runner from 7.1.1 to 7.1.2 by @dependabot in #1486
- build(deps): bump hashbrown from 0.14.2 to 0.14.3 by @dependabot in #1441
- build(deps): bump redis from 0.23.3 to 0.23.4 by @dependabot in #1442
- build(deps): bump redis from 0.23.3 to 0.24.0 by @dependabot in #1455
- build(deps): bump atoi_simd from 0.15.3 to 0.15.4 by @dependabot in #1444
- build(deps): bump atoi_simd from 0.15.4 to 0.15.5 by @dependabot in #1445
- build(deps): bump atoi_simd from 0.15.5 to 0.15.6 by @dependabot in #1512
- build(deps): bump actions/setup-python from 4.7.1 to 4.8.0 by @dependabot in #1454
- build(deps): bump actions/setup-python from 4.8.0 to 5.0.0 by @dependabot in #1459
- build(deps): bump actions/stale from 8 to 9 by @dependabot in #1463
- build(deps): bump itoa from 1.0.9 to 1.0.10 by @dependabot in #1464
- build(deps): bump tokio from 1.34.0 to 1.35.0 by @dependabot in #1465
- build(deps): bump tokio from 1.35.0 to 1.35.1 by @dependabot in #1483
- build(deps): bump ryu from 1.0.15 to 1.0.16 by @dependabot in #1466
- build(deps): bump file-format from 0.22.0 to 0.23.0 by @dependabot in #1468
- build(deps): bump github/codeql-action from 2 to 3 by @dependabot in #1476
- build(deps): bump geosuggest-utils from 0.5.1 to 0.5.2 by @dependabot in #1479
- build(deps): bump geosuggest-core from 0.5.1 to 0.5.2 by @dependabot in #1478
- build(deps): bump reqwest from 0.11.22 to 0.11.23 by @dependabot in #1480
- build(deps): bump calamine from 0.23.0 to 0.23.1 by @dependabot in #1481
- build(deps): bump qsv-sniffer from 0.10.0 to 0.10.1 by @dependabot in #1484
- build(deps): bump anyhow from 1.0.75 to 1.0.76 by @dependabot in #1485
- build(deps): bump futures from 0.3.29 to 0.3.30 by @dependabot in #1492
- build(deps): bump futures-util from 0.3.29 to 0.3.30 by @dependabot in #1491
- build(deps): bump crossbeam-channel from 0.5.9 to 0.5.10 by @dependabot in #1490
- build(deps): bump sysinfo from 0.29.10 to 0.29.11 by @dependabot in #1443
- Bump sysinfo from 0.29.11 to 0.30 #1489
- build(deps): bump sysinfo from 0.30.0 to 0.30.1 by @dependabot in #1495
- build(deps): bump sysinfo from 0.30.1 to 0.30.2 by @dependabot in #1504
- build(deps): bump sysinfo from 0.30.2 to 0.30.3 by @dependabot in #1509
- build(deps): bump tabwriter from 1.3.0 to 1.4.0 by @dependabot in #1500
- build(deps): bump tempfile from 3.8.1 to 3.9.0 by @dependabot in #1502
- build(deps): bump qsv_docopt from 1.4.0 to 1.5.0 by @dependabot in #1503
- build(deps): bump ahash from 0.8.6 to 0.8.7 by @dependabot in #1510
- build(deps): bump serde_json from 1.0.108 to 1.0.109 by @dependabot in #1511
- apply select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2023-12-23
Fixed
apply: Fix fordynfmtandcalcconvsubcommands not working in release mode #1467luau: fix check for excess mapped columns earlier. Otherwise, we'll get a CSV different field count error db15811
Removed
luau: remove unneeded--jitoption as we precompile luau scripts to bytecode #1438
Full Changelog: 0.119.0...0.120.0
0.119.0
Highlights:
As we prepare for version 1.0, we're focusing on performance, stability and reliability as we set the stage for qsv pro - a cloud-backed UI version of qsv powered by Tauri, set to be released in 2024. Stay tuned!
diffis now out of beta and blazingly fast! Give "the fastest CSV-diff in the world" a try ๐!joinpnow supports snappy automatic compression/decompression!sqlp&joinpnow recognize theQSV_COMMENT_CHARenvironment variable, allowing you to skip comment lines in your input CSV files. They're also faster with the upgrade to Polars 0.35.4.sqlpnow supports subqueries, table aliases, and more!luau: upgraded embedded Luau from 0.599 to 0.604; refactored code to reduce unneeded allocations and increase performance (more than doubling it!) as we prepare for extended recipe support.catis now even faster with the--flexibleoption. If you know your CSV files are valid, you can use this option to skip CSV validation and makecatrun twice as fast!- qsv can now add a Byte Order Mark (BOM) header sequence to produce Excel-friendly CSVs with the
QSV_OUTPUT_BOMenvironment variable. stats,sort,schema&validateare now faster with the use ofatoi_simdto directly convert &[u8] to integer, skipping unnecessary utf8 validation, while also using SIMD CPU instructions for noticeably faster performance.
Added
diff: added option/flag for headers in output by @janriemer in #1395diff: added option/flag--delimiter-outputby @janriemer in #1402cat: added--flexibleoption to makecat rowsfaster still #1408sqlp&joinp: both commands now recognize QSV_COMMENT_CHAR env var #1412joinp: added snappy compression/decompression support #1413geocode: now automatically decompresses snappy-compressed index files #1429- Add Byte Order Mark (BOM) output support #1424
- Added Codacy code quality badge 9959129
Changed
stats,sort,schema&validate: use atoi_simd to directly convert &[u8] to integer skipping unnecessary utf8 validation, while also using SIMD instructions for noticeably faster performancecat: fastercat rows#1407count: optimize--widthoption #1411luau: upgrade embedded Luau from 0.603 to 0.604 #1426- use
ato_simdfor fast &[u8] to int conversion #1423 luau: performance refactor 4cebd7c- build(deps): bump csv-diff from 0.1.0-beta.4 to 0.1.0 by @dependabot in #1394
- build(deps): bump serde_json from 1.0.107 to 1.0.108 by @dependabot in #1393
- build(deps): bump indexmap from 2.0.2 to 2.1.0 by @dependabot in #1397
- build(deps): bump jql-runner from 7.0.4 to 7.0.5 by @dependabot in #1399
- build(deps): bump jql-runner from 7.0.5 to 7.0.6 by @dependabot in #1400
- build(deps): bump file-format from 0.21.0 to 0.22.0 by @dependabot in #1401
- build(deps): bump cached from 0.46.0 to 0.46.1 by @dependabot in #1403
- build(deps): bump serde from 1.0.190 to 1.0.192 by @dependabot in #1404
- build(deps): bump tokio from 1.33.0 to 1.34.0 by @dependabot in #1409
- build(deps): bump flexi_logger from 0.27.2 to 0.27.3 by @dependabot in #1410
- build(deps): bump qsv-stats from 0.11.0 to 0.12.0 by @dependabot in #1415
- build(deps): bump itertools from 0.11.0 to 0.12.0 by @dependabot in #1418
- build(deps): bump rust_decimal from 1.33.0 to 1.33.1 by @dependabot in #1420
- build(deps): bump polars from 0.35.2 to 0.35.4 by @dependabot in #1425
- build(deps): bump uuid from 1.5.0 to 1.6.0 by @dependabot in #1428
- bump MSRV to 1.74.0
- apply select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2023-11-18
Fixed
pseudo: detect when more than one column is selected for pseudonymization 0b09372- dotenv (.env) tweaks/fixes #1427
- fix several typos 723443e
- fix several markdown lints
Removed
- remove fast-float as std float parse is now also using Eisel-Lemire algorithm #1414
Full Changelog: 0.118.0...0.119.0
NOTE:
To verify prebuilt binary zip archives - click here.

