ℹ️ NOTE: This tour is primarily targeted to Linux and macOS users. Though qsv works on Windows, the tour assumes basic knowledge of command-line piping and redirection, and uses other command-line tools (curl, tee, head, etc.) that are not installed by default on Windows.
For a more detailed, interactive tour (which also happens to be Windows-friendly) see 100.dathere.com.
Let's say you're playing with some data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the 124MB, 2.7M row CSV file and start examining it:
# there are no headers in the original repo, so let's download a prepared CSV with headers
$ curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/wcp.zip
$ unzip wcp.zip
$ qsv headers wcp.csv
1 Country
2 City
3 AccentCity
4 Region
5 Population
6 Latitude
7 Longitude
The next thing you might want to do is get an overview of the kind of data that
appears in each column. The stats command will do this for you:
$ qsv stats wcp.csv | qsv table
field type is_ascii sum min max range sort_order sortiness min_length max_length sum_length avg_length stddev_length variance_length cv_length mean sem geometric_mean harmonic_mean stddev variance cv nullcount n_negative n_zero n_positive max_precision sparsity
Country String true ad zw Unsorted 1 2 2 5398708 2 0 0 0 0 0
City String false al lusayli ??ykkvibaer Unsorted 0.8056 1 87 26071762 9.6585 4.0135 16.1081 0.4155 0 0
AccentCity String false Al Lusayli ???zler Unsorted 0.7503 1 87 26719888 9.8986 4.1308 17.0633 0.4173 0 0
Region String true 00 Z4 Unsorted 0.3513 0 2 5303100 1.9646 0.0025 0 0.0013 4 0
Population Integer 2290536125 7 31480498 31480491 Unsorted 0.0036 48730.6639 1422.5474 10888.1976 2663.625 308414.0419 95119221210.8852 632.8952 2652350 0 0 47004 0.9826
Latitude Float 76585211.1978 -54.9333333 82.483333 137.4167 Unsorted 0.0692 28.3717 0.0134 21.9384 481.2922 77.3249 0 330964 81 2368309 9 0
Longitude Float 75976506.6643 -179.9833333 180.0 359.9833 Unsorted 0.0688 28.1462 0.038 62.4729 3902.8581 221.9586 0 633589 159 2065606 9 0
Wow! That was fast! It took just 0.72 seconds to compile all that.1 One reason for qsv's speed is that it mainly works in "streaming" mode - computing statistics as it "streams" the CSV file line by line. This also means it can gather statistics on arbitrarily large files, as it does not have to load the entire file into memory.2
But can we get more summary statistics? What's the median, the antimodes, the percentiles, the
Median Absolute Deviation (MAD), the cardinality of the data? No problem. That's why qsv stats
has an --everything option to compute these more "expensive" stats. Expensive - as these
extended statistics can only be computed at the cost of loading the entire file into memory.
$ qsv stats wcp.csv --everything | qsv table
field type is_ascii sum min max range sort_order sortiness min_length max_length sum_length avg_length stddev_length variance_length cv_length mean sem geometric_mean harmonic_mean stddev variance cv nullcount n_negative n_zero n_positive max_precision sparsity mad lower_outer_fence lower_inner_fence q1 q2_median q3 iqr upper_inner_fence upper_outer_fence skewness cardinality uniqueness_ratio mode mode_count mode_occurrences antimode antimode_count antimode_occurrences percentiles
Country String true ad zw Unsorted 1 2 2 5398708 2 0 0 0 0 0 231 0.0001 ru 1 176934 cc|nf|pn|tf|tk 5 1
City String false al lusayli ??ykkvibaer Unsorted 0.8056 1 87 26071762 9.6585 4.0135 16.1081 0.4155 0 0 2008182 0.7439 san jose 1 313 *PREVIEW: al lusayli|amiri-ye `olya|bab el ahmar|baqeri `olya|chahar tang... 1741957 1
AccentCity String false Al Lusayli ???zler Unsorted 0.7503 1 87 26719888 9.8986 4.1308 17.0633 0.4173 0 0 2025934 0.7505 San Antonio 1 307 *PREVIEW: Al Lusayli|Amiri-ye `Olya|Baqeri `Olya|B?b el Ahmar|Chahar Tang... 1762562 1
Region String true 00 Z4 Unsorted 0.3513 0 2 5303100 1.9646 0.0025 0 0.0013 4 0 392 0.0001 04 1 143900 *PREVIEW: H1|H6|H9|I1|I4|K5|K6|L1|L7|M8 17 1
Population Integer 2290536125 7 31480498 31480491 Unsorted 0.0036 48730.6639 1422.5474 10888.1976 2663.625 308414.0419 95119221210.8852 632.8952 2652350 0 0 47004 0.9826 8327 -69766.5 -33018 3730.5 10879 28229.5 24499 64978 101726.5 0.4164 28460 0.0105 1 2652350 *PREVIEW: 100|10001|10002|100023|10003|10005|10007|10008|10009|100105 18970 1 5: 1002|10: 1869|40: 7145|60: 15286|90: 76682|95: 145843
Latitude Float 76585211.1978 -54.9333333 82.483333 137.4167 Unsorted 0.0692 28.3717 0.0134 21.9384 481.2922 77.3249 0 330964 81 2368309 9 0 15.2667 -84.7706 -35.9076 12.9553 33.8667 45.5306 32.5753 94.3935 143.2564 -0.2839 255133 0.0945 50.8 1 1128 *PREVIEW: -.020556|-.025833|-.035|-.035833|-.047222|-.051944|-.054722|-.06... 79002 1 5: -14.1667|10: -4.5167|40: 27|60: 38.0667|90: 53.1814|95: 56.4
Longitude Float 75976506.6643 -179.9833333 180.0 359.9833 Unsorted 0.0688 28.1462 0.038 62.4729 3902.8581 221.9586 0 633589 159 2065606 9 0 30.9583 -199.3667 -98.4917 2.3833 26.8803 69.6333 67.25 170.5083 271.3833 0.2715 407568 0.151 23.1 1 590 *PREVIEW: -.185556|-.208056|-.246944|-.248333|-.263889|-.271667|-.276389|... 162640 1 5: -88.3|10: -75.5|40: 18.7333|60: 38.2167|90: 113.1|95: 124.6381
ℹ️ NOTE: The
qsv tablecommand takes any CSV data and formats it into aligned columns using elastic tabstops. You'll notice that it even gets alignment right with respect to Unicode characters.
So, this command took 1.11 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:
qsv index wcp.csv
qsv stats wcp.csv --everything | qsv table
Which cuts it down to 0.41 seconds - 2.7x faster! (And creating the 21mb index took 0.22 seconds.
What about the first stats without --everything? From 0.72 seconds to 0.08 seconds with an index - 9x faster!)
Notably, the same type of "statistics" command in another
CSV command line toolkit
takes about 10 seconds to produce a subset of statistics on the same data set. Visidata
takes much longer - ~1.5 minutes to calculate a subset of these statistics with its Describe sheet.
Even python pandas'
describe(include="all")) took 12 seconds to calculate a subset of qsv's "streaming" statistics.3
This is another reason for qsv's speed. Creating an index accelerated statistics gathering as it enables multithreading & fast I/O.
For multithreading - running stats with an index was 9x faster because it divided the file into
16 equal chunks1 with ~170k records each, then running stats on each chunk in parallel across 16
cores and merging the results in the end. It was "only" 9x, and not 16x faster as there is
some overhead involved in multithreading.
For fast I/O - let's say you wanted to grab the last 10 records:
$ qsv count --human-readable wcp.csv
2,699,354
$ qsv slice wcp.csv --start -10 | qsv table
Country City AccentCity Region Population Latitude Longitude
zw zibalonkwe Zibalonkwe 06 -19.8333333 27.4666667
zw zibunkululu Zibunkululu 06 -19.6666667 27.6166667
zw ziga Ziga 06 -19.2166667 27.4833333
zw zikamanas village Zikamanas Village 00 -18.2166667 27.95
zw zimbabwe Zimbabwe 07 -20.2666667 30.9166667
zw zimre park Zimre Park 04 -17.8661111 31.2136111
zw ziyakamanas Ziyakamanas 00 -18.2166667 27.95
zw zizalisari Zizalisari 04 -17.7588889 31.0105556
zw zuzumba Zuzumba 06 -20.0333333 27.9333333
zw zvishavane Zvishavane 07 79876 -20.3333333 30.0333333
qsv count took 0.01 seconds and qsv slice, 0.01 seconds too! These commands are instantaneous
with an index because for count - the index already precomputed the record count, and with slice,
only the sliced portion has to be parsed - because an index allowed us to jump directly to that
part of the file. It didn't have to scan the entire file to get the last 10 records. For comparison,
without an index, slice took 0.30 seconds (30x slower).
ℹ️ NOTE: Creating/updating an index itself is extremely fast as well. If you want qsv to automatically create and update indices, set the environment var
QSV_AUTOINDEX_SIZE.
Okay, okay! Let's switch gears and stop obsessing over how fast 🚀 qsv is... let's go back to exploring 🔎 the data set.
Hmmmm... the Population column has a lot of null values. How pervasive is that?
First, let's take a look at 10 "random" rows with sample. We use the --seed parameter
so we get a reproducible random sample. And then, let's display only the Country,
AccentCity and Population columns with the select command.
$ qsv sample --seed 42 10 wcp.csv |
qsv select Country,AccentCity,Population |
qsv table
Country AccentCity Population
cn Shijidai
cu Santa Catalina
de Werscheresch
es La Cantera
gw Reino de Antula
kz Karakamys
md Peleriya
se Norra Lagnö
sv Santa Inés
us Tellico
Whoops! The sample we got doesn't have population counts. It's quite pervasive. Exactly how many cities have empty (NULL) population counts?
$ qsv frequency wcp.csv --limit 3 | qsv table
field value count percentage rank
Country ru 176934 6.55468 1
Country us 141989 5.26011 2
Country cn 117508 4.35319 3
Country Other (228) 2262923 83.83202 0
City san jose 313 0.0116 1
City san antonio 310 0.01148 2
City santa rosa 288 0.01067 3
City Other (2,008,161) 2698443 99.96625 0
AccentCity San Antonio 307 0.01137 1
AccentCity Santa Rosa 288 0.01067 2
AccentCity Santa Cruz 268 0.00993 3
AccentCity Other (2,025,913) 2698491 99.96803 0
Region 04 143900 5.33091 1
Region 02 127736 4.7321 2
Region 03 105455 3.90668 3
Region Other (388) 2322259 86.0303 0
Region (NULL) 4
Population 2310 12 0.02553 1
Population 2137 11 0.0234 2
Population 2230 11 0.0234 2
Population Other (28,456) 46970 99.92767 0
Population (NULL) 2652350
Latitude 50.8 1128 0.04179 1
Latitude 50.95 1076 0.03986 2
Latitude 50.6 1043 0.03864 3
Latitude Other (255,130) 2696107 99.87971 0
Longitude 23.1 590 0.02186 1
Longitude 23.2 586 0.02171 2
Longitude 23.05 575 0.0213 3
Longitude Other (407,565) 2697603 99.93513 0
(The qsv frequency command builds a frequency table for each column in the
CSV data, with an "Other" rollup row summarizing values that fall outside the
top N. This one only took 1.16 seconds.)
So it seems that most cities do not have a population count associated with them at all (2,652,350 to be exact). No matter — we can adjust our previous command so that it only shows rows with a population count:
$ qsv search --select Population '[0-9]' wcp.csv |
qsv sample --seed 1 10 |
qsv select Country,AccentCity,Population |
tee sample.csv |
qsv table
Country AccentCity Population
it Lipari 10649
cz Benesov nad Ploucnici 4042
il Kabul 9301
us Wahpeton 8395
in Pachperwa 15057
ph Pawing 3316
us Fort Dix 6868
ru Skorodnoye 3677
ee Haiba 395
nl Schoonhoven 12471
ℹ️ NOTE: The
teecommand reads from standard input and writes to both standard output and one or more files at the same time. We do this so we can create thesample.csvfile we need for the next step, and pipe the same data to theqsv tablecommand.
Why createsample.csv? Even though qsv is blazing-fast, we're just doing an initial investigation and a small 10-row sample is all we need to try out and compose the different CLI commands needed to wrangle the data.
Erk. Which country is ee? What continent? No clue, but datawookie
has a CSV file called country-continent.csv.
$ curl -L https://raw.githubusercontent.com/datawookie/data-diaspora/master/spatial/country-continent-codes.csv > country_continent.csv
$ qsv headers country_continent.csv
1 # https://datahub.io/JohnSnowLabs/country-and-continent-codes-list
Huh!?! That's not what we were expecting. But if you look at the country-continent.csv
file, it starts with a comment with the # character.
$ head -5 country_continent.csv
# https://datahub.io/JohnSnowLabs/country-and-continent-codes-list
continent,code,country,iso2,iso3,number
Asia,AS,"Afghanistan, Islamic Republic of",AF,AFG,4
Europe,EU,"Albania, Republic of",AL,ALB,8
Antarctica,AN,Antarctica (the territory South of 60 deg S),AQ,ATA,10
No worries, qsv got us covered with its QSV_COMMENT_CHAR environment variable. Setting it
to # tells qsv to ignore any lines in the CSV - may it be before the header, or even in the data
part of the CSV, that starts with the character we set it to.
$ export QSV_COMMENT_CHAR='#'
$ qsv headers country_continent.csv
1 continent
2 code
3 country
4 iso2
5 iso3
6 number
That's more like it. We can now do a join to see which countries and continents these are:
$ qsv join --ignore-case Country sample.csv iso2 country_continent.csv | qsv table
Country AccentCity Population continent code country iso2 iso3 number
it Lipari 10649 Europe EU Italy, Italian Republic IT ITA 380
cz Benesov nad Ploucnici 4042 Europe EU Czech Republic CZ CZE 203
il Kabul 9301 Asia AS Israel, State of IL ISR 376
us Wahpeton 8395 North America NA United States of America US USA 840
in Pachperwa 15057 Asia AS India, Republic of IN IND 356
ph Pawing 3316 Asia AS Philippines, Republic of the PH PHL 608
us Fort Dix 6868 North America NA United States of America US USA 840
ru Skorodnoye 3677 Europe EU Russian Federation RU RUS 643
ru Skorodnoye 3677 Asia AS Russian Federation RU RUS 643
ee Haiba 395 Europe EU Estonia, Republic of EE EST 233
nl Schoonhoven 12471 Europe EU Netherlands, Kingdom of the NL NLD 528
ee is Estonia - never would have guessed that. Thing is, now we have several unneeded
columns, and the column names case formats are not consistent. Also, there are two records
for Skorodnoye - for both Europe and Asia. This is because the Russian Federation spans
both continents.
We're primarily interested in unique cities per country for the purposes of this tour,
so we need to filter these out.
Also, apart from renaming the columns, I want to reorder them to "City, Population, Country, Continent".
No worries. Let's use the select (so we only get the columns we need, in the order we want),
dedup (so we only get unique County/City combinations) and rename (columns in titlecase) commands:
$ qsv join --ignore-case Country sample.csv iso2 country_continent.csv |
qsv select 'AccentCity,Population,country,continent' |
qsv dedup --select 'country,AccentCity' |
qsv rename City,Population,Country,Continent |
qsv table
City Population Country Continent
Benesov nad Ploucnici 4042 Czech Republic Europe
Haiba 395 Estonia, Republic of Europe
Pachperwa 15057 India, Republic of Asia
Kabul 9301 Israel, State of Asia
Lipari 10649 Italy, Italian Republic Europe
Schoonhoven 12471 Netherlands, Kingdom of the Europe
Pawing 3316 Philippines, Republic of the Asia
Skorodnoye 3677 Russian Federation Asia
Fort Dix 6868 United States of America North America
Wahpeton 8395 United States of America North America
Nice! Notice the data is now sorted by Country,City too! That's because dedup sorts the
CSV records (loading the entire file into memory) to find duplicates - unless you tell it
the input is already sorted with --sorted for streaming-mode dedup.
Now that we've composed all the commands we need, perhaps we can do this with the original CSV data?
Not the tiny 10-row sample.csv file, but all 2.7 million rows in the 124MB wcp.csv file?!
Indeed we can — because qsv is designed for speed - written in Rust with
amortized memory allocations, using the
performance-focused jemalloc allocator.
$ qsv join --ignore-case Country wcp.csv iso2 country_continent.csv |
qsv search --select Population '[0-9]' |
qsv select 'AccentCity,Population,country,continent,Latitude,Longitude' |
qsv dedup --select 'country,AccentCity,Latitude,Longitude' --dupes-output wcp_dupes.csv |
qsv rename City,Population,Country,Continent,Latitude,Longitude --output wcp_countrycontinent.csv
$ qsv sample 10 --seed 33 wcp_countrycontinent.csv | qsv table
City Population Country Continent Latitude Longitude
Cantilan 9408 Philippines, Republic of the Asia 9.3336111 125.9775
Björnli 288 Norway, Kingdom of Europe 63.133333 9.683333
Bocsa 3394 Romania Europe 47.3 22.9166667
La Garriga 13006 Spain, Kingdom of Europe 41.6833333 2.2833333
Veriora 549 Estonia, Republic of Europe 58.0041667 27.3519444
Vrani 1215 Romania Europe 45.0383333 21.4925
Vitim 3843 Russian Federation Asia 59.4511111 112.5577778
Panatau 2835 Romania Europe 45.3166667 26.3833333
Libenge 27051 Congo, Democratic Republic of the Africa 3.65 18.633333
Volovo 4089 Russian Federation Asia 53.55 37.95
$ qsv count -H wcp_countrycontinent.csv
47,004
$ qsv count -H wcp_dupes.csv
5,155
We fine-tuned dedup by adding Latitude and Longitude as there may be
multiple cities with the same name in a country. We also specified the
dupes-output option so we can have a separate CSV of the duplicate records
it removed.
We're also just interested in cities with population counts. So we used search
with the regular expression [0-9]. This cuts down the file to 47,004 rows.
The whole thing took ~2.5 seconds on my machine. The performance of join,
in particular, comes from constructing a SIMD-accelerated hash index of one of the CSV
files. The join command does an inner join by default, but it also has left,
right and full outer, cross, anti and semi join support too. All from the command line,
without having to load the files into a database, index them, to do a SQL join.
Finally, can we create a CSV file for each country of all its cities? Yes we can, with
the partition command (and it took just 0.04 seconds to create all 209 country-city files!):
$ qsv partition Country bycountry wcp_countrycontinent.csv
$ cd bycountry
$ ls -1lhS | head -6 ; echo ... ; ls -1lhS | tail -5
total 7256
-rw-r--r-- 319K UnitedStatesofAmerica.csv
-rw-r--r-- 260K PhilippinesRepublicofthe.csv
-rw-r--r-- 255K RussianFederation.csv
-rw-r--r-- 169K IndiaRepublicof.csv
...
-rw-r--r-- 1K Aruba.csv
-rw-r--r-- 1K Anguilla.csv
-rw-r--r-- 1K Gibraltar.csv
-rw-r--r-- 1K Ukraine.csv
Examining the USA csv file:
$ qsv stats --everything UnitedStatesofAmerica.csv | qsv table --output usa-cities-stats.csv
$ less -S usa-cities-stats.csv
field type is_ascii sum min max range sort_order sortiness min_length max_length sum_length avg_length stddev_length variance_length cv_length mean sem geometric_mean harmonic_mean stddev variance cv nullcount n_negative n_zero n_positive max_precision sparsity mad lower_outer_fence lower_inner_fence q1 q2_median q3 iqr upper_inner_fence upper_outer_fence skewness cardinality uniqueness_ratio mode mode_count mode_occurrences antimode antimode_count antimode_occurrences percentiles
City String true Abbeville Zionsville Ascending 1 3 26 38199 9.1495 2.968 8.8093 0.3244 0 0 3439 0.8237 Springfield 1 11 *PREVIEW: Abbeville|Abilene|Abington|Acton|Acushnet|Acworth|Ada|Adelanto|... 3002 1
Population Integer 179123400 216 8107916 8107700 Unsorted 0.0072 42903.8084 2596.2217 22665.0898 15550.3713 167752.8889 28141031740.29 390.9977 0 0 0 4175 0 8846 -60516 -24217.5 12081 19235 36280 24199 72578.5 108877 0.4087 3981 0.9535 10576|10945|11971|12115|13219|13250|8771|9944 8 3 *PREVIEW: 10003|10009|10012|10015|100158|10019|10030|10032|10034|10058 3795 1 5: 7527|10: 9311|40: 15772|60: 24003|90: 73206|95: 117258
Country String true United States of America United States of America Ascending 1 24 24 100200 24 0 0 0 0 0 1 0.0002 United States of America 1 4175 0 0
Continent String true North America North America Ascending 1 13 13 54275 13 0 0 0 0 0 1 0.0002 North America 1 4175 0 0
Latitude Float 158455.7902 17.9677778 71.2905556 53.3228 Unsorted 0.0987 37.9535 0.0929 37.4081 36.7621 6.0032 36.0386 15.8173 0 0 0 4175 7 0 3.3047 10.4336 22.2444 34.0553 39.4694 41.9292 7.8739 53.74 65.5508 -0.3752 4010 0.9605 42.0333333 1 5 *PREVIEW: 17.9677778|17.9680556|17.9736111|17.9794444|17.9861111|18.00833... 3856 1 5: 26.6403|10: 29.875|40: 37.9064|60: 40.7281|90: 43.9014|95: 45.2461
Longitude Float -377616.7798 -165.4063889 -65.3013889 100.105 Unsorted 0.0086 -90.4471 0.2663 17.209 296.1482 -19.0265 0 4175 0 0 7 0 10.1369 -158.9414 -128.2139 -97.4864 -86.0342 -77.0014 20.485 -46.2739 -15.5464 -0.1181 4074 0.9758 -118.3516667|-71.0666667|-71.3972222|-71.4166667|-83.1500000 5 3 *PREVIEW: -100.0166667|-100.3505556|-100.4055556|-100.4366667|-100.4991667... 3978 1 5: -122.1286|10: -118.4381|40: -89.4556|60: -82.4894|90: -72.0339|95: -71.0106
Hhhmmm... clearly the worldcitiespop.csv file from the Data Science Toolkit does not have comprehensive coverage of City populations.
The US population is far more than 179,123,400 (Population sum) and 3,439 cities (City cardinality).
Perhaps we can get population info elsewhere with the fetch command...
But that's another tour by itself! 😄
Footnotes
-
Timings collected by setting
QSV_LOG_LEVEL='debug'on an Apple M4 Max (16 cores) MacBook Pro running macOS 26.5 with 64gb of unified memory. ↩ ↩2 -
For example, running
qsv statson a CSV export of ALL of NYC's available 311 data from 2010 to Mar 2022 (27.8M rows, 16gb) took just 22.4 seconds with an index (which actually took longer to create - 39 seconds to create a 223mb index), and its memory footprint remained the same, pinning all 16 logical processors near 100% utilization on a Ryzen 7 4800H laptop with 32gb memory and 1 TB SSD running Windows 11. (qsv on current Apple Silicon hardware is even faster.) ↩