Overview
Right now we have code all over the place for creating Vitessce data/configs:
https://github.com/hubmapconsortium/portal-containers
https://github.com/hubmapconsortium/vitessce-data
https://github.com/hubmapconsortium/portal-ui/blob/master/context/app/api/vitessce.py
This is problematic as it makes launching new Vitessce configs difficult and hard to communicate to people not familiar with out code. This problem is only going to expand, and as we gain users (probably other data portals), it would be good to have not only schemas for validating the data, but a way of reliably generating the data.
The overarching goal here is to take in a Pandas dataframe and output compliant Arrow (in the future), Zarr, OME-TIFF, and JSON data for Vitessce. A secondary goal could be to also create Vitessce configurations based on what data has been generated - basically pre-defined view configurations based on certain standard inputs (i.e a genes/clusters + raster + cells/cell-sets without scatterplot gives what we have for CODEX, and with scatterplot gives Linnarsson minus one of the scatterplots).
I'll organize this issue by data type.
Genes/Clusters (Heatmap)
Our genes and clusters schema convey very similar information, i.e data per observation and a max for rendering. We should think about merging these, if possible, since if we can show one, we can show the other:
https://github.com/hubmapconsortium/portal-containers/blob/fb1910324fc796ff4b7d4e643de27ff2861e7d8c/containers/sprm-to-json/context/main.py#L125-L160
https://github.com/hubmapconsortium/vitessce-data/blob/master/python/cluster.py
https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py
This might require an arrow loader if it's too hard to parse out data properly using only one schema in the client across the two use cases, since they are used differently.
In any case, I think a function that takes in a Pandas DataFrame containing a Cell x Gene matrix and outputs JSON/Arrow should be the goal here. The index of such a DataFrame would be cell names and the column names genes. This will help with Cells/Cell-Sets.
df_genes
Actin CD107a CD11c CD20 CD21 CD31 CD3e CD4 CD45 CD45RO CD68 CD8 DAPI_2 E_CAD Histone_H3 Ki67 Pan_CK Podoplanin
Unnamed: 0
1 0.0 3825.083089 2172.038856 0.000000 13118.704545 0.0 2619.149560 2258.743646 3018.150782 13766.025415 2475.430352 17811.810362 2472.491447 13831.021750 2155.434995 12023.281769 0.0 12854.526882
2 0.0 3158.566135 1905.015101 6.866331 9662.850531 0.0 2279.843261 2059.656600 2866.507131 9865.706096 2220.703160 10513.558166 1972.618289 10445.596337 1802.067673 8310.784396 0.0 9166.099972
3 0.0 2112.107533 1464.033661 0.935408 8152.397926 0.0 1778.593705 1477.261827 2401.413574 7463.324054 1703.527838 6728.968341 2594.646470 8001.948144 1467.260735 6173.303675 0.0 7050.821325
4 0.0 2409.139601 1568.258547 30.035613 12435.782407 0.0 1835.470442 1643.249288 2789.540598 7843.279558 1962.359687 7357.050570 2328.332977 11190.447293 1503.501068 6625.033120 0.0 8061.569801
5 0.0 1789.038279 1165.606538 23.199695 6595.104505 0.0 1401.826389 1163.010501 1994.819783 5216.277778 1378.526423 4899.289804 1745.914973 6385.073679 1220.704268 4540.830454 0.0 4463.399051
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2653 0.0 1528.167373 1040.252119 71.731638 9857.117232 0.0 1133.142655 1081.707627 2482.951977 5863.394068 1245.564972 6276.619350 2695.375000 7168.248588 1072.548729 5214.332627 0.0 5677.270480
2654 0.0 866.767553 579.135481 7.370484 3924.449898 0.0 698.100375 555.293286 1207.978357 2482.735515 713.964724 1805.677062 1886.900818 2124.561350 615.980061 1431.171097 0.0 1684.441207
2655 0.0 1534.898357 949.947653 1.008920 6614.136854 0.0 1718.979343 1471.665023 1850.167840 6816.869014 1180.052113 4810.176761 1911.350939 5107.615493 918.007746 4728.398592 0.0 5064.655399
2656 0.0 1643.330193 1080.667150 23.054348 6832.027778 0.0 1456.217874 1124.606763 2271.074879 5281.138406 1362.480193 5671.768116 1566.910870 5627.569565 986.648792 4990.973913 0.0 5253.209420
2657 0.0 2407.073093 2120.567444 2.307910 12124.994703 0.0 4122.323093 3009.756356 3979.926907 14120.478814 2581.693856 12566.961511 2934.979520 11720.578390 1956.343220 11260.825212 0.0 12085.653249
[2657 rows x 18 columns]
>>> generate_cell_by_gene(df_genes)
Cell-Sets/Cells
@keller-mark knows best (feel free to comment/edit this issue!) but this is a little bit more complicated since the two are intertwined, but not necessary/sufficient in both directions (like the above); that is, one could have "Cells" without "Cell-sets" but not really "Cell-Sets" without "Cells."
Like the above we want a function that takes in a Pandas DataFrame and outputs JSON/Arrow but the structure for the DataFrame is a little bit hairier (not just a labeled Cell x Gene matrix where the labels are basically unchecked). I foresee us needing to either strongly define an API or rely on a properly named DataFrame (i.e each column has a specific name like poly or xy). I think we should probably go the route of an API so we have something like:
>>> df
Shape Actin CD107a CD11c CD20 CD21 CD31 CD3e ... Ki67 Pan_CK Podoplanin Mean Covariance Total Mean All Shape Vectors
id ...
1 [[0.0, 100.5], [1.0232, 100.5232], [1.7536, 10... 0.0 3825.083089 2172.038856 0.000000 13118.704545 0.0 2619.149560 ... 12023.281769 0.0 12854.526882 4 6 6 2 3
2 [[0.0, 130.5], [1.0798, 130.5798], [1.8667, 13... 0.0 3158.566135 1905.015101 6.866331 9662.850531 0.0 2279.843261 ... 8310.784396 0.0 9166.099972 2 2 3 3 3
3 [[0.0, 647.5], [0.6596, 646.8404], [1.4515, 64... 0.0 2112.107533 1464.033661 0.935408 8152.397926 0.0 1778.593705 ... 6173.303675 0.0 7050.821325 6 2 6 4 1
4 [[0.4782, 736.0218], [0.4782, 736.0218], [0.95... 0.0 2409.139601 1568.258547 30.035613 12435.782407 0.0 1835.470442 ... 6625.033120 0.0 8061.569801 6 2 1 4 2
5 [[0.9636, 890.5], [0.9636, 890.5], [1.6556, 89... 0.0 1789.038279 1165.606538 23.199695 6595.104505 0.0 1401.826389 ... 4540.830454 0.0 4463.399051 3 2 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2653 [[1005.0357, 298.5], [1005.5179, 298.5], [1005... 0.0 1528.167373 1040.252119 71.731638 9857.117232 0.0 1133.142655 ... 5214.332627 0.0 5677.270480 6 1 2 4 2
2654 [[1006.0, 531.5], [1004.9692, 531.4692], [1004... 0.0 866.767553 579.135481 7.370484 3924.449898 0.0 698.100375 ... 1431.171097 0.0 1684.441207 1 1 2 6 3
2655 [[1005.193, 599.5], [1005.193, 599.5], [1004.5... 0.0 1534.898357 949.947653 1.008920 6614.136854 0.0 1718.979343 ... 4728.398592 0.0 5064.655399 3 2 1 1 3
2656 [[1005.233, 754.5], [1005.233, 754.5], [1004.4... 0.0 1643.330193 1080.667150 23.054348 6832.027778 0.0 1456.217874 ... 4990.973913 0.0 5253.209420 3 2 1 1 3
2657 [[1006.0, 389.5], [1005.4694, 390.0306], [1004... 0.0 2407.073093 2120.567444 2.307910 12124.994703 0.0 4122.323093 ... 11260.825212 0.0 12085.653249 2 4 1 2 3
[2657 rows x 24 columns]
generate_cells(df, poly="Shape", genes=["CD11c", "CD20", ...], factors=["Mean", "Mean All", ...]....)
where each string argument is a column in the dataframe df to be put into the json portion corresponding roughly to the arg key. The index of this dataframe will be cell ids, just like the above.
I think Cell_sets is going to be a little harder. Maybe you could add something about this @keller-mark here in terms of what input data could look like.
Raster
This one is tricky as well. We should probably support both tiff and zarr via a flag. We'll need to set up the docker container for bioformats2raw/raw2ometiff as a dependency (which I think can be done via the setup.py file). Beyond that, the other major paint point will be input data. Are we expecting numpy arrays? dask arrays? zarr stores? File paths? Perhaps all 4 can be possible?
generate_raster(ome_tiff="/path/to/my_file.ome.tif", output_tiff=True)
# or
generate_raster(np_array=my_image, output_zarr=True)
@manzt can probably comment on this as well. I Imagine most people will input OME-TIFF to bioformats2raw but I think we can also handle other inputs and use our custom pyramid generator or something python-specific (in contrast to bioformats2raw) that Glencoe writes.
Molecules
I think this will be relatively straightforward like the genes data - I think an input data frame with the index being molecule names plugged into an API is what we will use:
>>> df
x_um y_um
gene
Gad2 1278.683956 6020.642260
Gad2 1326.970330 6023.884788
Gad2 1292.026844 6059.337093
Gad2 1300.886241 6097.786264
Gad2 1232.410068 6102.884182
... ... ...
Mup5 3161.427603 5192.594981
Mup5 3099.698528 5221.596008
Mup5 3084.582240 5297.234605
Mup5 3054.192051 5342.142346
Mup5 3058.963217 5348.150185
[3841412 rows x 2 columns]
>>> generate_molecules(df, x="x_um", y="y_um")
Overview
Right now we have code all over the place for creating Vitessce data/configs:
https://github.com/hubmapconsortium/portal-containers
https://github.com/hubmapconsortium/vitessce-data
https://github.com/hubmapconsortium/portal-ui/blob/master/context/app/api/vitessce.py
This is problematic as it makes launching new Vitessce configs difficult and hard to communicate to people not familiar with out code. This problem is only going to expand, and as we gain users (probably other data portals), it would be good to have not only schemas for validating the data, but a way of reliably generating the data.
The overarching goal here is to take in a Pandas dataframe and output compliant Arrow (in the future), Zarr, OME-TIFF, and JSON data for Vitessce. A secondary goal could be to also create Vitessce configurations based on what data has been generated - basically pre-defined view configurations based on certain standard inputs (i.e a genes/clusters + raster + cells/cell-sets without scatterplot gives what we have for CODEX, and with scatterplot gives Linnarsson minus one of the scatterplots).
I'll organize this issue by data type.
Genes/Clusters (Heatmap)
Our
genesandclustersschema convey very similar information, i.e data per observation and amaxfor rendering. We should think about merging these, if possible, since if we can show one, we can show the other:https://github.com/hubmapconsortium/portal-containers/blob/fb1910324fc796ff4b7d4e643de27ff2861e7d8c/containers/sprm-to-json/context/main.py#L125-L160
https://github.com/hubmapconsortium/vitessce-data/blob/master/python/cluster.py
https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py
This might require an arrow loader if it's too hard to parse out data properly using only one schema in the client across the two use cases, since they are used differently.
In any case, I think a function that takes in a Pandas DataFrame containing a Cell x Gene matrix and outputs JSON/Arrow should be the goal here. The index of such a DataFrame would be cell names and the column names genes. This will help with
Cells/Cell-Sets.Cell-Sets/Cells
@keller-mark knows best (feel free to comment/edit this issue!) but this is a little bit more complicated since the two are intertwined, but not necessary/sufficient in both directions (like the above); that is, one could have "Cells" without "Cell-sets" but not really "Cell-Sets" without "Cells."
Like the above we want a function that takes in a Pandas DataFrame and outputs JSON/Arrow but the structure for the DataFrame is a little bit hairier (not just a labeled Cell x Gene matrix where the labels are basically unchecked). I foresee us needing to either strongly define an API or rely on a properly named DataFrame (i.e each column has a specific name like
polyorxy). I think we should probably go the route of an API so we have something like:where each string argument is a column in the dataframe
dfto be put into the json portion corresponding roughly to theargkey. The index of this dataframe will be cell ids, just like the above.I think
Cell_setsis going to be a little harder. Maybe you could add something about this @keller-mark here in terms of what input data could look like.Raster
This one is tricky as well. We should probably support both
tiffandzarrvia a flag. We'll need to set up the docker container forbioformats2raw/raw2ometiffas a dependency (which I think can be done via thesetup.pyfile). Beyond that, the other major paint point will be input data. Are we expectingnumpyarrays?daskarrays?zarrstores? File paths? Perhaps all 4 can be possible?@manzt can probably comment on this as well. I Imagine most people will input
OME-TIFFtobioformats2rawbut I think we can also handle other inputs and use our custom pyramid generator or something python-specific (in contrast tobioformats2raw) that Glencoe writes.Molecules
I think this will be relatively straightforward like the genes data - I think an input data frame with the index being molecule names plugged into an API is what we will use: