Submitting Author: Flavio F. Contreras-Torres (@NanoBiostructuresRG)
All current maintainers: Flavio F. Contreras-Torres (@NanoBiostructuresRG)
Package Name: harmonsmile
One-Line Description of Package: HARMONSMILE harmonizes molecular SMILES strings into a reproducible canonical, isomeric, and Kekulized RDKit-based representation for cheminformatics and molecular machine-learning dataset preparation.
Repository Link: https://github.com/NanoBiostructuresRG/harmonsmile
Version submitted: v0.2.4
EiC: TBD
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD
Code of Conduct & Commitment to Maintain Package
Description
HARMONSMILE is a Python package for harmonizing molecular SMILES strings from heterogeneous chemical data sources into a consistent RDKit-based representation. It supports reproducible cheminformatics workflows by reducing representation inconsistencies that can affect molecule comparison, dataset integration, preprocessing for duplicate detection, and machine-learning dataset preparation.
The package provides both a Python API and a command-line interface. Its current functionality includes local SMILES preparation workflows as well as ingestion-oriented workflows for public chemical data sources such as PubChem and ChEMBL. HARMONSMILE is designed as a lightweight workflow layer around RDKit rather than as a replacement for RDKit itself.
Associated Publication (Optional)
If your package is associated with a peer-reviewed publication, please provide the details below. This information helps us assess eligibility for our publication fast-track pathway. If this does not apply, you can leave these fields blank.
Publication Title: N/A
Publication DOI: N/A
Journal/Venue: N/A
Scope
Domain Specific
Community Partnerships
If your package is associated with an
existing community please check below:
- For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
HARMONSMILE fits data retrieval and extraction because it provides ingestion-oriented workflows for public molecular databases such as PubChem and ChEMBL. It fits data processing/munging because its central purpose is to transform heterogeneous SMILES strings into a consistent RDKit-based molecular representation. It supports workflow automation and reproducibility through a tested Python API, CLI, documentation, examples, and packaged releases. It also supports database interoperability by helping users compare or merge molecular records from different chemical sources under a shared representation convention.
-
Who is the target audience and what are scientific applications of this package?
-
The target audience includes computational chemists, cheminformatics researchers, molecular machine-learning practitioners, and maintainers of molecular datasets who need reproducible preprocessing of SMILES strings before downstream analysis. Scientific applications include molecular dataset curation, reproducible SMILES standardization, cross-database molecular comparison, preprocessing for duplicate detection, and preprocessing for cheminformatics or machine-learning pipelines.
-
Are there other Python packages that accomplish the same thing? If so, how does yours differ?
-
RDKit already provides core cheminformatics functionality, including SMILES parsing, molecular standardization utilities, and canonical SMILES generation. HARMONSMILE does not replace RDKit; instead, it provides reusable tabular workflows, PubChem and ChEMBL ingestion-oriented utilities, a Python API, CLI, examples, tests, documentation, PyPI distribution, versioned releases, and Zenodo archival for a specific recurring dataset-preparation problem.
-
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:
-
No pre-submission enquiry was made.
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step. Please note that the PyOpenSci reviewers will not be reviewing the paper.md file
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
P.S. Have feedback/comments about our review process? Leave a comment on our GitHub Discussions
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Submitting Author: Flavio F. Contreras-Torres (@NanoBiostructuresRG)
All current maintainers: Flavio F. Contreras-Torres (@NanoBiostructuresRG)
Package Name: harmonsmile
One-Line Description of Package: HARMONSMILE harmonizes molecular SMILES strings into a reproducible canonical, isomeric, and Kekulized RDKit-based representation for cheminformatics and molecular machine-learning dataset preparation.
Repository Link: https://github.com/NanoBiostructuresRG/harmonsmile
Version submitted: v0.2.4
EiC: TBD
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD
Code of Conduct & Commitment to Maintain Package
Description
HARMONSMILE is a Python package for harmonizing molecular SMILES strings from heterogeneous chemical data sources into a consistent RDKit-based representation. It supports reproducible cheminformatics workflows by reducing representation inconsistencies that can affect molecule comparison, dataset integration, preprocessing for duplicate detection, and machine-learning dataset preparation.
The package provides both a Python API and a command-line interface. Its current functionality includes local SMILES preparation workflows as well as ingestion-oriented workflows for public chemical data sources such as PubChem and ChEMBL. HARMONSMILE is designed as a lightweight workflow layer around RDKit rather than as a replacement for RDKit itself.
Associated Publication (Optional)
If your package is associated with a peer-reviewed publication, please provide the details below. This information helps us assess eligibility for our publication fast-track pathway. If this does not apply, you can leave these fields blank.
Publication Title: N/A
Publication DOI: N/A
Journal/Venue: N/A
Scope
Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
Domain Specific
Community Partnerships
If your package is associated with an
existing community please check below:
HARMONSMILE fits data retrieval and extraction because it provides ingestion-oriented workflows for public molecular databases such as PubChem and ChEMBL. It fits data processing/munging because its central purpose is to transform heterogeneous SMILES strings into a consistent RDKit-based molecular representation. It supports workflow automation and reproducibility through a tested Python API, CLI, documentation, examples, and packaged releases. It also supports database interoperability by helping users compare or merge molecular records from different chemical sources under a shared representation convention.
Who is the target audience and what are scientific applications of this package?
The target audience includes computational chemists, cheminformatics researchers, molecular machine-learning practitioners, and maintainers of molecular datasets who need reproducible preprocessing of SMILES strings before downstream analysis. Scientific applications include molecular dataset curation, reproducible SMILES standardization, cross-database molecular comparison, preprocessing for duplicate detection, and preprocessing for cheminformatics or machine-learning pipelines.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?
RDKit already provides core cheminformatics functionality, including SMILES parsing, molecular standardization utilities, and canonical SMILES generation. HARMONSMILE does not replace RDKit; instead, it provides reusable tabular workflows, PubChem and ChEMBL ingestion-oriented utilities, a Python API, CLI, examples, tests, documentation, PyPI distribution, versioned releases, and Zenodo archival for a specific recurring dataset-preparation problem.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tagthe editor you contacted:No pre-submission enquiry was made.
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
paper.mdmatching JOSS's requirements with a high-level description in the package root or ininst/by the time you wish to submit to JOSS.Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step. Please note that the PyOpenSci reviewers will not be reviewing the paper.md file
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.
P.S. Have feedback/comments about our review process? Leave a comment on our GitHub Discussions
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Footnotes
Please fill out a pre-submission inquiry before submitting a data visualization package. ↩