Hi, I’ve carefully read your article and noticed that vclust has been tested on metagenomic and phage virus datasets. I’d like to ask whether vclust is also suitable for plant viruses assembled from transcriptomic data.
I’m currently mining plant viruses from public plant transcriptome datasets. For the species I’m focusing on, I’ve collected more than 300 samples. After assembling the reads and performing virus identification, I carried out taxonomic classification and predicted potential plant viruses based on their taxonomic ranks and host relationships. Although I used MMseqs2 to dereplicate the assembled sequences, there are still a large number of fragments remaining, and I’m stuck at this step.
I’m considering using vclust to generate vOTUs as the next step. Would that be appropriate? Because the data are transcriptomic, the assembled contigs are relatively short — their lengths range from 500 bp to 20 kb, but the vast majority fall between 500 and 3000 bp, so they are quite fragmented. Could this short fragmentation affect the results?
Below are some intermediate results from my pipeline (assembly results, virus identification results, MMseqs2 clustering results, and the plant virus candidates).
I would really appreciate any advice. Thank you very much!
Genome Size Contigs Max_len N50 N90 >500bp_Num >500bp_Ratio >1000bp_Num >1000bp_Ratio
megahit.mix.fasta 1.2G 1666981 40290 789 355 786973 71.10% 266714 40.39%
mix-cobra.merged.fasta 945M 1349084 40534 894 303 560222 71.73% 217827 46.26%
mix-mmseqs.cluster_rep_seq.fasta 308M 473343 40534 741 286 183270 67.64% 59849 39.53%
Plant.classified.fasta 49M 49576 13672 1006 558 49371 99.79% 13800 50.26%
Furthermore, the most abundant families and genera found in the analysis results are distributed as follows:
─ Top-15 Family ─
Alphaflexiviridae 62,249
Betaflexiviridae 12,708
Rhabdoviridae 3,815
Caulimoviridae 530
Pospiviroidae 191
Bromoviridae 84
Closteroviridae 70
Fimoviridae 42
Potyviridae 36
Tombusviridae 25
Secoviridae 23
Atkinsviridae 15
Tymoviridae 11
Benyviridae 10
Virgaviridae 9
─ Top-15 Genus ─
Potexvirus 62,213
Foveavirus 7,056
Carlavirus 4,684
Betacytorhabdovirus 3,784
Vitivirus 855
Soymovirus 451
Pospiviroid 191
Ilarvirus 83
Banmivirus 70
Caulimovirus 51
Emaravirus 42
Lolavirus 28
Crinivirus 25
Ipomovirus 22
Closterovirus 21
Hi, I’ve carefully read your article and noticed that vclust has been tested on metagenomic and phage virus datasets. I’d like to ask whether vclust is also suitable for plant viruses assembled from transcriptomic data.
I’m currently mining plant viruses from public plant transcriptome datasets. For the species I’m focusing on, I’ve collected more than 300 samples. After assembling the reads and performing virus identification, I carried out taxonomic classification and predicted potential plant viruses based on their taxonomic ranks and host relationships. Although I used MMseqs2 to dereplicate the assembled sequences, there are still a large number of fragments remaining, and I’m stuck at this step.
I’m considering using vclust to generate vOTUs as the next step. Would that be appropriate? Because the data are transcriptomic, the assembled contigs are relatively short — their lengths range from 500 bp to 20 kb, but the vast majority fall between 500 and 3000 bp, so they are quite fragmented. Could this short fragmentation affect the results?
Below are some intermediate results from my pipeline (assembly results, virus identification results, MMseqs2 clustering results, and the plant virus candidates).
I would really appreciate any advice. Thank you very much!
Furthermore, the most abundant families and genera found in the analysis results are distributed as follows: