step-by-step scripts for the example data

Hi, 
    I'm trying to test the spacedust to figure out the conserved gene clusters between the two example genomes. 

1. Creating databases
`spacedust createsetdb listOfFastaFiles.tsv setDB tmpFolder --gff-dir examples/gff.txt --gff-type CDS`

the `listOfFastaFiles.tsv` is:
examples/uvig_120081.fna
examples/uvig_255655.fna

2. Convert to structure sequence DB (the reference FoldseekDB `Alphafold/UniProt` has been downloaded in `~/database/FoldSeek/UniProt/` and named as `afdb`.
`spacedust aa2foldseek setDB ~/database/FoldSeek/UniProt/afdb tmpFolder`
Here I got two databases, `setDB_foldseek` and `setDB_unmapped`.
**Q:** I will analyze some virus genomes later, so full Foldseek structure searches against precomputed structures probably is a better choice than ProstT5?

3. Search querySetDB against targetSetDB (using Foldseek and MMseqs)
`spacedust clustersearch setDB setDB result.tsv tmpFolder --search-mode 1  --num-iterations 2`
4. I got the result.tsv file here.

[result.tsv](https://github.com/user-attachments/files/24381550/result.tsv)

I am not sure whether I have run the tool correctly. I am also confused by the results, as I would expect to observe some conserved gene clusters between the two example genomes.

**Q:** Besides, what if I have many genomes and want to identify the conserved gene clusters between any of the genomes?

Thanks!
Best wishes!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step-by-step scripts for the example data #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

step-by-step scripts for the example data #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions