Skip to content

step-by-step scripts for the example dataΒ #20

@BinhongLiu

Description

@BinhongLiu

Hi,
I'm trying to test the spacedust to figure out the conserved gene clusters between the two example genomes.

  1. Creating databases
    spacedust createsetdb listOfFastaFiles.tsv setDB tmpFolder --gff-dir examples/gff.txt --gff-type CDS

the listOfFastaFiles.tsv is:
examples/uvig_120081.fna
examples/uvig_255655.fna

  1. Convert to structure sequence DB (the reference FoldseekDB Alphafold/UniProt has been downloaded in ~/database/FoldSeek/UniProt/ and named as afdb.
    spacedust aa2foldseek setDB ~/database/FoldSeek/UniProt/afdb tmpFolder
    Here I got two databases, setDB_foldseek and setDB_unmapped.
    Q: I will analyze some virus genomes later, so full Foldseek structure searches against precomputed structures probably is a better choice than ProstT5?

  2. Search querySetDB against targetSetDB (using Foldseek and MMseqs)
    spacedust clustersearch setDB setDB result.tsv tmpFolder --search-mode 1 --num-iterations 2

  3. I got the result.tsv file here.

result.tsv

I am not sure whether I have run the tool correctly. I am also confused by the results, as I would expect to observe some conserved gene clusters between the two example genomes.

Q: Besides, what if I have many genomes and want to identify the conserved gene clusters between any of the genomes?

Thanks!
Best wishes!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions