Skip to content

Problem with pulling proteome from NCBI--solution found!Β #44

@ani-sch

Description

@ani-sch

Hello! Thank you for this great software and your time!

### EDIT/UPDATE 2--solved!
We found a solution to this problem! You can build a local BLAST database that you specifically use for the reciprocal BLAST step--there are various references/guides for doing this throughout the topiary docs and github, but you may have to dig a little.
First, find the proteome files of the species in your seed dataframe and download them. I've been able to find the protein.faa.gz files by searching the NCBI datasets site https://ncbi.nlm.nih.gov/datasets/. Then, build your local database using the makeblastdb function (run with --help argument if needed. More info online as well). I had the best luck using cat to combine files first, then using the combo file as the input. Once set up, start the pipeline: run the topiary-seed-to-alignment function and include the --local_recip_blast_db /path/to/databasename.faa argument. Running topiary-seed-to-alignment --help is helpful for setting this up.
Here are a few links that contain relevant/helpful info:

-topiary.ncbi.blast.recip API reference: https://topiary-asr.readthedocs.io/en/latest/topiary.ncbi.blast.html#module-topiary.ncbi.blast.recip
-(you may need to copy/paste this one into your browser, sorry):
https://github.com/harmslab/topiary/commit/468a6d72bbdb58a1d312f068feb8e02d9facfb34

### EDIT/UPDATE:
In the docs, it says users can specify sources of sequences (using the --blast_xml, --ncbi_blast_db, and --local_blast.db). However, I can't tell if those options only apply to building the sequence dataset (before dong reciprocal BLAST)? Or, if you can use those options to build a database for the reciprocal BLAST step specifically? If the latter is possible, we think that could solve the problem, as we could build a database with the unretrievable proteomes...but we are unsure if it'd create a problem with building/limit the sequence dataset (pre-reciprocal BLAST)? ###

### original post:
I am beginning an ASR project using this software, but am running into an issue in the seed-to-alignment phase. I have a seed-dataset and am able to run the first command in the pipeline. The BLAST query seems successful, but then after the Doing reciprocal BLAST part, I get errors (text file with error message attached). It seems like the location of the Homo sapiens proteome has changed--the error readout provides a full path link to where it thinks the proper file is, and when trying to follow it, you can't find the file.

my main question is: what is the best course of action in this situation? I really can't remove this species from my seed dataset (or set it to false) because it is a crucial species to include for my purposes. I'm assuming there's a way to get/upload to topiary the proper proteome, but I'm unsure of the best way to do that...

Thank you for your time and help!
topiary-error-may16.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions