Problem with pulling proteome from NCBI--solution found!

Hello! Thank you for this great software and your time!

**### EDIT/UPDATE 2--solved!**
We found a solution to this problem! You can build a local BLAST database that you specifically use for the reciprocal BLAST step--there are various references/guides for doing this throughout the topiary docs and github, but you may have to dig a little.
First, find the proteome files of the species in your seed dataframe and download them. I've been able to find the protein.faa.gz files by searching the NCBI datasets site [https://ncbi.nlm.nih.gov/datasets/](url). Then, build your local database using the `makeblastdb` function (run with `--help` argument if needed. More info online as well). I had the best luck using `cat` to combine files first, then using the combo file as the input. Once set up, start the pipeline: run the `topiary-seed-to-alignment` function and include the `--local_recip_blast_db /path/to/databasename.faa` argument. Running `topiary-seed-to-alignment --help` is helpful for setting this up.
Here are a few links that contain relevant/helpful info:
>-topiary.ncbi.blast.recip API reference: [https://topiary-asr.readthedocs.io/en/latest/topiary.ncbi.blast.html#module-topiary.ncbi.blast.recip](url) 
-*(you may need to copy/paste this one into your browser, sorry)*:
[https://github.com/harmslab/topiary/commit/468a6d72bbdb58a1d312f068feb8e02d9facfb34](url)
> 



**### EDIT/UPDATE:**
In the docs, it says users can specify sources of sequences (using the --blast_xml, --ncbi_blast_db, and --local_blast.db). However, I can't tell if those options only apply to building the sequence dataset (before dong reciprocal BLAST)? Or, if you can use those options to build a database for the reciprocal BLAST step specifically? If the latter is possible, we think that could solve the problem, as we could build a database with the unretrievable proteomes...but we are unsure if it'd create a problem with building/limit the sequence dataset (pre-reciprocal BLAST)? ###


**### original post:**
I am beginning an ASR project using this software, but am running into an issue in the seed-to-alignment phase. I have a seed-dataset and am able to run the first command in the pipeline. The BLAST query seems successful, but then after the *Doing reciprocal BLAST* part, I get errors (text file with error message attached). It seems like the location of the Homo sapiens proteome has changed--the error readout provides a full path link to where it thinks the proper file is, and when trying to follow it, you can't find the file. 

**my main question is:** what is the best course of action in this situation? I really can't remove this species from my seed dataset (or set it to false) because it is a crucial species to include for my purposes. I'm assuming there's a way to get/upload to topiary the proper proteome, but I'm unsure of the best way to do that...

Thank you for your time and help!
[topiary-error-may16.txt](https://github.com/harmslab/topiary/files/15357912/topiary-error-may16.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with pulling proteome from NCBI--solution found! #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem with pulling proteome from NCBI--solution found! #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions