-
Notifications
You must be signed in to change notification settings - Fork 10
Feature/new filtering tool #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for the PR! On the first glance it looks all fine and I'll have a more in depth look soon. |
|
I tested (A) The My next step would have been to run with I would expect that the outcome should be the same for |
|
Thanks, (A) The filtered reads are empty files on my runs with seqkit. I cannot check the differences. (B) I'm guessing there is an issue with the nf-core module when there are single reads not paired. Or I overlooked something. missing files are single end, right? |
No, I used Line 38 in 698eedc
No, those are empty. And are supposed to be empty. Because unclassified reads are removed, and all are unclassified (because the database is a mock database...). Thats not an ideal check, but a fast one.
Right. Missing files are: test_single-end_short_filtered.fastq.gz, test_single-end_long_longReads_filtered.fastq.gz, test_paired-end_plus_long-reads_longReads_filtered.fastq.gz |
|
I cannot see any change compared to before with |
|
I know what's happening but leaving early for the long weekend. will fix that next week. Here is what's going on: kraken2 in --paired mode emits fq header without paired info /1 and /2 |
|
I see, thanks for the explanation. Sure, no rush. |
|
The last two commits fixes A and B. The summary_classification is not accurate when both bbduk and kraken2 are used. This is because kraken2 in --paired mode trims the fastq header but bbduk performs on each fw and rv read separately. When filtering_tool is bbmap we don't touch the fq headers, so the isolated IDs are different from kraken2 and bbduk in terms of suffix (/1 /2). I guess a more robust approach to summarize the result is to use the filtered fastq files themselves. on top of that, it is a bit confusing, what does 25000 stand for in the summary report! are there 25000 paired reads for example? if that's the intention we need to be careful with the way isolate_bbduk_ids work because we take |
|
I tested as in #80 (comment) and A isnt solved but B is. A seems indeed of lower priority since that table is not optimal as-is, I agree. I'm going to test with test_full profile next. |
|
I run and found
where I attached the three trace files, feel free to check it out: So to conclude, looking only at the run time, I dont see improvements. I havent checked the output files. Do you have a reproducible example where the approach you implemented improves the runtime? |
Well I kinda expected that using the substring option would negate the whole "fast runtime" idea in the first place. But 6h is such a drama! well I made a fix which does not rely on substring matching. I'll test_full first and then push :D |
|
I just pushed a mapping step to map kraken2 sequence IDs to fq1 (and in case fq2) by position. now The file sizes might differ, most probably because of the way headers are renamed pre and after in seqkit route. I checked the headers from both tools and they match: # FILTERBYNAME
zcat CAPES_S11_longReads_filtered.fastq.gz | grep "@" > CAPES_S11_longReads_filtered.fastq.gz_filterbyname_headers.txt
# SEQKIT
zcat CAPES_S11_longReads_filtered.fastq.gz | grep "@" > CAPES_S11_longReads_filtered.fastq.gz_seqkit_headers.txt
# DIFF
diff CAPES_S11_longReads_filtered.fastq.gz_filterbyname_headers.txt CAPES_S11_longReads_filtered.fastq.gz_seqkit_headers.txtOn the The summary numbers for pair end reads are the combination of R1 and R2 since the summarizing step relies on unique header names and in |
d4straub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested run times with test_full and it looks to me on our HPC as if its a similar time. So thats seems fine.
When checking the output of removed reads headers were identical, but not quality in fastqs:
diff 'results_1.2.0_full/filter/removed/CAPES_S11_R1_removed.fastq' 'results_feature-new-filtering-tool_full_bbmap/filter/removed/CAPES_S11_removed_1.fastq'
1952c1952
< #0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
---
> !0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
2132c2132
< BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B#00#00<<BB7BFBBFFFBBFBBBFF
---
> BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B!00!00<<BB7BFBBFFFBBFBBBFF
3876c3876
< #0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
---
> !0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
5204c5204
< BBBFFFFFFFFFFIIFFFIII#0BB<BFFFFFFFIFFIIFFIIII##00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
---
> BBBFFFFFFFFFFIIFFFIII!0BB<BFFFFFFFIFFIIFFIIII!!00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
(thats partial output)
Why is that?
Didn't know that BBtools I/O does that. The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale. (see Removing N basecalls) well I guess that's harmless and beneficial! |
d4straub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve already, but please have a look at my minor comments above before merging.
The new function should be activated in a test profile (or a new test created), but that can be done in a separate PR and doesn't need to come from you.
|
Thanks a lot for your effort! |
This PR adds bbmap/filterbyname.sh as an alternative read-filtering tool, selectable via the --filtering_tool parameter.
bbmap/filterbyname.sh natively handles paired-end reads, eliminating the need for FASTQ header renaming before and after filtering.
This results in a significant runtime improvement.
The default value for --filtering_tool remains seqkit to preserve the current workflow behavior.
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).