Skip to content

Conversation

@m3hdad
Copy link
Contributor

@m3hdad m3hdad commented Aug 12, 2025

This PR adds bbmap/filterbyname.sh as an alternative read-filtering tool, selectable via the --filtering_tool parameter.

bbmap/filterbyname.sh natively handles paired-end reads, eliminating the need for FASTQ header renaming before and after filtering.
This results in a significant runtime improvement.

The default value for --filtering_tool remains seqkit to preserve the current workflow behavior.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/detaxizer branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@d4straub
Copy link
Collaborator

Thank you for the PR! On the first glance it looks all fine and I'll have a more in depth look soon.

@d4straub
Copy link
Collaborator

I tested
(1) nextflow run nf-core/detaxizer -r 1.2.0 -profile test,singularity [...]
(2) nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile test,singularity [...]
(3) nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile test,singularity --filtering_tool "bbmap" [...]
where [...] is just outdir and resource config.

(A) The summary/summary.tsv was identical for (1) and (2) but differed in (3) in lines:
test_paired-end_plus_long-reads: 26000 -> 28000
test_paired-end: 25993 -> 27993
(B) Additionally, while all 5 test samples seem to have been processed and listed in summary/summary.tsv, in folder filter/filtered were in (3) only 4 files instead of 7 as produced by (1) and (2). The file sizes were also dramatically different.

My next step would have been to run with -profile test_full, but I postpone that for now.

I would expect that the outcome should be the same for --filtering_tool "bbmap" or --filtering_tool "seqkit", correct?

@m3hdad
Copy link
Contributor Author

m3hdad commented Aug 13, 2025

Thanks,

(A) The filtered reads are empty files on my runs with seqkit. I cannot check the differences.
Could you confirm if you ran with the default --tax2filter Homo sapiens ?
Are the files in filter/filtered contain anything when seqkit is used?

(B) I'm guessing there is an issue with the nf-core module when there are single reads not paired. Or I overlooked something. missing files are single end, right?

@d4straub
Copy link
Collaborator

Could you confirm if you ran with the default --tax2filter Homo sapiens ?

No, I used -profile test as detailed above and that contains tax2filter= 'unclassified', see

tax2filter = 'unclassified'

Are the files in filter/filtered contain anything when seqkit is used?

No, those are empty. And are supposed to be empty. Because unclassified reads are removed, and all are unclassified (because the database is a mock database...). Thats not an ideal check, but a fast one.

missing files are single end, right?

Right. Missing files are: test_single-end_short_filtered.fastq.gz, test_single-end_long_longReads_filtered.fastq.gz, test_paired-end_plus_long-reads_longReads_filtered.fastq.gz

@d4straub
Copy link
Collaborator

d4straub commented Aug 14, 2025

I cannot see any change compared to before with

nextflow pull m3hdad/detaxizer -r feature/new-filtering-tool
nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile test,singularity --filtering_tool "bbmap" [...]

@m3hdad
Copy link
Contributor Author

m3hdad commented Aug 14, 2025

I know what's happening but leaving early for the long weekend. will fix that next week.

Here is what's going on: kraken2 in --paired mode emits fq header without paired info /1 and /2
bbtools/filterbyname.sh works on exact match. there is the possibility to turn on substring=t but I don't want to do that because the logic is shaky.

@d4straub
Copy link
Collaborator

I see, thanks for the explanation. Sure, no rush.

@m3hdad
Copy link
Contributor Author

m3hdad commented Aug 19, 2025

The last two commits fixes A and B.

The summary_classification is not accurate when both bbduk and kraken2 are used. This is because kraken2 in --paired mode trims the fastq header but bbduk performs on each fw and rv read separately. When filtering_tool is bbmap we don't touch the fq headers, so the isolated IDs are different from kraken2 and bbduk in terms of suffix (/1 /2).

I guess a more robust approach to summarize the result is to use the filtered fastq files themselves.

on top of that, it is a bit confusing, what does 25000 stand for in the summary report! are there 25000 paired reads for example? if that's the intention we need to be careful with the way isolate_bbduk_ids work because we take awk '!seen[$0]++' read_ids1.txt read_ids2.txt and since bbduk treats paired reads separately, if pairs differ the positional alignment breaks and the summary results will be confusing.

@d4straub
Copy link
Collaborator

I tested as in #80 (comment) and A isnt solved but B is. A seems indeed of lower priority since that table is not optimal as-is, I agree. I'm going to test with test_full profile next.

@d4straub
Copy link
Collaborator

I run test_full on our HPC (cfc):

#(1)
NXF_VER=25.04.6 nextflow run nf-core/detaxizer -r 1.2.0 -profile cfc,test_full --outdir results_1.2.0_full -resume
#(2)
NXF_VER=25.04.6 nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile cfc,test_full --outdir results_feature-new-filtering-tool_full_bbmap -resume --filtering_tool "bbmap"
#(3)
NXF_VER=25.04.6 nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile cfc,test_full --outdir results_feature-new-filtering-tool_full -resume

and found

measure (1): 1.2.0 (2): bbmap (3): seqkit
Walltime 23m 25s 10h 18m 8s 26m 28s
CPU hours 16.6 46.6 24.8

where BBMAP_FILTERBYNAME & BBMAP_FILTERBYNAME_REMOVED in (2) required 38m to 6h each while RENAME_FASTQ_HEADERS_PRE & RENAME_FASTQ_HEADERS_AFTER in (1) & (3) required 1m to 10m.

I attached the three trace files, feel free to check it out:
(1) execution_trace_2025-08-19_14-20-07.txt
(2) execution_trace_2025-08-19_15-29-48.txt
(3) execution_trace_2025-08-20_12-53-44.txt

So to conclude, looking only at the run time, I dont see improvements. I havent checked the output files. Do you have a reproducible example where the approach you implemented improves the runtime?

@m3hdad
Copy link
Contributor Author

m3hdad commented Aug 20, 2025

there is the possibility to turn on substring=t but I don't want to do that because the logic is shaky.

Well I kinda expected that using the substring option would negate the whole "fast runtime" idea in the first place. But 6h is such a drama!

well I made a fix which does not rely on substring matching. I'll test_full first and then push :D

@m3hdad
Copy link
Contributor Author

m3hdad commented Aug 21, 2025

I just pushed a mapping step to map kraken2 sequence IDs to fq1 (and in case fq2) by position.

now filterbyname.sh performs faster since it does not rely on substring search.

The file sizes might differ, most probably because of the way headers are renamed pre and after in seqkit route. I checked the headers from both tools and they match:

# FILTERBYNAME
zcat CAPES_S11_longReads_filtered.fastq.gz | grep "@" > CAPES_S11_longReads_filtered.fastq.gz_filterbyname_headers.txt

# SEQKIT
zcat CAPES_S11_longReads_filtered.fastq.gz | grep "@" > CAPES_S11_longReads_filtered.fastq.gz_seqkit_headers.txt

# DIFF
diff CAPES_S11_longReads_filtered.fastq.gz_filterbyname_headers.txt CAPES_S11_longReads_filtered.fastq.gz_seqkit_headers.txt

On the test_full the run time is improved by about 30min, but on larger files that I tested the difference is up to an hour.

The summary numbers for pair end reads are the combination of R1 and R2 since the summarizing step relies on unique header names and in filterbynames.sh we don't touch R1 and R2 so they are different.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested run times with test_full and it looks to me on our HPC as if its a similar time. So thats seems fine.
When checking the output of removed reads headers were identical, but not quality in fastqs:
diff 'results_1.2.0_full/filter/removed/CAPES_S11_R1_removed.fastq' 'results_feature-new-filtering-tool_full_bbmap/filter/removed/CAPES_S11_removed_1.fastq'

1952c1952
< #0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
---
> !0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
2132c2132
< BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B#00#00<<BB7BFBBFFFBBFBBBFF
---
> BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B!00!00<<BB7BFBBFFFBBFBBBFF
3876c3876
< #0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
---
> !0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
5204c5204
< BBBFFFFFFFFFFIIFFFIII#0BB<BFFFFFFFIFFIIFFIIII##00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
---
> BBBFFFFFFFFFFIIFFFIII!0BB<BFFFFFFFIFFIIFFIIII!!00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B

(thats partial output)
Why is that?

@m3hdad
Copy link
Contributor Author

m3hdad commented Aug 21, 2025

I tested run times with test_full and it looks to me on our HPC as if its a similar time. So thats seems fine. When checking the output of removed reads headers were identical, but not quality in fastqs: diff 'results_1.2.0_full/filter/removed/CAPES_S11_R1_removed.fastq' 'results_feature-new-filtering-tool_full_bbmap/filter/removed/CAPES_S11_removed_1.fastq'

1952c1952
< #0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
---
> !0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
2132c2132
< BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B#00#00<<BB7BFBBFFFBBFBBBFF
---
> BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B!00!00<<BB7BFBBFFFBBFBBBFF
3876c3876
< #0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
---
> !0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
5204c5204
< BBBFFFFFFFFFFIIFFFIII#0BB<BFFFFFFFIFFIIFFIIII##00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
---
> BBBFFFFFFFFFFIIFFFIII!0BB<BFFFFFFFIFFIIFFIIII!!00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B

(thats partial output) Why is that?

Didn't know that BBtools I/O does that.

The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale. (see Removing N basecalls)

well I guess that's harmless and beneficial!

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve already, but please have a look at my minor comments above before merging.
The new function should be activated in a test profile (or a new test created), but that can be done in a separate PR and doesn't need to come from you.

@d4straub
Copy link
Collaborator

Thanks a lot for your effort!

@d4straub d4straub merged commit 4d0d5de into nf-core:dev Aug 21, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants