Feature/new filtering tool #80

m3hdad · 2025-08-12T14:51:36Z

This PR adds bbmap/filterbyname.sh as an alternative read-filtering tool, selectable via the --filtering_tool parameter.

bbmap/filterbyname.sh natively handles paired-end reads, eliminating the need for FASTQ header renaming before and after filtering.
This results in a significant runtime improvement.

The default value for --filtering_tool remains seqkit to preserve the current workflow behavior.

PR checklist

d4straub · 2025-08-13T07:08:28Z

Thank you for the PR! On the first glance it looks all fine and I'll have a more in depth look soon.

d4straub · 2025-08-13T13:05:42Z

I tested
(1) nextflow run nf-core/detaxizer -r 1.2.0 -profile test,singularity [...]
(2) nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile test,singularity [...]
(3) nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile test,singularity --filtering_tool "bbmap" [...]
where [...] is just outdir and resource config.

(A) The summary/summary.tsv was identical for (1) and (2) but differed in (3) in lines:
test_paired-end_plus_long-reads: 26000 -> 28000
test_paired-end: 25993 -> 27993
(B) Additionally, while all 5 test samples seem to have been processed and listed in summary/summary.tsv, in folder filter/filtered were in (3) only 4 files instead of 7 as produced by (1) and (2). The file sizes were also dramatically different.

My next step would have been to run with -profile test_full, but I postpone that for now.

I would expect that the outcome should be the same for --filtering_tool "bbmap" or --filtering_tool "seqkit", correct?

m3hdad · 2025-08-13T15:06:09Z

Thanks,

(A) The filtered reads are empty files on my runs with seqkit. I cannot check the differences.
Could you confirm if you ran with the default --tax2filter Homo sapiens ?
Are the files in filter/filtered contain anything when seqkit is used?

(B) I'm guessing there is an issue with the nf-core module when there are single reads not paired. Or I overlooked something. missing files are single end, right?

d4straub · 2025-08-14T06:57:41Z

Could you confirm if you ran with the default --tax2filter Homo sapiens ?

No, I used -profile test as detailed above and that contains tax2filter= 'unclassified', see

detaxizer/conf/test.config

Line 38 in 698eedc

tax2filter = 'unclassified'

Are the files in filter/filtered contain anything when seqkit is used?

No, those are empty. And are supposed to be empty. Because unclassified reads are removed, and all are unclassified (because the database is a mock database...). Thats not an ideal check, but a fast one.

missing files are single end, right?

Right. Missing files are: test_single-end_short_filtered.fastq.gz, test_single-end_long_longReads_filtered.fastq.gz, test_paired-end_plus_long-reads_longReads_filtered.fastq.gz

d4straub · 2025-08-14T08:13:27Z

I cannot see any change compared to before with

nextflow pull m3hdad/detaxizer -r feature/new-filtering-tool
nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile test,singularity --filtering_tool "bbmap" [...]

m3hdad · 2025-08-14T08:57:09Z

I know what's happening but leaving early for the long weekend. will fix that next week.

Here is what's going on: kraken2 in --paired mode emits fq header without paired info /1 and /2
bbtools/filterbyname.sh works on exact match. there is the possibility to turn on substring=t but I don't want to do that because the logic is shaky.

d4straub · 2025-08-14T09:29:08Z

I see, thanks for the explanation. Sure, no rush.

…ader

m3hdad · 2025-08-19T11:53:37Z

The last two commits fixes A and B.

The summary_classification is not accurate when both bbduk and kraken2 are used. This is because kraken2 in --paired mode trims the fastq header but bbduk performs on each fw and rv read separately. When filtering_tool is bbmap we don't touch the fq headers, so the isolated IDs are different from kraken2 and bbduk in terms of suffix (/1 /2).

I guess a more robust approach to summarize the result is to use the filtered fastq files themselves.

on top of that, it is a bit confusing, what does 25000 stand for in the summary report! are there 25000 paired reads for example? if that's the intention we need to be careful with the way isolate_bbduk_ids work because we take awk '!seen[$0]++' read_ids1.txt read_ids2.txt and since bbduk treats paired reads separately, if pairs differ the positional alignment breaks and the summary results will be confusing.

d4straub · 2025-08-19T12:22:51Z

I tested as in #80 (comment) and A isnt solved but B is. A seems indeed of lower priority since that table is not optimal as-is, I agree. I'm going to test with test_full profile next.

d4straub · 2025-08-20T11:44:43Z

I run test_full on our HPC (cfc):

#(1)
NXF_VER=25.04.6 nextflow run nf-core/detaxizer -r 1.2.0 -profile cfc,test_full --outdir results_1.2.0_full -resume
#(2)
NXF_VER=25.04.6 nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile cfc,test_full --outdir results_feature-new-filtering-tool_full_bbmap -resume --filtering_tool "bbmap"
#(3)
NXF_VER=25.04.6 nextflow run m3hdad/detaxizer -r feature/new-filtering-tool -profile cfc,test_full --outdir results_feature-new-filtering-tool_full -resume

and found

measure	(1): 1.2.0	(2): bbmap	(3): seqkit
Walltime	23m 25s	10h 18m 8s	26m 28s
CPU hours	16.6	46.6	24.8

where BBMAP_FILTERBYNAME & BBMAP_FILTERBYNAME_REMOVED in (2) required 38m to 6h each while RENAME_FASTQ_HEADERS_PRE & RENAME_FASTQ_HEADERS_AFTER in (1) & (3) required 1m to 10m.

I attached the three trace files, feel free to check it out:
(1) execution_trace_2025-08-19_14-20-07.txt
(2) execution_trace_2025-08-19_15-29-48.txt
(3) execution_trace_2025-08-20_12-53-44.txt

So to conclude, looking only at the run time, I dont see improvements. I havent checked the output files. Do you have a reproducible example where the approach you implemented improves the runtime?

m3hdad · 2025-08-20T15:37:47Z

there is the possibility to turn on substring=t but I don't want to do that because the logic is shaky.

Well I kinda expected that using the substring option would negate the whole "fast runtime" idea in the first place. But 6h is such a drama!

well I made a fix which does not rely on substring matching. I'll test_full first and then push :D

m3hdad · 2025-08-21T10:42:49Z

I just pushed a mapping step to map kraken2 sequence IDs to fq1 (and in case fq2) by position.

now filterbyname.sh performs faster since it does not rely on substring search.

The file sizes might differ, most probably because of the way headers are renamed pre and after in seqkit route. I checked the headers from both tools and they match:

# FILTERBYNAME
zcat CAPES_S11_longReads_filtered.fastq.gz | grep "@" > CAPES_S11_longReads_filtered.fastq.gz_filterbyname_headers.txt

# SEQKIT
zcat CAPES_S11_longReads_filtered.fastq.gz | grep "@" > CAPES_S11_longReads_filtered.fastq.gz_seqkit_headers.txt

# DIFF
diff CAPES_S11_longReads_filtered.fastq.gz_filterbyname_headers.txt CAPES_S11_longReads_filtered.fastq.gz_seqkit_headers.txt

On the test_full the run time is improved by about 30min, but on larger files that I tested the difference is up to an hour.

The summary numbers for pair end reads are the combination of R1 and R2 since the summarizing step relies on unique header names and in filterbynames.sh we don't touch R1 and R2 so they are different.

d4straub

I tested run times with test_full and it looks to me on our HPC as if its a similar time. So thats seems fine.
When checking the output of removed reads headers were identical, but not quality in fastqs:
diff 'results_1.2.0_full/filter/removed/CAPES_S11_R1_removed.fastq' 'results_feature-new-filtering-tool_full_bbmap/filter/removed/CAPES_S11_removed_1.fastq'

1952c1952
< #0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
---
> !0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
2132c2132
< BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B#00#00<<BB7BFBBFFFBBFBBBFF
---
> BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B!00!00<<BB7BFBBFFFBBFBBBFF
3876c3876
< #0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
---
> !0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
5204c5204
< BBBFFFFFFFFFFIIFFFIII#0BB<BFFFFFFFIFFIIFFIIII##00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
---
> BBBFFFFFFFFFFIIFFFIII!0BB<BFFFFFFFIFFIIFFIIII!!00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B

(thats partial output)
Why is that?

modules/local/map_kraken2seqids_to_fqheaders.nf

nextflow_schema.json

workflows/detaxizer.nf

m3hdad · 2025-08-21T12:53:48Z

I tested run times with test_full and it looks to me on our HPC as if its a similar time. So thats seems fine. When checking the output of removed reads headers were identical, but not quality in fastqs: diff 'results_1.2.0_full/filter/removed/CAPES_S11_R1_removed.fastq' 'results_feature-new-filtering-tool_full_bbmap/filter/removed/CAPES_S11_removed_1.fastq'
1952c1952
< #0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
---
> !0<FFFFFFFFFFFIIIIFIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIFFFFFFFF<BFFFFFFFFBFFBFFFFFFFFFFB
2132c2132
< BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B#00#00<<BB7BFBBFFFBBFBBBFF
---
> BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFB<BFFFFFFFFFFFBBBFFFBFFFF<7B!00!00<<BB7BFBBFFFBBFBBBFF
3876c3876
< #0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
---
> !0<FFFFFFFFFFIIIIFIIFIIFIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIBFFFFFFFFFFBFFFBFBFFBFFFFFFFFFFFFBBF
5204c5204
< BBBFFFFFFFFFFIIFFFIII#0BB<BFFFFFFFIFFIIFFIIII##00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
---
> BBBFFFFFFFFFFIIFFFIII!0BB<BFFFFFFFIFFIIFFIIII!!00<BFFF'<BFFFIFFFFFFFFBBFFFF<BBBFBFBFFFFBBFFFBFFBF<<B
(thats partial output) Why is that?

Didn't know that BBtools I/O does that.

The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale. (see Removing N basecalls)

well I guess that's harmless and beneficial!

d4straub

I approve already, but please have a look at my minor comments above before merging.
The new function should be activated in a test profile (or a new test created), but that can be done in a separate PR and doesn't need to come from you.

d4straub · 2025-08-21T13:27:06Z

Thanks a lot for your effort!

m3hdad and others added 8 commits August 12, 2025 12:41

bbmap/filterbyname module added

0e1c541

feat: add bbmap filter option

e6f74c2

fix missing curly braces

6e36a00

fix missing curly braces

ee2b83a

fix single channel invocation for bbmap/filterbyname module

374ad20

define filterbyname_remove ext.args in modules.conf

51e8907

add rename&&remove to bbmap_filterbyname out names

111ee43

modify changelog

1c33d30

bbmap/filterbyname should drop reads it sees

93f5396

m3hdad added 2 commits August 19, 2025 13:42

filterbyname.sh removes isolated kraken2 IDs as substring of fastq he…

1fe5f61

…ader

update publish pattern to include single end reads.

b9eebd2

m3hdad and others added 3 commits August 21, 2025 11:09

map kraken2 seqids to fq headers by position

9bcea2d

Align Kraken2 read IDs with FASTQ headers

0ca8e1a

escape correctly

732bd35

d4straub reviewed Aug 21, 2025

View reviewed changes

d4straub approved these changes Aug 21, 2025

View reviewed changes

fix reviewer comments

f55f831

d4straub merged commit 4d0d5de into nf-core:dev Aug 21, 2025
13 checks passed

Feature/new filtering tool #80

Feature/new filtering tool #80

Uh oh!

Conversation

m3hdad commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

d4straub commented Aug 13, 2025

Uh oh!

d4straub commented Aug 13, 2025

Uh oh!

m3hdad commented Aug 13, 2025

Uh oh!

d4straub commented Aug 14, 2025

Uh oh!

d4straub commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m3hdad commented Aug 14, 2025

Uh oh!

d4straub commented Aug 14, 2025

Uh oh!

m3hdad commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4straub commented Aug 19, 2025

Uh oh!

d4straub commented Aug 20, 2025

Uh oh!

m3hdad commented Aug 20, 2025

Uh oh!

m3hdad commented Aug 21, 2025

Uh oh!

d4straub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m3hdad commented Aug 21, 2025

Uh oh!

d4straub left a comment

Choose a reason for hiding this comment

Uh oh!

d4straub commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m3hdad commented Aug 12, 2025 •

edited

Loading

d4straub commented Aug 14, 2025 •

edited

Loading

m3hdad commented Aug 19, 2025 •

edited

Loading