Skip to content

Conversation

@niyarin
Copy link
Contributor

@niyarin niyarin commented Dec 3, 2025

This PR fixes (partially) #137.
I applied NC-list to indexed chain.
Benchmark resources is generated by this code.

code
(require '[cljam.io.vcf :as vcf]
         '[varity.chain :as chain]
         '[varity.vcf-lift :as vl]
         '[cljam.io.sequence :as cseq])

(def rng (java.util.Random. 12345))

(defn rand-int* [n]
  (.nextInt rng n))

(defn %make-chain-data [dq-end seq-len]
  (let [size (+ 5 (rand-int* 500))
        dt (rand-int* 5)
        dq (rand-int* 5)
        new-dq-end (+ dq-end dq size)]
    (if (< new-dq-end seq-len)
      (cons {:size size :dt dt :dq dq}
            (lazy-seq (%make-chain-data new-dq-end seq-len)))
      [{:size (- seq-len dq-end)}])))


(defn make-chain [chr seq-len]
  (let [data (%make-chain-data 0 seq-len)
        t-size (+ (apply + (map :size data)) (apply + (keep :dt data)))]
    [{:header {:score 0
               :t-name chr :t-size t-size :t-start 0 :t-end t-size :t-strand :forward
               :q-name chr :q-size seq-len :q-start 0 :q-end seq-len :q-strand :forward
               :id 20}
      :data data}]))

(defn make-inverse-chain1 [chain]
  (let [{:keys [score t-name t-size t-strand t-start t-end
                  q-name q-size q-strand q-start q-end id]}
        (:header (first chain))
        data (:data (first chain))]
    [{:header {:score score
                :t-name q-name :t-start q-start :t-size q-size :t-end q-end :t-strand q-strand
                :q-name t-name :q-start t-start :q-size t-size :q-end t-end :q-strand t-strand
                :id id}
     :data (map (fn [{:keys [size dt dq]}]
                  {:size size :dt dq :dq dt})
                data)}]))

(defn write-chain-header [wtr header]
  (let [ {:keys [score t-name t-size t-strand t-start t-end
                  q-name q-size q-strand q-start q-end id]}
        header]
    (.write wtr (clojure.string/join
                    \space
                    ["chain" score
                     t-name t-size "+" t-start t-end
                     q-name q-size "+" q-start q-end id]))))

(defn write-chain-data [wtr data]
  (doseq [{:keys [size dt dq]} data]
    (if (and dt dq)
      (.write wtr (clojure.string/join \space [size dt dq]))
      (.write wtr (clojure.string/join \space [size])))
    (.write wtr "\n\n")))

(defn write-chain1 [fname chain-data]
  (with-open [wtr (clojure.java.io/writer (cljam.util/compressor-output-stream fname))]
    (let [{data :data
           header :header}
          (first chain-data)]
   (write-chain-header wtr header)
   (.write wtr "\n")
   (write-chain-data wtr data))))


(defn make-insert-variants-data [pos chr seq-len seq-rdr]
  (when (< pos seq-len)
    (let [ref' (cseq/read-sequence seq-rdr {:chr chr :start pos :end pos} {:mask? false})]
      (if (not= ref' "N")
          (cons
                {:chr chr
                 :pos pos
                 :ref ref'
                 :alt [(str ref' "TT")]
                 :qual 20
                 :filter nil
                 :info nil}
          (lazy-seq (make-insert-variants-data (+ pos 10 (rand-int* 100))  chr seq-len seq-rdr)))
          (lazy-seq (make-insert-variants-data (+ pos 10 (rand-int* 100))  chr seq-len seq-rdr))))))

(defn write-variants [fname variants]
  (with-open [wtr (vcf/writer fname
                              {}
                              ["CHROM" "POS" "ID" "REF" "ALT" "QUAL" "FILTER" "INFO"])]
    (vcf/write-variants
      wtr
      variants)))

(defn make-benchmark-resources [seq-path]
  (with-open [seq-rdr (cseq/reader seq-path)]
    (let [chain (make-chain "chr21" 46709983)
          inv-chain (make-inverse-chain1 chain)
          variants (make-insert-variants-data 1 "chr21"  46709983  seq-rdr)
          inv-chain-idx (chain/index inv-chain)
          un-liftovered-variants (keep #(vl/liftover-variant* inv-chain-idx %) variants)
          ]
      (write-chain1 "test-resources/large.chain.gz" chain)
      ;(write-chain1 "large-inv.chain.gz" inv-chain)
      (write-variants "test-resources/large.vcf" un-liftovered-variants)
      )))

;;
;; call (make-benchmark-resources seq-path)
;;

@niyarin niyarin force-pushed the fix/improve-find-chain-block branch from 0a9965b to 48d1c97 Compare December 3, 2025 06:12
@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.38%. Comparing base (fee6e9f) to head (c6f04fb).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #139      +/-   ##
==========================================
+ Coverage   48.26%   48.38%   +0.11%     
==========================================
  Files          16       16              
  Lines        2391     2412      +21     
  Branches       75       76       +1     
==========================================
+ Hits         1154     1167      +13     
- Misses       1162     1169       +7     
- Partials       75       76       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@niyarin niyarin force-pushed the fix/improve-find-chain-block branch 2 times, most recently from b647337 to c6f04fb Compare December 3, 2025 07:12
@niyarin niyarin requested a review from alumi December 4, 2025 02:22
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for identifying the bottleneck and achieving such a significant performance improvement! 👍
Before the fix, the benchmark run on my local machine was taking so long to finish, but now it’s extremely fast!⚡
I think the overall approach looks good. I’ve just left a few minor comments on code style, so please take a look.

@niyarin niyarin force-pushed the fix/improve-find-chain-block branch 2 times, most recently from 396a710 to 1adf97a Compare December 8, 2025 22:32
@niyarin
Copy link
Contributor Author

niyarin commented Dec 8, 2025

Thankyou for reviewing.

  • I have not remove clj-kondo's unresolved symbol warning for the benchmark name.
    • (although it can be resolved by creating a .clj-kondo/config.edn file and configuring it)
  • I reduced the benchmark VCF to 100,000 entries.

@niyarin niyarin force-pushed the fix/improve-find-chain-block branch from 1adf97a to ecfc682 Compare December 9, 2025 02:18
@niyarin niyarin marked this pull request as ready for review December 9, 2025 02:27
@niyarin niyarin requested a review from nokara26 December 9, 2025 02:30
Copy link
Contributor

@nokara26 nokara26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nokara26 nokara26 merged commit 4ac7837 into master Dec 9, 2025
46 checks passed
@nokara26 nokara26 deleted the fix/improve-find-chain-block branch December 9, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants