Skip to content

Conversation

@a-ba
Copy link

@a-ba a-ba commented Jan 22, 2026

This PR drastically improves the throughput of computing delta signatures on large files.

To test it:

mkdir rootfs
fallocate --length 4G rootfs/blob
time dar -c archive -R rootfs --delta sig:fixed:4096

before:

real    1m21,098s
user    0m40,691s
sys     0m39,587s

after:

real    0m8,752s
user    0m6,676s
sys     0m1,685s

Rationale

During archive creation, dar stores the signature in memory using the memoryfile class.

It appears that memoryfile has terrible scaling properties when subjected to short writes, because:

  1. the storage class stores each newly written block in a separate heap-allocated cellule object and these objects are collected in an internal linked list
  2. each call to memoryfile::inherited_write() implies at least one call to storage::size(), which fully traverses the linked list.

As the file grows the throughput decreases asymptotically. Rough measurement: 50 Mo/s at t0, 3.3 Mo/s at t0+0h20, 1.5Mo/s at t0+1h30, 1.2Mo/s at t0+2h30. At some point the cpu spends most of its time on page faults.

This patch modifies generic_rsync::inherited_read() so that its internal buffer is flushed only when full (rather than on every call). This reduces the size of the linked-list by two orders of magnitude. Hashing a 48Go file with a 4ko block size now takes 4 minutes (instead of 8 hours previously).

A longer term solution would be to refactor memoryfile (pehaps a naive std::vector would yield better results than the storage class, is the optimisation for arbitrary insertions really relevant?).

Also the PR fixes a possible null-pointer dereference when generic_rsync::send_eof() is called on delta creation (send_eof() has no use when creating a delta (it is only relevant forcreating signatures) and it may dereference x_output (which is null when creating a delta))

@Edrusb Edrusb self-assigned this Jan 22, 2026
@Edrusb
Copy link
Owner

Edrusb commented Jan 22, 2026

Thanks for this feedback and proposal.

Storage class is probably one of the oldest class of dar, and had the purpose of storing the bits of arbitrary long integers (infinint class)... this was year 2001, the fear of the bug of that time was still warm... :)

And I have not reviewed this code since decades but extended its use a few years later for class memoryfile as you have noted...

Well there is clearly something here to review and I thank you for pointing me to it!

For the short term, I will review your pull request, and delay the 2.8.3 release to see how to include your proposal in it.

If you can rebase your pull request on branch_2.8.x this will help me :) Thanks

@Edrusb Edrusb added the enhancement behavor/feature enhancement label Jan 22, 2026
a-ba added 2 commits January 22, 2026 23:42
send_eof() has no use when creating a delta (it is only relevant for
creating signatures)

furthermore it may dereference x_output (which is null when creating a delta)
@a-ba a-ba changed the base branch from master to branch_2.8.x January 23, 2026 00:03
@a-ba
Copy link
Author

a-ba commented Jan 23, 2026

had the purpose of storing the bits of arbitrary long integers

Ok i see ;)

The branch is rebased!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement behavor/feature enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants