Bug 580137 - DateRevQueue.buildIndex called thousands of times during fetch
Summary: DateRevQueue.buildIndex called thousands of times during fetch
Status: NEW
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 5.13   Edit
Hardware: PC All
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-10 18:37 EDT by Luca Milanesio CLA
Modified: 2022-06-13 18:10 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Milanesio CLA 2022-06-10 18:37:31 EDT
When running a JGit fetch on a local repository with millions of refs, the BasePackFetchConnection.negotiateBegin() needs to mark all local commits pointed by the refs on a RevWalk.

The RevWalk.markStart(Collection<RevCommit>) does not manage this particular bulk addition and instead is calling the individual markStart() for each commit, which will re-calculate the index every 1000 commits.

The problem was previously identified by Gustaf Lundh back in 2013 with a repository with 80k refs and that's why he identified batches of 1000 commits which would have reduced the number of reindexes to 80.

He admitted that the static bucketing of 1000 was not ideal and he mentioned:
"In the future, it should probably be dynamic and based on
the number of refs in the queue, but this should serve well
as a starting point."

I believe the "future" has come and we need to make the bucketing more dynamic based on the total number of refs in the repository.
Comment 1 Luca Milanesio CLA 2022-06-12 17:02:28 EDT
By timing the time to run buildIndex(), it looks quadratic and it doubles every 1000s refs.

Example:
6000 refs => buildIndex takes 124us
7000 refs => buildIndex takes 226us
Comment 2 Luca Milanesio CLA 2022-06-13 08:18:09 EDT
When testing with a repository with 1.5M refs, the buildIndex() execution times goes up to 42ms per call, but it is called 1500 times, which eventually takes 33s to complete (the earlier buildIndex iterations are faster than 42ms).

I am now going to try to change the implementation and use a standard priority queue and see the progress.
Comment 3 Luca Milanesio CLA 2022-06-13 16:00:05 EDT
The complexity of the current DevRevQueue is O(n^2/2) whilst the Java priority queue will be O(n log n).

I'll create an alternative implementation (DevRevPriorityQueue) that relies on a standard java priority queue and share the resulting E2E performance improvements on the Git fetch protocol.
Comment 4 Luca Milanesio CLA 2022-06-13 16:47:41 EDT
I've done an initial re-implementation of DevRevQueue (named DevRevPriorityQueue) which uses a Java PriorityQueue: the performance results are astonishing !!!

DevRevQueue.add(RevCommit) - 850k times: 4515 ms
DevRevPriorityQueue.add(RevCommit) - 850k times: 58 ms

The priority queue is 77x time faster.
Comment 5 Eclipse Genie CLA 2022-06-13 18:10:59 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/194140