Bug 577948 - Git history view should be updated as soon as a matching commit has been found
Summary: Git history view should be updated as soon as a matching commit has been found
Status: NEW
Alias: None
Product: EGit
Classification: Technology
Component: UI (show other bugs)
Version: 6.1   Edit
Hardware: PC Windows 10
: P3 enhancement with 1 vote (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 574368
Blocks:
  Show dependency tree
 
Reported: 2021-12-23 03:54 EST by Simon Sohrt CLA
Modified: 2023-12-11 06:38 EST (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Sohrt CLA 2021-12-23 03:54:12 EST
Both git log and gitk display matching commits as soon as they are found. This is very fast and I find this behavior desirable. 

In contrast to that, the git history view displays results only after all reachable commits have been processed. 

I downloaded the source to EGit/JGit and have already narrowed down the problem somewhat. It looks to me that this problem consists of two parts. 

First part: 
----------

When the fillTo-Method of GenerateHistoryJob is called for the first time, StartGenerator.next() will be called. Within the next()-method a new generator is instantiated: 
g = new FIFORevQueue(g); 

The problem is that during instantiation of the new generator, the old generator (PendingGenerator) is completely exhausted by going through all reachable commits that are accepted by the filter. During this time the UI is not updated. 

I think this is an implementation bug because the GenerateHistoryJob.run()-method uses the GenerateHistoryJob.fillTo-method expecting it to only load as many commits as specified by highMark (but instead it loads all commits). 

Here is my stack when PendingGenerator is in the process of being exhausted during the first call to GenerateHistoryJob.fillTo: 
Thread [Worker-5: Reading history from Git repository 'webapps'] (Suspended (breakpoint at line 99 in PendingGenerator))	
	PendingGenerator.next() line: 99	
	FIFORevQueue(BlockRevQueue).<init>(Generator) line: 40	
	FIFORevQueue.<init>(Generator) line: 37	
	StartGenerator.next() line: 133	
	GitHistoryWalk(RevWalk).next() line: 601	
	GitHistoryWalk(PlotWalk).next() line: 118	
	GitHistoryWalk(SWTWalk).next() line: 61	
	GitHistoryWalk.next() line: 56	
	GenerateHistoryJob$1(RevCommitList<E>).fillTo(int) line: 274	
	GenerateHistoryJob.run(IProgressMonitor) line: 122	
	Worker.run() line: 63	


Second part: 
------------

I believe that the BATCH_SIZE in GenerateHistoryJob is too high (it is currently set to 256). Even after addressing the first part of the problem, the UI would only be updated after 256 commits have been found. When I use the default settings of eclipse only the first eight commits are displayed on my monitor. 

I propose that either the BATCH_SIZE is set to a significantly smaller value (maybe 5) or that the UI is upated every time a new commit has been found. 


Reproduction of the problem: 
----------------------------
We have a fairly large repository with around 297746 reachable commits from our HEAD. We selected a file that has more than 256 commits. We then activate the following tracing flags: 
org.eclipse.egit.ui/debug=true
org.eclipse.egit.ui/debug/ui/historyview=true

When we display the history view of a file, we see the following output: 
| Worker-11: Reading history from Git repository 'webapps' | 2021-12-23 09:22:14.504 | org.eclipse.egit.ui | /debug/ui/historyview | org.eclipse.egit.ui.internal.history.GenerateHistoryJob | run | 104 | Filling commit list |
| main | 2021-12-23 09:22:14.504 | org.eclipse.egit.ui | /debug/ui/historyview | org.eclipse.egit.ui.internal.history.GitHistoryPage | setInput | 1978 | Exiting method with a void return |
| Worker-11: Reading history from Git repository 'webapps' | 2021-12-23 09:22:19.653 | org.eclipse.egit.ui | /debug/ui/historyview | org.eclipse.egit.ui.internal.history.GenerateHistoryJob | run | 156 | Loaded 256 commits |

We see that the loading of the first 256 commits takes around ~5s. 
Subsequent runs of the GenerateHistoryJob take only around 1ms since all the work has already been done in the first run.
Comment 1 Thomas Wolf CLA 2021-12-23 04:56:28 EST
With parent rewriting on, JGit will load the whole history anyway. Parent rewriting is on if the history is filtered to some paths.

A BATCH_SIZE of 5 would be way too small for incremental loading when the user scrolls the table.

A UI update on every commit found is not a good idea; would generate way too many updates when commits are found quickly.

I'm not happy with the whole implementation of the git history filling. It's old code that has been subverted to several additional use cases (incremental loading, but not when the find toolbar is on; using the find toolbar is sometimes cumbersome because of focus hacks that make some keyboard being ignored, UI updates triggered by that background GenerateHistoryJob directly, nearly incomprehensible logic about when updates are done and when they are suppressed, interactions between the UI update (setting input on the virtual table) and triggering more loads, ...).

This should be completely rewritten. It's basically a classic producer-consumer problem, with the commit list as queue in between. The job should be a pure producer, not caring about UI updates at all. UI updates should be a pure consumer, being triggered by commits being added to the list. (With some logic to prevent too many UI updates in a short timespan; maybe one UI update at most every 200ms, adding all new commits added since the last update to the table.)

The main problem with such a rewrite is that it would be a lot of work and would need lots of additional UI tests to ensure nothing breaks, and UI testing of this history view is difficult because of all the asynchronous work that's going on.

Also, JGit's history generation should be improved to not load the full history even if parent rewriting is on. That in itself is difficult (but doable -- at least I think so). Basically it would need to buffer only as far ahead until all parents of a commit to be emitted have been rewritten; then it could emit that commit and continue. I looked at that some time ago, but didn't have the time to  implement anything -- it's again a pretty large change to fairly central infrastructure in JGit and also would need extensive testing.
Comment 2 Simon Sohrt CLA 2022-04-01 10:10:22 EDT
(In reply to Thomas Wolf from comment #1)
> Also, JGit's history generation should be improved to not load the full
> history even if parent rewriting is on. That in itself is difficult (but
> doable -- at least I think so). Basically it would need to buffer only as
> far ahead until all parents of a commit to be emitted have been rewritten;
> then it could emit that commit and continue. I looked at that some time ago,
> but didn't have the time to  implement anything -- it's again a pretty large
> change to fairly central infrastructure in JGit and also would need
> extensive testing.

Then making JGit's history generation faster should be the first step towards a faster history view in EGit. I looked into the relevant code and found that StartGenerator creates the following generator chain when computing the history: 
PendingGenerator -> FIFORevQueue -> RewriteGenerator -> TopoSortGenerator 

FIFORevQueue and TopoSortGenerator are problematic because they both completely exhaust the previous generator. Let's first focus on removing FIFORevQueue from the chain by modifying RewriteGenerator. 

When we remove FIFORevQueue from the chain, the rewrite()-method of RewriteGenerator does not work anymore, because the rewrite()-method assumes that all commits have been parsed. This can be fixed by checking for the PARSED-flag in the parents commits. If a commit has not yet been PARSED, we need to call the next()-method of the previous generator until the parent has been PARSED and thus the REWRITE-flag has been written correctly. Since it is now possible that we call the next()-method of the previous generator multiple times within the next()-method of the RewriteGenerator, we must buffer the commits that we have already received from the previous generator. 

I plan to submit a modified RewriteGenerator and a unit-test for the RewriteGenerator in the next week.
Comment 3 Eclipse Genie CLA 2022-04-05 04:36:34 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/192501
Comment 4 Simon Sohrt CLA 2022-04-06 06:12:37 EDT
I submitted a patch for RewriteGenerator but did not write a unit-test for it, because I realized that RewriteGenerator is already indirectly covered by tests in  org.eclipse.jgit.test/tst/org/eclipse/jgit/revwalk 

After modifying RewriteGenerator, I focused on TopoSortGenerator. 

After reading https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/ & https://www.spinics.net/lists/git/msg161308.html, I realized that making TopoSortGenerator not exhaust its previous generator is a much harder problem. 

There are only two solutions: 
1. Use generation numbers:
After Bug 574368 is fixed, JGit can parse generation numbers from the commit-graph file. We should create a new topo generator (maybe called CommitGraphBasedTopoSortGenerator) that is inserted into the generator chain by StartGenerator when core.commitGraph is true and a commit-graph file is present.

2. Use the "gitk approach": 
Gitk outputs commits without doing a topological sort beforehand. If a parent commit has been emitted before a child commit, gitk simply inserts the child commit before the already emitted parent commit. This has the side effect that the graphical output may "wobble" because already printed commits can switch lanes during the parsing of the git repo. 

What would it take to implement this approach in EGit/JGit? 
- The history view should no longer require the TOPO_SORT flag. Together with the patch for the RewriteGenerator, this would make emitting commits from the generator chain iterative. 
- The history view must redraw _all_ commits every time a batch of new commits is emitted from the generator chain because the lanes of the already emitted commits might have changed. 

This approach has the following cons: 
- The UI of the history view behaves differently ("wobbles")
- Significant code changes are necessary in EGit/JGit. Because the UI is not covered by unit tests, UI bugs might be introduced by the code changes. 

Conclusion: 
-----------
I prefer the generation number approach because fewer code changes are necessary and the UI behavior does not change. I have therefore added a dependency to bug 574368.
Comment 5 Simon Sohrt CLA 2022-04-29 10:20:55 EDT
I was also wondering if I can support the patch process in any way? I also wondered, why the patches for bug 574368 have not been applied yet? What's the reason? 

Just out of curiosity I did some preliminary tests to quantify the possible speed up of my proposed changes. Here's what I did: 

I created a version of EGit/JGit with my patched RewriteGenerator and removed the TopoSortGenerator from the generator chain. 

The patch can be downloaded from here: https://myshare.his.de/s/htGzR7nYC45qmH2. Just unpack it into your plugins folder of your eclipse 2022-03 installation. 

I am aware that the commits might now be in the wrong order. I suspect my proposed CommitGraphBasedTopoSortGenerator (from comment 4) won't be _much_ slower than skipping topological sort altogether ;).  

I then checked the speed-up on the Linux kernel on the CREDITS file (this file was there since the beginning of the git history of the linux kernel). My machine has the following processor: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz. I measured the time until an output occurred. 

Here are the results: 
Without patch	| With patch
~17s		| ~2.5s	

I think that is a significant speed-up and I would be glad if the patch will be applied.
Comment 6 Thomas Wolf CLA 2022-05-01 03:23:20 EDT
(In reply to Simon Sohrt from comment #5)
> I was also wondering if I can support the patch process in any way? I also
> wondered, why the patches for bug 574368 have not been applied yet? What's
> the reason? 

Same problem as always: not enough reviewers :-(. I've seen Kyle's change sequence, but didn't find the time yet to start reviewing it. It's also an area I'm unfamiliar with, so it'll take me some time to get up to speed.

It would help if other people familiar with this commit-graph feature and generation numbers and maybe even with the C git implementation could review the changes. That would take some pressure off the two active maintainers on this project (Matthias and me), and would speed up things. To review something one doesn't need to be a committer.

Having only two people doing technical reviews (in their free time!) just doesn't scale.
Comment 7 Simon Sohrt CLA 2022-05-25 03:33:22 EDT
Do you have an estimate when the gerrit change request will probably be merged? Will it make it into 2022.06 or is it too late?
Comment 8 Thomas Wolf CLA 2022-05-25 06:21:05 EDT
(In reply to Simon Sohrt from comment #7)
> Do you have an estimate when the gerrit change request will probably be
> merged? Will it make it into 2022.06 or is it too late?

Your change for the rewrite generator or the commit graph feature?

Yours will go in before 2022.06. I have not forgotten it and started with it before seeing your notice here.

Kyle's commit graph feature won't make it for 2022.06. If nobody else starts reviewing it, I hope I can give it some attention. But it doesn't have to be Matthias or me doing the review alone.
Comment 10 Simon Sohrt CLA 2022-05-25 08:35:35 EDT
Thank you for the merge and the fast response :)

I can maybe spare some time to review the commit-graph feature - I will probably come back to you next week. 

I was also thinking about another solution: 
Topological sorting is not important for my organization. We would be fine if we could just turn it off. Is it possible to introduce a preference for turning topological sorting off? It would be acceptable to us if this preference is not exposed through the UI of eclipse aka. you have to edit the preference file of EGit/JGit directly to change the value.
Comment 11 Thomas Wolf CLA 2022-05-25 09:45:18 EDT
(In reply to Simon Sohrt from comment #10)
> Is it possible to introduce a preference for
> turning topological sorting off?

PlotWalk requires a topological sort.
Comment 12 Simon Sohrt CLA 2022-07-01 10:39:46 EDT
(In reply to Simon Sohrt from comment #10)
> I can maybe spare some time to review the commit-graph feature - I will
> probably come back to you next week. 

Took me a bit longer, sorry. 

I am interested in doing a review, but I doubt it will be helpful. Here are the reasons: 
- I am also unfamiliar with the area
- I have not much experience writting c code 
- I have never reviewed code before
- From reading the review comments up until this point, I am not sure what the sticking point is 

If you still believe I could be of help, let me know and I'll give it a shot. 

I am still committed to writing the proposed CommitGraphBasedTopoSortGenerator after Kyle's changes have been merged. 


(In reply to Thomas Wolf from comment #11)
> PlotWalk requires a topological sort.

Yeah, I know. We are currently using a patched EGit where Plotwalk no longer require topological sort. This is okay for our organization because we are not using branches in the relevant project (this history is just one long line).
Comment 13 Kyle Zhao CLA 2022-10-21 04:08:31 EDT
(In reply to Simon Sohrt from comment #12)
> (In reply to Simon Sohrt from comment #10)
> > I can maybe spare some time to review the commit-graph feature - I will
> > probably come back to you next week. 
> 
> Took me a bit longer, sorry. 
> 
> I am interested in doing a review, but I doubt it will be helpful. Here are
> the reasons: 
> - I am also unfamiliar with the area
> - I have not much experience writting c code 
> - I have never reviewed code before
> - From reading the review comments up until this point, I am not sure what
> the sticking point is 
> 
> If you still believe I could be of help, let me know and I'll give it a
> shot. 
> 
> I am still committed to writing the proposed
> CommitGraphBasedTopoSortGenerator after Kyle's changes have been merged. 
> 
> 
> (In reply to Thomas Wolf from comment #11)
> > PlotWalk requires a topological sort.
> 
> Yeah, I know. We are currently using a patched EGit where Plotwalk no longer
> require topological sort. This is okay for our organization because we are
> not using branches in the relevant project (this history is just one long
> line).

hi, Simon. Glad you noticed my changes.

Some time ago, because of the busy work, I didn't pay much attention to the maintenance of this part. But now I have a lot of time to work on this.

I'm looking forward to your review and then implement CommitGraphBasedTopoSortGenerator.
Comment 14 Simon Sohrt CLA 2023-04-28 09:19:42 EDT
I just submitted a bugfix for the RewriteGenerator. The bugfix can be found here: 
https://git.eclipse.org/r/c/jgit/jgit/+/201587

For some reason the bugtracker did not add the gerrit change to this bug report...

The unit tests also failed, but it looks like a general problem on the jenkins.  

I have also finished the CommitGraphBasedTopoSortGenerator and will submit it as soon as the patch for the RewriteGenerator has been reviewed.
Comment 15 Matthias Sohn CLA 2023-04-28 11:00:06 EDT
The build failures were caused by https://github.com/eclipse/dash-licenses/issues/232 which was fixed today
Comment 16 Simon Sohrt CLA 2023-05-22 09:05:54 EDT
I have implemented CommitGraphBasedTopoSortGenerator in the following Gerrit Change: https://git.eclipse.org/r/c/jgit/jgit/+/201964
Comment 17 Simon Sohrt CLA 2023-07-27 04:30:06 EDT
Can I assist with the review process in any way? I would love to get my patch merged into the 2023-09 release. 

I would greatly appreciate a partial review if you don't find the time for a full review.
Comment 18 Simon Sohrt CLA 2023-11-27 04:22:12 EST
There are now merge conflicts with my patch. 

Would you review and eventually merge my patch if I resolve the merge conflicts?
Comment 19 Matthias Sohn CLA 2023-11-27 08:37:47 EST
(In reply to Simon Sohrt from comment #18)
> There are now merge conflicts with my patch. 
> 
> Would you review and eventually merge my patch if I resolve the merge
> conflicts?

yes, I would.

Please note that we moved the repositories to GerritHub at https://eclipse.gerrithub.io/admin/repos/q/filter:eclipse, see https://wiki.eclipse.org/EGit/Contributor_Guide#Using_Gerrit
Comment 20 Simon Sohrt CLA 2023-11-27 08:54:42 EST
Thanks for your fast reply and the info about the moved repos. I am currently in the process of finishing something else for work, but after that, I will resolve the merge conflicts. I hope I will find time this week...
Comment 21 Simon Sohrt CLA 2023-12-11 06:38:59 EST
Okay, I locally resolved the merge conflicts and just pushed to gerrithub. Gerrithub gave me a little trouble at the beginning but I eventually figured it out...
URL: https://review.gerrithub.io/c/eclipse-jgit/jgit/+/1173387

# Design considerations # 
I copied the architecture and method names from cgit. I based my work on an older commit that only uses generation number version 1 (since only version 1 is currently supported by jgit). However, the differences between cgit's HEAD and this older revision are minimal...


# Tests # 
I did not write tests for CommitGraphBasedTopoSortGenerator for the following reasons: 

1. Kyle has already written tests that verify the correct output sequence when using a commit-graph (see RevWalkCommitGraphTest). The tests from Kyle are based on the tests written for c-git that test the commit-graph functionality (see https://github.com/git/git/blob/master/t/t5318-commit-graph.sh). Thus, the correct behavior of RevWalk is ensured. 
  
2. The CommitGraphBasedTopoSortGenerator only speeds up the emission of commits. Thus, it would make sense to add a performance test for RevWalk (since the correct behavior is already tested by Kyle's tests). I am not aware of any performance tests in JGit that are automatically processed (I know of FileTreeIteratorPerformanceTest, but its result must be interpreted by a human). I feel like a performance test is out of scope for this change because it probably involves creating a performance test framework in the context of JGit. I am also not sure that even the performance tests for c-git are processed automatically (these can be found here: https://github.com/git/git/tree/master/t/perf).
   
3. I attempted to write a unit test for CommitGraphBasedTopoSortGenerator but gave up when I realized that the test would become very convoluted since it involved very complex mocks of RevWalk, the PendingGenerator, and the RewriteGenerator.