Bug 578900 - BlameGenerator is slow (because of RenameDetector?)
Summary: BlameGenerator is slow (because of RenameDetector?)
Status: RESOLVED FIXED
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 6.0   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: 6.3   Edit
Assignee: Simeon Andreev CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-22 08:42 EST by Julien HENRY CLA
Modified: 2022-07-04 05:12 EDT (History)
4 users (show)

See Also:


Attachments
JGit Blame multiple files profiling (89.18 KB, image/png)
2022-02-22 08:42 EST, Julien HENRY CLA
no flags Details
visualvm CPU sampling snapshot of computing blame with jgit, on a moved file. (25.37 KB, application/octet-stream)
2022-06-13 09:08 EDT, Simeon Andreev CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Julien HENRY CLA 2022-02-22 08:42:24 EST
Created attachment 288102 [details]
JGit Blame multiple files profiling

Hi,

I am trying to understand why the use of JGit to blame files is so slow.

For comparison, I use this repository:
https://github.com/SonarSource/sonarlint-core

Using Git native command line:

time find . -name "*.java"  -exec git  --no-pager blame -w -p --line-porcelain "{}" > out.txt \;

real    0m7,457s
user    0m5,849s
sys     0m1,535s


Then using this sample program:

  public void blame(Path baseDir) {
    Collection<File> files = FileUtils.listFiles(baseDir.toFile(), new String[] {"java"}, true);
    try (Repository repo = buildRepository(baseDir)) {
      File gitBaseDir = repo.getWorkTree();
      files.forEach(inputFile -> blame(repo, gitBaseDir.toPath(), inputFile));
    }
  }

  private void blame(Repository repo, Path gitBaseDir, File inputFile) {
    String filename = inputFile.getName();
    System.out.println("Blame file " + filename);
    BlameResult blameResult;
    try (BlameGenerator gen = new BlameGenerator(repo, gitBaseDir.relativize(inputFile.toPath()).toString())) {
      gen.setTextComparator(RawTextComparator.WS_IGNORE_ALL);
      gen.setFollowFileRenames(true);
      gen.prepareHead();
      gen.getRenameDetector().setSkipContentRenamesForBinaryFiles(true);
      blameResult = gen.computeBlameResult();
    } catch (Exception e) {
      throw new IllegalStateException(e.getMessage(), e);
    }
    // Do something with blameResult
  }

takes around 50s (so almost 10x longer than native Git)


I run a profiler, and most of the time seems to be spent in RenameDetector. Is there a way to optmize this?

I had a few ideas:
- the BlameGenerator is not reusable, but in my use case we are likely to blame multiple files. Would it be possible to have my own blame implementation that would traverse commit history only once, and collect blame for all files in a single pass?
- maybe the results of the RenameDetector could be cached between consecutive runs of the BlameGenerator?
- maybe the RenameDetector can be optimized? I have the impression that it will compute the matrix for all files, even if the current blame is only interested by a single file
- are any of those Git CLI optimizations [1] present in JGit? 


[1] https://stackoverflow.com/a/61565481/534773
Comment 1 Thomas Wolf CLA 2022-02-22 09:41:54 EST
Nope, commit graph is not implemented in JGit. There is an open change series by Kyle Zhao for adding support for commit-graph.[1]

[1] https://git.eclipse.org/r/q/topic:%22commit-graph%22
Comment 2 Julien HENRY CLA 2022-02-22 09:59:38 EST
Hi Thomas,

Thanks, that look promising. Do you know if it is planned to make this benefit the blame command? According to https://git.eclipse.org/r/c/jgit/jgit/+/189102, I'm a bit afraid it would only target git log -- <path>

Also, any opinion on the feasibility to implement a "multi-file" blame?
Comment 3 Thomas Wolf CLA 2022-02-22 10:13:16 EST
(In reply to Julien HENRY from comment #2)
> Hi Thomas,
> 
> Thanks, that look promising. Do you know if it is planned to make this
> benefit the blame command? According to
> https://git.eclipse.org/r/c/jgit/jgit/+/189102, I'm a bit afraid it would
> only target git log -- <path>

I didn't look at this change series in detail. I would expect a commit-graph implementation be beneficial for any commit traversal filtered by one or several paths, whether that is a git log or a git blame or something else. And the first thing I see in the commit you linked is that despite the commit message there is absolutely no change in LogCommand, which seems to match my expectation.

> Also, any opinion on the feasibility to implement a "multi-file" blame?

No. I notice that the C git blame is also for a single file only according to the documentation.

I don't know whether the rename similarity matrix could be pruned.
Comment 4 Thomas Wolf CLA 2022-02-22 10:15:14 EST
(In reply to Thomas Wolf from comment #3)
> > Also, any opinion on the feasibility to implement a "multi-file" blame?
> 
> No. [...]

To clarify: that means "No opinion". I do not know if it would be feasible, or what it would entail.
Comment 5 Julien HENRY CLA 2022-02-22 11:55:04 EST
I checked out the branch change-183079-19 and used the artifact in my test. Unfortunately it did not provide any improvement to my use case.

Files in the repository I'm using for the test have been moved a lot recently, so that may explain why the RenameDetector is playing such a big (negative) part in the bad performance I see.

I will try to understand a bit more how it works and see if something can be done.
Comment 6 Thomas Wolf CLA 2022-02-23 03:15:08 EST
Perhaps something is missing. Did you generate a commit graph? If your repo doesn't have one, it cannot be used. I think running a garbage collection (with git config gc.writeCommitGraph=true) should create one.
Comment 7 Julien HENRY CLA 2022-03-01 08:03:23 EST
I checked and the commit graph was there.

> $ git commit-graph verify
> Verification des commits dans le graphe de commits: 100% (1366/1366), fait.
Comment 8 Simeon Andreev CLA 2022-06-13 09:08:30 EDT
Created attachment 288575 [details]
visualvm CPU sampling snapshot of computing blame with jgit, on a moved file.

We also observe very slow blame computation with jgit (upward of 20 seconds, whereas git blame is done in ~3 seconds). I've attached the visualvm snapshot, maybe it helps analyzing the problem.

As with the original report, a lot of time is spent in: org.eclipse.jgit.diff.RenameDetector.findContentRenames()
Comment 9 Simeon Andreev CLA 2022-06-14 13:36:23 EDT
(In reply to Simeon Andreev from comment #8)
> As with the original report, a lot of time is spent in:
> org.eclipse.jgit.diff.RenameDetector.findContentRenames()

I believe in our case we have the problem of developer branches being re-used over multiple commits and master being merged into a developer branch multiple times. So there are relatively large commits fairly often. RenameDetector.findContentRenames() is therefore called with a lot of files, multiple times, computing many large "sparse" matrices (rows/columns being new/old files). Each matrix is thrown away when the computations for the next commit start...

Maybe something can be restructured to reuse computations that were already done. Or maybe the computations can be limited to only relevant files that were affected by a rename? I don't really understand what the matrix is needed for, when computing a blame on a single file.

I'll take a look the next few days, if nothing else comes up.
Comment 10 Simeon Andreev CLA 2022-06-15 11:05:22 EDT
When I make a test repository where master is merged into a branch multiple times, the JGit blame still goes over the actual "git mv" commits when checking for renames (which as a user I understand and makes sense to me).

Whereas in the actual case (where JGit blame is very slow), its going over merge commits that allegedly add files. Any idea if this is normal (it doesn't seem to make sense to me)? Unfortunately I'm not a git expert.

E.g. we have a merge commit that "adds" 16k files and removes 9k files... so that creates a massive matrix computation to find potential renames. Its where JGit blame "hangs" in our case.
Comment 11 Eclipse Genie CLA 2022-06-15 16:24:24 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/194200
Comment 13 Julien HENRY CLA 2022-07-04 05:12:42 EDT
I just tested the latest master build, and the performance improvement is impressive!

For the record, on the same repo I used for testing, here are the durations to blame all Java files, one by one:
- native Git cli: 7s
- JGit 6.2: 60s
- JGit 6.3-SNAPSHOT (6.3.0-20220703.204523-11): 14s

I have not tested if it produces the same output.

Thanks a lot!