Community
Participate
Working Groups
Created attachment 288102 [details] JGit Blame multiple files profiling Hi, I am trying to understand why the use of JGit to blame files is so slow. For comparison, I use this repository: https://github.com/SonarSource/sonarlint-core Using Git native command line: time find . -name "*.java" -exec git --no-pager blame -w -p --line-porcelain "{}" > out.txt \; real 0m7,457s user 0m5,849s sys 0m1,535s Then using this sample program: public void blame(Path baseDir) { Collection<File> files = FileUtils.listFiles(baseDir.toFile(), new String[] {"java"}, true); try (Repository repo = buildRepository(baseDir)) { File gitBaseDir = repo.getWorkTree(); files.forEach(inputFile -> blame(repo, gitBaseDir.toPath(), inputFile)); } } private void blame(Repository repo, Path gitBaseDir, File inputFile) { String filename = inputFile.getName(); System.out.println("Blame file " + filename); BlameResult blameResult; try (BlameGenerator gen = new BlameGenerator(repo, gitBaseDir.relativize(inputFile.toPath()).toString())) { gen.setTextComparator(RawTextComparator.WS_IGNORE_ALL); gen.setFollowFileRenames(true); gen.prepareHead(); gen.getRenameDetector().setSkipContentRenamesForBinaryFiles(true); blameResult = gen.computeBlameResult(); } catch (Exception e) { throw new IllegalStateException(e.getMessage(), e); } // Do something with blameResult } takes around 50s (so almost 10x longer than native Git) I run a profiler, and most of the time seems to be spent in RenameDetector. Is there a way to optmize this? I had a few ideas: - the BlameGenerator is not reusable, but in my use case we are likely to blame multiple files. Would it be possible to have my own blame implementation that would traverse commit history only once, and collect blame for all files in a single pass? - maybe the results of the RenameDetector could be cached between consecutive runs of the BlameGenerator? - maybe the RenameDetector can be optimized? I have the impression that it will compute the matrix for all files, even if the current blame is only interested by a single file - are any of those Git CLI optimizations [1] present in JGit? [1] https://stackoverflow.com/a/61565481/534773
Nope, commit graph is not implemented in JGit. There is an open change series by Kyle Zhao for adding support for commit-graph.[1] [1] https://git.eclipse.org/r/q/topic:%22commit-graph%22
Hi Thomas, Thanks, that look promising. Do you know if it is planned to make this benefit the blame command? According to https://git.eclipse.org/r/c/jgit/jgit/+/189102, I'm a bit afraid it would only target git log -- <path> Also, any opinion on the feasibility to implement a "multi-file" blame?
(In reply to Julien HENRY from comment #2) > Hi Thomas, > > Thanks, that look promising. Do you know if it is planned to make this > benefit the blame command? According to > https://git.eclipse.org/r/c/jgit/jgit/+/189102, I'm a bit afraid it would > only target git log -- <path> I didn't look at this change series in detail. I would expect a commit-graph implementation be beneficial for any commit traversal filtered by one or several paths, whether that is a git log or a git blame or something else. And the first thing I see in the commit you linked is that despite the commit message there is absolutely no change in LogCommand, which seems to match my expectation. > Also, any opinion on the feasibility to implement a "multi-file" blame? No. I notice that the C git blame is also for a single file only according to the documentation. I don't know whether the rename similarity matrix could be pruned.
(In reply to Thomas Wolf from comment #3) > > Also, any opinion on the feasibility to implement a "multi-file" blame? > > No. [...] To clarify: that means "No opinion". I do not know if it would be feasible, or what it would entail.
I checked out the branch change-183079-19 and used the artifact in my test. Unfortunately it did not provide any improvement to my use case. Files in the repository I'm using for the test have been moved a lot recently, so that may explain why the RenameDetector is playing such a big (negative) part in the bad performance I see. I will try to understand a bit more how it works and see if something can be done.
Perhaps something is missing. Did you generate a commit graph? If your repo doesn't have one, it cannot be used. I think running a garbage collection (with git config gc.writeCommitGraph=true) should create one.
I checked and the commit graph was there. > $ git commit-graph verify > Verification des commits dans le graphe de commits: 100% (1366/1366), fait.
Created attachment 288575 [details] visualvm CPU sampling snapshot of computing blame with jgit, on a moved file. We also observe very slow blame computation with jgit (upward of 20 seconds, whereas git blame is done in ~3 seconds). I've attached the visualvm snapshot, maybe it helps analyzing the problem. As with the original report, a lot of time is spent in: org.eclipse.jgit.diff.RenameDetector.findContentRenames()
(In reply to Simeon Andreev from comment #8) > As with the original report, a lot of time is spent in: > org.eclipse.jgit.diff.RenameDetector.findContentRenames() I believe in our case we have the problem of developer branches being re-used over multiple commits and master being merged into a developer branch multiple times. So there are relatively large commits fairly often. RenameDetector.findContentRenames() is therefore called with a lot of files, multiple times, computing many large "sparse" matrices (rows/columns being new/old files). Each matrix is thrown away when the computations for the next commit start... Maybe something can be restructured to reuse computations that were already done. Or maybe the computations can be limited to only relevant files that were affected by a rename? I don't really understand what the matrix is needed for, when computing a blame on a single file. I'll take a look the next few days, if nothing else comes up.
When I make a test repository where master is merged into a branch multiple times, the JGit blame still goes over the actual "git mv" commits when checking for renames (which as a user I understand and makes sense to me). Whereas in the actual case (where JGit blame is very slow), its going over merge commits that allegedly add files. Any idea if this is normal (it doesn't seem to make sense to me)? Unfortunately I'm not a git expert. E.g. we have a merge commit that "adds" 16k files and removes 9k files... so that creates a massive matrix computation to find potential renames. Its where JGit blame "hangs" in our case.
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/194200
Gerrit change https://git.eclipse.org/r/c/jgit/jgit/+/194200 was merged to [master]. Commit: http://git.eclipse.org/c/jgit/jgit.git/commit/?id=1a364c49ec88e5b2e642dddc743df5ebd7445daf
I just tested the latest master build, and the performance improvement is impressive! For the record, on the same repo I used for testing, here are the durations to blame all Java files, one by one: - native Git cli: 7s - JGit 6.2: 60s - JGit 6.3-SNAPSHOT (6.3.0-20220703.204523-11): 14s I have not tested if it produces the same output. Thanks a lot!