Bug 494323 - Git Porcelain API: Add command is is extremely slow
Summary: Git Porcelain API: Add command is is extremely slow
Status: RESOLVED FIXED
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 4.3   Edit
Hardware: Macintosh Mac OS X
: P3 normal with 1 vote (vote)
Target Milestone: 6.6   Edit
Assignee: Thomas Wolf CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-23 11:13 EDT by Pavel Lobodinsky CLA
Modified: 2023-05-02 07:59 EDT (History)
7 users (show)

See Also:


Attachments
Profiler's snapshot from VisualVM (238.43 KB, image/png)
2016-05-30 03:49 EDT, Pavel Lobodinsky CLA
no flags Details
AddPerformance Test (2.99 KB, application/octet-stream)
2016-07-28 11:20 EDT, Christian Halstrick CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel Lobodinsky CLA 2016-05-23 11:13:35 EDT
For code base counting 2k files having about 100MiB, the "jgit add ." command is very slow even when no untracked change is present.

Reproduction steps:
1) Create new GIT repo by "Git git = Git.init();"
2) Add all files into index by "git.add().addFilepattern(".").call();"
3) Commit by "git.commit().setAuthor("author", "email").setMessage("message").call();"
4) Run again "git.add().addFilepattern(".").call();" 
5) Expected behavior: Shall be very quick. Regular GIT spends about 0.015s on this operation.
6) Actual behavior: It takes about 3 seconds to complete.

This is quite a big issue on our project where we expect many small commits made programmatically. On 100 commits in a row jGit spends about 4 minutes on the Add command, while regular Git gets the same done in 1 second.
Comment 1 Andrey Loskutov CLA 2016-05-23 11:17:24 EDT
Any chance that you can run jgit with profiler, so we have an idea which code is responsible?
Comment 2 Pavel Lobodinsky CLA 2016-05-30 02:02:59 EDT
Sorry for late reply, I thought the email notification I received was telling me I created an issue. But instead, it was your reply.

I will profile the app and sent you some details back.
Comment 3 Pavel Lobodinsky CLA 2016-05-30 03:49:57 EDT
Created attachment 262105 [details]
Profiler's snapshot from VisualVM

Here is the snapshot from the profiler. The test ran `git add .` command repeatedly over a directory with 2000 files.
Comment 4 Christian Halstrick CLA 2016-05-31 08:06:48 EDT
How is the performance if directly before the "jgit add ." command you do a native git "git status" command in that repository?
Comment 5 Pavel Lobodinsky CLA 2016-05-31 15:52:34 EDT
Calling native `git status` prior to `jgit add .` does not change a thing. See what the result was:

pavel@pavel-mbp: ~/Downloads/GitTest master $ time ../jgit-4.3.1.sh add .
../jgit-4.3.1.sh add .  5.87s user 0.46s system 85% cpu 7.370 total

pavel@pavel-mbp: ~/Downloads/GitTest master $ time git add .
git add .  0.01s user 0.01s system 88% cpu 0.023 total

pavel@pavel-mbp: ~/Downloads/GitTest master $ time git status
On branch master
nothing to commit, working directory clean
git status  0.01s user 0.02s system 97% cpu 0.030 total

pavel@pavel-mbp: ~/Downloads/GitTest master $ time ../jgit-4.3.1.sh add .
../jgit-4.3.1.sh add .  5.40s user 0.45s system 92% cpu 6.362 total
Comment 6 Pavel Lobodinsky CLA 2016-07-27 08:51:08 EDT
Any luck finding the cause of the poor performance?
Comment 7 Christian Halstrick CLA 2016-07-28 11:20:59 EDT
Created attachment 263361 [details]
AddPerformance Test
Comment 8 Christian Halstrick CLA 2016-07-28 11:24:14 EDT
I looked into it. Most of the time is spent because you say git.add.addFilePattern(".").call() instead of explicitly telling the modified pathes.
I wrote a test (attached it to this bug) where in a repo of 100k files I modified 1k files each time. Constructing a jgit AddCommand, calling 1000 times addFilePattern(<modifiedFileName>) and then do call() is 5 times faster then replacing the 1000 calls to addFilePattern(...) with one single addFilePattern(".")! The runtime for adding 1000 modified files drops from 2500ms to 500ms.
The reason is: when you tell git (or jgit) to "git add ." then git/jgit has to visit every file in the working tree to find out whether it is modified or not. Since we store in the git index a timestamp for each file in the workingtree it is easy to detect whether a file is not modified. JGit should not be as slow for a "git add ." as you have reported and also as my testcase shows. So there is a bug somewhere. But as I said: one workaround could be to explicitly specify the modified files.
Comment 9 Pavel Lobodinsky CLA 2016-07-28 11:38:04 EDT
(In reply to Christian Halstrick from comment #8)
> I looked into it. Most of the time is spent because you say
> git.add.addFilePattern(".").call() instead of explicitly telling the
> modified pathes.
> I wrote a test (attached it to this bug) where in a repo of 100k files I
> modified 1k files each time. Constructing a jgit AddCommand, calling 1000
> times addFilePattern(<modifiedFileName>) and then do call() is 5 times
> faster then replacing the 1000 calls to addFilePattern(...) with one single
> addFilePattern(".")! The runtime for adding 1000 modified files drops from
> 2500ms to 500ms.
> The reason is: when you tell git (or jgit) to "git add ." then git/jgit has
> to visit every file in the working tree to find out whether it is modified
> or not. Since we store in the git index a timestamp for each file in the
> workingtree it is easy to detect whether a file is not modified. JGit should
> not be as slow for a "git add ." as you have reported and also as my
> testcase shows. So there is a bug somewhere. But as I said: one workaround
> could be to explicitly specify the modified files.

I cannot specify the files explicitly because I simply have no clue which were changed.
Anyway, it is strange that jGit is so much slower that Git, whilst in your case it was not.
Comment 10 Pavel Lobodinsky CLA 2017-06-13 07:02:41 EDT
So I met the performance issue once again, this time having a different use-case.

1) Create a directory with a huge file in it, and make it current working directory in your terminal
2) Initialise Git repo by issuing `jgit init .`
3) Issue `jgit add . ; jgit commit -m "Init commmit"`
4) Issue `echo "test" > test.txt`
5) Issue `jgit add .`

This last `jgit add .` takes enormous time even though there is only a single change - the `test.txt` file.


Example from my local machine, having 4 files taking about 5GB:

$ time ../jgit-4.7.1.sh init .
../jgit-4.7.1.sh init .  0.66s user 0.08s system 162% cpu 0.455 total

$ time ../jgit-4.7.1.sh add .
../jgit-4.7.1.sh add .  193.70s user 10.35s system 100% cpu 3:23.05 total
                                                                                                                                                                                                                                              
$ ../jgit-4.7.1.sh status
On branch master
Changes to be committed:
	new file:   Oracle for Windows/Oracle 11.2.0.3.0 x86 64bit 1of2.zip
	new file:   Oracle for Windows/Oracle 11.2.0.3.0 x86 64bit 2of2.zip
	new file:   Oracle for Windows/winx64_12102_database_1of2.zip
	new file:   Oracle for Windows/winx64_12102_database_2of2.zip

$ ../jgit-4.7.1.sh commit -m "init"
[master a6e5db387537fb34e77d87498f303e1f322c0536] init

$ echo "test" > test.txt

$ ../jgit-4.7.1.sh status
On branch master
Untracked files:
	test.txt

$ time ../jgit-4.7.1.sh add .
../jgit-4.7.1.sh add .  180.22s user 7.96s system 100% cpu 3:08.02 total
Comment 11 Christian Halstrick CLA 2017-06-20 10:51:05 EDT
I can reproduce this problem. The thing is that when you execute 'jgit add .' then jgit will search for all files matching '.' and will add them again to the index. Adding huge binary files to the index is expensive because we read again all the bytes, compute a SHA1, etc. . JGit (I guess in contrast to c-git) doesn't look whether a file is modified or not since it was last time added to the index. Unmodified files (compared to what is in the index) should not be read again.

One workaround would be to first fire a jgit status to find the modified and new files and then do a jgit add with explicit pathes. That should be fast.

Of course it would be nicer if JGit would also avoid adding unmodified files again. But this could be a little bit tricky in combination with filters. Maybe you have not modified the big file itself but you have configured a new filter in your .gitattributes file. This may require re-adding file content although the file has not been touched. I have to investigate further on this.
Comment 12 Pavel Lobodinsky CLA 2017-06-20 10:59:22 EDT
Yes, the workaround with `git status` may do the trick for now. However, there is quite some additional logic required. It would be indeed great if jGit itself eliminates this.

I my testing scenario, I have just added a tiny new file. No filter changes, anything like that.

Also, it seems to me that the re-adding of all files is the cause of slowness of `git add .` when having large amount of files in my repo - the original reason why I reported this issue.
Comment 13 Dave Hawkins CLA 2023-04-20 11:29:55 EDT
While a smudge filter would obviously change the SHA calculation, is c-git taking filter changes into account during add? Looking at the C code, I'm not sure that it is and if that's the case, shouldn't JGit behave in the same way, ie just use the stat info?

In our use of JGit we, unfortunately, have to use a fairly slow filesystem. For a 10000 file test case, with each file 10k of random text and no modified files, jgit add . is taking >200s, whereas git add . is taking <2s.
Comment 14 Thomas Wolf CLA 2023-04-20 14:13:59 EDT
From code inspection it seems to me AddCommand.call() should set an IndexDiffFilter on its treewalk (possibly combined with a path filter, if it's not ".").
Comment 15 Eclipse Genie CLA 2023-04-22 06:42:08 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/201449
Comment 17 Dave Hawkins CLA 2023-05-02 05:56:43 EDT
Tested the latest nightly with my test case and there's a significant performance improvement with the new property, >100x faster.
Comment 18 Thomas Wolf CLA 2023-05-02 07:01:27 EDT
(In reply to Dave Hawkins from comment #17)
> Tested the latest nightly with my test case and there's a significant
> performance improvement with the new property, >100x faster.

Glad to hear that! Thanks for the report.

For others: that would be

  git.add().addFilePattern(".").setRenormalize(false).call();

"renormalize" is true by default in JGit, which is why it was so slow. See the commit message and [1] for details.

[1] https://git-scm.com/docs/git-add#Documentation/git-add.txt---renormalize
Comment 19 Matthias Sohn CLA 2023-05-02 07:59:38 EDT
(In reply to Thomas Wolf from comment #18)
> (In reply to Dave Hawkins from comment #17)
> > Tested the latest nightly with my test case and there's a significant
> > performance improvement with the new property, >100x faster.
> 
> Glad to hear that! Thanks for the report.
> 
> For others: that would be
> 
>   git.add().addFilePattern(".").setRenormalize(false).call();
> 
> "renormalize" is true by default in JGit, which is why it was so slow. See
> the commit message and [1] for details.
> 
> [1] https://git-scm.com/docs/git-add#Documentation/git-add.txt---renormalize

maybe we can switch the default in the next major release