Bug 579715 - Far too many git config accesses during cloning
Summary: Far too many git config accesses during cloning
Status: NEW
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 6.2   Edit
Hardware: PC Windows 10
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-15 11:24 EDT by James Poli CLA
Modified: 2022-08-15 09:07 EDT (History)
2 users (show)

See Also:


Attachments
e-mail between Thomas and myself. (59.00 KB, application/octet-stream)
2022-04-15 11:24 EDT, James Poli CLA
no flags Details
Eclipse run output when cloning to a network Samba drive (262.47 KB, application/x-zip-compressed)
2022-04-15 11:27 EDT, James Poli CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description James Poli CLA 2022-04-15 11:24:43 EDT
Created attachment 288454 [details]
e-mail between Thomas and myself.

Hello,
I've been working with Thomas Wolf on a performance issue when cloning to a Network Samba share.  Thomas asked me to open this bug.
Regards,
Jim
See also https://www.eclipse.org/forums/index.php?t=msg&th=1110388&goto=1851740&#msg_1851740
Comment 1 James Poli CLA 2022-04-15 11:27:05 EDT
Created attachment 288455 [details]
Eclipse run output when cloning to a network Samba drive

Thomas reviewed this log already; it's part of an attachment e-mail thread.
Comment 2 Thomas Wolf CLA 2022-04-15 13:30:42 EDT
Yes. This looks similar to the problems we had in EGit when the user home (and thus the user config) was on a shared drive on Windows: far too many accesses to the git repo config, each time getting the file attributes of the repo config, the user config, and the system config to check whether they were modified. Getting file attributes on Windows is a costly operation.

It looks like we have about 2-3 git config accesses *per file* that is checked out. Also a lot of checks for the pack directory.

This will need need some detailed tests to first find out where these Filesnapshot tests are coming from exactly, and then making them artificially slower (a short Thread.sleep() should do the trick). That way, this could be analyzed and fixed even on a fast filesystem.
Comment 3 James Poli CLA 2022-07-26 09:22:40 EDT
Hello, Any chance to get a few cycles on this issue?  Thanks, Jim
Comment 4 Thomas Wolf CLA 2022-08-14 08:36:17 EDT
Improving this is a lot of work.

There are several reasons for the many git config accesses.

1. During check-out: these come from DirCacheCheckout.checkoutEntry(), which
   calls repo.getConfig().get(WorkingTreeOptions.KEY) for every single file.

   repo.getConfig() checks the repo, user, and system config for being modified.

   This could be avoided in several ways:

   * Pass the config into DirCacheCheckout.checkoutEntry() and make sure it is
     loaded once outside in DirCacheCheckout.doCheckout(). There are other
     check-out paths though, also in merge, that might need to be adapted.
     (Or alternatively pass in the WorkingTreeOptions directly.)
  
   * Incorporate EGit's CachingRepository into FileRepository, port EGit's
     UnitOfWork, and modify commands that perform check-outs to run the actual
     check-out in a UnitOfWork. (A UnitOfWork caches the git config and does
     not re-check for on-disk modifications for the duration of the UnitOfWork.)

2. Writing refs (i.e., branches and tags received). There are many RefUpdates
   that call repo.getConfig() to get the core.trustFolderStat setting, and if
   false, would also check the packed refs file for having been modified. (See
   RefDirectory.getPackedRefs().)
   
   Similar remedies as above might be applied. Either read that trustFolderStat
   setting only once and pass it around, or see if a UnitOfWork could be used.

   Or handle the initial writing of refs specially when cloning.

3. (Not a git config access, but a costly file attributes access) Checks for
   the pack directory being modified: these come from the fetch phase when
   receiving pack files and parsing them. There is a check against duplicate
   objects (i.e., objects with the same object ID but different content) in
   there, which scans all packs in the directory, and if not found, checks
   whether new packs have been added to the pack directory and if so re-scans
   again. Unclear how this could be improved. This duplicate check is done
   for every object received.

   Maybe switch off this duplicate check? Or just don't re-check for new pack
   files if cloning and the first pass over the pack files didn't find a
   duplicate?
Comment 5 Eclipse Genie CLA 2022-08-14 11:01:15 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/195181
Comment 6 Thomas Wolf CLA 2022-08-14 11:22:15 EDT
(In reply to Eclipse Genie from comment #5)
> New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/195181

This fixes (1) from comment 4 by loading the WorkingTreeOptions only once instead of for every file to be checked out.

(2) and (3) are still open. Note that (3) has the biggest effect.
Comment 8 James Poli CLA 2022-08-15 08:40:10 EDT
Thanks, Tomas, for the work on this issue.  Do you know when this change will make it into Eclipse and what version that will be?  
Thanks,
Jim
Comment 9 Matthias Sohn CLA 2022-08-15 09:07:49 EDT
(In reply to James Poli from comment #8)
> Thanks, Tomas, for the work on this issue.  Do you know when this change
> will make it into Eclipse and what version that will be?  
> Thanks,
> Jim

This patch will be included in JGit 6.3.0 which is planned to be released with Eclipse 2022-09 (4.25) on Sep 14.