Bug 576165 - JGit refs lookup takes msecs to complete
Summary: JGit refs lookup takes msecs to complete
Status: NEW
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 5.9   Edit
Hardware: All All
: P3 normal with 1 vote (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-21 13:59 EDT by Luca Milanesio CLA
Modified: 2023-01-05 08:41 EST (History)
4 users (show)

See Also:


Attachments
benchmark results for reloading the ref cache (15.26 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-11-01 20:52 EDT, Matthias Sohn CLA
no flags Details
benchmark results getExactRef and getRefsByPrefix (20.08 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-11-01 20:53 EDT, Matthias Sohn CLA
no flags Details
benchmark for reloading the cache (15.39 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-11-02 20:21 EDT, Matthias Sohn CLA
no flags Details
benchmark results getExactRef and getRefsByPrefix (20.39 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-11-02 20:22 EDT, Matthias Sohn CLA
no flags Details
benchmark results getExactRef and getRefsByPrefix with memory consumption (22.74 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-11-03 13:03 EDT, Matthias Sohn CLA
no flags Details
benchmark results getExactRef and getRefsByPrefix with memory consumption added chromium repository (23.05 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-11-13 17:30 EST, Matthias Sohn CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Milanesio CLA 2021-09-21 13:59:46 EDT
Gerrit Code Review from v5.2 assumed that JGit refs lookups are cheap and comparable to in-memory cache lookups. That assumption has lead to redesign some of its caches (e.g. account cache) to include the SHA1s of in cache keys, relying on super-fact refs lookups on JGit ref-database.

The assumption seems perfectly valid in the DFS backend implemented at Google, but for the rest of the OpenSource community using a LocalDisk repository backend things are much different.

After having observed significant regressions on Gerrit v5.2 on the account cache lookup (2600x times slower than v5.1) we have realised that the original assumption of lightning-fast refs lookups is incorrect: hence raising the issue as JGit level to understand where is the problem.

The issue was initially raised in Gerrit see [1]. After a discussion on the issue with Han-Wen, we have agreed to move the problem at JGit level to understand where the bottleneck that causes a ref-lookup to take milli-seconds instead of the expected nano-seconds.

See below some numbers:
- System: Macbook Pro 2.4 GHz 8-Core Intel Core i9 / Memory: 64GB / Storage: 4TB SSD
- Expected avg ref-lookup time: < 1 micro-seconds, Observed: 0.5 msec.

Bottom line: the refs lookup takes thousand of times more time than what was initially thought.

JGit should be itself cache:
- Repository objects
- Ref-database

It is unclear why looking up a repository in-memory and a ref in an in-memory cache should take as much as 0.5 msec.

This needs further investigation and analysis of where is the bottleneck in JGit.

[1] https://bugs.chromium.org/p/gerrit/issues/detail?id=14945
Comment 1 Matthias Sohn CLA 2021-09-21 14:25:30 EDT
- can we start this by implementing some benchmarks (we already have some in org.eclipse.jgit.benchmark) modelling the relevant scenarios ? And run them on different environments (OS, filesystem etc) in order to start improvements based on solid measurements.
- note that JGit is used in many scenarios, some of them assume that another git process can interfere (e.g. cgit and EGit looking at the same repository)

looks like you implemented time-warp, didn't know that Gerrit v5.x is already a thing :-)
Comment 2 Luca Milanesio CLA 2021-09-21 14:33:30 EDT
The "test-case" that takes 0.5 msec is doing this:

try (Repository repo = localRepoMgr.openRepository(allUsers)) {
    String refName = String.format(RefNames.REFS_USERS + "%02d/%d", id.get() % 100, id.get());
    Ref accountRef = repo.exactRef(refName);
}  
try (Repository repo = localRepoMgr.openRepository(allUsers)) {
    Ref externalIdsRef = repo.exactRef(RefNames.REFS_EXTERNAL_IDS);
}

The two open/close of the allUsers repository should be a noop (thanks to the cached repositories) and the second lookup of the same refname (RefNames.REFS_EXTERNAL_IDS) should be a certain cache-hit, because the ref-name is a constant.

Hence, the overall time should be influenced by only the lookup of repo.exactRef(refName).
Comment 3 Luca Milanesio CLA 2021-09-21 14:35:39 EDT
(In reply to Luca Milanesio from comment #0)
Apologies for the confusion, I messed up the JGit and Gerrit versions :o)

> Gerrit Code Review from v5.2
This is v3.2

[...] 
> After having observed significant regressions on Gerrit v5.2
This is v3.2

> on the account
> cache lookup (2600x times slower than v5.1)
This is v3.1

The fact that Gerrit v3.2 includes JGit 5.9 ... made a mixup in my (tired) brain, apologies for the confusion.
Comment 4 Luca Milanesio CLA 2021-09-21 14:38:44 EDT
(In reply to Matthias Sohn from comment #1)
> - can we start this by implementing some benchmarks (we already have some in
> org.eclipse.jgit.benchmark) modelling the relevant scenarios ? And run them
> on different environments (OS, filesystem etc) in order to start
> improvements based on solid measurements.

Sure, but I believe the OS and filesystem should not matter, *IF* JGit is caching repositories and refs. If that isn't the case, then you are perfectly right.

> - note that JGit is used in many scenarios, some of them assume that another
> git process can interfere (e.g. cgit and EGit looking at the same repository)

I know, would you imply that the caching is constantly checking the underlying filesystem status, also for refs? Please note that I did not set any 'trustfolderstat' and also if I do, they do not seem to make any difference.

> looks like you implemented time-warp, didn't know that Gerrit v5.x is
> already a thing :-)

Oh man, when the brain (and body) is tired, it does strange mixups :o)
I've rectified that in a follow-up, thanks for pointing that out !
Comment 5 Luca Milanesio CLA 2021-09-21 15:02:09 EDT
I believe this is all related to the consistency check between the in-memory cache and the underlying filesystem.

Yes, the packed-refs are cached in memory *BUT* we check that they have not been modified on disk:

See RefDirectory.java:882

		final PackedRefList curList = packedRefs.get();
		if (trustFolderStat && !curList.snapshot.isModified(packedRefsFile)) {
			return curList;
		}

With regards to the cost of curList.snapshot.isModified() method, that has been very well explained by Matthias in his talk at the Gerrit User Summit 2019 in Sunnyvale (see [2]) and is all about the racy-reads problem in Git.

Bottom-line: depending on the version of the JVM, the OS and the resolution of the underlying filesystem, it may take milli-seconds, micro-seconds or even seconds !

@Matthias did I get it right?

[2] https://www.youtube.com/watch?v=m44cAozuLNI
Comment 6 Luca Milanesio CLA 2021-09-21 15:28:06 EDT
I reckon that the DFS implementation of JGit does not have this issue because it does not use the org.eclipse.jgit.internal.storage.file.FileSnapshot for checking if the in-memory cache is still valid or not.

I would then also rephrase the term "issue" because it is the way the racy reads work in JGit when using a local filesystem: it works as designed.

@Han-Wen @Matthias do you believe my analysis is correct?
Comment 7 Matthias Sohn CLA 2021-09-21 16:43:28 EDT
(In reply to Luca Milanesio from comment #5)
> I believe this is all related to the consistency check between the in-memory
> cache and the underlying filesystem.
> 
> Yes, the packed-refs are cached in memory *BUT* we check that they have not
> been modified on disk:
> 
> See RefDirectory.java:882
> 
> 		final PackedRefList curList = packedRefs.get();
> 		if (trustFolderStat && !curList.snapshot.isModified(packedRefsFile)) {
> 			return curList;
> 		}
> 
> With regards to the cost of curList.snapshot.isModified() method, that has
> been very well explained by Matthias in his talk at the Gerrit User Summit
> 2019 in Sunnyvale (see [2]) and is all about the racy-reads problem in Git.
> 
> Bottom-line: depending on the version of the JVM, the OS and the resolution
> of the underlying filesystem, it may take milli-seconds, micro-seconds or
> even seconds !
> 
> @Matthias did I get it right?
> 
> [2] https://www.youtube.com/watch?v=m44cAozuLNI

it depends :-) 

- if e.g. packed-refs was modified very recently (delta less than filesystem timestamp resolution) it might be racy and we may have to re-read and re-parse packed-refs since we can't be sure if the file changed or not
- if lastModified is older than filesystem timestamp resolution we only have to stat the file in order detect this fact and can reuse the cached content

I think most of the time we have the second case (that's why FileSnapshot was implemented). The chance to hit the racy case causing expensive re-reads is higher on filesystems with more coarse grained timestamp resolution and on older Java versions (due to bugs and worse Java-internal file timestamp resolution). For the details see my presentation.
Comment 8 Matthias Sohn CLA 2021-09-21 16:47:05 EDT
Here results from the existing CreateFileSnapshotBenchmark 
on MacOS 11.5.2
on my MacBook 2.9 GHz Quad-Core Intel Core i7
APPLE SSD SM0512L filesystem: APFS

Java 8 (Temurin build 1.8.0_302-b08)

Benchmark                                           Mode  Cnt       Score       Error  Units
CreateFileSnapshotBenchmark.testCreateFile          avgt   25  191742.726 ± 31218.288  ns/op
CreateFileSnapshotBenchmark.testCreateFileSnapshot  avgt   25   13966.100 ±  1254.139  ns/op

Java 11 (Temurin 11.0.12+7)

Benchmark                                           Mode  Cnt       Score       Error  Units
CreateFileSnapshotBenchmark.testCreateFile          avgt   25  243514,723 ± 21497,747  ns/op
CreateFileSnapshotBenchmark.testCreateFileSnapshot  avgt   25   13244,470 ±   608,150  ns/op

Java 17 (SapMachine build 17+35-LTS-sapmachine)

Benchmark                                           Mode  Cnt       Score       Error  Units
CreateFileSnapshotBenchmark.testCreateFile          avgt   25  185828,450 ± 10317,174  ns/op
CreateFileSnapshotBenchmark.testCreateFileSnapshot  avgt   25   11878,963 ±   247,574  ns/op

This means creating FileSnapshot capturing stat data for a file takes around 14 (java 8), 13 (java 11), 12 (java 17) microseconds on my laptop.

The time needed for FileSnapshot.isModified should be in the same order of magnitude since it has to again stat the file to get current FileAttributes and
then compare them against the stat data cached in FileSnapshot.

If FileSnapshot.isModified returns false JGit reuses the cached packedRefs otherwise (or if trustfolderstat=false) it reads them again.

AFAICS in RefDirectory also loose refs use FileSnapshot to determine if the cached ref can be reused or if the file needs to be re-read.

The same dance is done for loose refs, but there might be many of them.
Comment 9 Matthias Sohn CLA 2021-09-21 16:55:47 EDT
More aggressive caching could be done if we would know there is only a single jgit process having access to the git repository. Then this process could manage invalidation of stale cache entries. 

But if e.g. an administrator uses another cgit process to poke in a git repository in the site of a running Gerrit server we have to stat the files to ensure we don't corrupt the repository by basing a write operation on stale cache content read earlier.
Comment 10 Luca Milanesio CLA 2021-09-21 19:43:45 EDT
I've added a specific benchmark for the Gerrit / All-Users use-case:
https://git.eclipse.org/r/c/jgit/jgit/+/185683
Comment 11 Luca Milanesio CLA 2021-09-21 21:03:54 EDT
On EL7 Linux Kernel 3.10.0 / 500GB RAM / 48 CPUs / NFS v3 / Java 11:
- avg: 410683 ns (0.4 ms) ±(99.9%) 59595 ns/op

On MacOS 11.4 (BigSur) 2.4 GHz 8-Core Intel Core i9 / 64GB RAM / APFS (4 TB SSD) / Java 11:
- avg: 131498.381 ±(99.9%) 3503.044 ns/op

From Gerrit v3.2 we were expecting the above figures to be comparable to an in-memory cached value, therefore in the order of a few ns, whilst they are between 0.13 ms (on SSD) and 0.41 ms (on NFS).

I will try to re-run the tests by disabling on JGit the check for the file snapshot and report tomorrow the results.
Comment 12 Luca Milanesio CLA 2021-09-22 04:32:43 EDT
I tried to put the opening of the Git repository out of the equation and also the file snapshot, but still it takes more than expected, in the order of 50 micro-secs whilst I was expecting something like 1000x times faster.

By looking at the code carefully, I don't believe JGit is really doing caching but just a mere in-memory buffering, see below:

RefDirectory.java:1049

private Ref readRef(String name, RefList<Ref> packed) throws IOException {
  final RefList<LooseRef> curList = looseRefs.get();
  final int idx = curList.find(name);
  [...]

Imagine you have a high number of loose refs (it may happen), the above statement would just do a linear search, without any hashing.

Also when the refs are loaded from a packfile, they are still searched by name in a linear list:

RefList.java:135

	public final int find(String name) {
		int high = cnt;
		if (high == 0)
			return -1;
		int low = 0;
		do {
			final int mid = (low + high) >>> 1;
			final int cmp = RefComparator.compareTo(list[mid], name);
			if (cmp < 0)
				low = mid + 1;
			else if (cmp == 0)
				return mid;
			else
				high = mid;
		} while (low < high);
		return -(low + 1);
	}

I believe JGit doesn't really cache refs at the moment and we should introduce such a layer using Caffeine or similar and using hashing functions to locate the refs by name.

@Matthias @Han-Wen WDYT?
Comment 13 Nasser Grainawi CLA 2021-09-22 11:57:53 EDT
(In reply to Luca Milanesio from comment #12)
> By looking at the code carefully, I don't believe JGit is really doing
> caching but just a mere in-memory buffering, see below:

What would you expect to be different with caching vs this "buffering"? From what you have below, it sounds like maybe you mean that you can do an O(1) lookup with a key?

> Imagine you have a high number of loose refs (it may happen), the above
> statement would just do a linear search, without any hashing.

How many loose refs would you need for this to cause a measurable performance problem?

> 
> Also when the refs are loaded from a packfile, they are still searched by
> name in a linear list:

This find() is doing a binary search, not a linear scan, correct? And this same code is used for both loose and packed refs.

> 
> I believe JGit doesn't really cache refs at the moment and we should
> introduce such a layer using Caffeine or similar and using hashing functions
> to locate the refs by name.

Obviously having an O(1) lookup would be better than the current O(log n) lookup (and I don't know of any reason we wouldn't want to try to be able to safely get O(1) lookups for refs), but does that difference explain the performance concern you see?
Comment 14 Luca Milanesio CLA 2021-09-22 13:57:43 EDT
I've re-run the test by:
- removing the open/close of the repository
- enabling reftable

See below the results:

On MacOS 11.4 (BigSur) 2.4 GHz 8-Core Intel Core i9 / 64GB RAM / APFS (4 TB SSD) / Java 11:
  15175.585 ±(99.9%) 877.848 ns/op [Average]
  (min, avg, max) = (13508.188, 15175.585, 17417.864), stdev = 1171.902
  CI (99.9%): [14297.737, 16053.432]

The above results are 8x times faster than the traditional packed/loose refs implementation.

However, we are still in the micro-seconds units and very far from being as fast as an in-memory cache.

Is there any cache in reftable that I can enable?

I enabled it as:
[core]
	repositoryformatversion = 1
[extensions]
	refStorage = reftable
Comment 15 Luca Milanesio CLA 2021-09-22 14:31:45 EDT
(In reply to Nasser Grainawi from comment #13)
> (In reply to Luca Milanesio from comment #12)
> > By looking at the code carefully, I don't believe JGit is really doing
> > caching but just a mere in-memory buffering, see below:
> 
> What would you expect to be different with caching vs this "buffering"? From
> what you have below, it sounds like maybe you mean that you can do an O(1)
> lookup with a key?
> 
> > Imagine you have a high number of loose refs (it may happen), the above
> > statement would just do a linear search, without any hashing.
> 
> How many loose refs would you need for this to cause a measurable
> performance problem?

I don't know, it takes a very long time to create so many refs: as the number of refs grows, the rate of creating them slows down. In the benchmark there are 10k refs and I don't see a noticeable difference between loose refs and packed refs.

I guess the number must be then much higher than that, but that's just a finger in the air.


> > Also when the refs are loaded from a packfile, they are still searched by
> > name in a linear list:
> 
> This find() is doing a binary search, not a linear scan, correct?

Yes, I said a linear list and then the code tries to be smarter with a dihcotomic search, hence with a O(log(n)) complexity. That is still far away from what you would get from a Guava/Caffeine cache, which is a O(K) for a well-balanced cache.

> > I believe JGit doesn't really cache refs at the moment and we should
> > introduce such a layer using Caffeine or similar and using hashing functions
> > to locate the refs by name.
> 
> Obviously having an O(1) lookup would be better than the current O(log n)
> lookup (and I don't know of any reason we wouldn't want to try to be able to
> safely get O(1) lookups for refs), but does that difference explain the
> performance concern you see?

I believe it does, but it isn't the only bottleneck. JGit relies in "auto-detection" of any modification on the filesystem of any files, which adds also a tons of system calls that you won't have in a Guava/Caffeine cache.

Bottom line: I believe this bug can be closed with a statement "JGit does not currently implement a Guava or Caffeine-style caching layer of refs lookups", hence the execution time of a ref lookup can vary from 15 micro-secs (ref-table on local SSD) to 0.5 msec (packed-ref on networked NFS volume).

The numbers provided would be useful for others relying on refs lookups for taking into consideration of what the execution time could be, with a local SSD, an NFS storage and with PackedRefs vs. RefTable.
Comment 16 Eclipse Genie CLA 2021-10-05 20:50:53 EDT
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/186206
Comment 17 Matthias Sohn CLA 2021-10-06 06:29:56 EDT
I implemented a PoC InMemoryRefDatabase based on a ternary search tree in jgit which can be used as a Ref cache:
https://git.eclipse.org/r/c/jgit/jgit/+/186205

I also adapted the benchmark Luca implemented 
https://git.eclipse.org/r/c/jgit/jgit/+/186206
and ran it on my laptop, here the results on a MacBook 2.9 GHz Quad-Core Intel Core i7, SSD, filesystem APFS
using doGc=true  and trustFolderStat=true:

Benchmark                                   (doGc)  (numBranches)  (trustFolderStat)  Mode  Cnt   Score   Error  Units
GetRefsBenchmark.testGetExactRef              true            100               true  avgt    2  80,186          us/op
GetRefsBenchmark.testGetExactRef              true           2500               true  avgt    2  85,664          us/op
GetRefsBenchmark.testGetExactRef              true          10000               true  avgt    2  93,523          us/op
GetRefsBenchmark.testGetExactRefCached        true            100               true  avgt    2   0,040          us/op
GetRefsBenchmark.testGetExactRefCached        true           2500               true  avgt    2   0,043          us/op
GetRefsBenchmark.testGetExactRefCached        true          10000               true  avgt    2   0,041          us/op
GetRefsBenchmark.testGetRefsByPrefix          true            100               true  avgt    2  43,639          us/op
GetRefsBenchmark.testGetRefsByPrefix          true           2500               true  avgt    2  50,517          us/op
GetRefsBenchmark.testGetRefsByPrefix          true          10000               true  avgt    2  52,297          us/op
GetRefsBenchmark.testGetRefsByPrefixCached    true            100               true  avgt    2   0,120          us/op
GetRefsBenchmark.testGetRefsByPrefixCached    true           2500               true  avgt    2   0,139          us/op
GetRefsBenchmark.testGetRefsByPrefixCached    true          10000               true  avgt    2   0,234          us/op

This means in the tested range of 100-10k refs with the cache 
- getExactRef is faster by a factor of 2000
- getRefsByPrefix is faster by a factor of 200-380

Updating the cache is implemented in a simple brute-force way:
the cache listens on RefChangedEvents and reloads all refs on each event from the RefDatabase wrapped by the cache.
Comment 18 Luca Milanesio CLA 2021-10-06 13:38:15 EDT
Cool stuff @Matthias, well done.

The performance figures are definitely encouraging: this is of course a PoC and it would be nice to a design on how this should work with the final solution.

How do you guys typically manage new designs in the JGit project?
Comment 19 Matthias Sohn CLA 2021-11-01 20:52:18 EDT
Created attachment 287417 [details]
benchmark results for reloading the ref cache

measured using
https://git.eclipse.org/r/c/jgit/jgit/+/186415/10
on a MacBook 2.6 GHz 6-Core Intel Core i7 with SSD
Comment 20 Matthias Sohn CLA 2021-11-01 20:53:45 EDT
Created attachment 287418 [details]
benchmark results getExactRef and getRefsByPrefix

measured using
https://git.eclipse.org/r/c/jgit/jgit/+/186415/10
on a MacBook 2.6 GHz 6-Core Intel Core i7 with SSD
Comment 21 Matthias Sohn CLA 2021-11-01 21:00:41 EDT
I parametrized the following tests to run without and with the cache:
- RefUpdateTest (https://git.eclipse.org/r/c/jgit/jgit/+/186950)
- BatchRefUpdateTest (https://git.eclipse.org/r/c/jgit/jgit/+/187187)
- RefDirectoryTest (https://git.eclipse.org/r/c/jgit/jgit/+/187194)

and all tests are now passing :-)
Comment 22 Luca Milanesio CLA 2021-11-01 21:09:36 EDT
Cool stuff @Matthias. I'll check the change to see if we could start using it in Gerrit.
Comment 23 Matthias Sohn CLA 2021-11-02 04:46:29 EDT
(In reply to Luca Milanesio from comment #22)
> Cool stuff @Matthias. I'll check the change to see if we could start using
> it in Gerrit.

wait a minute, I just found out that my benchmark measures crap since it's measuring an empty InMemoryRefDatabase :-(0). Found that by creating a heap dump to measure memory consumption.
Comment 24 Matthias Sohn CLA 2021-11-02 20:21:26 EDT
Created attachment 287428 [details]
benchmark for reloading the cache

fixed several bugs, added one more repository
Comment 25 Matthias Sohn CLA 2021-11-02 20:22:51 EDT
Created attachment 287429 [details]
benchmark results getExactRef and getRefsByPrefix

fixed the benchmark, override getRefsByPrefix, new method to directly load values without keys
Comment 26 Han-Wen NIenhuys CLA 2021-11-03 05:59:37 EDT
Awesome, looks like >1000x speed-up. Hopefully, that quells concerns about notedb performance
Comment 27 Matthias Sohn CLA 2021-11-03 13:03:42 EDT
Created attachment 287437 [details]
benchmark results getExactRef and getRefsByPrefix with memory consumption

yep, though this speedup comes with a memory footprint for the cache, see the latest attachment.

Should be good enough for speeding up ref access in Gerrit's All-Users repository.
Comment 28 Matthias Sohn CLA 2021-11-13 17:30:12 EST
Created attachment 287507 [details]
benchmark results getExactRef and getRefsByPrefix with memory consumption added chromium repository
Comment 30 Eclipse Genie CLA 2022-12-07 20:04:25 EST
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/197555
Comment 31 Eclipse Genie CLA 2022-12-07 20:04:27 EST
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/197557
Comment 32 Matthias Sohn CLA 2022-12-13 09:37:50 EST
I rebased my RefCache series from last year onto master and fixed the bugs it had

https://git.eclipse.org/r/c/jgit/jgit/+/197557

Tests are now passing :-)
I still need to rework the benchmarks

- the implementation provides a write-through cache which is updated incrementally when RefUpdate, BatchRefUpdate, RefRename occur.
- the cache is enabled if core.refCache=true and extensions.refStorage is unset or set to "refdir"
- the cache is caching all refs of the repos configured to use it
- the cache doesn't recognize if refs are updated behind JGit's back e.g. by another process
- FileRepository exposes the cache so that an application using JGit can trigger a reload of the cache if necessary
- tests show that the cache consumes 200-600 bytes per cached ref