Community
Participate
Working Groups
Gerrit Code Review from v5.2 assumed that JGit refs lookups are cheap and comparable to in-memory cache lookups. That assumption has lead to redesign some of its caches (e.g. account cache) to include the SHA1s of in cache keys, relying on super-fact refs lookups on JGit ref-database. The assumption seems perfectly valid in the DFS backend implemented at Google, but for the rest of the OpenSource community using a LocalDisk repository backend things are much different. After having observed significant regressions on Gerrit v5.2 on the account cache lookup (2600x times slower than v5.1) we have realised that the original assumption of lightning-fast refs lookups is incorrect: hence raising the issue as JGit level to understand where is the problem. The issue was initially raised in Gerrit see [1]. After a discussion on the issue with Han-Wen, we have agreed to move the problem at JGit level to understand where the bottleneck that causes a ref-lookup to take milli-seconds instead of the expected nano-seconds. See below some numbers: - System: Macbook Pro 2.4 GHz 8-Core Intel Core i9 / Memory: 64GB / Storage: 4TB SSD - Expected avg ref-lookup time: < 1 micro-seconds, Observed: 0.5 msec. Bottom line: the refs lookup takes thousand of times more time than what was initially thought. JGit should be itself cache: - Repository objects - Ref-database It is unclear why looking up a repository in-memory and a ref in an in-memory cache should take as much as 0.5 msec. This needs further investigation and analysis of where is the bottleneck in JGit. [1] https://bugs.chromium.org/p/gerrit/issues/detail?id=14945
- can we start this by implementing some benchmarks (we already have some in org.eclipse.jgit.benchmark) modelling the relevant scenarios ? And run them on different environments (OS, filesystem etc) in order to start improvements based on solid measurements. - note that JGit is used in many scenarios, some of them assume that another git process can interfere (e.g. cgit and EGit looking at the same repository) looks like you implemented time-warp, didn't know that Gerrit v5.x is already a thing :-)
The "test-case" that takes 0.5 msec is doing this: try (Repository repo = localRepoMgr.openRepository(allUsers)) { String refName = String.format(RefNames.REFS_USERS + "%02d/%d", id.get() % 100, id.get()); Ref accountRef = repo.exactRef(refName); } try (Repository repo = localRepoMgr.openRepository(allUsers)) { Ref externalIdsRef = repo.exactRef(RefNames.REFS_EXTERNAL_IDS); } The two open/close of the allUsers repository should be a noop (thanks to the cached repositories) and the second lookup of the same refname (RefNames.REFS_EXTERNAL_IDS) should be a certain cache-hit, because the ref-name is a constant. Hence, the overall time should be influenced by only the lookup of repo.exactRef(refName).
(In reply to Luca Milanesio from comment #0) Apologies for the confusion, I messed up the JGit and Gerrit versions :o) > Gerrit Code Review from v5.2 This is v3.2 [...] > After having observed significant regressions on Gerrit v5.2 This is v3.2 > on the account > cache lookup (2600x times slower than v5.1) This is v3.1 The fact that Gerrit v3.2 includes JGit 5.9 ... made a mixup in my (tired) brain, apologies for the confusion.
(In reply to Matthias Sohn from comment #1) > - can we start this by implementing some benchmarks (we already have some in > org.eclipse.jgit.benchmark) modelling the relevant scenarios ? And run them > on different environments (OS, filesystem etc) in order to start > improvements based on solid measurements. Sure, but I believe the OS and filesystem should not matter, *IF* JGit is caching repositories and refs. If that isn't the case, then you are perfectly right. > - note that JGit is used in many scenarios, some of them assume that another > git process can interfere (e.g. cgit and EGit looking at the same repository) I know, would you imply that the caching is constantly checking the underlying filesystem status, also for refs? Please note that I did not set any 'trustfolderstat' and also if I do, they do not seem to make any difference. > looks like you implemented time-warp, didn't know that Gerrit v5.x is > already a thing :-) Oh man, when the brain (and body) is tired, it does strange mixups :o) I've rectified that in a follow-up, thanks for pointing that out !
I believe this is all related to the consistency check between the in-memory cache and the underlying filesystem. Yes, the packed-refs are cached in memory *BUT* we check that they have not been modified on disk: See RefDirectory.java:882 final PackedRefList curList = packedRefs.get(); if (trustFolderStat && !curList.snapshot.isModified(packedRefsFile)) { return curList; } With regards to the cost of curList.snapshot.isModified() method, that has been very well explained by Matthias in his talk at the Gerrit User Summit 2019 in Sunnyvale (see [2]) and is all about the racy-reads problem in Git. Bottom-line: depending on the version of the JVM, the OS and the resolution of the underlying filesystem, it may take milli-seconds, micro-seconds or even seconds ! @Matthias did I get it right? [2] https://www.youtube.com/watch?v=m44cAozuLNI
I reckon that the DFS implementation of JGit does not have this issue because it does not use the org.eclipse.jgit.internal.storage.file.FileSnapshot for checking if the in-memory cache is still valid or not. I would then also rephrase the term "issue" because it is the way the racy reads work in JGit when using a local filesystem: it works as designed. @Han-Wen @Matthias do you believe my analysis is correct?
(In reply to Luca Milanesio from comment #5) > I believe this is all related to the consistency check between the in-memory > cache and the underlying filesystem. > > Yes, the packed-refs are cached in memory *BUT* we check that they have not > been modified on disk: > > See RefDirectory.java:882 > > final PackedRefList curList = packedRefs.get(); > if (trustFolderStat && !curList.snapshot.isModified(packedRefsFile)) { > return curList; > } > > With regards to the cost of curList.snapshot.isModified() method, that has > been very well explained by Matthias in his talk at the Gerrit User Summit > 2019 in Sunnyvale (see [2]) and is all about the racy-reads problem in Git. > > Bottom-line: depending on the version of the JVM, the OS and the resolution > of the underlying filesystem, it may take milli-seconds, micro-seconds or > even seconds ! > > @Matthias did I get it right? > > [2] https://www.youtube.com/watch?v=m44cAozuLNI it depends :-) - if e.g. packed-refs was modified very recently (delta less than filesystem timestamp resolution) it might be racy and we may have to re-read and re-parse packed-refs since we can't be sure if the file changed or not - if lastModified is older than filesystem timestamp resolution we only have to stat the file in order detect this fact and can reuse the cached content I think most of the time we have the second case (that's why FileSnapshot was implemented). The chance to hit the racy case causing expensive re-reads is higher on filesystems with more coarse grained timestamp resolution and on older Java versions (due to bugs and worse Java-internal file timestamp resolution). For the details see my presentation.
Here results from the existing CreateFileSnapshotBenchmark on MacOS 11.5.2 on my MacBook 2.9 GHz Quad-Core Intel Core i7 APPLE SSD SM0512L filesystem: APFS Java 8 (Temurin build 1.8.0_302-b08) Benchmark Mode Cnt Score Error Units CreateFileSnapshotBenchmark.testCreateFile avgt 25 191742.726 ± 31218.288 ns/op CreateFileSnapshotBenchmark.testCreateFileSnapshot avgt 25 13966.100 ± 1254.139 ns/op Java 11 (Temurin 11.0.12+7) Benchmark Mode Cnt Score Error Units CreateFileSnapshotBenchmark.testCreateFile avgt 25 243514,723 ± 21497,747 ns/op CreateFileSnapshotBenchmark.testCreateFileSnapshot avgt 25 13244,470 ± 608,150 ns/op Java 17 (SapMachine build 17+35-LTS-sapmachine) Benchmark Mode Cnt Score Error Units CreateFileSnapshotBenchmark.testCreateFile avgt 25 185828,450 ± 10317,174 ns/op CreateFileSnapshotBenchmark.testCreateFileSnapshot avgt 25 11878,963 ± 247,574 ns/op This means creating FileSnapshot capturing stat data for a file takes around 14 (java 8), 13 (java 11), 12 (java 17) microseconds on my laptop. The time needed for FileSnapshot.isModified should be in the same order of magnitude since it has to again stat the file to get current FileAttributes and then compare them against the stat data cached in FileSnapshot. If FileSnapshot.isModified returns false JGit reuses the cached packedRefs otherwise (or if trustfolderstat=false) it reads them again. AFAICS in RefDirectory also loose refs use FileSnapshot to determine if the cached ref can be reused or if the file needs to be re-read. The same dance is done for loose refs, but there might be many of them.
More aggressive caching could be done if we would know there is only a single jgit process having access to the git repository. Then this process could manage invalidation of stale cache entries. But if e.g. an administrator uses another cgit process to poke in a git repository in the site of a running Gerrit server we have to stat the files to ensure we don't corrupt the repository by basing a write operation on stale cache content read earlier.
I've added a specific benchmark for the Gerrit / All-Users use-case: https://git.eclipse.org/r/c/jgit/jgit/+/185683
On EL7 Linux Kernel 3.10.0 / 500GB RAM / 48 CPUs / NFS v3 / Java 11: - avg: 410683 ns (0.4 ms) ±(99.9%) 59595 ns/op On MacOS 11.4 (BigSur) 2.4 GHz 8-Core Intel Core i9 / 64GB RAM / APFS (4 TB SSD) / Java 11: - avg: 131498.381 ±(99.9%) 3503.044 ns/op From Gerrit v3.2 we were expecting the above figures to be comparable to an in-memory cached value, therefore in the order of a few ns, whilst they are between 0.13 ms (on SSD) and 0.41 ms (on NFS). I will try to re-run the tests by disabling on JGit the check for the file snapshot and report tomorrow the results.
I tried to put the opening of the Git repository out of the equation and also the file snapshot, but still it takes more than expected, in the order of 50 micro-secs whilst I was expecting something like 1000x times faster. By looking at the code carefully, I don't believe JGit is really doing caching but just a mere in-memory buffering, see below: RefDirectory.java:1049 private Ref readRef(String name, RefList<Ref> packed) throws IOException { final RefList<LooseRef> curList = looseRefs.get(); final int idx = curList.find(name); [...] Imagine you have a high number of loose refs (it may happen), the above statement would just do a linear search, without any hashing. Also when the refs are loaded from a packfile, they are still searched by name in a linear list: RefList.java:135 public final int find(String name) { int high = cnt; if (high == 0) return -1; int low = 0; do { final int mid = (low + high) >>> 1; final int cmp = RefComparator.compareTo(list[mid], name); if (cmp < 0) low = mid + 1; else if (cmp == 0) return mid; else high = mid; } while (low < high); return -(low + 1); } I believe JGit doesn't really cache refs at the moment and we should introduce such a layer using Caffeine or similar and using hashing functions to locate the refs by name. @Matthias @Han-Wen WDYT?
(In reply to Luca Milanesio from comment #12) > By looking at the code carefully, I don't believe JGit is really doing > caching but just a mere in-memory buffering, see below: What would you expect to be different with caching vs this "buffering"? From what you have below, it sounds like maybe you mean that you can do an O(1) lookup with a key? > Imagine you have a high number of loose refs (it may happen), the above > statement would just do a linear search, without any hashing. How many loose refs would you need for this to cause a measurable performance problem? > > Also when the refs are loaded from a packfile, they are still searched by > name in a linear list: This find() is doing a binary search, not a linear scan, correct? And this same code is used for both loose and packed refs. > > I believe JGit doesn't really cache refs at the moment and we should > introduce such a layer using Caffeine or similar and using hashing functions > to locate the refs by name. Obviously having an O(1) lookup would be better than the current O(log n) lookup (and I don't know of any reason we wouldn't want to try to be able to safely get O(1) lookups for refs), but does that difference explain the performance concern you see?
I've re-run the test by: - removing the open/close of the repository - enabling reftable See below the results: On MacOS 11.4 (BigSur) 2.4 GHz 8-Core Intel Core i9 / 64GB RAM / APFS (4 TB SSD) / Java 11: 15175.585 ±(99.9%) 877.848 ns/op [Average] (min, avg, max) = (13508.188, 15175.585, 17417.864), stdev = 1171.902 CI (99.9%): [14297.737, 16053.432] The above results are 8x times faster than the traditional packed/loose refs implementation. However, we are still in the micro-seconds units and very far from being as fast as an in-memory cache. Is there any cache in reftable that I can enable? I enabled it as: [core] repositoryformatversion = 1 [extensions] refStorage = reftable
(In reply to Nasser Grainawi from comment #13) > (In reply to Luca Milanesio from comment #12) > > By looking at the code carefully, I don't believe JGit is really doing > > caching but just a mere in-memory buffering, see below: > > What would you expect to be different with caching vs this "buffering"? From > what you have below, it sounds like maybe you mean that you can do an O(1) > lookup with a key? > > > Imagine you have a high number of loose refs (it may happen), the above > > statement would just do a linear search, without any hashing. > > How many loose refs would you need for this to cause a measurable > performance problem? I don't know, it takes a very long time to create so many refs: as the number of refs grows, the rate of creating them slows down. In the benchmark there are 10k refs and I don't see a noticeable difference between loose refs and packed refs. I guess the number must be then much higher than that, but that's just a finger in the air. > > Also when the refs are loaded from a packfile, they are still searched by > > name in a linear list: > > This find() is doing a binary search, not a linear scan, correct? Yes, I said a linear list and then the code tries to be smarter with a dihcotomic search, hence with a O(log(n)) complexity. That is still far away from what you would get from a Guava/Caffeine cache, which is a O(K) for a well-balanced cache. > > I believe JGit doesn't really cache refs at the moment and we should > > introduce such a layer using Caffeine or similar and using hashing functions > > to locate the refs by name. > > Obviously having an O(1) lookup would be better than the current O(log n) > lookup (and I don't know of any reason we wouldn't want to try to be able to > safely get O(1) lookups for refs), but does that difference explain the > performance concern you see? I believe it does, but it isn't the only bottleneck. JGit relies in "auto-detection" of any modification on the filesystem of any files, which adds also a tons of system calls that you won't have in a Guava/Caffeine cache. Bottom line: I believe this bug can be closed with a statement "JGit does not currently implement a Guava or Caffeine-style caching layer of refs lookups", hence the execution time of a ref lookup can vary from 15 micro-secs (ref-table on local SSD) to 0.5 msec (packed-ref on networked NFS volume). The numbers provided would be useful for others relying on refs lookups for taking into consideration of what the execution time could be, with a local SSD, an NFS storage and with PackedRefs vs. RefTable.
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/186206
I implemented a PoC InMemoryRefDatabase based on a ternary search tree in jgit which can be used as a Ref cache: https://git.eclipse.org/r/c/jgit/jgit/+/186205 I also adapted the benchmark Luca implemented https://git.eclipse.org/r/c/jgit/jgit/+/186206 and ran it on my laptop, here the results on a MacBook 2.9 GHz Quad-Core Intel Core i7, SSD, filesystem APFS using doGc=true and trustFolderStat=true: Benchmark (doGc) (numBranches) (trustFolderStat) Mode Cnt Score Error Units GetRefsBenchmark.testGetExactRef true 100 true avgt 2 80,186 us/op GetRefsBenchmark.testGetExactRef true 2500 true avgt 2 85,664 us/op GetRefsBenchmark.testGetExactRef true 10000 true avgt 2 93,523 us/op GetRefsBenchmark.testGetExactRefCached true 100 true avgt 2 0,040 us/op GetRefsBenchmark.testGetExactRefCached true 2500 true avgt 2 0,043 us/op GetRefsBenchmark.testGetExactRefCached true 10000 true avgt 2 0,041 us/op GetRefsBenchmark.testGetRefsByPrefix true 100 true avgt 2 43,639 us/op GetRefsBenchmark.testGetRefsByPrefix true 2500 true avgt 2 50,517 us/op GetRefsBenchmark.testGetRefsByPrefix true 10000 true avgt 2 52,297 us/op GetRefsBenchmark.testGetRefsByPrefixCached true 100 true avgt 2 0,120 us/op GetRefsBenchmark.testGetRefsByPrefixCached true 2500 true avgt 2 0,139 us/op GetRefsBenchmark.testGetRefsByPrefixCached true 10000 true avgt 2 0,234 us/op This means in the tested range of 100-10k refs with the cache - getExactRef is faster by a factor of 2000 - getRefsByPrefix is faster by a factor of 200-380 Updating the cache is implemented in a simple brute-force way: the cache listens on RefChangedEvents and reloads all refs on each event from the RefDatabase wrapped by the cache.
Cool stuff @Matthias, well done. The performance figures are definitely encouraging: this is of course a PoC and it would be nice to a design on how this should work with the final solution. How do you guys typically manage new designs in the JGit project?
Created attachment 287417 [details] benchmark results for reloading the ref cache measured using https://git.eclipse.org/r/c/jgit/jgit/+/186415/10 on a MacBook 2.6 GHz 6-Core Intel Core i7 with SSD
Created attachment 287418 [details] benchmark results getExactRef and getRefsByPrefix measured using https://git.eclipse.org/r/c/jgit/jgit/+/186415/10 on a MacBook 2.6 GHz 6-Core Intel Core i7 with SSD
I parametrized the following tests to run without and with the cache: - RefUpdateTest (https://git.eclipse.org/r/c/jgit/jgit/+/186950) - BatchRefUpdateTest (https://git.eclipse.org/r/c/jgit/jgit/+/187187) - RefDirectoryTest (https://git.eclipse.org/r/c/jgit/jgit/+/187194) and all tests are now passing :-)
Cool stuff @Matthias. I'll check the change to see if we could start using it in Gerrit.
(In reply to Luca Milanesio from comment #22) > Cool stuff @Matthias. I'll check the change to see if we could start using > it in Gerrit. wait a minute, I just found out that my benchmark measures crap since it's measuring an empty InMemoryRefDatabase :-(0). Found that by creating a heap dump to measure memory consumption.
Created attachment 287428 [details] benchmark for reloading the cache fixed several bugs, added one more repository
Created attachment 287429 [details] benchmark results getExactRef and getRefsByPrefix fixed the benchmark, override getRefsByPrefix, new method to directly load values without keys
Awesome, looks like >1000x speed-up. Hopefully, that quells concerns about notedb performance
Created attachment 287437 [details] benchmark results getExactRef and getRefsByPrefix with memory consumption yep, though this speedup comes with a memory footprint for the cache, see the latest attachment. Should be good enough for speeding up ref access in Gerrit's All-Users repository.
Created attachment 287507 [details] benchmark results getExactRef and getRefsByPrefix with memory consumption added chromium repository
Gerrit change https://git.eclipse.org/r/c/jgit/jgit/+/185683 was merged to [master]. Commit: http://git.eclipse.org/c/jgit/jgit.git/commit/?id=a706835a49b3973fd0b15ea7edf42de1aa86627a
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/197555
New Gerrit change created: https://git.eclipse.org/r/c/jgit/jgit/+/197557
I rebased my RefCache series from last year onto master and fixed the bugs it had https://git.eclipse.org/r/c/jgit/jgit/+/197557 Tests are now passing :-) I still need to rework the benchmarks - the implementation provides a write-through cache which is updated incrementally when RefUpdate, BatchRefUpdate, RefRename occur. - the cache is enabled if core.refCache=true and extensions.refStorage is unset or set to "refdir" - the cache is caching all refs of the repos configured to use it - the cache doesn't recognize if refs are updated behind JGit's back e.g. by another process - FileRepository exposes the cache so that an application using JGit can trigger a reload of the cache if necessary - tests show that the cache consumes 200-600 bytes per cached ref
Gerrit change https://git.eclipse.org/r/c/jgit/jgit/+/197555 was merged to [master]. Commit: http://git.eclipse.org/c/jgit/jgit.git/commit/?id=70b436b1b2de249c795035ca1c655c34bceaf01f