Bug 582044 - Concurrent batch of delete tags fails even if packed-refs is updated on disk
Summary: Concurrent batch of delete tags fails even if packed-refs is updated on disk
Status: RESOLVED FIXED
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 5.13   Edit
Hardware: PC Mac OS X
: P3 normal (vote)
Target Milestone: 5.13.2   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-07 17:28 EDT by Luca Milanesio CLA
Modified: 2023-06-09 05:14 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Milanesio CLA 2023-06-07 17:28:15 EDT
When running multiple concurrent threads that are updating the same repository's packed-refs, the removal of tags can fail randomly returning:

org.eclipse.jgit.errors.ObjectWritingException: Unable to write packed-refs
       at org.eclipse.jgit.internal.storage.file.RefDirectory$1.writeFile(RefDirectory.java:1054)
       at org.eclipse.jgit.lib.RefWriter.writePackedRefs(RefWriter.java:169)
       at org.eclipse.jgit.internal.storage.file.RefDirectory.commitPackedRefs(RefDirectory.java:1063)
       at org.eclipse.jgit.internal.storage.file.RefDirectory.delete(RefDirectory.java:646)
       at org.eclipse.jgit.internal.storage.file.RefDirectoryUpdate.doDelete(RefDirectoryUpdate.java:119)
       at org.eclipse.jgit.lib.RefUpdate$2.execute(RefUpdate.java:650)
       at org.eclipse.jgit.lib.RefUpdate.updateImpl(RefUpdate.java:745)
       at org.eclipse.jgit.lib.RefUpdate.delete(RefUpdate.java:647)
       at org.eclipse.jgit.lib.BatchRefUpdate.execute(BatchRefUpdate.java:485)
       at org.eclipse.jgit.internal.storage.file.PackedBatchRefUpdate.execute(PackedBatchRefUpdate.java:107)
       at org.eclipse.jgit.lib.BatchRefUpdate.execute(BatchRefUpdate.java:582)

This has been observed on all JGit versions from 5.13 up to master, possibly also existed before.

The problem is that even if JGit throws an exception upon the call to deletion of refs through a BatchRefUpdate, the underlying operation succeeds on disk.

When using JGit with Gerrit in a multi-site environment, we rely on the success or failure of a BatchRefUpdate for updating also the global-refdb. However, because of this inconsistent behaviour from JGit, we end up in a split-brain situation, where the local packed-refs status is out of sync with the one in the global-refdb, because the local refdb on disk does not reflect the status of the JGit failure reported.
Comment 1 Luca Milanesio CLA 2023-06-07 17:49:47 EDT
This is happening because of the following sequence of events:


E1: org.eclipse.jgit.internal.storage.file.RefDirectory.delete(RefDirectory.java:646)

The delete() method takes the packed refs *before* and *after* locking it. It also assumes that getPackedRefs() always returns the list of refs that is stored in the atomic reference associated with the refdb, which MAY not be the case when there is concurrency (see below).

		// Write the packed-refs file using an atomic update. We might
		// wind up reading it twice, before and after the lock, to ensure
		// we don't miss an edit made externally.
		final PackedRefList packed = getPackedRefs(); <<< The packedList here is assumed to be the one in packedRefs atomic reference
		if (packed.contains(name)) {
			inProcessPackedRefsLock.lock();
			try {
				LockFile lck = lockPackedRefsOrThrow();
				try {
					PackedRefList cur = readPackedRefs();
					int idx = cur.find(name);
					if (0 <= idx) {
						commitPackedRefs(lck, cur.remove(idx), packed, true);
					}
				} finally {
					lck.unlock();
				}
			} finally {
				inProcessPackedRefsLock.unlock();
			}
		}

In case of concurrency though, the getPackedRefs() *DOES NOT* assures that the latest list has been stored, because the 'packedRefs.compareAndSet()' will return false and will not set 'newList' into packedRefs. Nevertheless, the newList is returned.

	PackedRefList getPackedRefs() throws IOException {
		boolean trustFolderStat = getRepository().getConfig().getBoolean(
				ConfigConstants.CONFIG_CORE_SECTION,
				ConfigConstants.CONFIG_KEY_TRUSTFOLDERSTAT, true);

		final PackedRefList curList = packedRefs.get();
		if (trustFolderStat && !curList.snapshot.isModified(packedRefsFile)) {
			return curList;
		}

		final PackedRefList newList = readPackedRefs();
		if (packedRefs.compareAndSet(curList, newList)           <<< If there is concurrency here, comparedAndSet() may NOT actually set the newList
				&& !curList.id.equals(newList.id)) {
			modCnt.incrementAndGet();
		}
		return newList;
	}

E2: org.eclipse.jgit.internal.storage.file.RefDirectory$1.writeFile(RefDirectory.java:1054)

The writing of the new packed-refs is successfully completed on disk, however, because the oldPackedList was not the one expected stored in packedRefs (see the concurrency issue happened in E1), the operation throws an exception causing the whole process to fail, even if the data on disk has been already written successfully. 

				if (!lck.commit())
					throw new ObjectWritingException(MessageFormat.format(JGitText.get().unableToWrite, name));

				byte[] digest = Constants.newMessageDigest().digest(content);               <<< When we get to this point, the packed-refs has been written
				PackedRefList newPackedList = new PackedRefList(
						refs, lck.getCommitSnapshot(), ObjectId.fromRaw(digest));   <<< This is the newly written packed-refs content and SHA1

				// This thread holds the file lock, so no other thread or process should
				// be able to modify the packed-refs file on disk. If the list changed,
				// it means something is very wrong, so throw an exception.
				//
				// However, we can't use a naive compareAndSet to check whether the
				// update was successful, because another thread might _read_ the
				// packed refs file that was written out by this thread while holding
				// the lock, and update the packedRefs reference to point to that. So
				// compare the actual contents instead.
				PackedRefList afterUpdate = packedRefs.updateAndGet(
						p -> p.id.equals(oldPackedList.id) ? newPackedList : p);   <<< This fails, because 'oldPackedList' was not set into packedRefs
				if (!afterUpdate.id.equals(newPackedList.id)) {
					throw new ObjectWritingException(
							MessageFormat.format(JGitText.get().unableToWrite, name));
				}
				if (changed) {
					modCnt.incrementAndGet();
				}
				result.set(newPackedList);


--- * ---

There are three issues here:

Issue-1: The getPackedRefs() should be retrying if packedRefs.compareAndSet(curList, newList) returns false, instead of returning the newList and making the caller assuming that the atomic ref has been updated, which is not the case.

Issue-2: Throwing an ObjectWritingException() *after* the writing has been successful, is not the right thing. Transactions should *first* check all their pre-requisites and, only when they are satisfied, execute the write. Once the data has been written and committed, there is no point in throwing an exception.

Issue-3: The exception class and description ("Unable to write packed-refs") are both inherently wrong: this was a concurrency locking issue and there is no underlying relationship with the actual writing to disk, it should have been an IllegalStateException or other locking related one. The description also is wrong and possibly the result of a copy&paste: I had to amend the code and put a different description to understand that this was failing in this part of the code and not in the actual write to disk.

I would then issue three different changes in a chain, fixing Issue-1, Issue-2 and Issue-3 as separate changes.
Comment 2 Luca Milanesio CLA 2023-06-07 19:02:14 EDT
I believe that Martin Fick has already found this issue and reverted Dave's commit with [1], therefore the problem does not apply to JGit v6.6 and master anymore.

However, all stable branches (5.13, 6.0, 6.1, 6.2, 6.3, 6.4 and 6.5) still have this serious issue and must be fixed.

I am going to cherry-pick Martin's revert to stable-5.13 and merge upstream.

[1] https://git.eclipse.org/r/c/jgit/jgit/+/197823
Comment 3 Luca Milanesio CLA 2023-06-07 19:09:54 EDT
Fixed with Martin's revert, cherry-picked to stable-5.13:
https://git.eclipse.org/r/c/jgit/jgit/+/201992
Comment 4 Matthias Sohn CLA 2023-06-09 05:14:37 EDT
cherry-pick and all merges up to master have been submitted