Bug 499471 - Hudson CI builds hanging on making workspace modifications
Summary: Hudson CI builds hanging on making workspace modifications
Status: RESOLVED FIXED
Alias: None
Product: CDT
Classification: Tools
Component: cdt-releng (show other bugs)
Version: Next   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: cdt-releng-inbox@eclipse.org CLA
QA Contact: Marc Khouzam CLA
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 499777 501923 515560
  Show dependency tree
 
Reported: 2016-08-09 15:27 EDT by Jonah Graham CLA
Modified: 2017-04-21 09:49 EDT (History)
5 users (show)

See Also:


Attachments
output of jps/jstack: https://hudson.eclipse.org/cdt/job/jstack-all/4/consoleText (310.20 KB, application/octet-stream)
2016-08-09 15:27 EDT, Jonah Graham CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonah Graham CLA 2016-08-09 15:27:32 EDT
Created attachment 263541 [details]
output of jps/jstack: https://hudson.eclipse.org/cdt/job/jstack-all/4/consoleText

Processes are being left around after hanging tests. This problem is therefore two-fold:
- Hudson not cleaning up fully
- CDT causing Eclipse/JUnit to hang

Here is a summary of my analysis as shared on cdt-dev. Attached is the full log from the jps/jstack that Marc Andre set up:

I think they are all ours. Some of them are Hudson processes and some
are maven (presumably the maven are the parents of the eclipse ones),
but the suspicious ones all have this or something similar as one of
the thread's stack trace (see below). There was one verify build
running when I captured this.

By my quick count:
- total process: 14
- maven processes waiting on a Unix process (blocking on
java.lang.UNIXProcess.waitFor): 6
- number of processes blocked on doing
CProjectDescriptionManager.runWspModification: 6
- hudson itself: 1
- the build that was actually known to be running: 1 (maven was
zipping at the time in
org.codehaus.plexus.archiver.zip.AbstractZipArchiver.zipFile)

I'll file a bug on this for what its worth.

"Worker-3" #58 prio=5 os_prio=0 tid=0x00007fc1242d9000 nid=0x1a10 in
Object.wait() [0x00007fc10dbd0000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.eclipse.core.internal.jobs.ThreadJob.waitForRun(ThreadJob.java:270)
- locked <0x00000000ff700bc0> (a java.lang.Object)
at org.eclipse.core.internal.jobs.ThreadJob.joinRun(ThreadJob.java:197)
at org.eclipse.core.internal.jobs.ImplicitJobs.begin(ImplicitJobs.java:92)
at org.eclipse.core.internal.jobs.JobManager.beginRule(JobManager.java:307)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.runWspModification(CProjectDescriptionManager.java:483)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.runWspModification(CProjectDescriptionManager.java:454)
at org.eclipse.cdt.internal.core.settings.model.CExternalSettingsManager.settingsChanged(CExternalSettingsManager.java:439)
at org.eclipse.cdt.internal.core.settings.model.CExternalSettingContainerFactoryWithListener.notifySettingsChange(CExternalSettingContainerFactoryWithListener.java:48)
at org.eclipse.cdt.internal.core.settings.model.CfgExportSettingContainerFactory.handleEvent(CfgExportSettingContainerFactory.java:253)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.notifyListeners(CProjectDescriptionManager.java:2212)
at org.eclipse.cdt.internal.core.settings.model.AbstractCProjectDescriptionStorage.fireLoadedEvent(AbstractCProjectDescriptionStorage.java:268)
at org.eclipse.cdt.internal.core.settings.model.xml.XmlProjectDescriptionStorage.getProjectDescription(XmlProjectDescriptionStorage.java:264)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescriptionInternal(CProjectDescriptionManager.java:437)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescription(CProjectDescriptionManager.java:419)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescription(CProjectDescriptionManager.java:413)
at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescription(CProjectDescriptionManager.java:406)
at org.eclipse.cdt.core.model.CoreModel.getProjectDescription(CoreModel.java:1424)
Comment 1 Marc-André Laperle CLA 2016-08-09 15:30:36 EDT
(In reply to Jonah Graham from comment #0)
> Processes are being left around after hanging tests. This problem is
> therefore two-fold:
> - Hudson not cleaning up fully

This looks like bug 443081. I can try to switch it to a shell script and see if that helps.
Comment 2 Jonah Graham CLA 2016-08-09 17:21:53 EDT
Wohoo! I have reproduced the stack trace in Comment 0 exactly.

The (or one of the) suspect test is org.eclipse.cdt.core.settings.model.CConfigurationDescriptionExportSettings.testProjectImport().

To reproduce the error this is what I did, after identifying frame in stack dump that on code review had potential for deadlock:
1) Create a breakpoint on CProjectDescriptionManager.runWspModification(IWorkspaceRunnable, ISchedulingRule, IProgressMonitor) line 481 (in the second if (!scheduleRule) that has the comment // Run immediately
2) Run the org.eclipse.cdt.core.suite.AutomatedIntegrationSuite 
3) As that breakpoint is hit often, added this condition
  !"main".equals(Thread.currentThread().getName()) && Thread.currentThread().getStackTrace().length > 20
the first part so it didn't break in main and the second part because there were still lots of short stack traces I didn't care about
4) re-run suite, but it is very slow now because of my condition! 
5) When breakpoint is finally hit I can see testProjectImport is running in JUnit view
6) Suspend whole vm, see matching stack traces, press resume and vm appears deadlocked.

Now to narrow it down.
Comment 3 Jonah Graham CLA 2016-08-09 20:13:34 EDT
I have tracked down the root cause.

First explanation of the symptom. 

- The main thread deletes a workspace resource (deleting the project that was created for the test)
- therefore it obtains a workspace lock on the main thread
- with the lock main thread does delete with associated resource change notifiers
- which triggers PDOM to stop indexing
- which requires PDOM to acquire the PDOMIndexerJob.taskMutex, but the thread described next has already obtained it

meanwhile, on another PDOMIndexer thread (run as an Eclipse job)
- PDOM starts indexing project because it was kicked off by PDOMSetupJob
- PDOM acquires the PDOMIndexerJob.taskMutex
- PDOM (eventually, many stack frames deeper) loads project description
- for reasons explained below this is the first time the project description is loaded, therefore CProjectDescriptionManager.notifyListeners is run
- one of the listeners is CfgExportSettingContainerFactory.handleEvent which, in the case of dependencies between projects, causes an update to build settings (see Bug 312575 for why)
- which requires a workspace lock to make the modifications, but the main thread has already obtained it

and now we are deadlocked between the two threads!


As it turns out, the PDOMIndexerJob is not supposed to be able to open a project description first ever. However there is no code in the indexer job itself to indicate that because where the job starts is too disconnected from where the CProjectDescriptionManager.getProjectDescription is called.

Therefore, the implementation has the PDOMSetupJob prevent the PDOMIndexerJob from being started until some other operation has loaded the project description for the first time. This happens in PDOMSetupJob.run. 

The problem is there is a race condition in PDOMSetupJob.run. In between when fManager.postponeSetup and fManager.createIndexer are called a project can be deleted and a new (or same) project created with the same name. The delete causes the loaded project description to be removed and then as createIndexer schedules the PDOMIndexerJob, the PDOMIndexerJob causes the project description to load as described above.

I am still considering what workarounds are possible. At the moment the createIndexer method, with appropriate locks, checks to make sure the project has not been removed between the pre-checks and scheduling the PDOMSetupJob. I suspect adding a similar check within the lock on the project being added may work. However, there is still the complication of the postponed jobs set and whether or not that should be checked. At the moment the postponed jobs set is not cleared of entries when a project is deleted from the workspace, so there is a new race condition option open here.

Pardon my late night ramblings, hopefully this explanation will be clear when I return to this problem when I can.
Comment 4 Jonah Graham CLA 2016-08-09 20:14:57 EDT
Another thing to mention is that it may be best to consider working around this very hard to hit issue in the IDE by adjusting the tests instead.
Comment 5 Jonah Graham CLA 2016-08-10 05:08:23 EDT
(In reply to Jonah Graham from comment #3)
> As it turns out, the PDOMIndexerJob is not supposed to be able to open a
> project description first ever. However there is no code in the indexer job
> itself to indicate that because where the job starts is too disconnected
> from where the CProjectDescriptionManager.getProjectDescription is called.
This conclusion may not be right. The PDOMSetupJob is not supposed to open a description (see Bug 310795), but the indexer is fine with that.

However, it is still the case that there is a race condition in the PDOMSetupJob. It is possible for PDOMSetupJob to test if a project is ready to index, and then have the project changed under its feet before the index is actually created.

My next set of issue is back to the deadlock itself then.
Comment 6 Eclipse Genie CLA 2016-08-10 07:37:14 EDT
New Gerrit change created: https://git.eclipse.org/r/78748
Comment 7 Jonah Graham CLA 2016-08-10 07:38:15 EDT
With https://git.eclipse.org/r/78748 I have put in a test workaround to see the impact on the build machine.
Comment 8 Marc-André Laperle CLA 2016-08-10 09:30:07 EDT
Perhaps Sergey (CC'd) has an opinion about this?
Comment 10 Jonah Graham CLA 2016-08-12 06:06:38 EDT
I have submitted a change that I believe worksaround the tests hanging. I'll keep my eye on the builds and see if any future ones timeout.
Comment 11 Marc Khouzam CLA 2016-08-12 07:15:48 EDT
Thanks Jonah!
Comment 12 Doug Schaefer CLA 2016-08-12 10:15:00 EDT
Super thanks! :)
Comment 13 Jonah Graham CLA 2016-09-21 10:20:29 EDT
It has been a month now and AFAICT there hasn't been a hung build and there are no processes still running on the build machine.

The jstack-all runs https://hudson.eclipse.org/cdt/job/jstack-all/19 and https://hudson.eclipse.org/cdt/job/jstack-all/20 show that the only java process that are running are the expected hudson ones.