Community
Participate
Working Groups
Created attachment 263541 [details] output of jps/jstack: https://hudson.eclipse.org/cdt/job/jstack-all/4/consoleText Processes are being left around after hanging tests. This problem is therefore two-fold: - Hudson not cleaning up fully - CDT causing Eclipse/JUnit to hang Here is a summary of my analysis as shared on cdt-dev. Attached is the full log from the jps/jstack that Marc Andre set up: I think they are all ours. Some of them are Hudson processes and some are maven (presumably the maven are the parents of the eclipse ones), but the suspicious ones all have this or something similar as one of the thread's stack trace (see below). There was one verify build running when I captured this. By my quick count: - total process: 14 - maven processes waiting on a Unix process (blocking on java.lang.UNIXProcess.waitFor): 6 - number of processes blocked on doing CProjectDescriptionManager.runWspModification: 6 - hudson itself: 1 - the build that was actually known to be running: 1 (maven was zipping at the time in org.codehaus.plexus.archiver.zip.AbstractZipArchiver.zipFile) I'll file a bug on this for what its worth. "Worker-3" #58 prio=5 os_prio=0 tid=0x00007fc1242d9000 nid=0x1a10 in Object.wait() [0x00007fc10dbd0000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.eclipse.core.internal.jobs.ThreadJob.waitForRun(ThreadJob.java:270) - locked <0x00000000ff700bc0> (a java.lang.Object) at org.eclipse.core.internal.jobs.ThreadJob.joinRun(ThreadJob.java:197) at org.eclipse.core.internal.jobs.ImplicitJobs.begin(ImplicitJobs.java:92) at org.eclipse.core.internal.jobs.JobManager.beginRule(JobManager.java:307) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.runWspModification(CProjectDescriptionManager.java:483) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.runWspModification(CProjectDescriptionManager.java:454) at org.eclipse.cdt.internal.core.settings.model.CExternalSettingsManager.settingsChanged(CExternalSettingsManager.java:439) at org.eclipse.cdt.internal.core.settings.model.CExternalSettingContainerFactoryWithListener.notifySettingsChange(CExternalSettingContainerFactoryWithListener.java:48) at org.eclipse.cdt.internal.core.settings.model.CfgExportSettingContainerFactory.handleEvent(CfgExportSettingContainerFactory.java:253) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.notifyListeners(CProjectDescriptionManager.java:2212) at org.eclipse.cdt.internal.core.settings.model.AbstractCProjectDescriptionStorage.fireLoadedEvent(AbstractCProjectDescriptionStorage.java:268) at org.eclipse.cdt.internal.core.settings.model.xml.XmlProjectDescriptionStorage.getProjectDescription(XmlProjectDescriptionStorage.java:264) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescriptionInternal(CProjectDescriptionManager.java:437) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescription(CProjectDescriptionManager.java:419) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescription(CProjectDescriptionManager.java:413) at org.eclipse.cdt.internal.core.settings.model.CProjectDescriptionManager.getProjectDescription(CProjectDescriptionManager.java:406) at org.eclipse.cdt.core.model.CoreModel.getProjectDescription(CoreModel.java:1424)
(In reply to Jonah Graham from comment #0) > Processes are being left around after hanging tests. This problem is > therefore two-fold: > - Hudson not cleaning up fully This looks like bug 443081. I can try to switch it to a shell script and see if that helps.
Wohoo! I have reproduced the stack trace in Comment 0 exactly. The (or one of the) suspect test is org.eclipse.cdt.core.settings.model.CConfigurationDescriptionExportSettings.testProjectImport(). To reproduce the error this is what I did, after identifying frame in stack dump that on code review had potential for deadlock: 1) Create a breakpoint on CProjectDescriptionManager.runWspModification(IWorkspaceRunnable, ISchedulingRule, IProgressMonitor) line 481 (in the second if (!scheduleRule) that has the comment // Run immediately 2) Run the org.eclipse.cdt.core.suite.AutomatedIntegrationSuite 3) As that breakpoint is hit often, added this condition !"main".equals(Thread.currentThread().getName()) && Thread.currentThread().getStackTrace().length > 20 the first part so it didn't break in main and the second part because there were still lots of short stack traces I didn't care about 4) re-run suite, but it is very slow now because of my condition! 5) When breakpoint is finally hit I can see testProjectImport is running in JUnit view 6) Suspend whole vm, see matching stack traces, press resume and vm appears deadlocked. Now to narrow it down.
I have tracked down the root cause. First explanation of the symptom. - The main thread deletes a workspace resource (deleting the project that was created for the test) - therefore it obtains a workspace lock on the main thread - with the lock main thread does delete with associated resource change notifiers - which triggers PDOM to stop indexing - which requires PDOM to acquire the PDOMIndexerJob.taskMutex, but the thread described next has already obtained it meanwhile, on another PDOMIndexer thread (run as an Eclipse job) - PDOM starts indexing project because it was kicked off by PDOMSetupJob - PDOM acquires the PDOMIndexerJob.taskMutex - PDOM (eventually, many stack frames deeper) loads project description - for reasons explained below this is the first time the project description is loaded, therefore CProjectDescriptionManager.notifyListeners is run - one of the listeners is CfgExportSettingContainerFactory.handleEvent which, in the case of dependencies between projects, causes an update to build settings (see Bug 312575 for why) - which requires a workspace lock to make the modifications, but the main thread has already obtained it and now we are deadlocked between the two threads! As it turns out, the PDOMIndexerJob is not supposed to be able to open a project description first ever. However there is no code in the indexer job itself to indicate that because where the job starts is too disconnected from where the CProjectDescriptionManager.getProjectDescription is called. Therefore, the implementation has the PDOMSetupJob prevent the PDOMIndexerJob from being started until some other operation has loaded the project description for the first time. This happens in PDOMSetupJob.run. The problem is there is a race condition in PDOMSetupJob.run. In between when fManager.postponeSetup and fManager.createIndexer are called a project can be deleted and a new (or same) project created with the same name. The delete causes the loaded project description to be removed and then as createIndexer schedules the PDOMIndexerJob, the PDOMIndexerJob causes the project description to load as described above. I am still considering what workarounds are possible. At the moment the createIndexer method, with appropriate locks, checks to make sure the project has not been removed between the pre-checks and scheduling the PDOMSetupJob. I suspect adding a similar check within the lock on the project being added may work. However, there is still the complication of the postponed jobs set and whether or not that should be checked. At the moment the postponed jobs set is not cleared of entries when a project is deleted from the workspace, so there is a new race condition option open here. Pardon my late night ramblings, hopefully this explanation will be clear when I return to this problem when I can.
Another thing to mention is that it may be best to consider working around this very hard to hit issue in the IDE by adjusting the tests instead.
(In reply to Jonah Graham from comment #3) > As it turns out, the PDOMIndexerJob is not supposed to be able to open a > project description first ever. However there is no code in the indexer job > itself to indicate that because where the job starts is too disconnected > from where the CProjectDescriptionManager.getProjectDescription is called. This conclusion may not be right. The PDOMSetupJob is not supposed to open a description (see Bug 310795), but the indexer is fine with that. However, it is still the case that there is a race condition in the PDOMSetupJob. It is possible for PDOMSetupJob to test if a project is ready to index, and then have the project changed under its feet before the index is actually created. My next set of issue is back to the deadlock itself then.
New Gerrit change created: https://git.eclipse.org/r/78748
With https://git.eclipse.org/r/78748 I have put in a test workaround to see the impact on the build machine.
Perhaps Sergey (CC'd) has an opinion about this?
Gerrit change https://git.eclipse.org/r/78748 was merged to [master]. Commit: http://git.eclipse.org/c/cdt/org.eclipse.cdt.git/commit/?id=b34d04c0d587d829f4df413de94017739ad582b1
I have submitted a change that I believe worksaround the tests hanging. I'll keep my eye on the builds and see if any future ones timeout.
Thanks Jonah!
Super thanks! :)
It has been a month now and AFAICT there hasn't been a hung build and there are no processes still running on the build machine. The jstack-all runs https://hudson.eclipse.org/cdt/job/jstack-all/19 and https://hudson.eclipse.org/cdt/job/jstack-all/20 show that the only java process that are running are the expected hudson ones.