Community
Participate
Working Groups
HIPP instance: PDT Since couple weeks we experiencing problems with disk access while our integration tests, they just fail. org.eclipse.core.resources doesn't allow run sync() on FileChannel, so this is not an option. From yesterday problems grow up, around 30% jobs was failed.
Can you help me understand what is failing?
Most often fail refactoring tests. They heavy use i/o operations, especially while rename: 1. Create project 2. Add files 3. Build project and index 4. Run refactoring processor 5. Read file from disk and compare result: expected vs actual file content 6. Drop project Example fail: https://hudson.eclipse.org/pdt/job/eclipse-next/lastCompletedBuild/testReport/org.eclipse.php.refactoring.core.rename/RenameTraitProcessorTest/test_resources_rename_renameTrait_testRenameTrait_pdtt_/ I was never able to reproduce it locally (even if I run 5 concurrent builds), and for me looks like classic disk access problem. Especially yesterday, probably due a lot of "just before release" builds.
What is the full file path to these sample files?
Test definition: https://git.eclipse.org/c/pdt/org.eclipse.pdt.git/tree/tests/org.eclipse.php.refactoring.core.tests/resources/rename/renameTrait/testRenameTrait.pdtt Our test suite based on this file will build project: https://git.eclipse.org/c/pdt/org.eclipse.pdt.git/tree/tests/org.eclipse.php.refactoring.core.tests/src/org/eclipse/php/refactoring/core/rename/RenameResourceProcessorTest.java
Sure, but you're mentioning a potential disk access issue. Hipp servers have a bunch of disks and mount points. Where on the hipp server are these files being created? --FILE-- FILENAME://testRenameTrait.php
Surefire store workspace under target/work/data directory: https://hudson.eclipse.org/pdt/job/eclipse-next/ws/tests/org.eclipse.php.refactoring.core.tests/target/work/data/ Project "Refactoring" was removed during tearDown()
That's a physical drive set, not an NFS mount, so you're in good shape. I just noticed now that you have 2 jobs running at the same time (both started at 11:59), perhaps for the same patch: 10122 8728 126 2.6 32908944 3546676 ? Sl 11:59 1:26 /shared/common/jdk1.7.0-latest/bin/java [snip] -DGERRIT_CHANGE_URL=https://git.eclipse.org/r/47456 -DGERRIT_CHANGE_ID=I8f45d2da35cf3606a482d18839f16c762434097d -DGERRIT_CHANGE_NUMBER=47456 -DGERRIT_PATCHSET_UPLOADER_NAME=Michal Niewrzal 10122 8859 132 2.2 32911348 2934860 ? Sl 11:59 1:16 /shared/common/jdk1.7.0-latest/bin/java [snip] -DGERRIT_CHANGE_URL=https://git.eclipse.org/r/47456 -DGERRIT_CHANGE_ID=I8f45d2da35cf3606a482d18839f16c762434097d -DGERRIT_CHANGE_NUMBER=47456 -DGERRIT_PATCHSET_UPLOADER_NAME=Michal Niewrzal Could these jobs be colliding?
Only on ~/.m2 write access while downloading dependencies. Furthermore, each one start own maven process (mvn clean verify -DskipPdtPerformanceTests), have own workspace and target. To avoid conflicts with .m2 write lock we run only mvn verify. Second job doing same thing, but with mars repo (first is for luna). Disk access problem exists also if I disable one of these jobs.
I've been running this command for about 30 minutes: time while [ 1 ]; do echo "aaa" > file; A=$(cat file); echo "bbb" >> file; B=$(cat file); if [ "$A" = "$B" ]; then echo "ERROR"; fi; done; And it hasn't failed once. Even opening a second shell and cat'ing the file, I can see it change. But the kernel is serving the correct content each and every time. So I suspect there's an underlying issue somewhere in the Surefire/Maven/Java stack. I'd be curious to strace the java process to see if it actually does a call for the subsequent file read.
Although the workspace is on a physical disk set, the hipp user's home directory is on NFS (with .hudson/jobs being a symlink to the physical disk). I'm running the above test from the hipp home directory. Stay tuned.
May be related. Today after abort one job, second (pending) fail, after re-trigger works correctly [1]. I saw this in past: Cleaning the workspace because project is configured to clean the workspace before each build. FATAL: Unable to delete /home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace - files in dir: [/home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace/tests] java.io.IOException: Unable to delete /home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace - files in dir: [/home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace/tests] [1] - https://hudson.eclipse.org/pdt/job/pdt-gerrit/939/console
Dawid, That is a telltale sign that another job is using the same directory. We've seen this before many times. I suggest disabling the Gerrit trigger and perhaps building on a schedule to see if it still happens. Disk errors are detected and reported by the kernel. Even NFS access errors (which is not the case here) are reported. Our logs are clean.
Before RC4 problem is back :P Problems with delete, examples: https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/293/console https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/294/console After few moments build run again (now in progress): https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/295/console job workspace is cleanup before each build. Each job have own workspace off course.
I've taken a look and the disk on the underlying host is getting full, so I'm wondering if that's a factor here. -M.
(In reply to Eclipse Webmaster from comment #14) > I've taken a look and the disk on the underlying host is getting full, so > I'm wondering if that's a factor here. > > -M. Hi any progress? Problem still exists. Example log from today, not related to maven/tests, hudson was unable to cleanup workspace before build: https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/711/console
The filesystem on the host is reporting 100% full(it's nearly 1TB) so something is either not cleaning up or downloading the world(and then not cleaning up). As such I've reached out the the projects consuming the most space asking them to clean up, One thing we could try is to transfer your instance to another host(we've got one left with a bit of space). The PDT HIPP instance would probably be offline for a couple of hours while we moved the job data around. -M.
(In reply to Eclipse Webmaster from comment #16) > One thing we could try is to transfer your instance to another host(we've > got one left with a bit of space). The PDT HIPP instance would probably be > offline for a couple of hours while we moved the job data around. > > -M. If this may help, +1 from me.
Does the 11th of Sept(tomorrow) work for me to move the PDT instance? -M
(In reply to Eclipse Webmaster from comment #18) > Does the 11th of Sept(tomorrow) work for me to move the PDT instance? > > -M Sure, I'll inform team about downtime. Job workspaces can be safely cleaned up.
OK I"ve moved PDT to another hipp host with more disk space so lets see how that works.
(In reply to Eclipse Webmaster from comment #20) > OK I"ve moved PDT to another hipp host with more disk space so lets see how > that works. Unfortunately hudson still sometimes is not able to clean workspace: https://hudson.eclipse.org/pdt/job/pdt-gerrit/1469/console https://hudson.eclipse.org/pdt/job/pdt-gerrit/1470/console
Dawid, are you still seeing this? -M.
I didn't see such problems from a long time. We reduced ~5x I/O operations during tests, maybe this helped ;) Closing for now, I'll reopen if problem will back.
Thanks for circling back.