Bug 466743 - HIPP disk access problem (overloaded?)
Summary: HIPP disk access problem (overloaded?)
Status: CLOSED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: CI Admin Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-07 10:46 EDT by Dawid Pakula CLA
Modified: 2016-08-26 11:14 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dawid Pakula CLA 2015-05-07 10:46:40 EDT
HIPP instance: PDT
Since couple weeks we experiencing problems with disk access while our integration tests, they just fail.

org.eclipse.core.resources doesn't allow run sync() on FileChannel, so this is not an option.

From yesterday problems grow up, around 30% jobs was failed.
Comment 1 Denis Roy CLA 2015-05-07 11:04:45 EDT
Can you help me understand what is failing?
Comment 2 Dawid Pakula CLA 2015-05-07 11:16:40 EDT
Most often fail refactoring tests. They heavy use i/o operations, especially while rename:
1. Create project
2. Add files
3. Build project and index
4. Run refactoring processor
5. Read file from disk and compare result: expected vs actual file content
6. Drop project

Example fail: https://hudson.eclipse.org/pdt/job/eclipse-next/lastCompletedBuild/testReport/org.eclipse.php.refactoring.core.rename/RenameTraitProcessorTest/test_resources_rename_renameTrait_testRenameTrait_pdtt_/

I was never able to reproduce it locally (even if I run 5 concurrent builds), and for me looks like classic disk access problem. Especially yesterday, probably due a lot of "just before release" builds.
Comment 3 Denis Roy CLA 2015-05-07 11:21:57 EDT
What is the full file path to these sample files?
Comment 5 Denis Roy CLA 2015-05-07 11:33:16 EDT
Sure, but you're mentioning a potential disk access issue.  Hipp servers have a bunch of disks and mount points.  Where on the hipp server are these files being created?

--FILE--
FILENAME://testRenameTrait.php
Comment 6 Dawid Pakula CLA 2015-05-07 11:40:07 EDT
Surefire store workspace under target/work/data directory:

https://hudson.eclipse.org/pdt/job/eclipse-next/ws/tests/org.eclipse.php.refactoring.core.tests/target/work/data/

Project "Refactoring" was removed during tearDown()
Comment 7 Denis Roy CLA 2015-05-07 12:04:13 EDT
That's a physical drive set, not an NFS mount, so you're in good shape. 

I just noticed now that you have 2 jobs running at the same time (both started at 11:59), perhaps for the same patch:

10122     8728  126  2.6 32908944 3546676 ?    Sl   11:59   1:26 /shared/common/jdk1.7.0-latest/bin/java [snip] -DGERRIT_CHANGE_URL=https://git.eclipse.org/r/47456 -DGERRIT_CHANGE_ID=I8f45d2da35cf3606a482d18839f16c762434097d -DGERRIT_CHANGE_NUMBER=47456 -DGERRIT_PATCHSET_UPLOADER_NAME=Michal Niewrzal 

10122     8859  132  2.2 32911348 2934860 ?    Sl   11:59   1:16 /shared/common/jdk1.7.0-latest/bin/java [snip] -DGERRIT_CHANGE_URL=https://git.eclipse.org/r/47456 -DGERRIT_CHANGE_ID=I8f45d2da35cf3606a482d18839f16c762434097d -DGERRIT_CHANGE_NUMBER=47456 -DGERRIT_PATCHSET_UPLOADER_NAME=Michal Niewrzal


Could these jobs be colliding?
Comment 8 Dawid Pakula CLA 2015-05-07 12:11:57 EDT
Only on ~/.m2 write access while downloading dependencies. 

Furthermore, each one start own maven process (mvn clean verify -DskipPdtPerformanceTests), have own workspace and target.

To avoid conflicts with .m2 write lock we run only mvn verify.

Second job doing same thing, but with mars repo (first is for luna).

Disk access problem exists also if I disable one of these jobs.
Comment 9 Denis Roy CLA 2015-05-07 13:35:26 EDT
I've been running this command for about 30 minutes:

time while [ 1 ]; do echo "aaa" > file; A=$(cat file); echo "bbb" >> file; B=$(cat file); if [ "$A" = "$B" ]; then echo "ERROR"; fi; done;

And it hasn't failed once.  Even opening a second shell and cat'ing the file, I can see it change.  But the kernel is serving the correct content each and every time.

So I suspect there's an underlying issue somewhere in the Surefire/Maven/Java stack.  I'd be curious to strace the java process to see if it actually does a call for the subsequent file read.
Comment 10 Denis Roy CLA 2015-05-07 13:43:31 EDT
Although the workspace is on a physical disk set, the hipp user's home directory is on NFS (with .hudson/jobs being a symlink to the physical disk).

I'm running the above test from the hipp home directory.  Stay tuned.
Comment 11 Dawid Pakula CLA 2015-05-13 07:46:47 EDT
May be related. 

Today after abort one job, second (pending) fail, after re-trigger works correctly [1]. I saw this in past:

Cleaning the workspace because project is configured to clean the workspace before each build.
FATAL: Unable to delete /home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace - files in dir: [/home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace/tests]
java.io.IOException: Unable to delete /home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace - files in dir: [/home/hudson/genie.pdt/.hudson/jobs/pdt-gerrit/workspace/tests]

[1] - https://hudson.eclipse.org/pdt/job/pdt-gerrit/939/console
Comment 12 Denis Roy CLA 2015-05-13 07:52:09 EDT
Dawid,

That is a telltale sign that another job is using the same directory.  We've seen this before many times.

I suggest disabling the Gerrit trigger and perhaps building on a schedule to see if it still happens.

Disk errors are detected and reported by the kernel.  Even NFS access errors (which is not the case here) are reported.  Our logs are clean.
Comment 13 Dawid Pakula CLA 2015-06-09 11:03:15 EDT
Before RC4 problem is back :P

Problems with delete, examples:
https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/293/console
https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/294/console

After few moments build run again (now in progress):
https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/295/console

job workspace is cleanup before each build. Each job have own workspace off course.
Comment 14 Eclipse Webmaster CLA 2015-06-26 16:14:56 EDT
I've taken a look and the disk on the underlying host is getting full, so I'm wondering if that's a factor here.

-M.
Comment 15 Dawid Pakula CLA 2015-09-09 10:53:38 EDT
(In reply to Eclipse Webmaster from comment #14)
> I've taken a look and the disk on the underlying host is getting full, so
> I'm wondering if that's a factor here.
> 
> -M.

Hi any progress? Problem still exists. Example log from today, not related to maven/tests, hudson was unable to cleanup workspace before build: https://hudson.eclipse.org/pdt/job/eclipse-next-gerrit/711/console
Comment 16 Eclipse Webmaster CLA 2015-09-10 16:16:54 EDT
The filesystem on the host is reporting 100% full(it's nearly 1TB) so something is either not cleaning up or downloading the world(and then not cleaning up).  As such I've reached out the the projects consuming the most space asking them to clean up, 

One thing we could try is to transfer your instance to another host(we've got one left with a bit of space).  The PDT HIPP instance would probably be offline for a couple of hours while we moved the job data around.  

-M.
Comment 17 Dawid Pakula CLA 2015-09-10 16:34:28 EDT
(In reply to Eclipse Webmaster from comment #16)
> One thing we could try is to transfer your instance to another host(we've
> got one left with a bit of space).  The PDT HIPP instance would probably be
> offline for a couple of hours while we moved the job data around.  
> 
> -M.

If this may help, +1 from me.
Comment 18 Eclipse Webmaster CLA 2015-09-10 16:43:07 EDT
Does the 11th of Sept(tomorrow) work for me to move the PDT instance?

-M
Comment 19 Dawid Pakula CLA 2015-09-10 16:48:32 EDT
(In reply to Eclipse Webmaster from comment #18)
> Does the 11th of Sept(tomorrow) work for me to move the PDT instance?
> 
> -M

Sure, I'll inform team about downtime. Job workspaces can be safely cleaned up.
Comment 20 Eclipse Webmaster CLA 2015-09-11 13:51:14 EDT
OK I"ve moved PDT to another hipp host with more disk space so lets see how that works.
Comment 21 Dawid Pakula CLA 2015-09-16 06:30:21 EDT
(In reply to Eclipse Webmaster from comment #20)
> OK I"ve moved PDT to another hipp host with more disk space so lets see how
> that works.

Unfortunately hudson still sometimes is not able to clean workspace:
https://hudson.eclipse.org/pdt/job/pdt-gerrit/1469/console
https://hudson.eclipse.org/pdt/job/pdt-gerrit/1470/console
Comment 22 Eclipse Webmaster CLA 2016-02-29 14:43:57 EST
Dawid, are you still seeing this?

-M.
Comment 23 Dawid Pakula CLA 2016-08-26 11:13:44 EDT
I didn't see such problems from a long time. We reduced ~5x I/O operations during tests, maybe this helped ;)

Closing for now, I'll reopen if problem will back.
Comment 24 Denis Roy CLA 2016-08-26 11:14:48 EDT
Thanks for circling back.