Bug 571952 - Slow build performance since beginning of March (okd cluster)
Summary: Slow build performance since beginning of March (okd cluster)
Status: CLOSED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows 10
: P3 critical (vote)
Target Milestone: ---   Edit
Assignee: CI Admin Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 572152 572275 572295 574013 (view as bug list)
Depends on:
Blocks: 572158
  Show dependency tree
 
Reported: 2021-03-15 07:05 EDT by Philippe Dul CLA
Modified: 2021-06-18 11:58 EDT (History)
18 users (show)

See Also:


Attachments
Xtext core build times (164.67 KB, image/png)
2021-05-06 03:31 EDT, Christian Dietrich CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philippe Dul CLA 2021-03-15 07:05:07 EDT
Hi, 

Since beginning of March, we noticed that our jobs are now twice longer.
Can you investigate on your side, perhaps something unoticed has changed in Capella ci ?

A Capella build was taking around 2h before march is now taking up to 4h to complete.

https://ci.eclipse.org/capella/job/capella-product/job/master

Thanks
Regards

Philippe
Comment 1 Frederic Gurr CLA 2021-03-15 07:15:34 EDT
We are still in the process of migrating build machines to our new cluster. Since the fastest machines have not been moved yet, build times can be longer than usual. We will try to fix this in coming days.
Comment 2 Boubekeur Zendagui CLA 2021-03-19 10:27:20 EDT
Hi,

We are facing the same issue with the build of kiralpha [1].

Best regards,
Boubekeur.

[1] https://ci.eclipse.org/kitalpha/job/Kitalpha/job/master/
Comment 3 Mikaël Barbero CLA 2021-03-19 11:17:35 EDT
Faster nodes are coming soon. Thanks for your patience.
Comment 4 Mikaël Barbero CLA 2021-03-22 03:44:09 EDT
*** Bug 572152 has been marked as a duplicate of this bug. ***
Comment 5 Mikaël Barbero CLA 2021-03-22 03:51:44 EDT
What took a long time in https://ci.eclipse.org/xtext/job/xtext-lib/job/cd_lsp4j011/ is for the agent to be provisionned. See the timestamp at the beginning of the logs:

07:44:08  Waiting for next available executor on ‘centos-7’
08:11:57  Agent centos-7-zl6fx is provisioned from template centos-7

It should not happen very often, and it will certainly disappear with more compute nodes being added.
Comment 6 Mikaël Barbero CLA 2021-03-22 03:51:57 EDT
whoops, wrong ticket.
Comment 7 Mikaël Barbero CLA 2021-03-25 09:17:44 EDT
*** Bug 572295 has been marked as a duplicate of this bug. ***
Comment 8 Mikaël Barbero CLA 2021-03-25 09:18:35 EDT
*** Bug 572275 has been marked as a duplicate of this bug. ***
Comment 9 Mikaël Barbero CLA 2021-03-25 09:19:22 EDT
From bug 572275 comment 5:

We've added a couple of faster machines, but not all of them are brought yet. Some more will be added next week most probably.

Tomorrow's maintenance https://www.eclipsestatus.io/incidents/lr3z03npxgy1 will hopefully also helps mitigating the issue as it fixes a couple of regressions we've been observing since last update. 

Thanks for you patience.
Comment 10 Christian Dietrich CLA 2021-03-26 05:10:47 EDT
will increase the xtext-eclipse and xtext-xtend timeouts to 5 respectively 6 hrs (normal build times where 30 mins)
Comment 11 Ed Willink CLA 2021-03-29 00:50:12 EDT
OCL too. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=572158
Comment 12 Mikaël Barbero CLA 2021-03-29 04:24:28 EDT
Last Friday upgrade went well. You may already see some improvements. 

Also, some more performant nodes should be added this week.

Thanks for your patience.
Comment 13 Mikaël Barbero CLA 2021-04-01 02:37:59 EDT
From the couple of projects I've checked, it seems that the performance are better since yesterday (end of day). Does anyone confirm or infirm? Of course we will wait for the trend to stabilize to close this ticket, I just want to know if the feeling is the same for everyone.
Comment 14 Christian Dietrich CLA 2021-04-01 02:39:50 EDT
with Xtext i see good times again but also bad times
https://ci.eclipse.org/xtext/job/xtext-eclipse/job/master/
Comment 15 Mikaël Barbero CLA 2021-04-01 02:44:25 EDT
Thanks Christian for the quick feedback. From 1x to 2x does not surprise me too much as we're still missing some of the more powerful machines to run builds (and we have a couple of really slow ones). But there should not be 5x/10x difference as you were mentioning in bug 572152.

More powerful machines are coming...
Comment 16 Christian Dietrich CLA 2021-04-01 02:46:24 EDT
the good time is already a 1.5x (compared with https://ci.eclipse.org/xtext/job/xtext-eclipse/job/cd_justAnotherTest/) thus the bad one is a 4x
will continue to observe
Comment 17 Ed Willink CLA 2021-04-01 04:18:56 EDT
Rebuild of https://ci.eclipse.org/ocl/job/ocl-compatibility-photon/ 'improved' from 39 mins to 38 mins.

Presumably so long as there is a mix of slow and fast servers it will be pot luck as to whether the build runs on a slow/fast server so timeouts must be set according to the slowest server.
Comment 18 Ed Willink CLA 2021-04-01 07:18:29 EDT
Rebuild of https://ci.eclipse.org/ocl/job/ocl-compatibility-2020-06/ got worse; 40 rather than 34 mins and it used to run fine in a 30 minute timeout.
Comment 19 Radek Felcman CLA 2021-04-07 02:48:41 EDT
In EclipseLink we have mixed results. Some job executions are quicker, than before (20%) other worse (80%). CPU and memory resources increase doesn't help. See
https://ci.eclipse.org/eclipselink/job/eclipselink-nightly-master/
Promotion/publishing job to Jakarta staging (no any resources change) is slower, than before and sometimes randomly killed. See
https://ci.eclipse.org/eclipselink/job/eclipselink-promote-3.0/

So overall summary is, that performance is worse, than before.
Comment 20 Christian Dietrich CLA 2021-04-08 04:27:53 EDT
another instance of superslowness
https://ci.eclipse.org/xtext/job/xtext-xtend/job/master/1277/consoleFull
Comment 21 Ed Willink CLA 2021-04-12 02:37:27 EDT
(In reply to Ed Willink from comment #18)
> Rebuild of https://ci.eclipse.org/ocl/job/ocl-compatibility-2020-06/ got
> worse; 40 rather than 34 mins and it used to run fine in a 30 minute timeout.

Bug 572158 describes fixes for some OCL regressions; builds are now back to 20 mins. It appears that there are/were three components to the problems.

a) OCL had degraded by 30% - now fixed
b) Servers had degraded by 30% - now generally better
c) End of job termination is wierd

The approximately one minute start up and two minute slowness in job startup is a longstanding issue, but Bug 572158 identifies that an "[INFO] Total time:  20:41 min" in the log file is reported as 27 mins on the Dashboard. This particular build used the Eclipse Oxygen platform and provokes numerous 
(SWT:861): Gtk-WARNING **: ... Negative content width -1 (allocation 1, extents 1x1) while allocating gadget (node trough, owner GtkProgressBar). Perhaps the vintage platform has some adverse interaction with e.g. the VNC support.
Comment 22 Jörg Kubitz CLA 2021-05-06 03:05:06 EDT
Status update and a solution please.
On https://ci.eclipse.org/jdt/job/eclipse.jdt.core-Gerrit/ we do have the still the problem that the builds are sometimes too slow.
"too slow" because individual junit tests do have a time limit after which they will abort and fail. 
Some tests that normally take only one seconds take sometimes one minute to execute. I.e the buildserver seems to be doing nothing (or other things?) during that time. A great amount of tests do currently fail because of timeouts. 
Total build times vary between 1 and 2.5 hours. A long total build time is not as problematic as the unpredictable pauses during build. It is a big problem that we need to restart builds over and over just to get a lucky run.
And the more builds we start the more will probably fail because they fight for resources.
Comment 23 Radek Felcman CLA 2021-05-06 03:24:27 EDT
I have to confirm previous comment. In EclipseLink jobs at https://ci.eclipse.org/eclipselink we can see same behavior.
Comment 24 Christian Dietrich CLA 2021-05-06 03:26:53 EDT
i can share the observation of fluctuating build times and we also dont see the very good times from before the changee anymore
Comment 25 Christian Dietrich CLA 2021-05-06 03:31:29 EDT
Created attachment 286331 [details]
Xtext core build times
Comment 26 Mikaël Barbero CLA 2021-05-06 03:34:25 EDT
We are in the last phase of our machines migration. It means that we are currently moving out all the remaining Jenkins instances from the previous cluster (labeled as "openshift" on https://ci.eclipse.org) to the new one (labeled as "okd"). Once it's done (by the end of the week), the majority of the remaining resources will be scheduled to be moved to the new cluster (hopefully sometime next week). After that, you should benefit again from the performances you're were used to. Thanks for you patience.
Comment 27 Jörg Kubitz CLA 2021-05-06 05:48:16 EDT
(In reply to Mikaël Barbero from comment #26)
> to the new one (labeled as "okd"). 

JDT is already labeled as "okd". So why is it slower then on old server?
Comment 28 Frederic Gurr CLA 2021-05-06 06:24:07 EDT
(In reply to Jörg Kubitz from comment #27)
> JDT is already labeled as "okd". So why is it slower then on old server?
Build jobs do not run on Jenkins master/controller instances themselves. They run on ephemeral build agents (containers) that run on cluster nodes (aka physical machines).

So it does not matter that JDT is already on OKD. What matters is the number of available (and fast) physical machines in the new cluster. Not all of them have been migrated from the old cluster (openshift) yet, therefore the build times can vary depending on the cluster load and other factors.
Comment 29 Ed Willink CLA 2021-05-10 02:49:27 EDT
It is now 3 weeks since the major ussue was fixed. Thank you.

OCL runs weekly tests [1] to confirm that its latest build is compatible with the last 13 Eclipse releases. Since the test contributions from platform/EMF/UML2/Xtext run-times are small, the same time would be expected for each platform. However...

The tests show a spread of times from 12 to 20 mins, with most towards the fast end; the 20 mins almost an outlier. The slowest outlier test is different each week. See Bug 572158.

The tests run at the same random hour each Sunday and are mostly CPU / workspace I/O bound so presumably the variation is almost entirely due to pot luck as to whether a new/old server is used, with perhaps some influence as to how busy the overall servers are.

This variation is presumably unavoidable unless older hardware is needlessly discarded. 

Perhaps there could be alternative targets such as "mixed-migration" and "fast-migration" depending on whether users are prepared to accept pot-luck or need more predictability.

[1] https://ci.eclipse.org/ocl/
Comment 30 Mikaël Barbero CLA 2021-05-10 02:53:47 EDT
Thanks for the feedbacks Ed.

(In reply to Ed Willink from comment #29)
> Perhaps there could be alternative targets such as "mixed-migration" and
> "fast-migration" depending on whether users are prepared to accept pot-luck
> or need more predictability.
> 
> [1] https://ci.eclipse.org/ocl/

We could design something like that, but your jobs may end up with being queued until the "fast" resources are freed to run your jobs... So, your job would run faster but would stay in the queue longer. Do you think it is a desirable behavior?
Comment 31 Ed Willink CLA 2021-05-10 03:06:29 EDT
(In reply to Mikaël Barbero from comment #30)
> So, your job
> would run faster but would stay in the queue longer. Do you think it is a
> desirable behavior?

For my 16-odd secondary test jobs, that run once a week, I am happy to have a lower queuing priority / poorer server just so long as a reasonable turnaround occurs with adequate performance. It is the unexpected unpredictability that suddenly doubles run-times and so fails time limits that is upsetting.

My secondary jobs are only interesting in so far as they email me if there is a problem. I don't care if the email is six hours 'late'; they might have been running overnight anyway.
Comment 32 Mikaël Barbero CLA 2021-05-10 03:19:27 EDT
Gotcha. We will keep that in mind and we may offer a solution around this idea at some point in the future. Thanks.
Comment 33 Mikaël Barbero CLA 2021-05-10 12:04:26 EDT
FYI, 4 fast nodes have been added today.
Comment 34 Denis Roy CLA 2021-05-10 13:03:58 EDT
\o/
Comment 35 Thomas Wolf CLA 2021-05-17 10:14:58 EDT
Where are we with this? We're seeing a severe performance degradation in JGit and EGit builds. JGit builds normally take some 30-40 minutes; in the past few weeks, the majority of them time out after 60 minutes. Similar for EGit.

It's very frustrating.
Comment 36 Denis Roy CLA 2021-05-17 10:19:26 EDT
DO you have an idea for which steps are taking longer? Pulling from Maven Central?  Building? Signing?
Comment 37 Thomas Wolf CLA 2021-05-17 11:39:23 EDT
(In reply to Denis Roy from comment #36)
> DO you have an idea for which steps are taking longer? Pulling from Maven
> Central?  Building? Signing?

Unfortunately the JGit build doesn't write timestamps :-(

Taking a fast EGit build from early march (23 min; about normal) and a recent EGit build (60 min; unfortunately about normal these days), I see

* ~10 min instead of 2 min for pulling from maven central
* 4 min instead of 1 min for target platform building (D/L from repo.eclipse.org & downloads.eclipse.org)
* 5 min vs 3 min for building until first test running
* 1 min vs. 20 sec test run
* 30 sec vs 15 sec building until next test run
* 15 min vs. 2 min test run
* 30 sec vs 10 sec building until next test run
* 20 min vs 12 min test run

It's much slower across the board. Signing does not appear to be a hotspot, it's slower in the slow build, but it's just a few seconds.

And no, we didn't add that many tests :-) Both builds do roughly the same work and should take roughly the same time.
Comment 38 Thomas Wolf CLA 2021-05-26 06:28:01 EDT
Still extremely bad performance. How often shall I retrigger https://git.eclipse.org/r/c/jgit/jgit/+/177593 ??

Look at https://ci.eclipse.org/jgit/job/stable/job/jgit.gerrit-pipeline/ :

Most builds take extremely long. Some manage to finish just before the timeout after 60 minutes; may are aborted due to the timeout. The timeout is **not** too short; a normal build should not take longer than 30 minutes! (See for instance https://ci.eclipse.org/jgit/job/stable/job/jgit.gerrit-pipeline/4899/ which is a build that by chance happened to run with the "normal" speed we've usually had.)

Most jobs have to be retriggered umpteen times until they happen not get aborted. This is not only super frustrating, it also will mean that building a release is not just one or two Jenkins job runs, but will take several days with retriggering jobs many times.

At this point this CI infrastructure is close to non-functional for us.
Comment 39 Frederic Gurr CLA 2021-05-26 07:40:36 EDT
(In reply to Thomas Wolf from comment #38)
> Most jobs have to be retriggered umpteen times until they happen not get
> aborted. This is not only super frustrating, it also will mean that building
> a release is not just one or two Jenkins job runs, but will take several
> days with retriggering jobs many times.
> 
> At this point this CI infrastructure is close to non-functional for us.
I understand that slow(er) builds are frustrating. We will investigate and try to improve the situation.

In the meantime, instead of wasting precious time on builds that are aborted due to a fixed timeout, please increase the timeout (e.g. 90 or 120min).
Comment 40 Thomas Wolf CLA 2021-05-26 08:17:52 EDT
(In reply to Frederic Gurr from comment #39)
> In the meantime, instead of wasting precious time on builds that are aborted
> due to a fixed timeout, please increase the timeout (e.g. 90 or 120min).

My conclusion is different: CI is broken; I'll run verify builds locally if needed and set verified+1 manually.

Not sure if EGit and JGit can produce a new release for SimRel 2021-06 with CI in this state. Perhaps we'll have to make the release builds run with -DskipTests=true. Or we'll have to ship the last release.
Comment 41 Frederic Gurr CLA 2021-05-26 08:27:31 EDT
(In reply to Thomas Wolf from comment #40)
> My conclusion is different: CI is broken; I'll run verify builds locally if
> needed and set verified+1 manually.
> 
> Not sure if EGit and JGit can produce a new release for SimRel 2021-06 with
> CI in this state. Perhaps we'll have to make the release builds run with
> -DskipTests=true. Or we'll have to ship the last release.
I'm sorry that you rather have no build than a slow build while we are working on a fix. I hope that you reconsider your conclusion.
Comment 42 Ed Willink CLA 2021-05-26 08:37:02 EDT
I find this puzzling. For my builds there was problem for a couple of weeks with speed, but that has been fixed and so my builds are all much the same as before, pace an unlucky choice of server may cost a 50% longer run. (The time to get a server online has also increased.)

But broadly speaking I'm happy, so is the problem that EGit/JGit are now stressing some facility that I don't?

It might be helpful for EGit/JGit to attempt to rerun a one-year-old build to distinguish CI from project evolution.
Comment 43 Thomas Wolf CLA 2021-05-26 08:56:39 EDT
(In reply to Ed Willink from comment #42)
> It might be helpful for EGit/JGit to attempt to rerun a one-year-old build
> to distinguish CI from project evolution.

Not necessary. Same effect is also seen in EGit builds. Look at https://ci.eclipse.org/egit/job/egit/buildTimeTrend , builds 606 and 607ff. That's a release build, not a Gerrit patchset verify build. It has an 80min timeout.

606 (April 24) was normal (~30min), since 607 (May 7) it's much slower, and the only change between the two builds are the version numbers in poms & manifests. Build script is the same (they're versioned, both builds use egit-pipelines@v0.0.6).

The EGit verify job shows a similar pattern: https://ci.eclipse.org/egit/job/egit.gerrit/buildTimeTrend

For JGit, I have pointed in comment 38 to a recent build that ran normally.
We still have some jobs that run normally, but most are much slower.

(In reply to Ed Willink from comment #42)
> as before, pace an unlucky choice of server may cost a 50% longer run. (The
> time to get a server online has also increased.)

Perhaps we're just exceptionally unlucky and hit this case far more often, and when we do, it's not a 50% longer run but more like 100% longer.
Comment 44 Thomas Wolf CLA 2021-05-26 09:20:01 EDT
@Fred: I see https://ci.eclipse.org/jgit/job/webmaster-test/ running. I presume that's you :-) If you ever want a green build, run that maven build with a JDK 8. JGit doesn't build with Java 11.
Comment 45 Frederic Gurr CLA 2021-05-26 09:37:41 EDT
(In reply to Thomas Wolf from comment #44)
> @Fred: I see https://ci.eclipse.org/jgit/job/webmaster-test/ running. I
> presume that's you :-) If you ever want a green build, run that maven build
> with a JDK 8. JGit doesn't build with Java 11.
Thanks for the heads-up. https://ci.eclipse.org/jgit/job/webmaster-test/7/consoleFull (without tests) seems to build fine with JDK11 though.
Comment 46 Matthias Sohn CLA 2021-05-26 16:55:03 EDT
here is another example of the same issue
https://ci.eclipse.org/egit/job/egit-github-stable/buildTimeTrend
usually this took 4-5 min nowadays it takes 15-20min
Comment 47 Denis Roy CLA 2021-05-27 09:05:55 EDT
Matthias is this intentional:

Success > Console Output#2​05​6May 27, 2021, 6:47 AM  180112,5
Success > Console Output#2​05​5May 27, 2021, 6:47 AM  180111,5
Aborted > Console Output#2​05​4May 27, 2021, 6:47 AM  180110,4
In progress > Console Output#2​05​3May 27, 2021, 6:47 AM
  180108,4[cancel]
Success > Console Output#2​05​2May 27, 2021, 6:47 AM  180109,4
Aborted > Console Output#2​05​1May 27, 2021, 6:47 AM  179520,4
Aborted > Console Output#2​05​0May 27, 2021, 6:47 AM  179460,5

Looks like you've had a bunch of the same build running at the exact same time. Not sure if that makes sense?


https://ci.eclipse.org/egit/job/egit.gerrit/
Comment 48 Jonah Graham CLA 2021-05-27 09:30:39 EDT
(In reply to Denis Roy from comment #47)
> Looks like you've had a bunch of the same build running at the exact same
> time. Not sure if that makes sense?

FWIW this makes sense to me - those gerrit checks were a series of commits* and if there are enough resource packs to run the checks in parallel they should be. This happens all the time in CDT too.

If there are not enough resource packs, Jenkins still reports them all starting at the same time, but the "real work" happens later, so you get logs like this[1]:

06:48:13  Waiting for next available executor on ‘centos-7-swzjq’
08:09:45  Agent centos-7-lmf0l is provisioned from template centos-7

showing that it took > 1 hour for the resource to become available.

* see the "Submitted together" section in https://git.eclipse.org/r/c/egit/egit/+/180112

[1] https://ci.eclipse.org/egit/job/egit.gerrit/2056/consoleFull
Comment 49 Thomas Wolf CLA 2021-05-27 16:11:33 EDT
(In reply to Jonah Graham from comment #48)
> FWIW this makes sense to me - those gerrit checks were a series of commits*

Exactly. This is business as usual, and that some of these builds have to wait before they even start is normal. But that builds, once they run, take that long is not.

The various build time trend charts all show that the problems started roughly at the beginning of May. Around that time, EGit was moved to OKD? [1] JGit followed a little later. [2]

Since builds appear to be much slower in general (no particular operation standing out) I wonder if it could be a file system issue. Downloading maven or Nexus artifacts and compiling are file system intensive, and so are JGit and EGit tests (git repos are in the file system, and the tests set up and tear down lots of repositories, and perform lots of commits and other git operations).

Just a wild hunch...

[1] https://github.com/eclipse-cbi/jiro/commit/c978a217a
[2] https://github.com/eclipse-cbi/jiro/commit/5b3bdafb
Comment 50 Denis Roy CLA 2021-05-27 20:23:48 EDT
I've been looking at our shared file system for CBI and it has been strained lately. We're working on fixing that, and implementing a high-performance cluster-based storage solution.

Thanks for your patience.
Comment 51 Frederic Gurr CLA 2021-05-28 09:33:28 EDT
With some limited testing, we have identified at least one node of our cluster with reduced performance. While the root cause is still unclear, we have adjusted some settings so pod templates on the JGit instance will _prefer_ other nodes.

We will test those settings and check with the JGit project if they see shorter build times over the next days. If proven successful, we will roll this out to other Jenkins instances soon.
Comment 52 Thomas Wolf CLA 2021-06-01 09:43:16 EDT
(In reply to Frederic Gurr from comment #51)
> We will test those settings and check with the JGit project if they see
> shorter build times over the next days. 

Time for feedback now that we've run a couple of builds (still a small sample, though):

Looking at the egit.gerrit build time trend [1], it seems that this change did help a little; it's now at about 40 min. It used to be around 25 min before OKD, so something is still not entirely good.

In JGit it looks similar [2]; build times are closer to the normal time range now.

[1] https://ci.eclipse.org/egit/job/egit.gerrit/buildTimeTrend
[2] https://ci.eclipse.org/jgit/job/stable/job/jgit.gerrit-pipeline/buildTimeTrend
Comment 53 Frederic Gurr CLA 2021-06-01 10:26:11 EDT
Thanks for the feedback.

Please note, that no settings have been changed for the EGit Jenkins instance. The shorter build times that you see on the EGit instance are mostly caused by the fact that the node with reduced performance has been taken out of rotation completely last week and has not been added back yet.

For JGit you are seeing better build times for the same reason. The changed node preferences had no effect yet.

tl;dr; your builds are faster but for another reason. ;)
Comment 54 Denis Roy CLA 2021-06-01 16:12:52 EDT
We have identified a filesystem and a network segment that were often congested. The network segment was moved, and this Sunday we'll be changing the filesystem. Already though, things should be getting better.
Comment 55 Matthias Sohn CLA 2021-06-01 18:30:32 EDT
thanks, we see some improvement

I just saw this in the latest egit build (https://ci.eclipse.org/egit/job/egit-stable/176/console):

23:43:01  wrapper script does not seem to be touching the log file in /home/jenkins/agent/workspace/egit-stable/repo@tmp/durable-544f1233
23:43:01  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
23:43:21  Cannot contact centos-7-0l4c9: java.lang.InterruptedException
23:48:06  wrapper script does not seem to be touching the log file in /home/jenkins/agent/workspace/egit-stable/repo@tmp/durable-544f1233
23:48:06  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
Comment 56 Denis Roy CLA 2021-06-01 19:59:36 EDT
Which one is the correct local Ottawa time?

17:32:54  21:32:53 [INFO]
Comment 57 Thomas Wolf CLA 2021-06-02 02:28:48 EDT
(In reply to Denis Roy from comment #56)
> Which one is the correct local Ottawa time?
> 
> 17:32:54  21:32:53 [INFO]

I think, the first. The second is written by mvn; that one appears to be UTC.

Matthias reported this at 18:30 EDT and you asked at 19:59 EDT, so if it were the second one, the build had run in the future :-) Or is it "will have had run"? Grammar gets difficult then.
Comment 58 Matthias Sohn CLA 2021-06-02 03:29:51 EDT
(In reply to Denis Roy from comment #56)
> Which one is the correct local Ottawa time?
> 
> 17:32:54  21:32:53 [INFO]

I see

23:32:54  21:32:53 [INFO]

it seems Jenkins is translating UTC to local time depending on the user's timezone
Comment 59 Mikaël Barbero CLA 2021-06-02 03:59:11 EDT
(In reply to Matthias Sohn from comment #55)
> thanks, we see some improvement
> 
> I just saw this in the latest egit build
> (https://ci.eclipse.org/egit/job/egit-stable/176/console):
> 
> 23:43:01  wrapper script does not seem to be touching the log file in
> /home/jenkins/agent/workspace/egit-stable/repo@tmp/durable-544f1233
> 23:43:01  (JENKINS-48300: if on an extremely laggy filesystem, consider
> -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.
> HEARTBEAT_CHECK_INTERVAL=86400)
> 23:43:21  Cannot contact centos-7-0l4c9: java.lang.InterruptedException
> 23:48:06  wrapper script does not seem to be touching the log file in
> /home/jenkins/agent/workspace/egit-stable/repo@tmp/durable-544f1233
> 23:48:06  (JENKINS-48300: if on an extremely laggy filesystem, consider
> -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.
> HEARTBEAT_CHECK_INTERVAL=86400)

I've created a PR to add this option to all Jenkins instances: https://github.com/eclipse-cbi/jiro/pull/163.

Note that default value is 300s (https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L115), and going straight to 86400s (24h) is a little too much of a jump IMO. Hence the 7200s value in the PR.

Once merged, the new option will be taken into account on next deployment.
Comment 60 Thomas Wolf CLA 2021-06-02 04:26:59 EDT
(In reply to Matthias Sohn from comment #58)
> (In reply to Denis Roy from comment #56)
> > Which one is the correct local Ottawa time?
> > 
> > 17:32:54  21:32:53 [INFO]
> 
> I see
> 
> 23:32:54  21:32:53 [INFO]
> 
> it seems Jenkins is translating UTC to local time depending on the user's
> timezone

There's a checkbox on the left in the Jenkins UI whether you want to see the bold timestamps in your browser's timezone. Yours is UTC+2 (CEST), so you'd see 23:32. If unchecked, you'll see 17:32, which is the time in whatever localtime on the Jenkins instance is (apparently EDT).

Things would be *much* simpler if all timestamps were always UTC everywhere.
Comment 61 Thomas Wolf CLA 2021-06-02 04:32:47 EDT
(In reply to Mikaël Barbero from comment #59)
> (In reply to Matthias Sohn from comment #55)
> > thanks, we see some improvement
> > 
> > I just saw this in the latest egit build
> > (https://ci.eclipse.org/egit/job/egit-stable/176/console):
> > 
> > 23:43:01  wrapper script does not seem to be touching the log file in
> > /home/jenkins/agent/workspace/egit-stable/repo@tmp/durable-544f1233
> > 23:43:01  (JENKINS-48300: if on an extremely laggy filesystem, consider
> > -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.
> > HEARTBEAT_CHECK_INTERVAL=86400)
> > 23:43:21  Cannot contact centos-7-0l4c9: java.lang.InterruptedException
> > 23:48:06  wrapper script does not seem to be touching the log file in
> > /home/jenkins/agent/workspace/egit-stable/repo@tmp/durable-544f1233
> > 23:48:06  (JENKINS-48300: if on an extremely laggy filesystem, consider
> > -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.
> > HEARTBEAT_CHECK_INTERVAL=86400)
> 
> I've created a PR to add this option to all Jenkins instances:
> https://github.com/eclipse-cbi/jiro/pull/163.
> 
> Note that default value is 300s
> (https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/
> org/jenkinsci/plugins/durabletask/BourneShellScript.java#L115), and going
> straight to 86400s (24h) is a little too much of a jump IMO. Hence the 7200s
> value in the PR.
> 
> Once merged, the new option will be taken into account on next deployment.

Is that a good idea? Isn't that just hiding some problem? What is this heartbeat good for?
Comment 63 Denis Roy CLA 2021-06-02 08:22:02 EDT
(In reply to Thomas Wolf from comment #57)
> Matthias reported this at 18:30 EDT and you asked at 19:59 EDT, so if it
> were the second one, the build had run in the future :-) Or is it "will have
> had run"? Grammar gets difficult then.

I didn't see a timestamp on the build, so it could have been the day before :)

I think "the build would have run in the future" is correct :)

I looked at the performance metrics for the networked filesystem behind ci at that time, and everything was basically idle. I suppose that, compared to a direct-attached SSH, everything else is a LAGGY FILESYSTEM
Comment 64 Thomas Wolf CLA 2021-06-02 10:34:03 EDT
In any case it's again super-slow for the past couple of builds.

https://ci.eclipse.org/egit/job/egit.gerrit/buildTimeTrend

Currently running build 2080 is also a super-slow one; it's after more than 20 minutes still on the first maven project.
Comment 65 Denis Roy CLA 2021-06-02 10:39:01 EDT
> Currently running build 2080 is also a super-slow one; it's after more than
> 20 minutes still on the first maven project.

Good to know; it's not due to slow shared file storage or network segments. It's something else.
Comment 66 Matthias Sohn CLA 2021-06-02 16:44:57 EDT
(In reply to Denis Roy from comment #65)
> > Currently running build 2080 is also a super-slow one; it's after more than
> > 20 minutes still on the first maven project.
> 
> Good to know; it's not due to slow shared file storage or network segments.
> It's something else.

today builds are really slow
https://ci.eclipse.org/jgit/job/stable/job/jgit.gerrit-pipeline/
https://ci.eclipse.org/jgit/job/stable/job/jgit-stable-pipeline/

looks like we won't be able to finish our RC1 build in time

Had a RC1 ready yesterday, today it turned out that we need another fix.
I started a respin around 16:00 CEST, still (22:44) the jgit build isn't finished, increased the timeout to 90min after I hit the 60min timeout twice. Already did verification locally to shorten turnaround. When jgit build is done I still need to do the same for EGit :-(
Comment 67 Thomas Wolf CLA 2021-06-03 05:30:10 EDT
JGit and EGit builds are essentially dead.

Even skipping tests and spotbugs, a build takes 40 minutes. It should take a couple of minutes at most. [1] 

Also, builds now have to wait very long to get an executor at all.

[1] https://ci.eclipse.org/egit/job/egit-stable/190/consoleFull
Comment 68 Christian Dietrich CLA 2021-06-03 05:43:32 EDT
we see the waiting for executor. still long build times (3x-5x) and instabilities on Xtext builds too
Comment 69 Mikaël Barbero CLA 2021-06-03 05:56:04 EDT
I've taken out the slowest nodes out of scheduling. Let us know if it helps.
Comment 70 Frederic Gurr CLA 2021-06-07 12:35:17 EDT
*** Bug 574013 has been marked as a duplicate of this bug. ***
Comment 71 Ed Willink CLA 2021-06-07 13:07:22 EDT
Something has clearly changed. Whereas the similar OCL builds were spread 12 mins to 20 mins, now they are mostly 10 mins but with slower 17, 18, 20, 23 min runs. Almost as if we now have two gears. See Bug 572158.
Comment 72 Denis Roy CLA 2021-06-11 09:39:45 EDT
(In reply to Thomas Wolf from comment #67)
> JGit and EGit builds are essentially dead.

Have you seen any improvements this week?
Comment 73 Thomas Wolf CLA 2021-06-11 12:02:17 EDT
(In reply to Denis Roy from comment #72)
> (In reply to Thomas Wolf from comment #67)
> > JGit and EGit builds are essentially dead.
> 
> Have you seen any improvements this week?

Yes, we did. Whatever you did helped. Interestingly the JGit build is now much faster than it was before (from about 40min to 20-30min). For EGit I don't have a direct comparison since there were no commits on master. egit-stable.gerrit now takes around 40min; that job essentially does the same as egit.gerrit, which normally took about 25min.

In any case it looks OK right now.

Thanks for your efforts!
Comment 74 Aparna Argade CLA 2021-06-18 10:19:12 EDT
Today SWTBot gerrit build https://ci.eclipse.org/swtbot/job/swtbot-gerrit is very slow. It normally takes 20 mins. Today it took 2hr or more. See builds #1232, #1234. Build #1234 failed due to UI slowness.
Comment 75 Mikaël Barbero CLA 2021-06-18 10:29:57 EDT
(In reply to Aparna Argade from comment #74)
> Today SWTBot gerrit build https://ci.eclipse.org/swtbot/job/swtbot-gerrit is
> very slow. It normally takes 20 mins. Today it took 2hr or more. See builds
> #1232, #1234. Build #1234 failed due to UI slowness.

You job got stuck for 2h in https://ci.eclipse.org/swtbot/job/swtbot-gerrit/1232/. See timestamp in https://ci.eclipse.org/swtbot/job/swtbot-gerrit/1232/consoleFull:

09:03:43 (SWT:1574): Gtk-WARNING **: 07:03:43.657: gtk_widget_size_allocate(): attempt to allocate widget with width -12 and height 0
11:03:43 
11:03:43 (SWT:1574): Gdk-CRITICAL **: 09:03:43.720: gdk_threads_set_lock_functions: assertion 'gdk_threads_lock == NULL && gdk_threads_unlock == NULL' failed

it looks more like stuck SWT test rather than slowness of the build infra.
Comment 76 Mikaël Barbero CLA 2021-06-18 11:03:47 EDT
AFAICT, build performance are back to normal. Here is what we did:

* All machines from the old cluster have been allocated to the new cluster
* Spinning HDDs on machines with high IO saturation have been replaced with SDDs
* Backend storage server has been cleaned up and restarted.
* Antiquated machines with poor performance have been retired.

We continue to monitor the performances, but I guess that this ticket can be closed for now. Feel free to reopen if you still face issues.

Thank you everyone for your patience.
Comment 77 Matthias Sohn CLA 2021-06-18 11:58:55 EDT
(In reply to Mikaël Barbero from comment #76)
> AFAICT, build performance are back to normal. Here is what we did:
> 
> * All machines from the old cluster have been allocated to the new cluster
> * Spinning HDDs on machines with high IO saturation have been replaced with
> SDDs
> * Backend storage server has been cleaned up and restarted.
> * Antiquated machines with poor performance have been retired.
> 
> We continue to monitor the performances, but I guess that this ticket can be
> closed for now. Feel free to reopen if you still face issues.
> 
> Thank you everyone for your patience.

I can confirm that we are back to normal regarding build times for both jgit and egit. I very much appreciate the improvements you implemented :-)