Bug 464947 - Many "MALFORMED" exceptions during platform build, during pack200 with signing at Eclipse.org
Summary: Many "MALFORMED" exceptions during platform build, during pack200 with signin...
Status: RESOLVED FIXED
Alias: None
Product: CBI
Classification: Technology
Component: signing-service (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 critical (vote)
Target Milestone: ---   Edit
Assignee: CBI Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 465338 465527 494679 (view as bug list)
Depends on:
Blocks: 465376
  Show dependency tree
 
Reported: 2015-04-18 23:46 EDT by David Williams CLA
Modified: 2016-06-10 09:03 EDT (History)
20 users (show)

See Also:


Attachments
full log from a failed run (779.33 KB, application/x-zip-compressed)
2015-04-19 15:59 EDT, David Williams CLA
no flags Details
Here's bundle, including target, where failure occurred (41.75 KB, application/x-zip-compressed)
2015-04-19 16:10 EDT, David Williams CLA
no flags Details
zipped target folder of failed artifact (30.19 KB, application/x-zip)
2015-04-23 09:12 EDT, Marvin Mueller CLA
no flags Details
Other issue after using fork (44.91 KB, application/x-zip)
2015-04-23 10:19 EDT, Marvin Mueller CLA
no flags Details
bundle (and target) of failed run (4.02 MB, application/x-xz)
2015-04-23 12:48 EDT, David Williams CLA
no flags Details
log from run that failed on org.eclipse.core.commands (110.88 KB, application/x-xz)
2015-04-23 12:55 EDT, David Williams CLA
no flags Details
bundle (and target) that failed with Illegal Argument exception (3.85 MB, application/x-xz)
2015-04-23 16:56 EDT, David Williams CLA
no flags Details
log that goes with previous attached failure (3.85 MB, application/x-xz)
2015-04-23 16:58 EDT, David Williams CLA
no flags Details
REDO: here is the log that goes with previous failed bundle (146.23 KB, application/x-xz)
2015-04-23 18:20 EDT, David Williams CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Williams CLA 2015-04-18 23:46:59 EDT
This is probably related to bug 463510, but thought I'd open it separately, since its cause is unknown at the moment. 

I've tried about 5 test I-builds of the eclipse platform, 4 with it JAR_PROCESSOR_JAVA=java8 set, and one without. Even the one without, ended up with this error! A couple of the others had a similar error (always at a different point), twice it failed in a more "normal" way, but was unexplainable, looking at the results. And one ended with a "crashed VM"! Again, for nothing very obvious. 

I'm going to try a few more without JAR_PROCESSOR_JAVA in the equation, since if it still fails then, it would seem something fundamental is wrong (since, that should use the same underlying mechanism we have been using all along). 

And ... recall ... I was pretty sure the "batch signing" changes would not effect Tycho/Maven builds! 

I'm beginning to wonder if there is some odd "threading" change? 


So, questions are, how is is possible Maven/Tycho is effected by "the batch signer" ... I'm assuming it goes through the "web service" via the maven plugin, to do the signing. 

Have arguments to the "web service" changed? In a way similar to the the way the arguments to batch signer changed? (I believe now it's mandatory to specify one of 'mail' or 'mailnow' to the batch signer? 


In addition, I believe the 'java8' option uses the "newer" jarprocessor from the Eclipse platform? Perhaps that changes things it ways that were meant to be fixed (in another context)? 

I'll try a few more builds with out JAR_PROCESSOR_JAVA, to be positive something is broken, even using "the old way". 

= = = = = 

[ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack (pack200-pack) on project org.eclipse.ltk.ui.refactoring.tests: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack (pack200-pack) on project org.eclipse.ltk.ui.refactoring.tests: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED

...

Caused by: org.apache.maven.plugin.PluginExecutionException: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:115)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
        ... 19 more
Caused by: java.lang.IllegalArgumentException: MALFORMED
        at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
Comment 1 David Williams CLA 2015-04-18 23:58:21 EDT
Minor point, I have determined the JAR_PROCESSOR_JAVA was not set set to java8, but, the way I had things, would have been set to 
JAR_PROCESSOR_JAVA=""

I'm assuming this would be equivalent to "not being set at all", (it is for pure bash, but I don't really know what is "reading" this environment variable? 

So, now I'll be sure it is "not there at all". And see if I can get a clean run.
Comment 2 David Williams CLA 2015-04-19 12:12:18 EDT
I took the variable out completely, and got two good runs. 

I then set back to 

export JAR_PROCESSOR_JAVA="java8"

And lo and behold it ran ok, that time. 

Any thoughts? 
Coincidence? 
Server Load? 
Do certain "NFS nodes" take long to all "get in sync" with your changes? 

Or, did you fix something over the weekend? 

I should note, the failures I were seeing really were producing "bad" pack.gz files. So bad, that they could not even be decompressed. The jars themselves, presumably conditioned, and signed, looked ok. So ... don't know.
Comment 3 David Williams CLA 2015-04-19 12:26:37 EDT
Jan, Igor, 

Adding you as Tycho experts, to see if you know of any possible way, that the Tycho "re-pack", or "pack" Mojo's could interact with the "batch signing service" (which calls 're-pack' and then 'sign', ordinarily, but I know you do not use it for that. 

So, I'd sure think no interaction ... but, guess what I'm wondering, is if they both used the same Java VM on disk, and it in turn perhaps "called out" to some native code, is that a possible source of "concurrency issues"? (I don't actually know that it calls out to native code, but some very old bug reports, when this was first implemented, implied that it did). 

And, hmmm, I wonder, can files in general get stepped-on, by two processes? Perhaps especially if doesn't flush/close it correctly? (Though, I think would have shown up before now, if that was that case).
Comment 4 David Williams CLA 2015-04-19 12:46:38 EDT
Also adding Mikael and Thanh, with similar question about Maven "signing mojo" ... any possible way it could interact with the "batch signing service"? 

Again, I would sure think not. 

Remember, we in the Platform build should not be using the "batch signing service" at all ... so, I was just so surprised a change in that service, would impact Maven/Tycho builds. 

Grasping at straws ... :)
Comment 5 David Williams CLA 2015-04-19 15:59:20 EDT
Created attachment 252518 [details]
full log from a failed run

Since my last comment, I ran two more successful test I-builds, but was thinking "maybe is was a temporary thing", but thought I'd just as well run a few more, then "boom", the third once failed with the "MALFORMED" exception. A different location that last time, so sure sounds like some semi-random issue -- the time often associated with "threads" or similar.
Comment 6 David Williams CLA 2015-04-19 16:10:38 EDT
Created attachment 252520 [details]
Here's bundle, including target, where failure occurred

Not sure this can be useful ... just shows the 'incomplete' bundle where the error occurred. (In complete, since the pack.gz file is much smaller than it should be since, obviously, it was interrupted by exception.
Comment 7 David Williams CLA 2015-04-19 16:25:18 EDT
For what it's worth, I've been trying, so far, to "run the build" using Java 7, 
even though, by use of the environmental variable, was saying to use "Java 8" to do the re--pack and sign. 

If anyone thinks that makes a difference, I'd sure like to hear why, since, again, I have been assuming this change to "batch signer" was completely independent of Maven/Tycho, that they use a completely different method and do not use the batch signer at all. 

Are the methods related, somehow, "under the covers"; "behind the scenes"? 


At any rate, I'll try a few runs using Java 8 (which, used to result in another bug, I'll log :) ... and, then, probably, remove the variable, and run a bunch with Java 7, as before. Perhaps the issue is independent of what I do?
Comment 8 David Williams CLA 2015-04-20 04:27:22 EDT
(In reply to David Williams from comment #7)

> 
> At any rate, I'll try a few runs using Java 8 (which, used to result in
> another bug, I'll log :) ... and, then, probably, remove the variable, and
> run a bunch with Java 7, as before. Perhaps the issue is independent of what
> I do?

I've ran 5 "test I-builds" since my last comment, and so far they all complete with the "MALFORMED" exception. 

bug 464980 is that other bug I mentioned we have, running under 1.8, but is is unrelated to the current signing/packing issues. I first saw several weeks ago, before all this started, but didn't have time then to write it up.

I guess teh questions that need to be answered are: 

a) Why does the "malformed" exception occur at all, if Tycho and Maven builds not related to this "batch processor". 

b) if they are related to batch processor, in what way? 

c) what is the root cause? concurancy? Do people using the "web service" have to change what they specify? (Such as that 'mail|nomail' is no longer optional? 

I am not *that* concerned about moving up to Java 1.8, but am concerned if we say that *everyone* has to ... as we see even in our case, "it can change things" ... so .... multiply that times 30? and I think we'll have a lot of chaos. 

I might suggest a "java7" option also be provided ... for now ... but, unless we know the root cause of why it happens, I am not sure that alone will really fix anything (i.e. a sample of 5 builds in one day, doesn't "prove" that the problem is not still there, since it was intermittant to begin with.
Comment 9 David Williams CLA 2015-04-20 09:35:56 EDT
(In reply to David Williams from comment #8)
> (In reply to David Williams from comment #7)
> 
> > 
> > At any rate, I'll try a few runs using Java 8 (which, used to result in
> > another bug, I'll log :) ... and, then, probably, remove the variable, and
> > run a bunch with Java 7, as before. Perhaps the issue is independent of what
> > I do?
> 
> I've ran 5 "test I-builds" since my last comment, and so far they all
> complete with the "MALFORMED" exception. 

Whoops, major typo, re-reading in the light of day ... 

they all completed *without* the "MALFORMED" exception.
Comment 10 David Williams CLA 2015-04-20 09:48:52 EDT
(In reply to David Williams from comment #9)
> (In reply to David Williams from comment #8)
> > (In reply to David Williams from comment #7)
> > 
> > > 
> > > At any rate, I'll try a few runs using Java 8 (which, used to result in
> > > another bug, I'll log :) ... and, then, probably, remove the variable, and
> > > run a bunch with Java 7, as before. Perhaps the issue is independent of what
> > > I do?
> > 
> > I've ran 5 "test I-builds" since my last comment, and so far they all
> > complete with the "MALFORMED" exception. 
> 
> Whoops, major typo, re-reading in the light of day ... 
> 
> they all completed *without* the "MALFORMED" exception.

Oh, BUT, a test build running overnight ended *WITH* the MALFORMED exception: 

[ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack (pack200-pack) on project org.eclipse.core.filesystem.macosx: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack (pack200-pack) on project org.eclipse.core.filesystem.macosx: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED
Comment 11 David Williams CLA 2015-04-20 09:54:46 EDT
I think this may be a Tycho bug? 

It occurs with both Java 1.7, and Java 1.8. 

Perhaps interacts with the CBI-signing-plugin?
Comment 12 Jan Sievers CLA 2015-04-20 10:21:04 EDT
the only recent pack200 change in Tycho I am aware of is

http://git.eclipse.org/c/tycho/org.eclipse.tycho.extras.git/diff/pack200/tycho-pack200-impl/src/main/java/org/eclipse/tycho/extras/pack200/Pack200Archiver.java?id=f080a224e1c1f575d52fa0be1106f828bd49073a

but this shouldn't matter because it only removed special workarounds for pack200 JDK bugs in Java 6 or earlier and Java 6 is no longer supported anyway with Tycho 0.23.0-SNAPSHOT (bug 463433)

that said, the JDK version you are using can have an impact as the pack200 code comes with the JDK. So if you changed recently to Java 8, going back to Java 7 may be worth checking to see if that makes any difference in behaviour.

To my knowledge signing and packing are extremely sensitive to subtle bytestream level changes if used together (e.g. if unpack200 behaves slightly different on one JDK vs another JDK where pack200 was originally done, this could lead to invalid signatures of the packed signed jar). There is a so-called pack normalization step to make sure that a pack+unpack operation has the binary identical result as the original (signed) artifact but I'm unsure if using a different JDK somewhere in the process can throw it off.

So another idea would be to double-check if the JDKs used for packing/unpacking across the whole process are consistently the same and if not, check if aligning them helps.
Comment 13 Igor Fedorenko CLA 2015-04-20 10:34:52 EDT
Overall, packing and signing process looks like this

1. tycho performs pack200 normalize, which invokes java.util.jar.Pack200, see below.
2. tycho calls remote signing service via eclipse-jarsigner-plugin.
3. tycho performs pack200 pack, which, again, invokes java.util.jar.Pack200.

I don't see how jvms used to pack200:normalize and pack200:pack can get "out-of-sync" during the same build. Signing, on the other hand, is not supposed to change class files, and it should not matter what jvm is used for signing.

I don't know what JAR_PROCESSOR_JAVA does, but I don't think this is something tycho or maven understand.

I can't really think of a plausible explanation of the observed problem. Failure in different projects does suggest some sort of concurrency issue, but I can't think of what it might be.
Comment 14 Jan Sievers CLA 2015-04-20 10:43:37 EDT
David,

to make this easier to analyse we'll somehow have to strip it down.

One idea is if you could provide an original valid and not yet pack-conditioned bundle jar for which (un)packing failed, we could test if pack normalization behaves exactly the same when using Java 7 and Java 8.

>And ... recall ... I was pretty sure the "batch signing" changes would not 
>effect Tycho/Maven builds! 

that's supicious... what changes were done exactly?

>Tycho "re-pack", or "pack" Mojo's could interact with the "batch signing 
>service" (which calls 're-pack' and then 'sign', ordinarily, but I know you do 
>not use it for that. 

if the signing service (which is called during build AFAICT) is doing pack200 operations independent of Tycho then it could affect pack200 operations on the Tycho side if the two pack200 implementations somehow differ in an incompatible way.
Comment 15 Jan Sievers CLA 2015-04-20 11:23:00 EDT
(In reply to Igor Fedorenko from comment #13)
> 1. tycho performs pack200 normalize, which invokes java.util.jar.Pack200,
> see below.
> 2. tycho calls remote signing service via eclipse-jarsigner-plugin.

if this step somehow fails silently, the next step 3 below could fail as a follow-up e.g. due to corrupted file.
At least the HTTP status code handling in [1] leaves room for returning without an error although no content/only partial content was served [2].

I suggest to check the logs of the jar signer service at the time the build failure(s) occured.

> 3. tycho performs pack200 pack, which, again, invokes java.util.jar.Pack200.


[1] https://git.eclipse.org/c/cbi/org.eclipse.cbi.maven.plugins.git/tree/common/src/main/java/org/eclipse/cbi/common/signing/Signer.java#n44
[2] http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#2xx_Success
Comment 16 David Williams CLA 2015-04-20 12:45:07 EDT
(In reply to Jan Sievers from comment #14)
> David,
> 
> to make this easier to analyse we'll somehow have to strip it down.
> 
> One idea is if you could provide an original valid and not yet
> pack-conditioned bundle jar for which (un)packing failed, we could test if
> pack normalization behaves exactly the same when using Java 7 and Java 8.

Ok, not sure how to do that ... but, am willing to try. I'm guessing you mean more than what is attached to this bug, the 'snapshot' jar would have already been conditioned? 

> 
> >And ... recall ... I was pretty sure the "batch signing" changes would not 
> >effect Tycho/Maven builds! 
> 
> that's supicious... what changes were done exactly?
> 


By CC, I'll have to ask the webmasters to describe the changes, AND to describe the relationship between the "signing web service" (which the maven CBI signing bundle uses) and the "batch signing service". But, I do know the maven CBI signing bundle does not do any pack/re-pack ... well ... as long as the webservice does not, which as far as I know it never has. 

But, what I do know is: the "batch signing service" (used by Buckminster, with -nopack option) and PDE builds (by tradition using the pack function, when the VM used to -repack was well known) was found to be using 
Java 6 + the old "update manager jarprocessor". So, I asked that to be upgraded to Java 7 or 8 and to use the new p2 jarprocessor (bug 464541). 

Then there were problems, and it was discovered many of the problems were caused by "builders" (or, the IT service, depending on your point of view) was re-signing bundles that were already signed. This was discussed in original bug 463510 and a very old bug 174475 (and, it used to be "sometimes ok, sometimes not", but with the change in VMs, etc., was obvious it was now almost always not ok to resign a bundle). 


So, in the end, the webmasters changed the batch signing service, at least, to not re-sign a bundle, if "jarsigner -verify" returned "ok", plus added an option to use Java 8 to "run" the jarprocessor. All this, while leaving the old Java 6 stuff in place, while we tested. I believe specifying 'java8' ALSO uses the new p2 jarprocessor. 


> >Tycho "re-pack", or "pack" Mojo's could interact with the "batch signing 
> >service" (which calls 're-pack' and then 'sign', ordinarily, but I know you do 
> >not use it for that. 
> 
> if the signing service (which is called during build AFAICT) is doing
> pack200 operations independent of Tycho then it could affect pack200
> operations on the Tycho side if the two pack200 implementations somehow
> differ in an incompatible way.

I'm pretty sure this is not the case ... at least by design ... but, have not yet heard from webmasters what the relationship is, or if this could be a source of error (on the nature of a "typo" or something?)
Comment 17 Eclipse Webmaster CLA 2015-04-20 14:13:05 EDT
(In reply to David Williams from comment #16)

> By CC, I'll have to ask the webmasters to describe the changes,

The changes to the batch signer consisted of:

1)The sign command now refuses to queue jars(it emits a '$FILE already signed') when the result of 'jarsigner -verify' is "verified"
2) Forces the second parameter of the sign command to be one of: now,mail or nomail(which is consistent with the 'help' instructions)
3) It passes the value of 'JAR_PROCESSOR_JAVA' from the environment to the signing request queue.
4) The queue processor now switches between using the Java 6 or Java 8 jarsign command based on the presence of 'java8' in the signing queue(#3). This also impacts which jar processor is called(by default the same jar processor we've always used, with Java 8 the jar processor from Luna)

> describe the relationship between the "signing web service"

The web service uses Java 7 and just calls jarsign on the requested file, it does no reprocessing(that I can see).  If you use the 'now' flag with the batch signer it basically fires a request to the web signing service.  Outside of that there isn't a direct connection between the two services.

> > if the signing service (which is called during build AFAICT) is doing
> > pack200 operations independent of Tycho then it could affect pack200
> > operations on the Tycho side if the two pack200 implementations somehow
> > differ in an incompatible way.
> 
> I'm pretty sure this is not the case ... at least by design ... but, have
> not yet heard from webmasters what the relationship is, or if this could be
> a source of error (on the nature of a "typo" or something?)

We don't do any explicit (un)pack operations in the batch signer scripts, but I thought the jarprocessor did (un)pack things as needed?

-M.
Comment 18 David Williams CLA 2015-04-20 14:26:58 EDT
(In reply to Eclipse Webmaster from comment #17)

> We don't do any explicit (un)pack operations in the batch signer scripts,
> but I thought the jarprocessor did (un)pack things as needed?

Yes, to be explicit, (with correct terminology, not to confuse with unpacking pack.gz files, which it does not do in this context ... at least, shouldn't!) ... the jarprocessor calls pack200 with the re-pack option. (i.e. "conditions" it in preparation for signing) (and then calls jarsigner, but does not call the final "pack200"). 

So, in the end, sounds my assumption that they "should be separate" sounds right ... but ... why did we start getting this "MALFORMED" exceptions, after the change? 

[Tycho team, am I correct in assuming you just use the "system" JRE to call Java's Pack/repack, correct? (i.e. don't try to match "BREES" or anything?).] 

As one possible point of "collision", can anyone tell if Java's PACK/repack classes make native calls to the executable that comes with the VM, or is it pure Java? If so, is that native call an opportunity for "collisions" of some sort?
Comment 19 David Williams CLA 2015-04-20 21:37:12 EDT
(In reply to Eclipse Webmaster from comment #17)
> (In reply to David Williams from comment #16)
> 
> > By CC, I'll have to ask the webmasters to describe the changes,
> 
> The changes to the batch signer consisted of:
> 
> 1)The sign command now refuses to queue jars(it emits a '$FILE already
> signed') when the result of 'jarsigner -verify' is "verified"
> 2) Forces the second parameter of the sign command to be one of: now,mail or
> nomail(which is consistent with the 'help' instructions)
> 3) It passes the value of 'JAR_PROCESSOR_JAVA' from the environment to the
> signing request queue.
> 4) The queue processor now switches between using the Java 6 or Java 8
> jarsign command based on the presence of 'java8' in the signing queue(#3).

And that's "either, or", right. Clients can set JAR_PROCESSOR_JAVA=java8 as an environment variable OR specify java8 it on the command line? Or, did I get confused? 

I did notice Orbit builds, using java8 on command line are signed with Java 6 jarsigner (SHA1 in manifest) ... (as I noted in bug 463510, I assumed an error in server scripts ... now I am wondering if I misunderstood?) 

I do see that our Tycho builds are signed with Java 7 (or 8?) as I can see 
SHA256 in the manifest.mf files 

I've removed JAR_PROCESSOR_JAVA from our Tycho/Maven builds ... should not make a difference anyway .... so if I see "MALFORMED" again ... then we'll know ... well, we'll know I'm more confused than ever. :)
Comment 20 Jan Sievers CLA 2015-04-21 03:00:50 EDT
(In reply to David Williams from comment #18)
> [Tycho team, am I correct in assuming you just use the "system" JRE to call
> Java's Pack/repack, correct? (i.e. don't try to match "BREES" or
> anything?).] 

we use the java.util.jar.Pack200 API which is provided by the JDK the build is being run with. This is done in-process, no forking, no special BREE handling.
Comment 21 Jan Sievers CLA 2015-04-21 03:07:37 EDT
(In reply to David Williams from comment #19)
> I've removed JAR_PROCESSOR_JAVA from our Tycho/Maven builds ... should not
> make a difference anyway .... so if I see "MALFORMED" again ... then we'll
> know ... well, we'll know I'm more confused than ever. :)

the next time this happens, we'll need the logs of the signing service as well as the zipped contents of the target/ folder of the failed project to see what's going on.
Comment 22 David Williams CLA 2015-04-21 03:34:48 EDT
(In reply to Jan Sievers from comment #21)

> we use the java.util.jar.Pack200 API which is provided by the JDK the build
> is being run with. This is done in-process, no forking, no special BREE
> handling.

Thanks Jan. Looking at those classes (or, their byte codes, for the "implementation" part, I see two suspicious things. 1) They use threadLocal at some points -- an area hared to get just right, IMHO. and 2) they have hardcoded /tmp/.. in a few places, instead of using the file.io.tmpdir property we supplied. So, in theory, tmp could be getting filled up, and somewhere (perhaps even in cbi-signer, they "eat" an exception, instead of failing (or, at least, it doesn't make it's way back "through the queue". 

[only on my last failure did I think to look at /tmp, and at that time, it was only 65% full ... but, that was a good 45 minutes after the failure, so could have cleaned itself up a little by then. (If anyone sees it really getting full, let me know and I'll start a quick test job :) 


(In reply to Jan Sievers from comment #21)

> the next time this happens, we'll need the logs of the signing service as
> well as the zipped contents of the target/ folder of the failed project to
> see what's going on.

I assume you mean something different than what is attached here? Something from "server side"?
Comment 23 Jan Sievers CLA 2015-04-21 03:51:05 EDT
(In reply to David Williams from comment #22)
> I assume you mean something different than what is attached here? Something
> from "server side"?

ah, oversaw the zipped target/ folder attached already. 
the signed jar seems OK according to jarsigner -verify.
I can manually pack200/unpack200 the signed jar without problems using JDK 7 and 8...

There were memory leaks in the pack200 code in the past (JDK 6) and it seems the problem only occurs sporadically/after some time. I'm beginning to think it may be a good idea to always fork the pack200 process so we would avoid accumulating spurious state.
Comment 24 Jan Sievers CLA 2015-04-21 04:56:26 EDT
(In reply to Jan Sievers from comment #23)
> There were memory leaks in the pack200 code in the past (JDK 6) and it seems
> the problem only occurs sporadically/after some time. I'm beginning to think
> it may be a good idea to always fork the pack200 process so we would avoid 
> accumulating spurious state.

https://git.eclipse.org/r/#/c/46156/1

when this patch has gone through gerrit you will be able to configure fork=true for Tycho's pack200 operations (default is still false).
There is a price to pay for forking in terms of performance but if turning on this switch works around the problem, we would have narrowed down the root cause to some kind of accumulated state in the JDK's pack200 code.
Comment 25 Igor Fedorenko CLA 2015-04-21 07:44:41 EDT
@David where do you see use of hardcoded /tmp location? Note that forked pack200 process was almost certainly using default temp directory location.

@Jan the exception happens reading recently written jar file, not creating pack200 archive. I don't believe forking pack200 will narrow down the problem to accumulated jvm state. Forked pack200 execution changes timing and jvm state is different, so it may still workaround the problem, or at least make the problem much less likely, so probably not a bad change.
Comment 26 Jan Sievers CLA 2015-04-21 08:01:38 EDT
stacktrace above was truncated, attaching full relevant stacktrace extracted from log zip:

[ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack (pack200-pack) on project org.eclipse.e4.core.di.annotations: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack (pack200-pack) on project org.eclipse.e4.core.di.annotations: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:317)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:152)
	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:555)
	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:214)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.23.0-SNAPSHOT:pack failed: MALFORMED
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:115)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
	... 19 more
Caused by: java.lang.IllegalArgumentException: MALFORMED
	at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
	at java.util.zip.ZipFile.getZipEntry(ZipFile.java:531)
	at java.util.zip.ZipFile.access$900(ZipFile.java:56)
	at java.util.zip.ZipFile$1.nextElement(ZipFile.java:513)
	at java.util.zip.ZipFile$1.nextElement(ZipFile.java:483)
	at java.util.jar.JarFile$1.nextElement(JarFile.java:244)
	at java.util.jar.JarFile$1.nextElement(JarFile.java:239)
	at java.util.Collections.list(Collections.java:3688)
	at com.sun.java.util.jar.pack.PackerImpl$DoPack.scanJar(PackerImpl.java:631)
	at com.sun.java.util.jar.pack.PackerImpl$DoPack.run(PackerImpl.java:457)
	at com.sun.java.util.jar.pack.PackerImpl.pack(PackerImpl.java:99)
	at org.eclipse.tycho.extras.pack200.Pack200Wrapper.pack(Pack200Wrapper.java:51)
	at org.eclipse.tycho.extras.pack200.Pack200Wrapper.pack(Pack200Wrapper.java:41)
	at org.eclipse.tycho.extras.pack200.Pack200Archiver.pack(Pack200Archiver.java:148)
	at org.eclipse.tycho.extras.pack200.mojo.Pack200PackMojo.execute(Pack200PackMojo.java:70)
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:106)
	... 20 more
[ERROR] 
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
Comment 27 David Williams CLA 2015-04-21 08:08:21 EDT
(In reply to Igor Fedorenko from comment #25)
> @David where do you see use of hardcoded /tmp location? Note that forked
> pack200 process was almost certainly using default temp directory location.

False alarm, looking at it now, I see only in the comment of pack200 class, 
where they are showing an example.  

 *    try {
 *        JarFile jarFile = new JarFile("/tmp/testref.jar");
 *        FileOutputStream fos = new FileOutputStream("/tmp/test.pack");
 *        // Call the packer
 *        packer.pack(jarFile, fos);
 *        jarFile.close();
 *        fos.close();
 *
 *        File f = new File("/tmp/test.pack");
 *        FileOutputStream fostream = new FileOutputStream("/tmp/test.jar");
 *        JarOutputStream jostream = new JarOutputStream(fostream);
 *        Unpacker unpacker = Pack200.newUnpacker();
 *        // Call the unpacker
Comment 28 Jan Sievers CLA 2015-04-21 08:54:34 EDT
I made an attempt to fix this in 0.23.0-SNAPSHOT

http://git.eclipse.org/c/tycho/org.eclipse.tycho.extras.git/commit/?id=a52a0d258fc0889b9eb45f2092c10845e0575d73

could be this was a latent bug in Tycho related to jar file handles still open at the same time that pack200 reads the jar file.

Two things:

1. Wait a while to see if this fixes the problem
2. If the patch by itself does not help, it also allows to set a new parameter fork=true for tycho-pack200a-plugin/tycho-pack200b-plugin, so you could use this as an alternative workaround
Comment 29 Eclipse Webmaster CLA 2015-04-21 09:42:45 EDT
(In reply to David Williams from comment #19)
 
> And that's "either, or", right. Clients can set JAR_PROCESSOR_JAVA=java8 as
> an environment variable OR specify java8 it on the command line? Or, did I
> get confused? 

No.  Based on your comment in 463510 about it being difficult to add command line flags I tuned the script to use the environment instead.

-M.
Comment 30 Marvin Mueller CLA 2015-04-23 02:47:16 EDT
Hey there, at this moment we seem to have the same issue[1] in our Jubula HIPP instance. Nearly each Build is failing due to the error in different places:
[ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.22.0:pack (pack200-pack) on project org.eclipse.jubula.feature.api: Execution pack200-pack of goal org.eclipse.tycho.extras:tycho-pack200b-plugin:0.22.0:pack failed: MALFORMED

We are using java 8 for building since we have some jars which depend on JavaFX. Also it seems to be getting worse if we use maven deploy(but that is just an assumption). I have switched back to not deploy our snapshot bundles into the nexus and the build gets the error in later bundles. I am unsure if this is related to the problem or if this is to the sporadic nature of the problem.

We had this problem sometimes, but not very often. Since the last days it is really getting worse and a lot of builds fail(nearly each). Could we do something here as a workaround?


[1] https://hudson.eclipse.org/jubula/job/jubula-nightly/2375/console
Comment 31 Jan Sievers CLA 2015-04-23 04:22:28 EDT
(In reply to Marvin Mueller from comment #30)
> We had this problem sometimes, but not very often. Since the last days it is
> really getting worse and a lot of builds fail(nearly each). Could we do
> something here as a workaround?

you can try using Tycho 0.23.0-SNAPSHOT [1], see comment 28
If you do, let us know if this helped or not.

[1] https://wiki.eclipse.org/Tycho/Release_Notes/0.23#SNAPSHOT_builds
Comment 32 Marvin Mueller CLA 2015-04-23 06:30:47 EDT
(In reply to Jan Sievers from comment #31)
> (In reply to Marvin Mueller from comment #30)
> > We had this problem sometimes, but not very often. Since the last days it is
> > really getting worse and a lot of builds fail(nearly each). Could we do
> > something here as a workaround?
> 
> you can try using Tycho 0.23.0-SNAPSHOT [1], see comment 28
> If you do, let us know if this helped or not.
> 
> [1] https://wiki.eclipse.org/Tycho/Release_Notes/0.23#SNAPSHOT_builds

I have tested this and i am unsure if this does make any difference. The Build is still failing with the 0.23.0-SNAPSHOT version. I have also reverted the maven setting "Deploy to repo.eclipse.org" to NONE, afters this the build run successful, but it might only be by accident.
Comment 33 Jan Sievers CLA 2015-04-23 07:35:31 EDT
(In reply to Marvin Mueller from comment #32)
> (In reply to Jan Sievers from comment #31)
> > (In reply to Marvin Mueller from comment #30)
> > > We had this problem sometimes, but not very often. Since the last days it is
> > > really getting worse and a lot of builds fail(nearly each). Could we do
> > > something here as a workaround?
> > 
> > you can try using Tycho 0.23.0-SNAPSHOT [1], see comment 28
> > If you do, let us know if this helped or not.
> > 
> > [1] https://wiki.eclipse.org/Tycho/Release_Notes/0.23#SNAPSHOT_builds
> 
> I have tested this and i am unsure if this does make any difference. The
> Build is still failing with the 0.23.0-SNAPSHOT version. I have also
> reverted the maven setting "Deploy to repo.eclipse.org" to NONE, afters this
> the build run successful, but it might only be by accident.

thanks for letting us know. Next attempt to work around using Tycho 0.23.0-SNAPSHOT would be to fork pack200 operations by configuring in your parent pom

<build>
  <pluginManagement>
    <plugins>
      <plugin>
        <groupId>org.eclipse.tycho.extras</groupId>
        <artifactId>tycho-pack200a-plugin</artifactId>
        <version>${tycho-version}</version>
        <configuration>
          <fork>true</fork>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.eclipse.tycho.extras</groupId>
        <artifactId>tycho-pack200b-plugin</artifactId>
        <version>${tycho-version}</version>
        <configuration>
          <fork>true</fork>
        </configuration>
      </plugin>
    </plugins>
  </pluginManagement>
</build>

also, before doing another build, can you attach the zipped target/ folder of the artifact that failed the build?

Another thing you could try independently is to use eclipse-jarsigner-plugin 1.1.2-SNAPSHOT from https://repo.eclipse.org/content/repositories/cbi-snapshots/. It seems there have been some error handling fixes in the meantime since last release 1.1.1:

https://git.eclipse.org/c/cbi/org.eclipse.cbi.maven.plugins.git/log/

For the record, JDK and OS where the problem occured for Jubula:

Maven home: /home/hudson/genie.jubula/.hudson/maven/slavebundle/bundled-maven
Java version: 1.8.0_40, vendor: Oracle Corporation
Java home: /opt/public/common/jdk1.8.0_40/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.0.101-0.46-default", arch: "i386", family: "unix"
Comment 34 Jan Sievers CLA 2015-04-23 07:58:06 EDT
recent fix in JDK 8 update 45 that seems slightly suspicious to me:

https://bugzilla.redhat.com/show_bug.cgi?id=1211504
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/165947f5448a

which JDK version are we using on the signing server?
Comment 35 Marvin Mueller CLA 2015-04-23 09:12:06 EDT
Created attachment 252676 [details]
zipped target folder of failed artifact
Comment 36 Marvin Mueller CLA 2015-04-23 10:19:27 EDT
Created attachment 252681 [details]
Other issue after using fork

I have done the changes as mentioned in comment 33 but this does lead to other issues (2 new issues 1 successful build). I have attached the project with the hudson log where a jvm crash was happening [1]. Another issue was that we got an illegal argument exception[2]. Both happend in the  "org.eclipse.tycho:tycho-p2-plugin:0.23.0-SNAPSHOT:p2-metadata (p2-metadata)" goal.

Thanks for the help.



[1] https://hudson.eclipse.org/jubula/job/jubula-nightly/2390/console
[2] https://hudson.eclipse.org/jubula/job/jubula-nightly/2388/console
Comment 37 Jan Sievers CLA 2015-04-23 10:33:17 EDT
a JVM crash and another seemingly unrelated error.

Now I'm completely confused. Sorry I have no clue what's going on for now.
Comment 38 David Williams CLA 2015-04-23 11:26:16 EDT
(In reply to Jan Sievers from comment #34)
> recent fix in JDK 8 update 45 that seems slightly suspicious to me:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1211504
> http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/165947f5448a
> 
> which JDK version are we using on the signing server?

I believe it is 

$ /shared/common/jdk1.8.0_x64-latest/bin/java -version
java version "1.8.0_40"
Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

(And, it was Java 7, up to a few days ago). 

WEBMASTERS (sorry to yell) ... can you please confirm (this bug would be about the webservice jar signer): both what's currently used, and if it will ALWAYS be "-latest"? or, is it a separately installed VM?

FYI: I've have asked for some method of "finding out" which version is in use, bug 464432).
Comment 39 David Williams CLA 2015-04-23 11:36:14 EDT
FWIW, I am about to start "pounding" the service again, just to test this ... is this a good time? (From Tycho's point of view). 

I had been using  
 <cbi-plugins.version>1.1.1</cbi-plugins.version>
and will start with a few runs from that, and then 
move up to 
 <cbi-plugins.version>1.1.2-SNAPSHOT</cbi-plugins.version>
if you think that makes a difference.
Comment 40 Jan Sievers CLA 2015-04-23 11:45:42 EDT
(In reply to David Williams from comment #39)
> FWIW, I am about to start "pounding" the service again, just to test this
> ... is this a good time? (From Tycho's point of view). 

sure. 

> I had been using  
>  <cbi-plugins.version>1.1.1</cbi-plugins.version>
> and will start with a few runs from that, and then 
> move up to 
>  <cbi-plugins.version>1.1.2-SNAPSHOT</cbi-plugins.version>
> if you think that makes a difference.

that was just a desperate idea trying to find the cause. Not sure it helps. There are too many variables at play right now and we should slow down and try one after the other I think. It's going to be hard as the issue is sporadic and for now it seems the only place where we can observe it is on eclipse.org infrastructure (due to the specific signing service being used).
Comment 41 David Williams CLA 2015-04-23 12:48:07 EDT
Created attachment 252687 [details]
bundle (and target) of failed run

My "tests" didn't take long. Failed first time! 

This attachmeent is "the build area" for the bundle that failed, included original as checked out from Git code, plus the "target" in partially complete state. 

I'll attach log next.
Comment 42 Igor Fedorenko CLA 2015-04-23 12:49:45 EDT
This was with pack200 fork=true, correct?
Comment 43 David Williams CLA 2015-04-23 12:55:12 EDT
Created attachment 252688 [details]
log from run that failed on org.eclipse.core.commands


If it helps ... this is pretty far fetched ... but, I did notice if I did following command: 

unzip -t org.eclipse.core.commands.jar_ ...-SNAPSHOT 

it gives a error about 
EF block length (0 bytes) invalid (< 4)

AND, as reported in bug 461944, I believe this is entirely spurious, an error in the "unzip --test" code (that was introduced with a security fix). 

But, even if spurious, I'm wondering if any part of the process uses "unzip -t" to confirm a "jar is ok" and then when it see this spurious error, it reports an error? 


WEBMASTERS, this would have occurred shortly after 12 noon today, 4/23, if there's anything in the "webservices" log that needs to be checked? 

Thanks,
Comment 44 David Williams CLA 2015-04-23 12:56:32 EDT
(In reply to Igor Fedorenko from comment #42)
> This was with pack200 fork=true, correct?

No. I take your question to mean I should change to use that. 
Will do.
Comment 45 Eclipse Webmaster CLA 2015-04-23 13:22:37 EDT
(In reply to David Williams from comment #38)

> WEBMASTERS (sorry to yell) ... can you please confirm (this bug would be
> about the webservice jar signer): both what's currently used, and if it will
> ALWAYS be "-latest"? or, is it a separately installed VM?

Right now its using /opt/public/common/jdk1.8.0_x64-latest and I expect it to continue using that path until we update to Java 9(or whatever comes next.

> WEBMASTERS, this would have occurred shortly after 12 noon today, 4/23, if
> there's anything in the "webservices" log that needs to be checked? 
 
The webservice only logs the requests, and returns the error to the requesting client.  I've tweaked the code to try and get it to log the signing output.

-M.
Comment 46 David Williams CLA 2015-04-23 16:46:30 EDT
(In reply to David Williams from comment #44)
> (In reply to Igor Fedorenko from comment #42)
> > This was with pack200 fork=true, correct?
> 
> No. I take your question to mean I should change to use that. 
> Will do.

Using 'forked', the build still fails, but in a different way. At first, I thought it might be due to some change in our code (see bug 465338) but then the next test run, it happened again, at a different place, so still seems "semi-random" issue related to "state" or concurrency? 

I'll attach log and the failed bundle from latest failure.
Comment 47 David Williams CLA 2015-04-23 16:56:36 EDT
Created attachment 252695 [details]
bundle (and target) that failed with Illegal Argument exception
Comment 48 David Williams CLA 2015-04-23 16:58:41 EDT
Created attachment 252696 [details]
log that goes with previous attached failure

FWIW, I did check /tmp, and plenty of space there and rest of file system.
Comment 49 Igor Fedorenko CLA 2015-04-23 17:03:19 EDT
I think attachment 252696 [details] is the same bundle, it's not a log.
Comment 50 David Williams CLA 2015-04-23 18:20:38 EDT
Created attachment 252699 [details]
REDO: here is the log that goes with previous failed bundle

Sorry about that. Just "clicked" wrong.
Comment 51 Lars Vogel CLA 2015-04-23 21:33:50 EDT
*** Bug 465338 has been marked as a duplicate of this bug. ***
Comment 52 Igor Fedorenko CLA 2015-04-23 23:16:33 EDT
The log has one interesting message right before the exception

    !ENTRY org.eclipse.equinox.p2.publisher.eclipse 4 0 2015-04-23 16:08:24.894
    !MESSAGE Unable to acquire PluginConverter service during generation for: /opt/public/eclipse/builds/4I/gitCache/eclipse.platform.releng.aggregator/rt.equinox.bundles/bundles/org.eclipse.equinox.bidi/target/org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar.

This message is produced by org.eclipse.equinox.p2.publisher.eclipse.BundlesAction.basicLoadManifest(File) when it can't read bundle manifest from bundle jar. I confirmed that bundle manifest is present in the attached copy of org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar and the manifest is readable using the code similar to BundlesAction.basicLoadManifest.

Given sporadic nature of the problem I suspect build jvm is running out of some system resources, like open file handles, either because of a resource leak triggered by use of newer version of java or simply because new version of java uses the resource in question more freely.
Comment 53 David Williams CLA 2015-04-24 00:19:36 EDT
(In reply to Igor Fedorenko from comment #52)
> The log has one interesting message right before the exception
> 
>     !ENTRY org.eclipse.equinox.p2.publisher.eclipse 4 0 2015-04-23
> 16:08:24.894
>     !MESSAGE Unable to acquire PluginConverter service during generation
> for:
> /opt/public/eclipse/builds/4I/gitCache/eclipse.platform.releng.aggregator/rt.
> equinox.bundles/bundles/org.eclipse.equinox.bidi/target/org.eclipse.equinox.
> bidi-0.10.0-SNAPSHOT.jar.
> 
> This message is produced by
> org.eclipse.equinox.p2.publisher.eclipse.BundlesAction.
> basicLoadManifest(File) when it can't read bundle manifest from bundle jar.
> I confirmed that bundle manifest is present in the attached copy of
> org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar and the manifest is readable
> using the code similar to BundlesAction.basicLoadManifest.

I was assuming this was some form of 

Bug 437466 - erroneous PluginConverter message caused by a directory in "dropins"

Feasible? I'm looking at the latest failure (not attached) and seems it occurs 3 or so times, before the one that fails. 

I used to have some "resource monitors" ... I'll see if I can find those.
Comment 54 Martin Schreiber CLA 2015-04-24 09:02:03 EDT
(In reply to Igor Fedorenko from comment #52)
> The log has one interesting message right before the exception
> 
>     !ENTRY org.eclipse.equinox.p2.publisher.eclipse 4 0 2015-04-23
> 16:08:24.894
>     !MESSAGE Unable to acquire PluginConverter service during generation
> for:
> /opt/public/eclipse/builds/4I/gitCache/eclipse.platform.releng.aggregator/rt.
> equinox.bundles/bundles/org.eclipse.equinox.bidi/target/org.eclipse.equinox.
> bidi-0.10.0-SNAPSHOT.jar.
> 
> This message is produced by
> org.eclipse.equinox.p2.publisher.eclipse.BundlesAction.
> basicLoadManifest(File) when it can't read bundle manifest from bundle jar.
> I confirmed that bundle manifest is present in the attached copy of
> org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar and the manifest is readable
> using the code similar to BundlesAction.basicLoadManifest.
> 
> Given sporadic nature of the problem I suspect build jvm is running out of
> some system resources, like open file handles, either because of a resource
> leak triggered by use of newer version of java or simply because new version
> of java uses the resource in question more freely.

I can confirm that the IllegalArgumentException seeing in the log of attachment 252699 [details] is related to the warning message "Unable to acquire PluginConverter...".

I ran a debug session during a simple plugin build and added a breakpoint to the BundlesAction.basicLoadManifest method. If I change the manifest stream to null, I do get exactly the same stacktrace as in the attached log. 

If I read the code lines of BundlesAction.basicLoadManifest [1] correctly the manifestStream can be null in 2 cases:

case1: the bundleLocation is not a jar file (line 627); don't think that this is the case here) or

case2: when the jarFile.getEntry(JarFile.MANIFEST_NAME) returns null which could be the case.


[1] http://git.eclipse.org/c/tycho/org.eclipse.tycho.p2-fork.git/tree/bundles/org.eclipse.equinox.p2.publisher.eclipse/src/org/eclipse/equinox/p2/publisher/eclipse/BundlesAction.java?h=tycho-integration#n601
Comment 55 Martin Schreiber CLA 2015-04-24 09:10:12 EDT
Small addition to the last comment:

The javadoc of ZipFile.getEntry says "returns the zip file entry, or null if not found" which might be related to Davids comment #43.
Comment 56 David Williams CLA 2015-04-24 09:24:59 EDT
(In reply to Martin Schreiber from comment #54)
 
> I can confirm that the IllegalArgumentException seeing in the log of
> attachment 252699 [details] is related to the warning message "Unable to
> acquire PluginConverter...".

Sounds like a good piece of data .. do we know, though, what causes the "Unable to
acquire PluginConverter..." message? Is it a "spurious" case, such as described in Bug 437466, or ... and indication that something deeper has gone wrong? (Like, no more handles, or something?) 

[To be explicit, there is no "PluginConverter in our Mars 4.5 code, that is "shipped by default". We still put a version in our repository if you in Tycho (or others) need one? ... but, don't think that's the issue here.]. 


In reply to comment #55, 
I believe that message was quite specific to "unzip" executable (not Java) and then only with the "test" option. I know at one point, the "sign" plugin looks for "nested jars" ... but not sure even then it would use the executable -t approach ... though ... it might?
Comment 57 David Williams CLA 2015-04-24 09:30:33 EDT
(In reply to David Williams from comment #53)

> I used to have some "resource monitors" ... I'll see if I can find those.

I have been running some builds that monitors "handles" of open files. 

And we do some some pretty big hogs: JavaDoc generation and apiDescriptors: 
    Number of handles in process 27763:         605
    Number of handles in process 22352:         233

And these do, for some reason, appear to be "running in parrallel" at times. 

But, the default limit (ulimit -n) is 1024. 

I've increased that to the "max" (for a user) of ulimit -n 2048, and the results are at least promising ... some of the builds succeed when I do that ... but in my tests so far, still one out of three failed. 

Hard to know if that is a "good hint" or "mere coincidence".
Comment 58 Igor Fedorenko CLA 2015-04-24 09:42:07 EDT
(In reply to David Williams from comment #56)
> (In reply to Martin Schreiber from comment #54)
>  
> > I can confirm that the IllegalArgumentException seeing in the log of
> > attachment 252699 [details] is related to the warning message "Unable to
> > acquire PluginConverter...".
> 
> Sounds like a good piece of data .. do we know, though, what causes the
> "Unable to
> acquire PluginConverter..." message? Is it a "spurious" case, such as
> described in Bug 437466, or ... and indication that something deeper has
> gone wrong? (Like, no more handles, or something?) 
> 
> [To be explicit, there is no "PluginConverter in our Mars 4.5 code, that is
> "shipped by default". We still put a version in our repository if you in
> Tycho (or others) need one? ... but, don't think that's the issue here.]. 
> 

The jar in question, org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar, does have valid bundle manifest, so p2 is not expected to need PluginConverter. I believe the only way to explain the message, is the jvm was not able to read the zip entry for whatever reason. Also note that Tycho p2-metadata was successfully invoked number of times for other bundles, which I believe adds plausibility to the 'insufficient resources' hypothesis.
Comment 59 David Williams CLA 2015-04-24 09:50:48 EDT
(In reply to David Williams from comment #57)
> (In reply to David Williams from comment #53)
> 
> > I used to have some "resource monitors" ... I'll see if I can find those.
> 
> I have been running some builds that monitors "handles" of open files. 
> 
> And we do some some pretty big hogs: JavaDoc generation  
>     Number of handles in process 27763:         605

After conferring with the experts (Thanks Markus) I was "mis-reading" part of the output, and this isn't "JavaDoc" related at all ... it is simply the main Java process running Maven! (and, happens to end with a -D parameter that mentions javadoc). 

But, conclusion is the same ... we are certainly "flirting" with the 1024 "open files" limit and 2048 seems better, but did fail once, even with that limit (in my informal testing overnight).
Comment 60 Igor Fedorenko CLA 2015-04-24 09:55:02 EDT
(In reply to David Williams from comment #57)
> (In reply to David Williams from comment #53)
> 
> > I used to have some "resource monitors" ... I'll see if I can find those.
> 
> I have been running some builds that monitors "handles" of open files. 
> 
> And we do some some pretty big hogs: JavaDoc generation and apiDescriptors: 
>     Number of handles in process 27763:         605
>     Number of handles in process 22352:         233
> 
> And these do, for some reason, appear to be "running in parrallel" at times. 
> 
> But, the default limit (ulimit -n) is 1024. 
> 
> I've increased that to the "max" (for a user) of ulimit -n 2048, and the
> results are at least promising ... some of the builds succeed when I do that
> ... but in my tests so far, still one out of three failed. 
> 
> Hard to know if that is a "good hint" or "mere coincidence".

I am not sure ulimit is a good indication of max number of open files on modern linux. For example, even though `ulimit -n` says 1024 in my build.eclipse.org login, I was able to open >8K files before getting "Too many open files".

Also, max open files was just an example. I was not able to reproduce the problem reading bundle manifest last night. My test either loaded bundle manifest successfully or it failed with "too many open files". I think native out-of-heap memory, which is used by Deflater, is more likely the culprit here, but I don't have a good idea how to test this theory yet (and I don't really have time to investigate).
Comment 61 David Williams CLA 2015-04-24 10:18:28 EDT
(In reply to Igor Fedorenko from comment #60)

> I am not sure ulimit is a good indication of max number of open files on
> modern linux. For example, even though `ulimit -n` says 1024 in my
> build.eclipse.org login, I was able to open >8K files before getting "Too
> many open files".
> 
> Also, max open files was just an example. I was not able to reproduce the
> problem reading bundle manifest last night. My test either loaded bundle
> manifest successfully or it failed with "too many open files". I think
> native out-of-heap memory, which is used by Deflater, is more likely the
> culprit here, but I don't have a good idea how to test this theory yet (and
> I don't really have time to investigate).

Thanks Igor. Sorry to hear my old PPC knowledge is out of date. :) 

(And, I just had my second failure, so only 2 out of 4 were successful. 

WEBMASTERS, 
(honest, not yelling, just trying to make it easy for you to see that's directed towards you :) 
1) any recommendations on "monitoring resources"? (I do think we use a lot of heap, too, currently set at 3G, so hate to increase too much more, but does seem like swap is being used!) 
2) Besides *our* processes ... could you monitor processes in the "webservices" signer and/or pack process (especially if done "in same process") -- even though Tycho only uses the webservices, I recall some early experiments where me setting that environment variable (which should have had no effect on Tycho) seemed to increase failure rate. 

[I'll be off-line a few hours, but will restart testing this afternoon -- and weekend, since we are supposed to start our "M7 stabilization builds" on Sunday!] 

Thanks all,
Comment 62 Jan Sievers CLA 2015-04-24 11:09:35 EDT
(In reply to Igor Fedorenko from comment #52)
> This message is produced by
> org.eclipse.equinox.p2.publisher.eclipse.BundlesAction.
> basicLoadManifest(File) when it can't read bundle manifest from bundle jar.
> I confirmed that bundle manifest is present in the attached copy of
> org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar and the manifest is readable
> using the code similar to BundlesAction.basicLoadManifest.

BundlesAction.basicLoadManifest() is using the ZipFile/ZipEntry API (just like pack200 does). To me, this is an indication that it's a different symptom of the same issue.
Comment 63 Igor Fedorenko CLA 2015-04-24 11:35:19 EDT
(In reply to Jan Sievers from comment #62)
> (In reply to Igor Fedorenko from comment #52)
> > This message is produced by
> > org.eclipse.equinox.p2.publisher.eclipse.BundlesAction.
> > basicLoadManifest(File) when it can't read bundle manifest from bundle jar.
> > I confirmed that bundle manifest is present in the attached copy of
> > org.eclipse.equinox.bidi-0.10.0-SNAPSHOT.jar and the manifest is readable
> > using the code similar to BundlesAction.basicLoadManifest.
> 
> BundlesAction.basicLoadManifest() is using the ZipFile/ZipEntry API (just
> like pack200 does). To me, this is an indication that it's a different
> symptom of the same issue.

Yes, this was exactly my point, i.e. both observed problems are caused by the same underlying problem that somehow breaks reading zip/jar files.

Which reminds me. Since we eliminated pack200 forking as the fix, I think we should revert the tycho change. Or at least clearly document the parameter as "for build troubleshooting only, can be removed without prior notice".
Comment 64 Jan Sievers CLA 2015-04-24 11:56:03 EDT
(In reply to Igor Fedorenko from comment #60)
> I am not sure ulimit is a good indication of max number of open files on
> modern linux. For example, even though `ulimit -n` says 1024 in my
> build.eclipse.org login, I was able to open >8K files before getting "Too
> many open files".
> 
> Also, max open files was just an example. I was not able to reproduce the
> problem reading bundle manifest last night. My test either loaded bundle
> manifest successfully or it failed with "too many open files". 

I also tried forcing the problem by limiting max file handles

sysctl -w fs.file-max=<slightly above currently used>

This results in a "too many open files" error, not "MALFORMED" zip entries.
So, does not look like it's a file handle leak problem to me either. The native JVM crash from 

https://hudson.eclipse.org/jubula/job/jubula-nightly/2390/console

also points into another direction.

Maybe it would be good to get the JVM crash files (if available) to try to do some post-mortem based on these.
Comment 65 David Williams CLA 2015-04-24 12:56:31 EDT
When all else fails, search the web :) 

Here's at least one similar entry related to "encoding". 

http://stackoverflow.com/questions/20013091/java-unzip-error-malformed

Are "all the machines involved" using UTF-8? 
(And, if so, does even that need to be specified when a zip file is created?)
Comment 66 Igor Fedorenko CLA 2015-04-24 13:00:42 EDT
(In reply to David Williams from comment #65)
> When all else fails, search the web :) 
> 
> Here's at least one similar entry related to "encoding". 
> 
> http://stackoverflow.com/questions/20013091/java-unzip-error-malformed
> 
> Are "all the machines involved" using UTF-8? 
> (And, if so, does even that need to be specified when a zip file is created?)

"Our" failure is intermittent, if I understand correctly, sometimes it works and sometimes it fails for the same project. I don't believe zip file encoding is an issue here.
Comment 67 David Williams CLA 2015-04-24 13:53:08 EDT
(In reply to Igor Fedorenko from comment #66)
> (In reply to David Williams from comment #65)
> > When all else fails, search the web :) 
> > 
> > Here's at least one similar entry related to "encoding". 
> > 
> > http://stackoverflow.com/questions/20013091/java-unzip-error-malformed
> > 
> > Are "all the machines involved" using UTF-8? 
> > (And, if so, does even that need to be specified when a zip file is created?)
> 
> "Our" failure is intermittent, if I understand correctly, sometimes it works
> and sometimes it fails for the same project. I don't believe zip file
> encoding is an issue here.

True. And, your probably right, but I think both signing and and re-packing (with batch mode) run on "different queues" ... so ... I was just thinking those processes might differ. I've no idea how it's implemented, and I know it's a stretch ... but ... trying to cover all bases. 

Webmasters, if in doubt, you can query with 
locale charmap
and/or set with 
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

And from experience, I know this varies "per user". I remember some early days where my "shell" would report UTF-8, 
but a job started from a cronjob had something else set by default.
Comment 68 David Williams CLA 2015-04-26 21:03:22 EDT
I *think* this is running ok now, or at least better. Over the weekend, I fixed some issues in the CBI's eclipse-jarsigner-plugin. One of them was some cases of JarFile instances not being closed (bug 465458). The other was a changing a static method to an instance method (bug 465482). The unclosed files might be the obvious explanation -- and that was a bug in 1.1.1, but, I worry about the static method too, since even though introduced after 1.1.1, seemed to me things got worse after I switched to try 1.1.2-SNAPSHOT. 

But, after getting that new version deployed to "snapshots", I've not had a single failures yet (from about 5 attempts). That may not seem like many, for an intermittent failure, but considering from Wednesday to Friday, out of about 20 attempts only one or two made it all the way through ... 

We are starting our "milestone stability" period now, meaning we'll be building twice a day till Wednesday ... might through in a few test builds too? 

So, I'd say by then, if no failures, we can claim victory. 

FYI, I've left "fork" set to true, because I did run out of memory once (but has also been playing around with GC settings. And, even with fork set to true, sure seems like we are using  LOT of memory, though, unfortunately, I have not really been keeping track of it. So, not sure if worse that it was with Java 7 and Tycho 0.22.0. In the 'top' line below, note the virtual memory (VIRT), resident memory (RES), SWAP, and page faults (nFLT) ... not to mention the badness score :/ 


  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  SWAP   TIME CODE DATA nFLT Badness COMMAND                                                      
19364 e4Build   20   0 6206m 3.2g  17m S     74 13.8  30:26.11 2.8g  30:26    4 6.0g 3280     127 /shared/common/jdk1.8.0_x64-latest/bin/java -Xms2048m -Xmx4096m
Comment 69 Mickael Istria CLA 2015-04-27 02:33:38 EDT
*** Bug 465527 has been marked as a duplicate of this bug. ***
Comment 70 Mickael Istria CLA 2015-04-27 02:41:27 EDT
It seems to be an issue in the signing service that sometimes provides erroneous output according to the pack200b mojo. Not sure it's related to Tycho then. Should we redirect it to the CBI project ?
Comment 71 David Williams CLA 2015-04-27 03:05:52 EDT
(In reply to Mickael Istria from comment #70)
> It seems to be an issue in the signing service that sometimes provides
> erroneous output according to the pack200b mojo. Not sure it's related to
> Tycho then. Should we redirect it to the CBI project ?

I don't really care where the bug resides, but, I checked a lot of "failed builds", and the "results of signing" looked correct. So, I think it's more complicated than any one component. So far, all we know is some unclosed jars were cleaned up in Tycho, some were cleaned up in jarsigner plugin. And, I'll feel lucky if "that's all it was" ... but, has seemed to improve it, if not fix it. (Could be many other things, server load, etc.). 

If not obvious from my previous post, people experiencing problems should 
1) say which version of tycho and cbi jar-signer they are using, and 
2) try tycho 0.23.0-SNAPSHOT and cbi plugins (jarsigner) 1.1.2-SNAPSHOT and if anyone still sees in this case, that'd be the important ones to know about. (with stack traces, "DEBUG" (-X) output, etc.
Comment 72 Matthias Sohn CLA 2015-04-27 03:16:26 EDT
we also see such random build failures on JGit Hudson since last week:

https://hudson.eclipse.org/jgit/job/jgit.gerrit/6317/console
https://hudson.eclipse.org/jgit/job/jgit.gerrit/6316/console
https://hudson.eclipse.org/jgit/job/jgit.gerrit/6315/console
https://hudson.eclipse.org/jgit/job/jgit.gerrit/6309/console
https://hudson.eclipse.org/jgit/job/jgit.gerrit/6304/console
https://hudson.eclipse.org/jgit/job/jgit2/54/console

we never saw such errors earlier and we didn't touch the build for quite a while

some of the errors show a different cause:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb98ae90dab, pid=1834, tid=140434887821056
#
# JRE version: Java(TM) SE Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x84dab]  memcpy+0x15b
#
# Core dump written. Default location: /jobs/genie.jgit/jgit2/workspace/core or core.1834 (max size 1 kB). To ensure a full core dump, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /jobs/genie.jgit/jgit2/workspace/hs_err_pid1834.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
[ERROR] Failure: hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
FATAL: hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
org.hudsonci.utils.tasks.OperationFailure: hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
	at org.hudsonci.utils.tasks.PerformOperation.execute(PerformOperation.java:64)
	at org.hudsonci.maven.plugin.builder.MavenBuilder.perform(MavenBuilder.java:169)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:34)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:646)
	at hudson.model.Build$RunnerImpl.build(Build.java:181)
	at hudson.model.Build$RunnerImpl.doRun(Build.java:136)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:434)
	at hudson.model.Run.run(Run.java:1390)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:40)
	at hudson.model.ResourceController.execute(ResourceController.java:82)
	at hudson.model.Executor.run(Executor.java:137)
Caused by: hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.Request.call(Request.java:142)
	at hudson.remoting.Channel.call(Channel.java:643)
	at org.hudsonci.maven.plugin.builder.internal.PerformBuild.doExecute(PerformBuild.java:198)
	at org.hudsonci.utils.tasks.PerformOperation.execute(PerformOperation.java:50)
	... 10 more
Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.Request.abort(Request.java:262)
	at hudson.remoting.Channel.terminate(Channel.java:743)
	at hudson.slaves.Channels$1.terminate(Channels.java:71)
	at hudson.remoting.Channel$ReaderThread.run(Channel.java:1042)
Caused by: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.Channel$ReaderThread.run(Channel.java:1023)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at hudson.remoting.Channel$ReaderThread.run(Channel.java:1017)
Comment 73 Mickael Istria CLA 2015-04-27 03:23:34 EDT
(In reply to David Williams from comment #71)
> I don't really care where the bug resides, but, I checked a lot of "failed
> builds", and the "results of signing" looked correct. So, I think it's more
> complicated than any one component. So far, all we know is some unclosed
> jars were cleaned up in Tycho, some were cleaned up in jarsigner plugin.
> And, I'll feel lucky if "that's all it was" ... but, has seemed to improve
> it, if not fix it. (Could be many other things, server load, etc.). 

Before updating the signing service, all jobs were running fine. Now, everything using the Eclipse.org signing service and building with Tycho seems affected, without any change. So it really seems like a regression in the signing service.

Could it be simply that the signing service returns too early, before the jar is totally processed and that pack200b mojo only gets access to a partial jar at that time?
Comment 74 Jan Sievers CLA 2015-04-27 03:27:44 EDT
(In reply to Matthias Sohn from comment #72)
> we also see such random build failures on JGit Hudson since last week:

please see comment 71
Comment 75 Matthias Sohn CLA 2015-04-27 03:58:09 EDT
(In reply to Mickael Istria from comment #73)
> (In reply to David Williams from comment #71)
> > I don't really care where the bug resides, but, I checked a lot of "failed
> > builds", and the "results of signing" looked correct. So, I think it's more
> > complicated than any one component. So far, all we know is some unclosed
> > jars were cleaned up in Tycho, some were cleaned up in jarsigner plugin.
> > And, I'll feel lucky if "that's all it was" ... but, has seemed to improve
> > it, if not fix it. (Could be many other things, server load, etc.). 
> 
> Before updating the signing service, all jobs were running fine. Now,
> everything using the Eclipse.org signing service and building with Tycho
> seems affected, without any change. So it really seems like a regression in
> the signing service.
> 
> Could it be simply that the signing service returns too early, before the
> jar is totally processed and that pack200b mojo only gets access to a
> partial jar at that time?

this theory seems to be supported by the error
"Caused by: java.io.IOException: Unexpected termination of the channel" we saw in some JGit builds (see comment 72)
Comment 76 Igor Fedorenko CLA 2015-04-27 07:58:15 EDT
(In reply to Matthias Sohn from comment #75)
> 
> this theory seems to be supported by the error
> "Caused by: java.io.IOException: Unexpected termination of the channel" we
> saw in some JGit builds (see comment 72)

I believe this exception simply means Hudson could not maintain dialog with the crashed jvm, it does not imply anything about the reasons the jvm crashed.

I also agree with David that open JarFile instance leak is the most plausible explanation for the platform build failure we have so far.
Comment 77 Jan Sievers CLA 2015-04-27 08:32:57 EDT
(In reply to David Williams from comment #68)
> I *think* this is running ok now, or at least better. Over the weekend, I
> fixed some issues in the CBI's eclipse-jarsigner-plugin. 

just found one more suspect: bug 465570
Comment 78 Marc-André Laperle CLA 2015-04-28 17:36:55 EDT
The CDT build was failing pretty consistently but now if works after I upgraded jarsigner to 1.1.2-SNAPSHOT. Tycho 0.22.0 is still used. So the recent changes to the jarsigner seem to have helped.
Comment 79 Dani Megert CLA 2015-04-29 10:55:53 EDT
Sorry, I didn't read all comments.

Can someone answer the following questions for me:
- Why did it start to fail?
- Do we now know what the problem is/was?
- Where are we with a fix?
- What can be done to avoid this in the future? See bug 465809.

Thanks, and sorry again for not reading all 78 comments.
Comment 80 David Williams CLA 2015-04-29 14:53:32 EDT
(In reply to Dani Megert from comment #79)
> Sorry, I didn't read all comments.
> 
> Can someone answer the following questions for me:
> - Why did it start to fail?

I think when the "batch signing service" was updated. (see Bug 463510, for 105 comments :). 

> - Do we now know what the problem is/was?

I do not think we do, exactly. One "fix" that helped was to make sure all "jar files" were closed, in the cbi jarsigner maven plugin. (bug 465458). But, those bugs existed for a long time, so it's unclear how the "batch signer" change, would effect Tycho Maven builds (which do not use the batch signer). (But, was explained in one of these bugs, there is a part of the system they have in common). 

> - Where are we with a fix? 

Besides CDT, I have heard that WTP too ran into this issue and changing to 1.1.2-SNAPSHOT fixed it for them too. So ... I'd say "fixed for practical purposes". But, hard to know if fixed completely, with intermittent failures, unless root cause is observed and can be recreated. And, I think I'm not alone, is thinking it is the part of the system we can not see, that might be interacting with all the pieces, where root cause is. Perhaps a "running out of resources or memory" issue? Perhaps a concurrency issue? All made worse my leaving file handles open. 

> - What can be done to avoid this in the future? See bug 465809.

Test early, test often. :) 

I'll mention two other, minor, almost off topic items: 

We do need to be sure that *everywhere* we use proper encodings settings for working with zipfiles (UTF-8). This is automatic is working with "JarFile" class, but if someone happens to use ZipFile, on a jar file, they need to be sure to specify UTF-8). I only emphasize this one, because I've confirmed, on build machine anyway, that the "default platform encoding" used by, say, a cronjob, is ANSI_X3.4-1968, not UTF-8, which is what it is (for most of us?) if we login with a system shell. I did not check any "Hudson server settings" ... but some of the "signing infrastructure" is behind closed doors, as it should be. This doesn't really account for the apparent "random" nature of this bug, but, I wouldn't rule it out completely. (Simply because I do not know, a lot of details about the back end, implementation ... such as if queues are on different machines or if those queues could have different "environment variables" ?) Just seemed like a good idea to remind everyone of the importance of charset, and not assume that "since on Linux, must be UTF-8". 

Second, in making even further improvements to the cbi plugins Mikael Barbero (bug 465686) used the Java 7 "try-with-resources" construct, which caused me to read up  on it :) [1] Besides being syntactic candy, and an easy way to make sure resources are closed if an exception occurs, it also will also not allow one exception to be "hidden" (suppressed) by another exception that occurs in a finally block. (But, must use Throwable.getSuppressed method to get at them).
I doubt that this "MALFORMED" exception is hiding an original, earlier exception, but you never know, unless you have studied the code. So, just take this as interesting "tip of the day" :) 

[1] https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html

But, bottom line, is I think this can be closed as fixed, with cbi 1.1.2-SNAPSHOT, until we get a report of it occurring even with that snapshot. 

If anyone disagrees with me, please re-open, explaining. 

P.S. We've not talked about when to *release* CBI 1.1.2 -- have been assuming it would be the same time as Tycho 0.23.0, but I do not know their schedule, and for a critical fix such as this, should probably release before Mars RCs.
Comment 81 Igor Fedorenko CLA 2015-04-29 15:46:39 EDT
Did the problem start when the build moved to java 7+ or it also happens for projects still using java 6?

New version of jarsigner will work with older versions of Tycho. I do not believe it is necessary to couple jarsigner plugin release with Tycho 0.23. Not a big deal, but you may want to increase jarsigner plugin version from 1.1.x to 1.2.x to signify new java version requirement.
Comment 82 David Williams CLA 2015-04-29 21:33:54 EDT
(In reply to Igor Fedorenko from comment #81)
> Did the problem start when the build moved to java 7+ or it also happens for
> projects still using java 6?

The problem with the batch signing process started when people started moving their builds to Java 7+, since the -repack was occurring with Java 6, and they'd (try) to finish with pack200 in their build with Java 7 (that's the topic of bug 463510). But I do not know of anyone doing Tycho builds with Java 6 ... and, not that I would know ... but, sort of doubt many would be, if any. 

> New version of jarsigner will work with older versions of Tycho. I do not
> believe it is necessary to couple jarsigner plugin release with Tycho 0.23.

Well ... the jarsigner plugin does depend on Tycho, and in the past, once, when a new version of Tycho came out, CBI was broken due to some incompatible change, based on CBI using a non-API from Tycho. Are you saying you've reviewed the code, and jarsigner now uses well established Tycho APIs? If so, that's good news. Thank you for that review. 

> Not a big deal, but you may want to increase jarsigner plugin version from
> 1.1.x to 1.2.x to signify new java version requirement.

Good point. This got me thinking and looking at this in detail, to the point I opened CBI bug 465866 to resurrect some history. (In short, it was changed to 1.7 after 1.1.1 came out, but, I can't really see the reason why it was changed). 

Do you happen to know, are there projects at Eclipse.org using Java 6 to do Tycho/Maven builds? 

= = = = = = = = = = = = = = = = = = = = = = = = = = = =

Back to the topic of "is this fixed": 

by coincidence, I saw bug 465430 comment 13 which speaks to the importance of using 
<fork>true</fork> 
in the "packa" and "packb" plugin configuration. (From one of the many comments buried in this bug.)  In that bug, from WTP, they are still using Tycho 0.22.0 (or, older?) so the memory issues they were encountering, after moving to CBI 1.1.2-SNAPSHOT, *might* be improved in Tycho 0.23.0-SNAPSHOT (since Tycho did some jar closing cleanup too) ... but, wanted to mention it here, in closing, in case others are trying to "work around" one problem, only to hit another, it may require both cbi snapahot, plus adding <fork>. 

Just (trying) to be helpful! :) 

(The memory leaks are suspected to be in the VM itself, but I don't think anyone knows for sure or has done any profiling -- again, AFAIK.)
Comment 83 Andreas Sewe CLA 2015-04-30 03:25:17 EDT
(In reply to David Williams from comment #82)
> (In reply to Igor Fedorenko from comment #81)
> > Did the problem start when the build moved to java 7+ or it also happens for
> > projects still using java 6?
> 
> The problem with the batch signing process started when people started
> moving their builds to Java 7+, since the -repack was occurring with Java 6,
> and they'd (try) to finish with pack200 in their build with Java 7 (that's
> the topic of bug 463510). But I do not know of anyone doing Tycho builds
> with Java 6 ... and, not that I would know ... but, sort of doubt many would
> be, if any. 

We (Code Recommenders) are using Java 7 in our builds, because that's what Hudson picked. AFAIK, we have been doing so for quite some time and using Tycho's toolchain feature to compile and test against JavaSE-1.6.

I wonder whether tycho-pack200-plugin would do good to implement toolchain support as well. This would give projects explicit control over which version of pack200 to run, even if the "mvn" process itself runs with, say, a Java 8 VM because of build performance.

Should I open a feature request with the Tycho project?
Comment 84 Jan Sievers CLA 2015-04-30 03:46:22 EDT
(In reply to Dani Megert from comment #79)

here is my take on the most plausible chain of cause and effect:

> - Why did it start to fail?

Hard to say since the problem was latent and depending on things like GC timing (see below) but most probably surfaced due to use of different JDK.

> - Do we now know what the problem is/was?

as per comment 78, the problem was JarFile handles left open (and only closed by garbage collector) in the client (maven) part of the CBI signing plugin. The hard part was figurng out that the place where eventually the JVM crashed or failed to handle any more ZipFiles (mostly in Tycho code) was not pointing to the root cause somewhere else.

> - Where are we with a fix?

It seems last couple of commits in CBI jarsigner 
http://git.eclipse.org/c/cbi/org.eclipse.cbi.maven.plugins.git/log/ 

(closing streams/JarFiles in finally blocks/try-with-resources) fixed the problem.

> - What can be done to avoid this in the future? See bug 465809.

I think the stream handling/leaking resource programming errors in the jarsigner plugin were easy to spot in the code but hard to analyze once they showed up with cryptic stacktraces in production. IMHO gerrit review could have caught them.
Comment 85 Mickael Istria CLA 2015-04-30 03:51:06 EDT
(In reply to Jan Sievers from comment #84)
> I think the stream handling/leaking resource programming errors in the
> jarsigner plugin were easy to spot in the code but hard to analyze once they
> showed up with cryptic stacktraces in production. IMHO gerrit review could
> have caught them.

Or just using FindBugs plugin in the IDE, or looking at reports SonarQube can generate would have make it more directly visible.
Comment 86 Igor Fedorenko CLA 2015-04-30 07:37:03 EDT
(In reply to Andreas Sewe from comment #83)
> 
> I wonder whether tycho-pack200-plugin would do good to implement toolchain
> support as well. This would give projects explicit control over which
> version of pack200 to run, even if the "mvn" process itself runs with, say,
> a Java 8 VM because of build performance.
> 
> Should I open a feature request with the Tycho project?

What exact problems do you expect to solve by adding toolchain support to tycho pack200 plugin? I honestly think 'fork' parameter is unnecessary complication, which only slows down the build, and toolchains will only make it worse. Unless, of course, there are specific reasons to require particular version of pack200.
Comment 87 Andreas Sewe CLA 2015-04-30 09:01:11 EDT
(In reply to Igor Fedorenko from comment #86)
> (In reply to Andreas Sewe from comment #83)
> > 
> > I wonder whether tycho-pack200-plugin would do good to implement toolchain
> > support as well. This would give projects explicit control over which
> > version of pack200 to run, even if the "mvn" process itself runs with, say,
> > a Java 8 VM because of build performance.
> > 
> > Should I open a feature request with the Tycho project?
> 
> What exact problems do you expect to solve by adding toolchain support to
> tycho pack200 plugin? I honestly think 'fork' parameter is unnecessary
> complication, which only slows down the build, and toolchains will only make
> it worse. Unless, of course, there are specific reasons to require
> particular version of pack200.

I agree that forking slows down the build. From the comments in this thread (e.g., comment #12), however, I was under the impression that controlling the exact version of pack200 and jarsigner may be necessary. *If* that is the case, I'd rather execute the "mvn" process with one (new, fast) JVM and only fork to a older pack200/jarsigner than having to run the entire build using the old JVM.

That being said, whatever cbi-jarsigner-plugin 1.1.2-SNAPSHOT changed, it fixed the issues our Code Recommenders build was experiencing, so the discussion may be moot for now.
Comment 88 David Williams CLA 2015-05-07 11:23:29 EDT
(In reply to David Williams from comment #82)

> Back to the topic of "is this fixed": 

I just wanted to mention, using Tycho 0.23.0-SNAPSHOT, and CBI 1.1.2-SNAPSHOT, we have not seen this original problem after doing many many builds AND I removed the <fork> workaround a week ago or so, and still no problem. I have not profiled or scientifically investigated if there might still be memory leaks, but if there are, I have not seen them in day to day use.
Comment 89 Matthias Sohn CLA 2015-05-07 17:18:32 EDT
We still use Tycho 0.22.0 and just switched to CBI signing plugin 1.1.2-SNAPSHOT and this fixed the problem for us
Comment 90 Andreas Sewe CLA 2015-05-08 02:57:42 EDT
(In reply to Matthias Sohn from comment #89)
> We still use Tycho 0.22.0 and just switched to CBI signing plugin
> 1.1.2-SNAPSHOT and this fixed the problem for us

Same here (Code Recommenders). Have been using 1.1.2-SNAPSHOT with Tycho 0.22.0 without any issues so far.
Comment 91 Jan Sievers CLA 2016-06-10 09:03:02 EDT
*** Bug 494679 has been marked as a duplicate of this bug. ***