Bug 546434 - Reduce disk space on download.eclipse.org
Summary: Reduce disk space on download.eclipse.org
Status: RESOLVED MOVED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Servers (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-15 10:37 EDT by Denis Roy CLA
Modified: 2021-11-03 14:28 EDT (History)
16 users (show)

See Also:


Attachments
screenshot of baobab at work (296.26 KB, image/png)
2019-12-06 16:26 EST, Jonah Graham CLA
no flags Details
Screenshot (65.02 KB, image/png)
2020-06-25 10:53 EDT, Denis Roy CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Roy CLA 2019-04-15 10:37:01 EDT
An Eclipse mirror has reported that the disk footprint for being an Eclipse mirror for download.eclipse.org is 1.4T

download.eclipse.org should only be used to store current release builds. Archived builds should be moved to archive.eclipse.org, and nightly and integration builds should be deleted once stale.

See the docs for more info:
https://wiki.eclipse.org/IT_Infrastructure_Doc#Downloads

Below is a table of projects that seem to be consuming more disk space than I'd expect.

_ALL PROJECTS SHOULD CLEAN THEIR AREA, NOT JUST THOSE LISTED BELOW_

14G	4diac	
6.6G	acceleo	
15G	app4mc	
57G	birt	
11G	dirigible	
20G	e4	e4/sdk/drops 2011 -> move to archive.eclipse.org
72G	eclipsescada	
12G	ecp	
9.7G	efxclipse	
9.5G	epsilon	
16G	gemoc	
16G	ice	
25G	jdtls	
9.8M	keyple	
13G	kura	
6.5G	mat	
4.8G	mmt	
160G	modeling	
37G	emf	
36G	gmp	
72G	mdt	
8.0G	tmf	
14G	n4js	
41G	orion	
98G	rcptt	
76G	eclipselink	
38G	sirius	
18G	staging	
398G	epp	
71G	cdt	
110G	orbit	
41G	ptp	
36G	tracecompass	
40G	webtools	Lots of old stuff that can be moved
Comment 1 Dani Megert CLA 2019-04-15 10:53:29 EDT
> 20G	e4	e4/sdk/drops 2011 -> move to archive.eclipse.org
Sravan, please do this for e4. Move all but the latest.
Comment 2 Ed Merks CLA 2019-04-15 11:34:46 EDT
Lots of people and infrastructure rely on the stability update sites even for older releases.  Simply moving them is definitely going to break references from builds and from target platforms that in turn will break those builds and break target platform resolution across the ecosystem.  If a release is to be moved to a different host, it seems important that the existing site be transformed into a composite site that references the different new (archived) location to prevent such breakage.  Keep in mind, for example, how many older Eclipse installation will references older update sites...

So what exactly does "current" release builds mean?  When does a release no longer become current?  Unfortunately a large part of the ecoystem really does use very old releases...
Comment 3 Ed Merks CLA 2019-04-15 11:59:53 EDT
For "160G modeling", I think the number is a sum total of all the modeling subprojects...  It seems to me there are no "modeling" sites other than those of the subprojects.  But there is no such sum-total number for "tools", for example.

Thinking about this some more, i.e., preserving the integrity of the long-established stable update sites, it's really not safe to move a composite because its likely to have relative references based on the current location.  Of course they use no significant space, so they really aren't so much of a concern,

It's not totally clear that its even safe for simple repositories when they have a mirror URL in them:

https://wiki.eclipse.org/Equinox/p2/p2.mirrorsURL#Moving_a_repo_to_archive.eclipse.org

The wiki suggests the artifacts likely would still be found if available at the corresponding location on archive.eclipse.org (or that because it's not mirrored anymore the mirror script will return an empty list so that only archive.eclipse.org will be used) but it's not totally clear that this is the all really works without a problem. 

If this really does work, then it should be pretty easy to script the moving of a simple repository to archive.eclipse.org and to also script the replacement composite that is left behind at the established old location...
Comment 4 David Williams CLA 2019-04-15 15:14:04 EDT
(In reply to Denis Roy from comment #0)
> ... 
> and nightly and integration builds should be deleted once stale.
> ...

True enough, but shouldn't these not be mirrored in the first place? 

For many, by the time the "outgoing bytes" from a build get pushed to, what, 50 mirrors, it is already time for another build to start becoming "outgoing out". During that time, perhaps 2 to 20 people used the mirrors to get the content. 

The benefit is not worth the cost, if those numbers are close to valid. 

I know that you do (or used to) filer out the N and I build from some projects. Perhaps this procedure should be codified.
Comment 5 Ed Merks CLA 2019-04-15 17:27:38 EDT
I'm pretty sure (based on watching how the platforms I builds are resolved in a target platform resolution) that I builds (and N builds) are not mirrored, though probably then are included in the disk usage count...  

I know that GMP was not properly cleaning up milestones and have (had?) a monstrous composite with endless outdated repos in it; I'm sure that's included in the sums.  I only noticed it because it takes a very long time to load such a huge composite...
Comment 6 Dani Megert CLA 2019-04-16 03:33:28 EDT
(In reply to Ed Merks from comment #5)
> I'm pretty sure (based on watching how the platforms I builds are resolved
> in a target platform resolution) that I builds (and N builds) are not
> mirrored,
Correct.
Comment 7 Mikaël Barbero CLA 2019-04-16 03:35:24 EDT
FYI, https://wiki.eclipse.org/IT_Infrastructure_Doc#Use_mirror_sites.2Fsee_which_mirrors_are_mirroring_my_files.3F (see the "note" at the bottom of this section with the list of excluded file patterns.
Comment 8 Ed Merks CLA 2019-04-16 03:46:44 EDT
Are these totals taking into account the exclusion patterns? 

How can I compute these numbers for a given folder?  I.e., I can do the following to compute such information for "just the EMF project's downloads (excluding other projects such as CDO, Compare, and so on", but this total will include nightly and integration builds as well as Javadoc:

emerks@build:/home/data/httpd/download.eclipse.org/modeling/emf> du -h -c -s emf
5.2G    emf
5.2G    total
Comment 9 Mikaël Barbero CLA 2019-04-16 03:47:36 EDT
(In reply to Ed Merks from comment #8)
> Are these totals taking into account the exclusion patterns? 

No idea, sorry.
Comment 10 Pierre-Charles David CLA 2019-04-16 04:29:16 EDT
(In reply to Ed Merks from comment #5)
> I'm pretty sure (based on watching how the platforms I builds are resolved
> in a target platform resolution) that I builds (and N builds) are not
> mirrored, though probably then are included in the disk usage count...  
> 
> I know that GMP was not properly cleaning up milestones and have (had?) a
> monstrous composite with endless outdated repos in it; I'm sure that's
> included in the sums.  I only noticed it because it takes a very long time
> to load such a huge composite...

I though I had fixed it once, but apparently there's some subtelty in the legacy releng scripts that I missed. I just checked, and GMF Runtime alone uses 27Go, a large part of which can probably be simply removed.

I'll clean this and Sirius (thoug I can't guarantee this will be done this week).
Comment 11 Ed Merks CLA 2019-04-16 05:25:38 EDT
We really need to know how to compute numbers that are reflective of how much space we will actually save for the mirrors.   Of course generally reducing disk space is a good thing, but given much of the action (other than to delete stale/outdated builds/drops) is simply to move the disk space from one host to another, such moving of the disk space seems not all that useful for saving overall resources for the overall set of Eclipse hosts themselves.

Just as a suggestion perhaps one way to get such accurate information would be to set up temporarily as host that acts as a mirror so that each project can see which of their files are actually copied to a mirror.  From that file system we could simply use "du" for computing what would definitely be relevant numbers.
Comment 12 Michael Keppler CLA 2019-04-16 06:57:52 EDT
Has anyone ever checked if deduplication of files would improve the situation? I.e. if there are many files with identical content in different update sites, it would be sufficient to store one copy.

However, I'm a Windows guy, so I really don't have a good understanding whether that would require the underlying file system to support this, or if something needs to be done on the update site file level (and whether that would only be effective for new or also for existing update sites).
Comment 13 Quentin Le Menez CLA 2019-04-16 11:04:02 EDT
(In reply to Ed Merks from comment #3)
> For "160G modeling", I think the number is a sum total of all the modeling
> subprojects...  

Papyrus contributes its fair share to that amount (about 60G+) and could be trimmed down significantly. I'll try to tackle this tomorrow during the M1 release.
Comment 14 Denis Roy CLA 2019-04-16 11:04:36 EDT
(In reply to Ed Merks from comment #2)
> Lots of people and infrastructure rely on the stability update sites even
> for older releases.  Simply moving them is definitely going to break

We can implement stable paths for download.e.o to redirect to archive.e.o if the same path exists. Would this be helpful?

if (download.eclipse.org/some/path/myfile) == 404 Not Found) {
  if (file_exists(archive.eclipse.org/some/oath/myfile) {
    send_302_redirect(archive.eclipse.org/some/oath/myfile);
  else {
    send_404();
  }
}


 
> So what exactly does "current" release builds mean?  When does a release no
> longer become current?  Unfortunately a large part of the ecoystem really
> does use very old releases...

I don't have specific guidelines here. 2011 is not current. Eclipse Neon is not current. Lots of people use Windows 7, but it is not current.


(In reply to Ed Merks from comment #3)
> For "160G modeling", I think the number is a sum total of all the modeling
> subprojects...  

Correct, I should have removed it.

> The wiki suggests the artifacts likely would still be found if available at
> the corresponding location on archive.eclipse.org (or that because it's not
> mirrored anymore the mirror script will return an empty list so that only
> archive.eclipse.org will be used) but it's not totally clear that this is
> the all really works without a problem. 

It works if you use the Mirrors explicity.  It does not work for direct links  to download.eclipse.org but as above, we can make it work transparently.(In reply to Ed Merks from comment #8)


> Are these totals taking into account the exclusion patterns? 

They do not. The Eclipse Foundation still needs to maintain backups of download.e.o and regardless of mirror footprint, stale files are still costly to maintain.

(In reply to Ed Merks from comment #11)
> Just as a suggestion perhaps one way to get such accurate information would
> be to set up temporarily as host that acts as a mirror

I'll try to put together a size report based on
rsync -a --list-only rsync://rsync.osuosl.org/eclipse/
Comment 15 Denis Roy CLA 2019-04-16 13:31:39 EDT
This one-liner will create a full directory structure of sparse files based on what is on the OSUOSL mirror. It's much cheaper than creating a mirror, takes a fraction of the space and allows disk space calculations in the same manner.

rsync -a --list-only rsync://rsync.osuosl.org/eclipse/ | awk '/^-r/ {gsub(",","",$2); print $2 " " $5}' | while read size file ; do echo "Create file: $file size: $size bytes"; mkdir -p "$(dirname $file)"; truncate -s $size "$file"; done

I'll make this available shortly.
Comment 16 Denis Roy CLA 2019-04-16 14:48:05 EDT
(In reply to Ed Merks from comment #2)
> Lots of people and infrastructure rely on the stability update sites even
> for older releases.  Simply moving them is definitely going to break

I've made a small change to our 404 handler, for file requests only.

BEFORE

wget -S https://download.eclipse.org/modeling/OLD/birt-repo-3.7.2.v20120207.zip
HTTP request sent, awaiting response... 
  HTTP/1.1 404 Not Found
  Date: Tue, 16 Apr 2019 14:31:17 GMT
  X-NodeID: download1
2019-04-16 10:31:20 ERROR 404: Not Found.



AFTER:

wget -S https://download.eclipse.org/modeling/OLD/birt-repo-3.7.2.v20120207.zip
HTTP request sent, awaiting response... 
  HTTP/1.1 307 Moved Permanently
  Date: Tue, 16 Apr 2019 18:40:43 GMT
  Location: http://archive.eclipse.org/modeling/OLD/birt-repo-3.7.2.v20120207.zip
  X-NodeID: download1
Location: http://archive.eclipse.org/modeling/OLD/birt-repo-3.7.2.v20120207.zip [following]
Connecting to archive.eclipse.org (archive.eclipse.org)|198.41.30.199|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK        
(snip)


> references from builds and from target platforms that in turn will break
> those builds and break target platform resolution across the ecosystem.  If
> a release is to be moved to a different host, it seems important that the
> existing site be transformed into a composite site that references the
> different new (archived) location to prevent such breakage.  Keep in mind,
> for example, how many older Eclipse installation will references older
> update sites...
> 
> So what exactly does "current" release builds mean?  When does a release no
> longer become current?  Unfortunately a large part of the ecoystem really
> does use very old releases...
Comment 17 Markus Knauer CLA 2019-04-17 05:31:24 EDT
Because EPP is one of the largest consumers of disk space, I've synchronised the important parts (again) to archive.eclipse.org, and removed *many* of the older packages from download.eclipse.org. This reduces the size below /technology/epp a lot.
Comment 18 Ed Merks CLA 2019-04-17 12:15:37 EDT
I've indexed data available on the mirrors. I specifically used http://ftp.fau.de/eclipse/ as the data source.

The index is available here:

https://download.eclipse.org/oomph/archive/eclipse

This structure mirrors the folder structure found on the mirrors, but includes the cumulative sizes for each folder and sorts folder according to size, largest first.  It also shows the % of space used by that folder relative to the other sibling folders as well as the date of the folder. 

The header of each page shows the total size of the parent folder and its % usage relative to the total size of the mirror.  The link on the header is navigable; it opens a mirror page that allows you to navigate to the actual folders hosted by an actual mirror (so you could look at what files are actually in the folders).

The EPP changes have already reduced the total mirror size to 895G, so that's a big improvement.

Papyrus stands out as a heavy hitter

https://download.eclipse.org/oomph/archive/eclipse/modeling/mdt/papyrus/index.html

But that will apparently be addressed.

I don't see so many folks on the CC list so I don't think many people are paying attention.

Perhaps the additional useful details would helpful.
Comment 19 Denis Roy CLA 2019-04-17 12:49:28 EDT
Darn, Ed, that is really cool.

I've also filed bug 546528 - with very little code and effort, I think we can make managing downloads and archives much, much easier.
Comment 20 Ed Willink CLA 2019-04-17 15:30:17 EDT
Is it possible to accommodate not-yet-decontainerized projects? e.g ocl and qvtd are both "0" probably because I have requested the root downloads but have yet to pluck up courage to actually move from e.g. modeling/mdt/ocl.
Comment 21 Pierre-Charles David CLA 2019-04-18 03:26:04 EDT
(In reply to Pierre-Charles David from comment #10)
> (In reply to Ed Merks from comment #5)
> > I'm pretty sure (based on watching how the platforms I builds are resolved
> > in a target platform resolution) that I builds (and N builds) are not
> > mirrored, though probably then are included in the disk usage count...  
> > 
> > I know that GMP was not properly cleaning up milestones and have (had?) a
> > monstrous composite with endless outdated repos in it; I'm sure that's
> > included in the sums.  I only noticed it because it takes a very long time
> > to load such a huge composite...
> 
> I though I had fixed it once, but apparently there's some subtelty in the
> legacy releng scripts that I missed. I just checked, and GMF Runtime alone
> uses 27Go, a large part of which can probably be simply removed.

Done for EMF Services (gained about 2G) and GMF Notation/Runtime (gained almost 27G).

> I'll clean this and Sirius (thoug I can't guarantee this will be done this
> week).

Partly done, down to 17G, from 38G initially. I should be able to remove at least 10 more, but I'm waiting for feedback on downstream projects which may still depend on milestones before removing them.
Comment 22 Ed Merks CLA 2019-04-18 05:38:14 EDT
Denis, do you know what mirrors do with symbolic links?

For example CDO uses quite a bit of space almost 6G or 0.66% of the mirror size, but that is "accounted for" multiple times.  

The "real" location of CDO this accounted for like this:

https://download.eclipse.org/oomph/archive/eclipse/modeling/emf/cdo/index.html

I.e., the following folder exists at this location on build.eclipse.org:

/home/data/httpd/download.eclipse.org/modeling/emf/cdo

But this "same" content is also "accounted for" here:

https://download.eclipse.org/oomph/archive/eclipse/modeling/emft/cdo/index.html

That's because /home/data/httpd/download.eclipse.org/modeling/emft/cdo is a symbolic link like this:

lrwxrwxrwx   1 nickb   modeling.emft           10 Feb 18  2009 cdo -> ../emf/cdo

I have a feeling that mirrors actually do duplicate the content because while the following link works to load a p2 repository:

  http://ftp.fau.de/eclipse/modeling/emft/cdo/updates/releases/

This direct download.eclipse.org link does not work:

  http://download.eclipse.org/eclipse/modeling/emft/cdo/updates/releases/

That kind of makes sense, because I've been told that servers won't serve up links.

There are yet more links to make this even worse. I.e., we find CDO again at this location because emft_LNK is a link to ../technology/emft

https://download.eclipse.org/oomph/archive/eclipse/modeling/emft_LNK/cdo/index.html

And then we find it yet again in technology because technology/emft/cdo is a link to modeling/emft/cdo (which, as mentioned above, is a link to modeling/emf/cdo):

https://download.eclipse.org/oomph/archive/eclipse/technology/emft/cdo/index.html

So it seems to me that CDO has 4 copies on every mirror.

I don't know how best to unravel this mess of links. :-(

I'm tempted just to delete all these links; it seems that download.eclipse.org can't serve them and so mirrors should not copy them...  Their only possible use is by builds that are directly accessing the file system.  Or do I miss some other important reason for these links existing?

Please share your understanding of how mirrors handle symbolic links.
Comment 23 Ed Merks CLA 2019-04-18 06:03:17 EDT
(In reply to Ed Willink from comment #20)
> Is it possible to accommodate not-yet-decontainerized projects? e.g ocl and
> qvtd are both "0" probably because I have requested the root downloads but
> have yet to pluck up courage to actually move from e.g. modeling/mdt/ocl.

What specifically would it entail to "accommodate" that?  The data of course reflects (and should reflect) the actual contents of a mirror, so if you have nothing in those folders at download.eclipse.org, then there is nothing in the mirrors for them...
Comment 24 Denis Roy CLA 2019-04-18 09:13:07 EDT
> Please share your understanding of how mirrors handle symbolic links.

Mirrors likely don't handle symbolic links, and that's why we don't enable their usage on http://download.e.o. However, to be safe, the rsync mechanism that mirrors use probably translate links to real paths, so it's entirely possible that mirrors duplicate data  :/  I was not even aware of this.

I can likely turn that off at our server but I wouldn't want to break anything.  I don't think many Eclipse projects make use of symlinks.


(In reply to Ed Merks from comment #22)
> I'm tempted just to delete all these links; it seems that
> download.eclipse.org can't serve them and so mirrors should not copy them...
> Their only possible use is by builds that are directly accessing the file
> system.  Or do I miss some other important reason for these links existing?

That is my understanding as well.
Comment 25 Ed Willink CLA 2019-04-18 13:37:32 EDT
(In reply to Ed Merks from comment #23)
> (In reply to Ed Willink from comment #20)
> > Is it possible to accommodate not-yet-decontainerized projects? e.g ocl and
> > qvtd are both "0" probably because I have requested the root downloads but
> > have yet to pluck up courage to actually move from e.g. modeling/mdt/ocl.
> 
> What specifically would it entail to "accommodate" that?  The data of course
> reflects (and should reflect) the actual contents of a mirror, so if you
> have nothing in those folders at download.eclipse.org, then there is nothing
> in the mirrors for them...

Request retracted. All the information I wanted is in your hyper-linked report. I naively assumed it was a flat report file.
Comment 26 Sravan Kumar Lakkimsetti CLA 2019-04-22 22:16:25 EDT
(In reply to Dani Megert from comment #1)
> > 20G	e4	e4/sdk/drops 2011 -> move to archive.eclipse.org
> Sravan, please do this for e4. Move all but the latest.

Hi Dani,

I don't have committer rights on e4 project. So I don't have permissions to this activity. the folder is with group permissions eclipse.e4. 

There are two ways to approach this.

1. Add me to committers list
2. Combine e4 project with eclipse platform project.

Since this project is used for tips, I suggest going with 2 option.

Thanks
Sravan
Comment 27 Ed Merks CLA 2019-04-23 00:36:05 EDT
Note that I've enhanced the support for producing this page:

https://download.eclipse.org/oomph/archive/eclipse/

There is now a job that rebuilds it once per day:

https://ci.eclipse.org/oomph/job/mirror-index/

The page header how has a breadcrumb for better navigation/summary information and the mirror page (accessed via any -> icon in the nav bar), shows a nice table

https://download.eclipse.org/oomph/archive/mirror.php?location=

The only problem I can't work around is the automatic computation of the list mirrors.  That's because while this URL works for me a home:

http://www.eclipse.org/downloads/download.php?file=/favicon.ico&format=xml

When running on Jenkins, it produces an empty list, i.e., this is in the log:

<?xml version="1.0" encoding="ISO-8859-1"?>
<mirrors></mirrors>
No mirrors found; hard-coded defaults will be used.
 
Denis, is there any way/URL that would return me a list of mirrors while running on Jenkins? (And isn't this a poor choice of encoding, especially given the file contains Chinese characters?)
Comment 28 Ed Willink CLA 2019-04-23 10:03:22 EDT
(In reply to Ed Merks from comment #27)
> There is now a job that rebuilds it once per day:

Thanks.

After cleaning up some of my own projects, some of which have long established releng practices. I see that some practices are questionable.

Download ZIPs are pruned to the last two years of R-builds, 3 recent S-builds and perhaps a couple of I-builds and N-builds. Older R-builds are moved to archive and linked from the downloads page. Seems good, albeit a bit manual.

P2 repos are not pruned to the same extent, since relevant Wiki authors seem to have neglected to advocate P2 repo archiving. The release aggregate therefore has every release ever, costing mirror space and useless content scanning time. The milestone repo grows and grows unless some enthusiastic releng manually removes both composite entry and content consistently.

Taking EMF as a typical example of a long established project...

https://download.eclipse.org/oomph/archive/eclipse/modeling/emf/emf/updates/index.html#releases

has 15 release versions from 2.6 to 2.14. Since policies/tooling evolve, earlier and later releases are somewhere else.

Surely we should try to have just the last ?5 years of P2 repo releases in one place, with all older P2 repos moved to archive without aggregation from the main releases aggregate? Perhaps a separate archive aggregate might point at them, perhaps just a Wiki/PMI page. Perhaps for really old releases, users can be told that the archive ZIPs are the only option. Why waste archiving space on almost identical P2 repos and ZIPs?
Comment 29 Denis Roy CLA 2019-04-23 15:53:23 EDT
> Denis, is there any way/URL that would return me a list of mirrors while
> running on Jenkins? 

It's designed to not return mirrors for internal (to us) hosts.
Comment 30 Ed Merks CLA 2019-04-24 01:19:23 EDT
(In reply to Denis Roy from comment #29)
> > Denis, is there any way/URL that would return me a list of mirrors while
> > running on Jenkins? 
> 
> It's designed to not return mirrors for internal (to us) hosts.

Is there perhaps some file in the file system that contains this same information? The PHP script must compute it from something...  Though I suppose the list of mirrors doesn't often change so I'm being overly picky...
Comment 31 Ed Merks CLA 2019-04-24 01:48:56 EDT
(In reply to Ed Willink from comment #28)
> 
> After cleaning up some of my own projects, some of which have long
> established releng practices. I see that some practices are questionable.
> 

Yes, when I migrated to EMF to Tycho I was not very happy with the old structure under modeling/emf/emf/downloads and modeling/emf/emf/updates, replacing it all with modeling/emf/emf/builds.

> Download ZIPs are pruned to the last two years of R-builds, 3 recent
> S-builds and perhaps a couple of I-builds and N-builds. Older R-builds are
> moved to archive and linked from the downloads page. Seems good, albeit a
> bit manual.
> 

Nightly builds and integration builds are generally excluded from the mirrors, but that depends on the naming pattern used.  The EMF build job automatically remove stale builds, i.e., at most 5 N builds, and all "stale" milestone builds are removed as soon as there is a milestone build with an incremented version.  So it's all completely automatic.

I will not move/remove releases at this time; modeling/emf/emf/builds uses 0.07% of the mirror space, so it's not exactly compelling to spend time on this.


> P2 repos are not pruned to the same extent, since relevant Wiki authors seem
> to have neglected to advocate P2 repo archiving. The release aggregate
> therefore has every release ever, costing mirror space and useless content
> scanning time. The milestone repo grows and grows unless some enthusiastic
> releng manually removes both composite entry and content consistently.
> 

Yes, that is why I provide a "latest" child and have asked on cross-projects for others to do the same, it's pointless to process through large composites when generally (almost inevitably) one ends up resolving to the last version anyway.


> Taking EMF as a typical example of a long established project...
> 
> https://download.eclipse.org/oomph/archive/eclipse/modeling/emf/emf/updates/
> index.html#releases
> 
> has 15 release versions from 2.6 to 2.14. Since policies/tooling evolve,
> earlier and later releases are somewhere else.
> 

Yes, I'd like to move all stuff under updates and downloads to archive.eclipse.org, but this would have a 0.16% impact so also not the most compelling activity.

> Surely we should try to have just the last ?5 years of P2 repo releases in
> one place, with all older P2 repos moved to archive without aggregation from
> the main releases aggregate? Perhaps a separate archive aggregate might
> point at them, perhaps just a Wiki/PMI page. Perhaps for really old
> releases, users can be told that the archive ZIPs are the only option. Why
> waste archiving space on almost identical P2 repos and ZIPs?

Overall modeling/emf/emf uses .27% of the space, so there are definitely *many* projects that could invest time to have a more significant impact. But already we see the mirror size reduced from 1.1T to close to 800G...

In principle, the following "automatic" process should work for a folder such as /modeling/emf/emf/updates/ and probably more generally:

Copy the entire folder to the corresponding file location in archive.eclipse.org, preserving the path structure.
Delete all files in all folders of the original folder tree (but preserving the folder structure).
For each folder which is/was a p2 repository in the original folder tree, replace it with a p2 composite that references http(s?)://archive.eclipse.org/<correspond-archived-folder-copy>.
Finally, prune empty folders.

This way all the older established URLs for p2 repositories continue to work and if what Denis suggests is working properly, mirror URLs in the copied/archived repos do not need to change because archive.eclipse.org will act as a mirror automatically.  And if everyone was well-behaved, links would all be using download.php and also would continue to work properly (according to Denis' suggestion):

https://www.eclipse.org/downloads/download.php?file=/modeling/emf/emf/builds/release/2.17/EMF-Updates-2.17.zip

When I have some spare time, I will experiment with this and test that it actually works. But I will not do something like this manually.  It's too time consuming and too error prone!
Comment 32 Ed Willink CLA 2019-04-24 04:10:36 EDT
(In reply to Ed Merks from comment #31)
> When I have some spare time, I will experiment with this and test that it
> actually works. But I will not do something like this manually.  It's too
> time consuming and too error prone!

It would be great to have something automatic that we could all share since from the moment we have built and tested a P2 repo I think many projects' requirements are identical but independently and often manually implemented. I regularly raise bugs in regard to bad download maintenance.

The new EMF Updates page is a huge improvement on its predecessor and a few initial limitations have now vanished. (The traditional alias name such as emf-xsd-Update-2.12.0M6.zip is perhaps the main regression. ?? also pre-release hiding ??)

An integration of the EMF Updates page with archiving and the PMI would definitely prompt me to rip-off the technology now that bit rot has set into the PHP underlying the old modeling downloads pages. Bug 534467.
Comment 33 Denis Roy CLA 2019-04-24 09:41:18 EDT
> > It's designed to not return mirrors for internal (to us) hosts.
> 
> Is there perhaps some file in the file system that contains this same
> information? The PHP script must compute it from something...  Though I
> suppose the list of mirrors doesn't often change so I'm being overly picky...

The mirrors are stored in a database and the list is dynamic based on the GeoIP lookup of the caller. It's specifically designed to not give a mirror list to callers on our LAN as downloading from mirrors wouldn't make sense.  I'm trying to think of ways this could work for you without adding a kludge to the code.
Comment 34 Jonah Graham CLA 2019-12-05 15:08:48 EST
I plan to do this long overdue cleanup for CDT in the coming days. I will be sending an email to cdt-dev as a heads up.
Comment 35 Denis Roy CLA 2019-12-05 15:28:58 EST
Thanks, Jonah.

I've recently seen another ping about this on the mirrors mailing list, so expect me to start yelling about this on multiple channels.


*****************
  A forewarning  
*****************

We will -- eventually -- be implementing CBI disk quotas, just as we've implemented CPU and memory quotas, because we do not have enough resources for unlimited storage and free-for-all disk space, and neither do our mirrors. 


As a reminder, Ed's disk space browser tool is here:

https://download.eclipse.org/oomph/archive/eclipse/



Many thanks to all the projects that perform regular housecleaning. Your work is appreciated. If you need help doing this maintenance, please file a separate bug.
Comment 36 Greg Watson CLA 2019-12-06 15:42:31 EST
Since we no longer have the ability to 'cd' or even find the size of a directory using 'df', please provide information on how you expect us to do this? Or are you planning to provide some web based tools that enable us to do this?
Comment 37 Greg Watson CLA 2019-12-06 15:43:55 EST
(In reply to Greg Watson from comment #36)
> Since we no longer have the ability to 'cd' or even find the size of a
> directory using 'df', please provide information on how you expect us to do
> this? Or are you planning to provide some web based tools that enable us to
> do this?

I should have read the previous post!
Comment 38 Ed Merks CLA 2019-12-06 15:56:39 EST
(In reply to Greg Watson from comment #37)
> (In reply to Greg Watson from comment #36)
> > Since we no longer have the ability to 'cd' or even find the size of a
> > directory using 'df', please provide information on how you expect us to do
> > this? Or are you planning to provide some web based tools that enable us to
> > do this?
> 
> I should have read the previous post!

The report is definitely useful for identifying where you can save space, but the ability to manage the files on disk with the badly crippled set of tools available is definitely a problem.
Comment 39 Greg Watson CLA 2019-12-06 16:02:57 EST
(In reply to Ed Merks from comment #38)

> 
> The report is definitely useful for identifying where you can save space,
> but the ability to manage the files on disk with the badly crippled set of
> tools available is definitely a problem.

Agreed. Someone will need to figure out how to move directories to archive.eclipse.org with the restricted shell before this can be done.
Comment 40 Jonah Graham CLA 2019-12-06 16:17:54 EST
(In reply to Greg Watson from comment #39)
> (In reply to Ed Merks from comment #38)
> 
> > 
> > The report is definitely useful for identifying where you can save space,
> > but the ability to manage the files on disk with the badly crippled set of
> > tools available is definitely a problem.
> 
> Agreed. Someone will need to figure out how to move directories to
> archive.eclipse.org with the restricted shell before this can be done.

You can move directories just fine with the restricted shell:

ssh <user>@build.eclipse.org
$ mv /home/data/httpd/download.eclipse.org/tools/cdt/dir1 /home/data/httpd/archive.eclipse.org/tools/cdt/
Comment 41 Jonah Graham CLA 2019-12-06 16:20:18 EST
(In reply to Jonah Graham from comment #40)
> You can move directories just fine with the restricted shell:

I should have said *I* can. But I assume that I am not priviledged in this way.

I also find it useful to browse with my file manager to sftp://jograham@build.eclipse.org/home/data/httpd/

baobab, on Linux, can even make pretty charts if Ed's tool (https://download.eclipse.org/oomph/archive/eclipse/) does not work.
Comment 42 Jonah Graham CLA 2019-12-06 16:26:59 EST
Created attachment 280901 [details]
screenshot of baobab at work

It is fairly fast to use baobab - but I don't know about using it on very large directories. This 25GB (which is most of CDT's mirrored downloads) took less than a minute.
Comment 43 Ed Willink CLA 2019-12-06 17:13:43 EST
(In reply to Jonah Graham from comment #40)
> You can move directories just fine with the restricted shell:

or you use a 'shell' Jenkins job to do each of e.g. 

ssh genie.qvt-oml@projects-storage.eclipse.org cd /home/data/httpd/download.eclipse.org/mmt/qvto/updates/releases ; ant -f /shared/modeling/tools/promotion/manage-composite.xml remove -Dchild.repository=3.4.0

ssh genie.qvt-oml@projects-storage.eclipse.org cd /home/data/httpd/download.eclipse.org/mmt/qvto/updates/releases ; mv 3.4.0 /home/data/httpd/archive.eclipse.org/mmt/qvto/updates/releases

ssh genie.qvt-oml@projects-storage.eclipse.org cd /home/data/httpd/archive.eclipse.org/mmt/qvto/updates/releases ; ant -f /shared/modeling/tools/promotion/manage-composite.xml add -Dchild.repository=3.4.0

Another option in /shared/modeling/tools/promotion/manage-composite.xml would be good.
Comment 44 Ed Willink CLA 2019-12-06 17:16:28 EST
(In reply to Jonah Graham from comment #41)
> I also find it useful to browse with my file manager to
> sftp://jograham@build.eclipse.org/home/data/httpd/

Provided projects have not put in a custom index.html, as wasonce a good idea, the EF's default 404 page is now quite good. You can happily browse in your favourite browser. (I steadily raise Buzillas against projects that have an infgerior index.html.)
Comment 45 Michael Keppler CLA 2019-12-09 05:26:58 EST
Is it actually useful to have milestones and release candidates available for eternity on the download server? Looking at the report from Eike it looks like every release version has between 4 and 7 sub directories due to the RCs and Ms. To my mind those should be removed the moment the final release becomes available.

Is there any document describing the process how to retire RCs and Ms?
Comment 46 Eike Stepper CLA 2019-12-09 05:39:03 EST
I think the process is as easy as "rm -rf" ;-)

Of course a project should publicly document the retention policies for their different build types (I, M, S, R) and for M/S builds it should probably be something like "are kept here until the next release".
Comment 47 Eike Stepper CLA 2019-12-09 05:41:54 EST
... and if M/S builds are also offered in a composite repo, it would be great if that was primed with the latest release build until thefirst M/S build of the subsequent release show up.
Comment 48 Ed Merks CLA 2019-12-09 06:16:24 EST
(In reply to Michael Keppler from comment #45)
> Is it actually useful to have milestones and release candidates available
> for eternity on the download server? 

No, these should be removed at some point after the release. Of course updating composites that reference them...

Over eager removal is not so great because it's possible and perhaps even likely that your downstream consumers are using your integration builds in their builds, so it would be nasty to potentially break their builds before 2019-12 itself is available.

And as Eike mentions leaving an integration composite empty also makes it useless.

What I do for EMF is ensure that the whole cleanup process is automated.  So only the last 5 nightly builds are retained (and referenced by the composite).  For the milestones, the process detects that a new version is being added and then removes all older versions.  So this deletes the folders and cleans the composite as soon as I do a milestone build for the next release...
Comment 49 Denis Roy CLA 2019-12-09 09:37:27 EST
> The report is definitely useful for identifying where you can save space,
> but the ability to manage the files on disk with the badly crippled set of
> tools available is definitely a problem.

Agreed. I filed bug 546528. I think we could address this rather easily with my initial proposal there.


> Provided projects have not put in a custom index.html, as wasonce a good
> idea, the EF's default 404 page is now quite good. You can happily browse in
> your favourite browser.

Thanks - agreed the 404 does an honest job of providing users with workarounds.
Comment 50 Jonah Graham CLA 2019-12-11 10:12:57 EST
(In reply to Denis Roy from comment #0)
> 71G	cdt	

CDT is done (pending Bug 553887 for a small cleanup from webmaster). CDT is now <9GB on download, approx 5GB in mirrored directories. The tools.cdt archive is up to 100GB - however I plan to delete 70GB of it (old builds dating back to 2007).

(In reply to Denis Roy from comment #16)
> I've made a small change to our 404 handler, for file requests only.

This was very helpful in doing this cleanup - Thank you!
Comment 51 Denis Roy CLA 2020-06-25 10:53:01 EDT
Created attachment 283412 [details]
Screenshot

The 404 handler on download and archive can now give you options on your project's downloads if you're logged into https://eclipse.org

From the Download server, files/folders can be moved to same location on the Archive server. If the parent directory doesn't exist on Archive, the action will fail.

From the Archive server, files/folders can be deleted permanently.
Comment 52 Denis Roy CLA 2021-11-03 14:28:53 EDT
Moved to GitLab Helpdesk: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/78