Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to NOT write timestamps for zip file entries #48

Closed
danielwegener opened this issue Jun 30, 2016 · 34 comments
Closed

Add ability to NOT write timestamps for zip file entries #48

danielwegener opened this issue Jun 30, 2016 · 34 comments
Milestone

Comments

@danielwegener
Copy link

danielwegener commented Jun 30, 2016

To be discussed:
Thinking in reproducible builds, It would be great if we could see build tools that use plexus-archiver as pure transformers/functions - given the same input - will always produce the same output (at least if they are used synchronous and always provide the same input in the same sequence). This is however currently not possible using maven-archiver (transitively using plexus-archiver) because the AbstractZipArchiver will always create ZipEntry timestamps. I wish there was a property to AbstractZipArchiver like private boolean createZipEntryTime = true; to turn the setTime behaviour off.

I'd be happy to submit a PR if this feature is desirable (is there maybe something in the jar spec that insists on these fields? I have not found any reference).

@danielwegener danielwegener changed the title Add ability to NOT write timestamps for file entries Add ability to NOT write timestamps for zip file entries Jun 30, 2016
@michael-o
Copy link
Member

What timestamp should be written then (instead of)?

@danielwegener
Copy link
Author

danielwegener commented Jun 30, 2016

I thought modificationTime/Date was be an optional field in the ZIP file format, but https://www.iana.org/assignments/media-types/application/zip says its not. So my initial idea would be to set the field to 0/epoch.

This is also what reproducible-builds-maven-plugin is doing.

@michael-o
Copy link
Member

The timestamp of the class file shouldn't change if the class has not be recompiled, should it? How do you define a reproducible build in this case?

@danielwegener
Copy link
Author

danielwegener commented Jun 30, 2016

I mean that mvn clean install on the same input and environment should always produce the same output. The time when the build is executed just should not matter. class files are just intermediate artifacts that should be recreatable without changing the result of the whole build result.

@sewe
Copy link

sewe commented Jun 30, 2016

I think the Plexus archiver should allow its caller to explicitly set a single, fixed timestamp which will be used for all entries. If the timestamp is not set, then the Plexus archiver should default to its current behavior: using the modification timestamps of the entries to be archive.

If this single timestamp is then exposed as <archive> configuration of the maven-jar-plugin and friends, one can set it explicitly, e.g., to the time when the commit being build was made.

@krosenvold
Copy link
Member

krosenvold commented Jun 30, 2016

Would it make more sense to actually adjust the timestamp of the .class file to match the source file ? (I'm thinking about an additional plugin/step that ensures this behaviour)

@danielwegener
Copy link
Author

@sewe sounds legit. Ill report back with some code :)

@danielwegener
Copy link
Author

@krosenvold Point is - the modification time of the source file says absolutely nothing about its content. This would just shift the problem from "when have I build something" to "when have i touched something" - it still would not be about "what was built" (independent from time).

@sewe
Copy link

sewe commented Jun 30, 2016

Would it make more sense to actually adjust the timestamp of the .class file to match the source file ?

That would still not always help in creating reproducible builds, as it assumes that the modification timestamp of a freshly checked out source file stays the same across checkouts, which I don't think holds for every SCM out there.

@danielwegener
Copy link
Author

danielwegener commented Jun 30, 2016

That would still not always help in creating reproducible builds, as it assumes that the modification timestamp of a freshly checked out source file stays the same across checkouts, which I don't think holds for every SCM out there.

Not for git at least. There are only commit-timestamps

@sewe
Copy link

sewe commented Jun 30, 2016

@sewe sounds legit. Ill report back with some code :)

Great! Thank you.

@krosenvold
Copy link
Member

Yeah, I suppose it makes sense to use something like the commit timestamp of the head commit.

@danielwegener
Copy link
Author

danielwegener commented Jun 30, 2016

Well even then, if I have different commits that change stuff that does not impact the compiler input (.java file) -> hence does not influence the compiler output (.class files), I'd still get different artifacts. I don't see any use of the modified timestamp of files within a jar (even the weird JSP template case does only make AFAIK sense If modify an exploded template in the war).
But for this component, a timestamp parameter will offer best flexibility.

@krosenvold
Copy link
Member

As long as the change just makes it possible to set a fixed file date on the archiver level I think this is a good solution. Someone else can decide what to set :)

danielwegener added a commit to danielwegener/plexus-archiver that referenced this issue Jul 1, 2016
@Zlika
Copy link

Zlika commented Oct 7, 2017

Hi,
For your information there is a global Maven ticket for reproducible builds: https://issues.apache.org/jira/browse/MNG-6276. Having the ability to have fixed timestamps in ZIP files is of course one of the big thing to solve, also with a fixed ordering of the files inside the ZIP file.

@hboutemy
Copy link
Member

hboutemy commented Oct 7, 2017

while working on MNG-6276, perhaps supporting SOURCE_DATE_EPOCH environment variable would be good idea: see https://reproducible-builds.org/docs/timestamps/

@michael-o
Copy link
Member

I am not really fond of using env variables because this is not the Java way of doing this.

@hboutemy
Copy link
Member

hboutemy commented Oct 8, 2017

sure, the env variable should not be used at library level
but instead of coding a magic timestamp value on any archive, I like the idea of giving archiver user a sourceDateEpoch configuration (yes, in Java, then as a field in a configuration bean) to let him decide what timestamp to use on his archive

@hboutemy
Copy link
Member

complementary idea: if this sourceDateEpoch value could be automatically populated with the aggregator pom file timestamp, this would have the magic I personnally expect:

  • automatic (then forgetting about changing value does not happen)
  • gives a good hint on when the component was released

@sewe
Copy link

sewe commented Oct 26, 2017

IMHO, plexus-archiver should just provide the infrastructure to set timestamps of archive entries for arbitrary files/file sets.

Where exactly this timestamp comes for (or whether it should even be the same across all archive entries) should be left for higher levels to decide. I hence dislike having a single, global fixedEntryModificationTime as @danielwegener proposed.

I agree with @hboutemy that “the env variable should not be used at library level”, but I would expect to be able to say something like this in my POM:

<archive>
    <defaultTimestamp>${env.SOURCE_DATE_EPOCH}</defaultTimestamp>
</archive>

… possibly along with <defaultTimestampFormat> (inspired by maven.build.timestamp.format).

@michael-o
Copy link
Member

@sewe the format is pregiven by the archive spec, not us.

@sewe
Copy link

sewe commented Oct 26, 2017

@sewe the format is pregiven by the archive spec, not us.

@michael-o I know. But I can envision other valid sources of timestamps besides ${env.SOURCE_DATE_EPOCH} that follow formats different from “seconds since the Unix epoch”. Take Jenkins’ {$env.BUILD_ID}, for example, which uses YYYY-MM-DD_hh-mm-ss. In that case, it would be nice to be able to directly configure the archiver to parse the value rather than having to use another Maven plug-in (worst case: maven-antrun-plugin) merely to convert from one format to another.

I hope that explains my reasoning behind the hypothetical <defaultTimestampFormat>.

@Zlika
Copy link

Zlika commented Oct 26, 2017

On a side note, be careful that the timestamp effectively written in the ZIP file should NOT depend on the user's time zone.

@ebourg
Copy link

ebourg commented Oct 27, 2017

I am not really fond of using env variables because this is not the Java way of doing this.

This is true for many languages/environments/build tools that have to deal with reproducible build issues. Using SOURCE_DATE_EPOCH may seem ugly, but an environment variable is the lowest common denominator between various ecosystems. If every tool/library comes with its own way of setting the date it quickly becomes a headache to integrate heterogeneous tools together to achieve reproducible builds.

@plamentotev
Copy link
Member

Making a bit-by-bit reproducible build requires more than just making sure the timestamps are the same. For example the jar files are created in parallel threads so there is no grantee for the order of the entries. Of course we can make sure the timestamps are the same, add option to crate the jar files in single thread with predictable entry order and so on but the list may even grow bigger with the time and make Plexus Archiver hard to maintain. I think it would be better if the archives are created and after that the entries are sorted and any information that may differ between builds (such as timestamps) to be stripped or replaced with given values. For example there is a plugin that does that - https://zlika.github.io/reproducible-build-maven-plugin/.

What do you think?

@sewe
Copy link

sewe commented Jul 23, 2018

@plamentotev Post-processing the artifact, like the reproducible-build-maven-plugin does, adds yet more I/O, as you are effectively writing the artifact to disk twice. So, if the issue can be fixed upstream (i.e., in the Plexus archiver) that would be IMHO preferable, even if that means making more than fixing just the timestamps (in a separate ticket).

@plamentotev
Copy link
Member

@sewe I do agree but my concern is whether it is feasible to achieve that in Plexus Archiver. There are a lot of moving parts - Plexus Archiver allows a lot of customization and it relies on dependencies that does not guarantee determinism. I'm not saying it is impossible to active but is quite tricky. I would prefer to see a discussion about the whole picture and how the goal can be achieved. If were doing this piece by piece I'm afraid that we'll end up in dead end or create a solution that is not optimal.

If there is added value to have the timestamp fixed for Zip entries besides reproducible builds then we should do it anyway. Otherwise I think we should first grantee the order of the inputs and the outputs - IMHO that would be the hardest part to achieve without additional I/O.

@plamentotev
Copy link
Member

Sorry. Looks like I've missed that there is already a discussion (issue and wiki page) to track the reproducible/verifiable builds[1][2].

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74682318
[2] https://issues.apache.org/jira/browse/MNG-6276

@hboutemy
Copy link
Member

hboutemy commented May 8, 2019

I think we should first grantee the order of the inputs and the outputs - IMHO that would be the hardest part to achieve without additional I/O

just done apache/commons-compress#78, waiting for merge
now, the last key missing part is setting a defined timestamp for zip entries

@plamentotev
Copy link
Member

plamentotev commented May 11, 2019

just done apache/commons-compress#78, waiting for merge
now, the last key missing part is setting a defined timestamp for zip entries

Great work @hboutemy! But not sure if it is enough. This will grantee that the entries in the ZIP file are stored in the same way they are added. But what if they are added in different order every time? Do we have a guarantee that the resource collections (PlexusIoFileResourceCollection for example) are going to returns their elements in consistent way? I'm not saying they are not, just didn't have the time to check, but as I don't see any ordering code I have some concerns if the traversing order of the file system is reproducible.

@Zlika
Copy link

Zlika commented May 11, 2019

As far as I know, the traversing order of the filesystem is not reproducible between two computers (or between two copies on the same computer). Anyway, Hervé's work is an important step towards reproducibility.

@Tibor17
Copy link

Tibor17 commented May 12, 2019

@plamentotev @Zlika
No reason to worry. I was working on it together with @hboutemy and the code make sense.
apache/commons-compress#79
apache/commons-compress#78

@hboutemy
Copy link
Member

latest news: see #121
currently tested with maven-source-plugin and maven-assembly-plugin, works with zip files, still require uid/gid force for tar reproducibility

@hboutemy hboutemy added this to the 4.2.0 milestone Oct 12, 2019
@hboutemy
Copy link
Member

closing this issue as it is superceded by #121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants