Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squash build dependencies #6906

Closed
shykes opened this issue Jul 8, 2014 · 20 comments
Closed

Squash build dependencies #6906

shykes opened this issue Jul 8, 2014 · 20 comments
Labels
area/builder area/distribution kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Comments

@shykes
Copy link
Contributor

shykes commented Jul 8, 2014

The current implementation of “docker commit” and “docker build” makes it difficult to strip images of their build dependencies. This causes 2 problems:

  • Many images are unnecessarily bloated.
  • To avoid the bloat, some developers avoid using “docker build”, creating unnecessary fragmentation.
@unclejack
Copy link
Contributor

We could collapse instructions which only introduce metadata changes into single layers.

Something like:

FROM someimage
MAINTAINER someone
RUN apt-get update
RUN apt-get install -y somepackage
ENV foo bar
ENV bar baz
ENV boo foo

could be safely reduced to fewer layers:

FROM someimage
MAINTAINER someone
RUN apt-get update
RUN apt-get install -y somepackage
ENV foo bar ENV bar baz ENV boo foo

We could also introduce automated squashing which would keep the original images around to make rebuilds faster.

@shykes
Copy link
Contributor Author

shykes commented Jul 8, 2014

@unclejack collapsing layers in this way addresses a different problem: the limit in number of filesystem layers. It doesn't address the problem of disk space. I suggest we address it in a separate issue.

@aweiteka
Copy link

Assumptions:

  • cached image layers are only useful or meaningful to the application developer during the build process
  • new Docker users naively expect each docker build to result in a single new image layer

I've consider build artifact layer management to be a good candidate for image signing which would squash the new build layers into a single new image layer.

docker build -t my/app . [--sign] adds one new layer to the application and optionally signs it.

@trevorjay
Copy link

The caching is most useful for cases like package system commands that take a long time to run. However, even these commands you often want to explicitly re-run. What about a simple --no-cache flag for the build command?

Alternatively, what about a FRESH command for Dockerfiles? Basically it would have the same affect as:

RUN echo "nonce"

but wouldn't require the author to keep changing it.

Tweaking @aweiteka's idea: since you only want to sign "finished" images anyway, what about making --sign (if and when implemented) implicitly also behave as a --no-cache ?

@proppy
Copy link
Contributor

proppy commented Jul 17, 2014

A somewhat "emergent" pattern is to use a builder and runner image.

Only the builder one contain the build dependencies, and the artefacts are extracted either using volumes, docker cp or even stdout and injected into another context.

I wonder how (and if) the docker CLI / API could bless and facilitate the pattern in a way that's compatible with the hub.

A more importantly: would that fixes this issue or is it a separate discussion?

@thaJeztah
Copy link
Member

since you only want to sign "finished" images anyway, what about making --sign (if and when implemented) implicitly also behave as a --no-cache

-1 on that, or (at least) have --no-cache as a separate flag as well; if I don't need to sign my image, I still want to be able to disable caching layers or be able to squash.

@shykes
Copy link
Contributor Author

shykes commented Jul 22, 2014

@proppy yes, I think we should support the "builder / runner" pattern you talk about. That is the goal of nested builds (#7115). Note that even with nested builds, you still need dependency squashing to make sure the leftover "unpublished" build dependencies are not carried into the image.

@SvenDowideit
Copy link
Contributor

the example I have is a Dockerfile with

RUN apt-get install make
RUN make whatever
RUN apt-get remove make

even if i want my build artifacts, right now, its not simple enough to get rid of the build tools.

plus, it enforces a Dockerfile hygiene thing - it will encourage users to extract the

RUN apt-get update
RUN apt-get install apache
ADD certificates
RUN echo "" > domain_settings_files

that they have copied and pasted into a few places into one common image that they then FROM local-web-debian - and update and manage more carefully..

@phemmer
Copy link
Contributor

phemmer commented Jul 23, 2014

I'm just curious, how does this issue differ from #332 or #4232?

@aigarius
Copy link

Wasn't there supposed to be work on "ghost" layers that would be present in the build process, but disappear from the final image? If such a feature would be technically possible, one could easily imagine the following Dockerfile:

FROM debian
RUN apt-get install -y libjpeg
~RUN apt-get install -y libjpeg-dev build-essential gcc
~ADD source /build
~WORKDIR /build
~RUN ./configure
~RUN make
RUN make install
CMD /usr/local/bin/myexe

Where all layers generated with lines that start with a "~" would actually not appear in the final image.

@phemmer
Copy link
Contributor

phemmer commented Jul 28, 2014

I personally like the syntax @txomon mentioned in #332, which uses a COMMIT directive.
The reason being is it seems clearer where the resulting image would be generated.

For example

FROM debian
RUN foo
~RUN bar
~RUN baz
CMD bash

Will the resulting image have the RUN bar and RUN baz as a single image, and CMD bash as another? Or will there be one image for all three?

On the other hand, if we re-use ideologies from database systems:

FROM debian
RUN foo
BEGIN
RUN bar
RUN baz
COMMIT
CMD bash

Seems very clear that RUN bar and RUN baz will be squashed into a single image, and CMD bash will be a separate one.

@aigarius
Copy link

I think that there is a miscommunication of what "squash" build dependencies means in the context of this ticket. I understand it like "remove", thus any changes to filesystem that are made by tilde commands will not show up at all in the final image. This means that you can install build dependencies, do the compiles and none of that will actually be included in the final image. Only the result of "RUN make install" in my example above (only the actual already compiled binaries in /usr/local/bin) and their runtime dependencies installed in the second line of the example would be present in the final image.

As far as I am reading into it, half of the comments here confuse this with #332 which keeps all the changes to the filesystem in the final image, but just in fewer layers.

@phemmer
Copy link
Contributor

phemmer commented Jul 28, 2014

Ahh, thank you. That makes a lot more sense. Perhaps we could call it "stripping" instead of "squashing".

@TomasTomecek
Copy link
Contributor

Any updates?

@stain
Copy link

stain commented Jan 18, 2015

What about just having a multi-line RUN-MANY mean the equivalent of the awkward && \ chaining pattern?

RUN-MANY
  apt-get update
  apt-get install wget unzip build-essentials
  wget http://example.com/source.zip
  mkdir /tmp/src
  cd /tmp/src
  unzip source.zip
  # CRAZY - comments allowed in the middle!
  make install
  apt-get remove --purge wget unzip build-essentials
  apt-get --purge autoremove
  apt-get autoclean
  rm -rf /tmp/*
END

This would give a lean image, but also make it easier to copy-paste in any existing install-scripts.

Also - say you start with a traditional RUN style to have faster development time - now you can just insert RUN-MANY and END, search-replace the multiple RUN - and hey presto - you have the lean, autocleaning version of your script.

@phemmer
Copy link
Contributor

phemmer commented Jan 19, 2015

@stain As you mentioned, you can already accomplish the same thing with a simple && shell operator. What this issue is about is doing things which aren't possible at all (not just inconvenient). Such as removing layers which are not necessary in the final image.

@thaJeztah
Copy link
Member

@phemmer to give some context; @stain's original Dockerfile used a COPY to add the source. I suggested a build-container or curl/wget workaround and pointed to existing issues on the issue tracker wrt getting rid of large intermediate layers.

@stain
Copy link

stain commented Jan 19, 2015

But isn't there something wrong when most of the Dockerfile commands can't
be used in a production image because they generate enormous intermediate
images that nobody else will ever need?

The tiny difference between ADD and COPY does not help.

The COMMIT should be a better solution then my RUN-MANY, as it would allow
you to do COPY and so on without worrying about waste, and just do multiple
RUN in a naive way.

In one project I have a github repository which somehow is 600 MB when
checked out. Ideally I would like to host the Dockerfile right there, so
the image on the hub would track updates to the github code.

Then I would just COPY . /src and do the build within the Docker image
(installing about 40 MB of binaries) and then clean up before a COMMIT.

currently I need to have a massive && which in the end does all the cleanup
like deleting /tmp and deleting all the library files needed for
compilation. Obviously testing this takes forever as it always does
everything.

Moving from Dockerfile style during development to mega-RUN-&& style takes
considerable effort, as one has to do so much more housekeeping like
installing wget and unzip, keep track of the temporary files downloaded (or
use pipes if tar.gz), and clean everything up afterwards.

During such a sequence not a single Dockerfile command can be used as it
would make it pointless to clean up.

Flattening layers should be straight forward as its just set difference
(after all it is done at runtime by the container). you can set the
boundary to each Dockerfile, so you can't flatten outside layers.
On 19 Jan 2015 07:05, "Sebastiaan van Stijn" notifications@github.com
wrote:

@phemmer https://github.com/phemmer to give some context; @stain
https://github.com/stain's original Dockerfile used a COPY to add the
source. I suggested a build-container or curl/wget workaround and pointed
to existing issues on the issue tracker wrt getting rid of large
intermediate layers.


Reply to this email directly or view it on GitHub
#6906 (comment).

@aigarius
Copy link

@stain you are also confusing this with #332 , please read my comments above for a clarification.

@jessfraz jessfraz added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny /dist/registry area/builder labels Feb 27, 2015
@jessfraz
Copy link
Contributor

Hello!
We are no longer accepting patches to the Dockerfile syntax as you can read about here: https://github.com/docker/docker/blob/master/ROADMAP.md#22-dockerfile-syntax

Mainly:

Allowing the Builder to be implemented as a separate utility consuming the Engine's API will open the door for many possibilities, such as offering alternate syntaxes or DSL for existing languages without cluttering the Engine's codebase

Then from there, patches/features like this can be re-thought. Hope you can understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/builder area/distribution kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet
Development

No branches or pull requests