Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Extending Build Context with Intermediate State #12415

Closed
WhisperingChaos opened this issue Apr 15, 2015 · 12 comments
Closed

Proposal: Extending Build Context with Intermediate State #12415

WhisperingChaos opened this issue Apr 15, 2015 · 12 comments

Comments

@WhisperingChaos
Copy link
Contributor

TOC

Background
Essential
TLDR
Syntax
Semantics
Benefits
Example

Description

This proposal outlines another solution to separate build-time from run-time concerns. Its predecessors #7115, #7149, and #8660 present thorough summaries of the issues arising from the current build system's inability to completely isolate these concerns. Therefore, this proposal, at this time, will not duplicate their content but instead, will focus on describing its solution.

Background

Highly declarative languages rely on the notion of immutable values when executing a program. The docker build context embodies this notion, as it represents an immutable set of values employed to create an image(s). However, although the initial provided state to a declarative program represents all the information that's necessary to produce the final result, the program itself is generally decomposed into a set of transforms which themselves provide/represent intermediate state information required to eventually compute the result. Once produced, this intermediate state information must also be immutable and its production must not destructively overwrite existing execution state. For example, the declarative program (-1 * cos(X)) where X is bound to the value 0 degrees, requires the computation of the transform cos(0) before evaluating the multiplication operation. Computing cos(0) extends the execution state/context to include value 1. Notice, extending the execution state preserves value 0, (it's not destructively overwritten) leaving variable X still bound to this same value.

In an analogous way, a Dockerfile FROM and its associated Dockerfile commands: its ImageContext can implement intermediate transforms when executing process(es) by directly specifying RUN or indirectly through an ONBUILD RUN .., trigger. As above, these intermediate transforms must nondestructively extend the current execution state with intermediate or final values required respectively by a succeeding, dependent transform or to construct the desired image. The new values must also be addressable, so they can be bound (coupled), making their values visible (accessible) or unbound, preventing access to them.

Indeed, the Nested/Chained Build proposal #7149 accomplished the objectives of nondestructively extending the build context and ensuring the addressabilty of execution state, as each intermediate build step would reconstitute the entire build context required to satisfy the remaining ones. Reconstructing the build context included operators to selectively propagate an existing value, by assigning it an address (path/file name) or extending the build context by including values generated by the current step. Both the addressability and extension operators accomplished their tasks by physically copying execution state to a directory within the current step's filesystem. For example, the addressibility operator, ADD, transfers the pre-existing build context values, required by the remaining steps, to the image's file system at a specific address (path), while the implementation of the extension operator relies on processes executing within an ImageContext to eventually write their values to the same directory/subdirectory, as targeted by the addressability operator (ADD).

Although Nested/Chain Build, provided a means to extend the build context and permits its addressability, its lack of a coupling mechanism to declare a value to be either bound (visible) or unbound (inaccessible) to a variable (path/file reference), results in tightly coupled code, artificial dependencies between build steps, and other undesirable traits as discussed in these posts: #7149 comment 1, #7149 comment 2, and demonstrated through coding examples: Compare Function Idiom to Nested/Chained Build. That said, what if Nested/Chained Build's incremental evolutionary approach to deliver isolation between build-time and run-time concerns, as compared to #8660 Function Idiom's more extensive one, be realized without Nested/Chained Build's drawbacks?

Essential

As outlined above a solution would require mechanisms to:

  • nondestructively extend the build context with intermediate execution state without physically copying it,
  • address (locate) preexisting and newly computed execution state without physically copying it,
  • bind (couple) and unbind (decouple) variables, path/file name references specified by Dockerfile commands like ADD and COPY, to build context state.

Since this Proposal: Dynamic Coupling via Local Build Context details and suggests syntax to implement a capable coupling/binding mechanism, the remainder of this description will outline the implementation of a language feature to deliver the other required mechanisms mentioned above.

Through its implicit container commits, docker build already preserves intermediate execution state and enforces its immutability eliminating the need to formulate another mechanism. Therefore, what remains unimplemented is a suitable addressability mechanism, which until recently, was absent from the Dockerfile language. An appropriate addressability mechanism can be found in the deferred proposal: Build multiple tagged images per Dockerfile #3251 which alluded to "Adding layers to a base image from arbitrary points in the current build" and in particular, this post demonstrates an addressability mechanism, based on the TAG command, which would eliminate the unnecessary physical copying imposed by Nested/Chained Build's implementation. Although deferred, the proposal's addressability mechanism reappears with the recent introduction of the Dockerfile LABEL command as discussed, implemented by Proposal: One Meta Data to Rule Them All => Labels #9882.

TLDR

Borrowing from these efforts, this proposal presents an addressability mechanism that integrates LABEL, but doesn't necessitate it's use, to extend the initial build context with references to immutable execution state from other already performed and terminated ImageContexts encapsulated within the given Dockerfile. This addressability mechanism is similar to a bind mount or symbolic link in that, it extends an addressing scheme, in this instance, the build context directory structure, with references to another addressed context, in this case an ImageContext:directory, to provide addressability to the immutable values exposed by ImageContext:directory.

The remainder of this proposal will discuss the feature's Syntax, Semantics, provide an Example , and list its Benefits.

Syntax

FROM ... [MOUNT { <ImageContextDirectoryReference>:<RootName>
                [ <ImageContextDirectoryReference>:<RootName> ] ... 
                | 
                  //<LABEL>/<ImageContextDirectoryReference>:<RootName>
                [ //<LABEL>/<ImageContextDirectoryReference>:<RootName> ] ...
                }
         ]
  • FROM
  • : An absolute directory path resolved within a terminated ImageContext.
  • : An absolute directory path within build context.
  • : A LABEL reference to a potentially intermediate image constructed during the execution of the lexically associated ImageContext.
  • Why specify MOUNT as a FROM keyword?
    • Although it can be applied independently of CONTEXT, the synergy between MOUNT and CONTEXT and their similar encoding naturally suggest their joint expression.
    • Facilitates generation of a Dockerfile's computational dependency graph, as all information necessary to compute it appears on only the FROM statement.
    • Please read the bullet points before CONTEXT's Syntax section which explains the rational for CONTEXT's lexical association to FROM.

Semantics

  • The build context's root directory represents a composite value. This proposal mutates the composite value of this directory potentially violating the idempotent constraint for Dockerfile ADD,COPY references that transfer the content of this directory to an image's file system. In general, even in Dockerfiles with only one FROM command, references to the root directory contents would contain resources that are build-time only, they shouldn't appear in the run-time image. For example, the Dockerfile executed to construct the image should probably be excluded from its run-time image. Furthermore, implementing Proposal: Dynamic Coupling via Local Build Context visibility (invisibility) mechanism could hide a mutated value, preventing it from influencing the execution environment of subsequent _ImageContext_s. Therefore this deviation enabling unified access to both initial and intermediate execution state from the address space of the initial build context, provides a benefit that seems to warrant the possible violation of idempotence.
  • : The topmost (root) directory name of this path must not already exist within the build context. If it does, it's a violation of the immutability principle, as the composite values of overlapping directories and individual values of overlapping files are changed. These types of alterations might, but not necessarily so, violate the idempotent constraint.
  • A reference (path/file) resolved within the ImageContext of the last committed container representing the final execution state of these objects. However, LABEL can be used to address one of the intermediate committed containers generated within the current ImageContext. In this situation, the values of directories and their files may represent a state different from their final one. Also, environment variable references, used to construct path/file names are resolved within the appropriate ImageContext.
  • : Must refer to a committed container (image) produced within the ImageContext of the current FROM. This constraint limits its scope/visibility to only the intermediate execution state produced within the current build context to better preserve idempotence and a developer's ability to reason about its state.
  • MOUNT: Binds to of build context. It confers addressability to the intermediate values accessible by via the build context. It also identifies values contributed by the ImageContext that other _ImageContext_s, within the same Dockerfile consume. When paired with notion of CONTEXT, defined by Proposal: Dynamic Coupling via Local Build Context, a hopefully Directed Acyclic Graph, DAG, documenting computational dependencies between "build steps", FROM statements, can be generated to order the execution of build steps.
  • Image History associated to the resulting minimal image no longer reflects the entire Dockerfile command stream required to assemble it. The introduction of a "used by" composition relationship, as differentiated from the currently prescribed "inherited" one, requires integration of a separate Dockerfile command lineage for each "used by" component. Currently, this "integration" appears as a series of ADD operations, transferring values addressable from the build context into the resultant image's file system. In certain cases, such as secret keys, this incomplete recording benefits a build's security profile. Although important, this proposal postpones the discussion of lineage until its adoption gains traction within the community. Also note, any credible proposal addressing separation of build-time and run-time concerns will encounter this issue.
  • As experienced by Image History, the introduction of a "used by" composition relationship, as differentiated from the currently prescribed "inherited" affects the disposition of intermediate images, as those related through the "used by" relationship would currently be considered "dangling parents". This disposition could result in their inadvertent removal from the build host's image cache. Again, a solution to this issue won't be attempted until indicated by interest in this proposal.

Benefits

  • Completely separates build-time and run-time concerns, when applied alongside the CONTEXT feature. An image representing the minimal desired state can be indirectly composed from values (files/paths) available from the initial build context, as well as any ImageContext encapsulated within a given Dockerfile.
  • Improves the declarative benefit of a Dockerfile as:
    • Computational dependencies precisely determine the execution order of an ImageContext. In other words, FROM statements can appear in any order within a Dockerfile.
    • Builder implements MOUNT's binding mechanism from simple mapping statements eliminating its manual encoding by individuals developing Dockerfiles.
  • Can accelerate the apparent execution speed (not total CPU time), when compared to serially running build steps by concurrently executing them, once the promise documented by the DAG of dependent values is satisfied.
  • Unifies the notions of the initial build context and the derived, intermediate build context created by other _ImageContext_s encapsulated within the current build, as simply the build context.

Example

Purpose:

The example consists of an initial go application that decides which one of two competing strategies to execute when solving a problem. The competing strategies are also written in go. Assume all go programs are linked as static images. The selected Docker Hub image: google/golang-runtime executes a go compiler request converting source to a static executable via ONBUILD triggers. The Dockerfile reflects the task of generating the executables and adding them to the minimal "scratch" image.

Build Context:

/
    Dockerfile
/app
    main.go
/stgt1
    main.go
/stgt2
    main.go

Dockerfile
# Regarding the FROM statement that follows,
# CONTEXT creates a virtual directory structure from the uploaded build context corresponding
# to the build context expected by google/golang-runtime's "ONBUILD ADD . /gopath/src/app" 
# trigger statement.  From the perspective of google/golang-runtime, this virtual build context
# is simply: "/main.go".  All other paths/files appearing in the uploaded build context are
# excluded/inaccessible to google/golang-runtime.  Once google/golang-runtime's
# triggers terminate, completing the compilation of main.go into a static executable image,
# the MOUNT keyword extends the uploaded build context with essentially a link to the
# produced executable.  In this case, MOUNT extends the uploaded build context with
# the directory path "/binStgt1/stgt1" that points to "/$GOPATH/bin/app".  
FROM google/golang-runtime CONTEXT /stgt1/*:/  MOUNT /$GOPATH/bin/app:/binStgt1/stgt1

# The same process applies to the FROMs below, except the bindings change.
FROM google/golang-runtime CONTEXT /app/*:/    MOUNT /$GOPATH/bin/app:/binApp/app
FROM google/golang-runtime CONTEXT /stgt2/*:/  MOUNT /$GOPATH/bin/app:/binStgt2/stgt2

# To construct the minimal image,
# CONTEXT creates a virtual directory structure from the extended, uploaded build
# context corresponding to the build context expected by scratch and its associated
# Dockerfile commands.  In this case scratch's "ADD /bin/* /" requires all the
# executable files be accessible through the path reference "/bin".  CONTEXT
# coalesces the contents of "/binApp", "/binStgt1", and "/binStgt2" to form "/bin"
# with the following virtual directory structure:
#    /bin 
#        /app
#            main
#        /stgt1
#            main
#        /stgt2
#            main
# The "/bin" virtual directory structure becomes the build context provided to scratch.
FROM scratch CONTEXT /binApp/*:/bin/ /binStgt1/*:/bin/ /binStgt2/*:/bin/
  ADD /bin/* /
  ENTRYPOINT  [ "/app/main", "/stgt1/main", "/stgt2/main" ]
  • Any FROM command can be located anywhere within the Dockerfile, as the execution order is determined by the computational dependencies, independent of the order in which they were defined.
  • MOUNT can include environment variable references, in this instance $GOPATH, that are resolved within the ImageContext of the last committed container or the one identified when resolving a LABEL.
  • Given the computational dependencies documented by CONTEXT & MOUNT, the three google/golang-runtime _ImageContext_s can be concurrently executed while delaying the initiation of scratch's ImageContext until the promised dependencies defined by MOUNT exist.
  • Although secrets, if necessary, would appear in the ImageContext inherited from, in this instance google/golang-runtime, as long as the secrets themselves aren't compiled into the resultant go executables, they would not appear in the resultant image derived from scratch.
  • The absence of the MOUNT keyword on a FROM, in this case scratch, identifies the most likely image to associate the repository name and tag provided by docker build's -t option since it's not exposing intermediate state to be used by other build steps, it's most likely a resultant image that accepts content from other FROMs.
@WhisperingChaos
Copy link
Contributor Author

@erikh @tiborvass @proppy @burke @discordianfish

Mechanisms promoted by the following proposals:

guided the formulation of the following proposals, that together, may lead to an incremental solution to properly separating build-time and run-time concerns:

  • Proposal: Extending Build Context with Intermediate State #12415 : Extends, in a mostly idempotent way, the uploaded build context with intermediate and final state (path/files). This allows files derived from the uploaded build context, like compiled executables, located within a committed container's file system to be accessible via a path resolvable within the extended structure of the uploaded build context.
  • Proposal: Dynamic Coupling via Local Build Context #12072 : Maps, in a read only way, the directory structure of the uploaded build context provided by docker build to present a virtual directory structure that corresponds to the one required by a given FROM statement. Think of it as creating a directory structure comprised entirely of symbolic links that conforms to what's needed by FROM and its associated Dockerfile commands.

Read #12415 TLDR and view its Example, as it concisely (for me) demonstrates both mechanisms. It's worth a look ...

@WhisperingChaos
Copy link
Contributor Author

The "Proposal: SLINK instruction for Dockerfile" #7654 @cmfatih furthers the notion of addressability conferring the following benefits as it:

  • completely removes transient secret keys from both the final run-time and the build host's image cache,
  • improves build performance by optimizing the number of physical copy operations.

I would suggest that checksums produced by SLINK represent the "actual" file/path checksums not the file/path checksum computed for the symbolic link, as the symbolic link should present a proxy indistinguishable from the actual artifact it represents.

@WhisperingChaos
Copy link
Contributor Author

@thaJeztah
I appreciate your labeling the proposal. Would it be possible to also add "/tools/build" label? Thanks!

@WhisperingChaos
Copy link
Contributor Author

@ibuildthecloud

Noticed you looking to build minimal images. Would like your feedback regarding the following proposals: #12415 & #12072. Thanks.

@erikh
Copy link
Contributor

erikh commented Apr 22, 2015

I worry that this may be too complicated for the typical Docker use-case.

On Apr 21, 2015, at 6:41 PM, Rich Moyse notifications@github.com wrote:

@ibuildthecloud https://github.com/ibuildthecloud
Noticed you looking to build minimal images. Would like your feedback regarding the following proposals: #12415 #12415 & #12072 #12072. Thanks.


Reply to this email directly or view it on GitHub #12415 (comment).

@WhisperingChaos
Copy link
Contributor Author

@erikh

Neither feature is required.

However, if you want added security/reliability, the ability to reformulated the build context to match the ONBUILD triggers executed by library go and/or ruby images, a true separation of build-time and run-time concerns, then supplying input/output mapping specifications seems a small price to pay. It's also familiar, as it acts like an argument list to a function call.

Do you have something in mind that's more minimal than specifying the data dependencies?

@ibuildthecloud
Copy link
Contributor

Honestly it may that me a bit to grok this full PR. To be honest, that may be its downfall. When I read through the example it's not obvious what is going on. Docket generally favors usuability and simplicity often at the cost of correctness.

So I will try to understand the concepts of this PR and then maybe they can be presented in a slightly different form.

@WhisperingChaos
Copy link
Contributor Author

@ibuildthecloud

I appreciate your responding to my request to review the two proposals.

Honestly it may that me a bit to grok this full PR.

Sometimes, the perspective of a stranger in a strange land at least provides some level of insight into a problem that may be improved by others. Remember what happens if you don't grok...

I can be available via IRC/Skype if you which to discuss the mechanisms and reasoning behind their application.

Also, I'm interested in understanding the “complexity” argument. Certainly there's much to read but the essential mechanism is bind mount influenced by the set Union operator. It is this composite mechanism that implements the mapping function common to CONTEXT and MOUNT. The mapping function's arguments are pairs of file/directory names which simply declare the associations, that in the case of CONTEXT, capture the specific input dependencies & structure required by, and expose output artifacts (MOUNT) produced by the processes running within the Dockerfile commands associated to a given FROM statement. Since these dependencies are unavoidable, exposing them facilitates their management by, for example, automatically generating the binding code, reducing the complexity required to write a Dockerfile.

Perhaps comparing this Proposal's example solution of (6 statements) above to a similarly encoded Chained Build solution (23 statements) below may help?:

#Chained Build
FROM scratch
  # reconstruct the build context to exclude the Dockerfile
  # and bury the other go source code so it's ignored by the go compiler
  # and doesn't overlay the code for app
  ADD ./app/* /go/
  ADD ./stgt1 /go/source/
  ADD ./stgt2 /go/source/
BUILD /go
  FROM google/golang-runtime
  RUN  mkdir -p /go/bin/app && mv /$GOPATH/bin/app/* /go/bin/app/
  ADD ./stgt1/* /go/
  ADD ./stgt2 /go/source/
BUILD /go
  FROM google/golang-runtime
  RUN mkdir -p /go/bin/stgt1 && mv /$GOPATH/bin/app/* /go/bin/stgt1/
  ADD ./bin/app/* /go/bin/app/
  ADD ./stgt2/* /go
BUILD /go
  FROM google/golang-runtime
  RUN mkdir -p /go/bin/stgt2 && mv /$GOPATH/bin/app/* /go/bin/stgt2/
  ADD ./bin/app/*   /go/bin/app/
  ADD ./bin/stgt2/* /go/bin/stgt2/
BUILD /go
  FROM scratch
  ADD ./bin/* /
  ENTRYPOINT  [ "/app/main", "/stgt1/main", "/stgt2/main" ]

Note - every Dockerfile statement other than FROM represents binding requests to either couple build context objects, that are inputs to the go compiler, to the particular container's file system, or bind generated output objects and those input objects not processed yet, to form a new (extended) build context for the next build step. Implementing CONTEXT and MOUNT eliminates the complexity of manually coding, debugging, and maintaining these 17 extra Dockerfile commands.

There are other important differences between the approaches but I don't want to overwhelm.

@errordeveloper
Copy link
Contributor

I would very excited to see this implemented!

My few thoughts on this are:

  • the description is quite theoretical and I am worried that many folks won't have the time to read it
  • examples use Go, which actually has next-to-none runtime dependencies; some dynamic language would be best

The bigger issue I am seeing is that people using Node.js, Python, Ruby or other dynamic language don't realise that they have heaps of build-time-only dependencies bundled into their runtime images, things like the compiler that is used to build native extensions and all the *-dev packages to go along with it.

Speed of deployment and security surface are probably the most important aspects, it's not the size on disk or theoretical purity.

@WhisperingChaos
Copy link
Contributor Author

@errordeveloper

Thanks! Community support is critically important to convince core maintainers to adopt both proposals. Therefore, if you know of others interested in achieving what's proposed, please let them know.

the description is quite theoretical and I am worried that many folks won't have the time to read it

I'll put together a TLDR section for the companion proposal #12072. I hope the TLDR above addresses your concern for this one? When implementing proposals whose effect broadly impacts a critical component, like Builder, I would want a thoughtful assessment presenting the reasoning behind the proposal and its scope. Besides what your see is an approach that's a bit ingrained, as I prefer to deconstruct concepts to more abstract ones in order to identify similarities that can then be leveraged to hopefully arrive at a minimal solution. For example, both CONTEXT and MOUNT are very closely related which should improve the reliability of the resultant code and minimize the time required to develop it.

examples use Go, which actually has next-to-none runtime dependencies; some dynamic language would be best

The bigger issue I am seeing is that people using Node.js, Python, Ruby or other dynamic language don't realise that the have heaps of build-time-only dependencies bundled into their runtime images, things like GCC that they use to compile native extensions with along with all the *-dev packages etc.

  • I agreed but even google/golang-runtime has dependencies, like build-essential, that packs a GNU C++ compiler (google/golang - golang-runtime's base image) and a go library implementing garbage collection, that's dynamically loaded unless an application is statically linked, into an image.
  • The example purposely employs go because it's known to the core maintainers of Docker and as you point out, it's easy to create a statically linked executable that can be "lifted" from the compiler's output bin and relocated to the desired directory of a targeted image. Although it would be useful, an automated method to extract a component and its related run-time dependencies is beyond the scope of this Proposal. Instead, for now, developers will have to the rely on the tools offered by these programming environments and their experience to determine the minimal component set necessary to satisfy run-time dependencies.

Speed of deployment and security surface are probably the most important aspects, it's not the size on disk or theoretical purity.

  • Regarding theoretical purity, I would suggest idempotence is not a theory, but a trait and without exploring and enumerating the effects these proposals incur on preserving/violating this trait, core maintainers would certainly reject proposals that violate it or if they do bend this trait, that there be sufficient cause to do so, as well as a mechanism to control the effects of the violation. Systems that obey idempotence speed development, therefore, in a sense, accelerate deployment by not only supporting build-time optimization but also improving a developer's reasoning/understanding about the deployed component.

  • Disk size is at least casually related to the security surface so it's difficult to realize one without addressing the other. Also, there are others that would disagree with your sense of priorities.

  • Implementing "Proposal: SLINK instruction for Dockerfile" Proposal: SLINK instruction for Dockerfile #7654 along with "Proposal: Dynamic Coupling via Local Build Context" Proposal: Dynamic Coupling via Local Build Context #12072 would both reduce the security surface of the build environment itself, as initial build context information, like secret keys, would not even appear in the host's build cache, and also improve the performance of a build by eliminating unnecessary copying of both initial and extended build context files/directories. SLINK is essentially a special form of ADD. It should align with ADD's purpose: to identify a process' input dependencies and determine through its state (checksum), if the process should be executed (invalidated cache), as all Dockerfile processes are considered pure. However, instead of physically copying content to the image's file system, SLINK would simply refer to the referenced resource. Additionally, SLINK needs, like both ADD and COPY, the coupling and visibility mechanisms offered by CONTEXT so the static build context file/directory references specified by SLINK can be retargeted/coupled to the actual file/directory and limited to only those objects required by the executing build process.

    For example, by replacing google/golang-runtime's Dockerfile "ONBUILD ADD . /gopath/src/app/" with "ONBUILD SLINK . /gopath/src/app/" the go source, which itself may represent a secret, will appear as only a reference, not as an acutal source file in the host's build cache. SLINK's reference of "." would be resolved through the virtual directory structure defined by CONTEXT to the appropriate build context object(s).

@WhisperingChaos
Copy link
Contributor Author

@erikh

Why do you worry that this may be too complicated for the typical Docker use-case?

@jessfraz
Copy link
Contributor

Hello!
We are no longer accepting patches to the Dockerfile syntax as you can read about here: https://github.com/docker/docker/blob/master/ROADMAP.md#22-dockerfile-syntax

Mainly:

Allowing the Builder to be implemented as a separate utility consuming the Engine's API will open the door for many possibilities, such as offering alternate syntaxes or DSL for existing languages without cluttering the Engine's codebase

Then from there, patches/features like this can be re-thought. Hope you can understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants