Skip to content

vigith/containerization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Understanding Containerization

I was quite baffled first time when I saw docker running centos on an amazon linux. I thought I will jot down all I understand about containers so someone can save some time without needing to read a whole lot of documents. In this document I will try to achieve most (all) of the concepts without using any docker (or rkt) commands so we can know how these tools work behind the scene.

NOTE : As of today this document references docker, but it should be true for other systems too. These topics covered here are only showing the proof of concept. Tools like docker, rkt, etc wraps around these tools but they cover all the corner cases and all the use cases way more than what is mentioned here.

DISCLAIMER I am not an expert in these domains, only my curiosity lead me to read and write this document. (Other words, don't try in production)

The reason behind writing this document is in operations world when we debug a production issue, you better know the internals than making foolish assumptions (assumption is the root of all major screw-ups).

Basic Docker

I assume the reader knows more about Docker and its features. Knowing the features of Docker will help the reader understand the internals better (otherwise you might think why the heck am I saying this)

Very Basic Docker

FROM centos

If you save the above contents to Dockerfile and do a docker build -t vigith/centos . followed by a docker run -t -i centos /bin/bash you will get a bash promt in centos (you can confirm this by doing a cat /etc/system-release on your new prompt).

It also supports mounting volumes, exposing ports, container linking etc. Though the names sound unfamiliar the technology behind it remains the same. Since I am focusing on containers as a whole, I leave it to the reader to explore more into docker.

Images

UnionFS

UnionFS lets you overlay files and directories of different filesystem forming a single unified mountable filesystem. The merged single filesystem will have no duplicates and later layers take precedence over former layers such that we will end up in a new unified coherent virtual filesystem. Couple of well know UnionFS are AUFS, btrfs, DeviceMapper, OverlayFS, etc

CopyOnWrite

It allows both read-only and read-write filesystems to be merged. Once a write is made, it can be persisted by making the copy go to a file system. The writes can be either discarded or persisted, persisting will enable users to create a snapshot of the changes and later build layers on top if it as if it were the base layer.

An Usecase

You have to install nginx for you website. The end container on your website server will be an nginx process tailor cut for website called website-nginx. You would also like to reuse your nginx build because it has lot of patches made specific for your env.

This can be done in 2 steps

  • getting a specific version of nginx (patched with all crazy stuff) called ops-nginx
  • use ops-nginx to build out website-nginx server, also this same ops-nginx can be used for other servers by just putting the right confs

step 1 Create the ops-nginx image from base os, then can be reused later for many other apps

(base os)                              -> layer 1
   \_ installing patched nginx         -> layer 2
        \_ install users               -> layer 3
        |_ giving sudo for ops         -> layer 3 (snapshot as ops-nginx)

step 2 Create the website-nginx image from ops-nginx, a server with a specific set of configs and other packages

(ops-nginx)                 -> layer 1 (snapshot)
  \_ nginx website conf     -> layer 2 
  |_ ssl conf               -> layer 2
  |_ log conf               -> layer 2 (snapshot as website-nginx)

Docker Way

step 1 Create the ops-nginx image from base os

  • docker pull centos
  • docker run -t -i centos /bin/bash
    • yum install nginx foobar
    • ... other crazy command ...
  • docker commit -m "ops nginx image" IMAGE_ID ops-nginx

step 2 Create the website-nginx image from ops-nginx

  • docker pull ops_nginx
  • docker run -t -i ops_nginx /bin/bash
    • .. change your config ..
    • ... other voodoo stuff ...
  • docker commit -m "ops nginx image" IMAGE_ID website-nginx

HOWTO

If there was no docker, how could have we done this!. Docker or other tools might be doing real different, but lets see couple of ways how we can do it.

To understand how we can do it, we need to understand the following

I will try give some crude examples on device mappers. These examples try to do 2 step snapshots

empty filesystem
 |
 +- load filesystem (snapshot 1)
     |
     +- edit filesystem (snapshot 2)

If we can achieve this, then we can do this repeatitively, also can make and persist any kind of changes at any level of snapshot.

Snapshot

This is a crude HOWTO on a working example of device mapper snapshots.

Loop Device

# create a sparse 100G file
> truncate -s100G test.block

# create /dev/loop0
# -f will find an unused device and use it
# --show will print the device name
> losetup -f --show test.block 

Now we have /dev/loop0 (my example is based on loop0, if loop0 is not free do a losetup -d /dev/loop0) attached to test.block (file mounted as block device).

Create Origin and Snapshot targets

# create base target (1953125 = 1000 * 1000 * 1000 / 512)
# where 512 byte = 1 sector, and GB = 1000 * 1000 * 1000  (it would have been 1024
# if GiB was the unit)
> dmsetup create test-snapshot-base-real --table '0 1953125 linear /dev/loop0 0'

# create the cow snapshot target
# 390625 + 1953125 = 2343750 (== 1.2GB)
> dmsetup create test-snapshot-snap-cow --table '0 390625 linear /dev/loop0 1953125'

Populate the Origin Device

I downloaded a centos rootfs (actually I took a docker centos image and converted it to tar via docker2aci). This centos tar is named as centos-latest.tar.

# format the orgin as an ext4 device
> mkfs.ext4 /dev/mapper/test-snapshot-base-real
# create a dir to mount the new ext4 fs
> mkdir -p /mnt/loados
# mount it
> mount /dev/mapper/test-snapshot-base-real /mnt/loados
# load centos to new ext4
> tar -xf centos-latest.tar -C /mnt/loados/
# umount the dir
> umount /mnt/loados

Mark the device as Origin

We will make the newly created ext4 filesystem containing centos rootfs as our origin.

# make /dev/mapper/test-snapshot-base-real as origin 
> dmsetup create test-snapshot-base --table '0 1953125 snapshot-origin /dev/mapper/test-snapshot-base-real'

Create CoW Snapshot

This will make a snapshot target, which can be mounted and edited. Snapshot target will be having the origin as its backend (ie, if no write is made to snapshot origin == snapshot, else all new writes will go to snapshot)

# P (2nd last arg) means, make it persistent across reboot
# 8 (last arg) chunk-size, granularity of the of copying the snapshot
> dmsetup create test-snapshot-cow --table '0 1953125 snapshot /dev/mapper/test-snapshot-base-real /dev/mapper/test-snapshot-snap-cow P 8'

Note how the origin device is the not the same device as the one we just created (ie test-snapshot-base), but rather the origin's underlying device test-snapshot-base-real

At this point if you do a dmsetup status you will see something as follows

> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-origin
test-snapshot-base-real: 0 1953125 linear
test-snapshot-cow: 0 1953125 snapshot 16/390625 16

Editing on CoW Snapshot

Lets add some data on the CoW Snapshot. The origin won't have these changes but only the CoW snapshot.

# mount the CoW device
> mount /dev/mapper/test-snapshot-cow /mnt/loados
# create a dir (one way to edit)
> mkdir /mnt/loados/vigith_test
# add some data
> echo bar > /mnt/loados/vigith_test/foo
# umount the device
> umount /mnt/loados

Merging the Snapshot

Take the changes we have made and merge these changes to the origin, so the origin will have all these changes. This is good to do because, next time we create a snapshot we will already have the changes.

To merge a snapshot,

  • origin must be suspended
  • the snapshot device unmapped
  • merge the snapshot via snapshot-merge
  • resume
  • once merge is complete (check it via dmsetup status)
  • suspend
  • replace the snapshot-origin with snapshot-merge
  • reload
## replace the snapshot-origin target replaced with the snapshot-merge target, and the origin resumed
> dmsetup suspend test-snapshot-base
> dmsetup remove test-snapshot-cow
> dmsetup reload test-snapshot-base --table '0 1953125 snapshot-merge /dev/mapper/test-snapshot-base-real /dev/mapper/test-snapshot-snap-cow P 8'

if you do a dmsetup status you will see that test-snapshot-cow is missing now.

> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-origin  <--- it is snapshot-origin
test-snapshot-base-real: 0 1953125 linear

do a resume

> dmsetup resume test-snapshot-base

If you do dmsetup status, you will see that snapshot-origin became snapshot-merge

> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-merge 16/390625 16  <--- snapshot-merge
test-snapshot-base-real: 0 1953125 linear 

suspend; replace the snapshot-origin with snapshot-merge; reload

## dmsetup status output will need be polled to find out then the merge is complete.
## Once the merge is complete, the snapshot-merge target should be replaced with the snapshot-origin target
> dmsetup suspend test-snapshot-base
> dmsetup reload test-snapshot-base --table '0 1953125 snapshot-origin /dev/mapper/test-snapshot-base-real'
> dmsetup resume test-snapshot-base

Now dmsetup status will confirm that snapshot-merge has become snapshot-origin

> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-origin    <--- snapshot-origin
test-snapshot-base-real: 0 1953125 linear

Load snapshot-origin to check for merge

We should be seeing the new directory we created in here /mnt/loados/vigith_test and also the file inside that dir /mnt/loados/vigith_test/foo

# mount
> mount /dev/mapper/test-snapshot-base /mnt/loados
# you should be seeing 'bar' as output
> cat /mnt/loados/vigith_test/foo
bar
# unmount it
> umount /mnt/loados

file based FileSystem

If you remember, we started with a file called test.block. If you run file test.block or tune2fs -l test.block you will see it is an ext4 file. Also, you can mount that file to any dir and you will see that it is the merged origin you just created

# run file
> file test.block
# tune2fs
> tune2fs -l test.block
# create a mount dir
> mkdir /tmp/testmnt
# lets mount this test.block
> mount -o loop test.block /tmp/testmnt
# look for the dir and file we created
> cat /tmp/testmnt/vigith_test/foo
bar
# umount it
> umount /tmp/testmnt

Now you have a file that can be mounted.

An astute reader might say, ofcourse you can add files and manipulate the FS but what about installing packages, compiling source code pointing to new libraries in the new FS. Answer to that is, keep reading, skip to next section (pivot_root) if you are really curious.

Thin Provisioning

Compared to the previous implementation of snapshots, is that it allows many virtual devices to be stored on the same data volume. Please read the doc more to understand about it.

An example of how thin provisioning works.

Loop Device

Thin Provisioning requires a metadata and data store.

# create a sparse 100G data file
> truncate -s100G testthin.block
# create a sparse 1G metadata file
> truncate -s1G testmetadata.block

# create /dev/loop0
# -f will find an unused device and use it
# --show will print the device name
> losetup -f --show testthin.block
# create /dev/loop1 for metadata
> losetup -f --show testmetadata.block
# clean it with zeros
> dd if=/dev/zero of=/dev/loop1 bs=4096 count=1

Create a Thin Pool

# test-thin-pool => poolname
# /dev/loop1 /dev/loop0  => metadata and data devices
# 20971520 => 10GiB (20971520 = 10 * 1024 * 1024 * 1024 / 512)
# 128 => data blocksize 
> dmsetup create test-thin-pool --table '0 20971520 thin-pool /dev/loop1 /dev/loop0 128 0'

Create a Thin Volume

  • send message to active pool device
  • activate the new volume (allocate storage)
# create a new thin volume
# 0 (last arg) => 24 bit identifier
# 0 (other 0) => sector (512 bytes) in the logical device
> dmsetup message /dev/mapper/test-thin-pool 0 "create_thin 0"

# allocate storage/activate
# 0  (last arg) => thinp device identifier
# 2097152 => 1GiB (2097152 sectors = 1024 * 1024 * 1024 / 512)
> dmsetup create test-thin --table '0 2097152 thin /dev/mapper/test-thin-pool 0'

Load Data

Load the data to the new thin device. We will use this loaded thin device to create snapshots.

# create an ext4 partition
> mkfs.ext4 /dev/mapper/test-thin
# mount the dir
> mount /dev/mapper/test-thin /mnt/loados
# load the partition with centos
> tar -xf centos-smaller.tar -C /mnt/loados/
# unmount it
> umount /mnt/loados/

Internal Snapshot

  • suspend the origin device whose snapshot is being taken
  • send message "create_snap"
  • resume the origin device
# suspend origin
> dmsetup suspend /dev/mapper/test-thin

# create snapshot
# 1 => identifier for snapshot
# 0 => identifier for origin device (last arg 0)
> dmsetup message /dev/mapper/test-thin-pool 0 "create_snap 1 0"

#resume the origin
> dmsetup resume /dev/mapper/test-thin

If you do an ls -l /dev/mapper you won't be seeing any snapshot yet.

> ls /dev/mapper/
control  test-thin  test-thin-pool

Activating Internal Snapshot

Once created, the user doesn't have to worry about any connection between the origin and the snapshot. It can be worked on like yet another thinly-provisioned device (ie, you can do snapshots on this)

# active the snapshot (note there that we gave 1)
# 1 => snapshot identifier (same value we gave when we called "create_snap")
# 2097152 => 1GiB (2097152 sectors = 1024 * 1024 * 1024 / 512)
> dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"

If you do a ls -l /dev/mapper you should be seeing test-thin-snap in the listing

> ls /dev/mapper/
control  test-thin  test-thin-pool  test-thin-snap

Editing Snapshot

Lets mount this thin snapshot and put some data

# mount
> mount /dev/mapper/test-thin-snap /mnt/loados
# create some new dir
> mkdir /mnt/loados/vigith_test
# write some data
> echo bar > /mnt/loados/vigith_test/foo
# umount
> umount /mnt/loados/

Internal Snapshot (Again)

This snapshot is exactly same as the earlier discussed Internal Snapshot.

# suspend the origin (origin for this snap, but it is a snapshot of 1st origin)
> dmsetup suspend /dev/mapper/test-thin-snap
# please note we have incremented identifier to 2 and origin is 1
# (for the earlier run it was 1 and 0)
> dmsetup message /dev/mapper/test-thin-pool 0 "create_snap 2 1"
# resume the origin
> dmsetup resume /dev/mapper/test-thin-snap

Activating Internal Snapshot (Again)

Same as activating the earlier Activating Internal Snapshot, except that the identifier is 2 now (it was 1 before)

# earlier the identifier was 1
# lets call it test-thin-snap-2
> dmsetup create test-thin-snap-2 --table '0 2097152 thin /dev/mapper/test-thin-pool 2'

Load Snapshot

Load the latest snapshot to see the new dir created

> mount /dev/mapper/test-thin-snap-2 /mnt/loados
> ls -l /mnt/loados/vigith_test/foo
# you should be seeing 'bar' as output
> cat /mnt/loados/vigith_test/foo
bar
> umount /mnt/loados

file based FileSystem

If you remember, we start with a file called testthin.block. If you run file testthin.block or tune2fs -l testthin.block you will see it is an ext4 file. Also, you can mount that file to any dir and you will see that it is the merged origin you just created.

# mounting the thin-snapshots is little different from snapshots

# load the thin block
> losetup -f --show testthin.block
# load the metadata
> losetup -f --show testmetadata.block
# create the pool
> dmsetup create test-thin-pool --table '0 20971520 thin-pool /dev/loop1 /dev/loop0 128 0'
# create the thin device
> dmsetup create test-thin --table '0 2097152 thin /dev/mapper/test-thin-pool 0'
# create a mount dir
> mkdir /tmp/testmnt
# lets mount this test.block
> mount /dev/mapper/test-thin /mnt/testmnt
# look for the dir and file we created
> cat /tmp/testmnt/vigith_test/foo
bar
# umount it
> umount /tmp/testmnt

Layering

If you are thinking how is layering done, it is not done by operating system. docker has an fsdiff.go program which does it. Basically you do a diff between the archive that is brought in by pull and the changes you made.

FIXME: If I am wrong

Kernel Namespaces

clone syscall, really! that is all about kernel namespaces.

Using Kernel Namespaces, we achieve process isolation

  • ipc - InterProcess Communication (flag: CLONE_NEWIPC)
  • mnt - Mount points (flag: CLONE_NEWNS)
  • pid - Process ID (flag: CLONE_NEWPID)
  • net - Networking (flag: CLONE_NEWNET)
  • uts - set of identifiers returned by uname(2) (flag: CLONE_NEWUTS)

clone syscall

When a process is created, the new process inherits most of the parent process flags. To use namespaces we just need to pass the right flags.

int clone(int (*fn)(void *), void *child_stack,
   int flags, void *arg, ...
   /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

The 3rd argument is the flags.

For eg, clone(fn_child, child_stack, SIGCHLD|CLONE_NEWPID|CLONE_NEWNET, &fn_child_args); can be called to create a child process with a new net and pid namespace.

Clone /bin/bash

To understand Kernel Namespaces, lets write a sample clone code and start building on it. The major change will be in static int clone_flags = SIGCHLD; where we will add more flags. This code when executed will run a new bash process in the child context.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
#include <sched.h>
#include <errno.h>
#include <string.h>

#define STACKSIZE (1024*1024)

/* the flags */
static int clone_flags = SIGCHLD;

/* fn_child_exec is the func that will be executed by clone
   and when this function returns, the child process will be
   terminated */
static int fn_child_exec(void *arg) {
  char * const cmd[] = { "/bin/bash", NULL};
  fprintf(stderr, "Child Pid: [%d] Invoking Command [%s] \n", getpid(), cmd[0]);
  if (execv(cmd[0], cmd) != 0) {
    fprintf(stderr, "Failed to Run [%s] (Error: %s)\n", cmd[0], strerror(errno));
    exit(-1);
  }
  /* exec never returns */
  exit(EXIT_FAILURE);
}

int main(int argc, char *argv) {
  char *child_stack = (char *)malloc(STACKSIZE*sizeof(char));

  /* create a new process, the function fn_child_exec will be called */
  pid_t pid = clone(fn_child_exec, child_stack + STACKSIZE, clone_flags, NULL);
    
  if (pid < 0) {
    fprintf(stderr, "clone failed (Reason: %s)\n", strerror(errno));
    exit(EXIT_FAILURE);
  }
  /* wait on our child process the parent exits, else init will reap it.
     we could also do other book keeping in the parent to use cgroups etc */
  if (waitpid(pid, NULL, 0) == -1) {
    fprintf(stderr, "'waitpid' for pid [%d] failed (Reason: %s)\n", pid, strerror(errno));
    exit(EXIT_FAILURE);
  }

  return 0;
}

To compile, save this code (TODO: save this as a git code) as clone_example.c

> gcc clone_example.c -o bash_ex

Now when you run ./bash_ex, you will get a new bash child process.

> ./bash_ex 
Child Pid: [13225] Invoking Command [/bin/bash]

ipc

man page of clone describes the CLONE_NEWIPC flag as below

If  CLONE_NEWIPC  is set, then create the process in a new IPC namespace.  If this flag is not set,
then (as with fork(2)), the process is created in the same IPC namesace as the calling process.
This flag is intended for the implementation of containers.

Recompile the code after changing static int clone_flags = SIGCHLD; to static int clone_flags = SIGCHLD|CLONE_NEWIPC;

We will create a Shared Memory Segment in the parent shell and we will confirm that we can see the segment we created

# create a segment
> ipcmk -M 4096

# list the shared memory segments
> ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x6b68f06d 0          ec2-user   644        4096       0              

Now run the ./bash_ex you just created (with the new flag), now when you do ipcs -m, you won't be seeing the segments created earlier, because you are in a new IPC Namesace.

# cloned process with CLONE_NEWIPC set
> ./bash_ex
Child Pid: [12624] Invoking Command [/bin/bash]

## no shared memory listed
> ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      

mnt

CLONE_NEWNS creates a new mount namespace. If the child process you created has set CLONE_NEWNS, then the mount(2) and umount(2) system calls will only affect the child process (or processes that live in the same namespace). You can have multiple process in same mount namespace if you create new child processes without setting CLONE_NEWNS.

Recompile the code after changing static int clone_flags = SIGCHLD; to static int clone_flags = SIGCHLD|CLONE_NEWNS;

Run the ./bash_ex, we will umount tmpfs and prove that it only got removed in child process, not in the parent.

> ./bash_ex
Child Pid: [12785] Invoking Command [/bin/bash]
## list the mount type tmpfs
> mount -l -t tmpfs
tmpfs on /dev/shm type tmpfs (rw,relatime)
## unmount tmpfs
> umount tmpfs
## show that the tmpfs got unmounted
> mount -l -t tmpfs

While in the shell (or anything process) they will still see tmpfs mounted.

## tmpfs is still mounted
> mount -l -t tmpfs
tmpfs on /dev/shm type tmpfs (rw,relatime)

NOTE: Please don't mistake mount namespace with process jailing, this has nothing to do with jailing.

pid

A PID namespace provides an isolated environment for PIDs: PIDs in a new namespace start at 1, somewhat like a standalone system, and calls to
fork(2), vfork(2), or clone()  will  produce processes with PIDs that are unique within the namespace. The  first  process  created  in  a new
namespace (i.e., the process created using the CLONE_NEWPID flag) has the PID 1, and is the "init" process for the namespace. Children that are
orphaned within the namespace will be reparented to this process rather than init(8).

Recompile the code after changing static int clone_flags = SIGCHLD; to static int clone_flags = SIGCHLD|CLONE_NEWPID;

Execute the code and check the pid of the process, it should be 1

> ./bash_ex
Child Pid: [1] Invoking Command [/bin/bash]
> echo $$
1

If you do a pstree or ps auxwww, you will see lot of other processes too. This is because those tools work by reading /proc dir and our /proc is still pointing to the parent process's namespace.

net

Recompile the code after changing static int clone_flags = SIGCHLD; to static int clone_flags = SIGCHLD|CLONE_NEWNET;

If you do ip addr on the terminal you will be seeing multiple interfaces, like lo, eth0 etc. Now execute the newly compiled code and do an ip addr on the child bash promt, you will be seeing only lo interface.

ip addr on normal bash prompt

> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
..snip..
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP qlen 1000
    link/ether 0a:a5:84:25:0a:db brd ff:ff:ff:ff:ff:ff
..snip..

ip addr on bash process created with CLONE_NEWNET

> ./bash_ex
Child Pid: [13182] Invoking Command [/bin/bash]
## only lo is shown
> ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
..snip..

uts

A UTS namespace is the set of identifiers returned by uname(2); among these, the domain name and the host name can be modified
by setdomainname(2) and sethostname(2),  respectively Changes made to the identifiers in a UTS namespace are visible to all
other processes in the same namespace, but are not visible to processes in other UTS namespaces.

Recompile the code after changing static int clone_flags = SIGCHLD; to static int clone_flags = SIGCHLD|CLONE_NEWUTS;

We should be able to change the hostname in the new process and still not affect the hostname of the global namespace.

> ./bash_ex
Child Pid: [13225] Invoking Command [/bin/bash]
# change the hostname
> hostname foo.bar
> hostname
foo.bar

While the hostname as per the global namespace is still unaltered.

## hostname of the system
> hostname
test.qa

Resource Management (cgroups)

redhat cgroup Doc is a very beautiful doc for resource management per process. Reading this document is a must to really understand and use it more efficiently. Kernel Doc has the implementation details. cgroup Subsystem level doc is useful when you want to tweak each subsystem.

cgroups (control groups) is a Linux kernel feature that limits, accounts for and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. cgroups allow you to allocate resources—such as CPU time, system memory, network bandwidth, or combinations of these resources—among user-defined groups of tasks (processes) running on a system.

I will just show an example on how to do it, me trying to explain cgroups will be doing injustice to the cgroups :-)

Limiting Memory

As you know, cgroups doesn't have any system call by itself but uses the VFS system calls to implement it. cgroups filesytem is mounted in /cgroups. So to limit memory we just need to put a cap on memory.limit_in_bytes and memory.memsw.limit_in_bytes in /cgroups/memory/<subgroup>/.

# create the subgroup
> mkdir /cgroup/memory/test/
# jump to newly created subgroup
> cd /cgroup/memory/test/
# list the cgroup
> lscgroup
..snip..
memory:/
memory:/test
..snip..

Now lets write a quick code to use 1MB of memory and make sure it is able to run.

# create an array with elements 1 to 1024x1024x10 > 1M
> perl -le '@x=[1..1024*1024*10]; print "done"'
done

Using cgroups we can put a cap limit on the process, for memory and swap

# cap on memory
> echo $((1024*1024)) > /cgroup/memory/test/memory.limit_in_bytes
# cap on swap too (else it will swap out and run)
> echo $((1024*1024)) > /cgroup/memory/test/memory.memsw.limit_in_bytes
# run the process in the newly create cgroups `test` (And get killed)
> cgexec -g memory:test perl -le '@x=[1..1024*1024*10];print "done"'
Killed
# We can check the `dmesg` to confirm it
> dmesg | tail
[76470.936104]  [<ffffffff8148b948>] page_fault+0x28/0x30
[76470.938226] Task in /test killed as a result of limit of /test
[76470.940815] memory: usage 1024kB, limit 1024kB, failcnt 7
[76470.943113] memory+swap: usage 1024kB, limit 1024kB, failcnt 0
[76470.945670] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[76470.948289] Memory cgroup stats for /test: cache:0KB rss:1024KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:976KB inactive_file:0KB active_file:0KB unevictable:0KB
[76470.957530] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[76470.960909] [ 4986]     0  4986    85839      595      22        0             0 perl
[76470.964260] Memory cgroup out of memory: Kill process 4986 (perl) score 2339 or sacrifice child
[76470.968144] Killed process 4986 (perl) total-vm:343356kB, anon-rss:828kB, file-rss:1552kB
>

You can also cgroup.proc file to achieve the same without using cgexec.

 - cgroup.procs: list of thread group IDs in the cgroup.  This list is
   not guaranteed to be sorted or free of duplicate TGIDs, and userspace
   should sort/uniquify the list if this property is required.
   Writing a thread group ID into this file moves all threads in that
   group into this cgroup.

To use cgroup.proc, we just need to write your pid to /cgroup/memory/test/cgroup.procs (you might think who created this file, when you create a subgroup via mkdir, this file gets created for you)

> echo $$ > /cgroup/memory/test/cgroup.procs
> perl -le '@x=[1..1024*1024*10];print "done"'
Killed

cgroup is very powerful, I would suggest you to read the doc before implementing it.

Networking

  • veth - Virtual Ethernet device that comes in as a pair device, anything that is send to one device will come out from the other
  • bridge - Virtual Ethernet Bridge Device
  • netns - Network Namespace

Inter Container Communication

+--------------+                                     +--------------+
|              |                                     |              |
| Container 1 (iface)  <====== (bridge) ======> (iface) Container 2 |
|              |                                     |              |
+--------------+                                     +--------------+
  • create a bridge
  • activate the bridge
  • create a vethA pair (vethA1/vethA2)
  • set bridge as the master to one end of the veth pair (vethA1)
  • bring the vethA1 up
  • attach the vethA2 to your container namespace (container1)
    • name the vethA2 interface as eth1 (optional)
    • give ip addr to interface (use the new name or use vethA2)
    • bring up the interface
    • do an arping to make sure interface is good
  • create another veth pair (vethB1/vethB2)
  • set bridge as the master to one end of the veth pair (vethB1)
  • bring the vethB1 up
  • attach the vethB2 to your other container namespace (container2)
    • name the vethB2 interface as eth1 (optional)
    • give ip addr to interface (use the new name or use vethB2)
    • bring up the interface
    • do an arping to make sure interface is good

Example

Start two process with new network namespace. In case you forgot HOWTO, recompile the template code we wrote for Kernel Namespaces after changing static int clone_flags = SIGCHLD; to static int clone_flags = SIGCHLD|CLONE_NEWNET; When you execute the compiled binary, you will get two process in two differen network namespaces.

By default there two comtainers won't have any interface attached other than lo, so they won't be able to talk to each other. Lets make them talk!

Start Process 1 (prompt 1)

# process 1
> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]

Start Process 2 (prompt 2)

# process 2
> ./bash_ex
Child Pid: [2307] Invoking Command [/bin/bash]

On the Global Namespace

# to make life easier, lets set the 2 pids as our
# namespaces
# (pid from prompt 1)
> pidA=2264
# (pid from prompt 2)
> pidB=2307

# make it ready for `ip netns` to read
# (strace told me so)
> mkdir -p /var/run/netns
> ln -s /proc/$pidA/ns/net /var/run/netns/$pidA
> ln -s /proc/$pidB/ns/net /var/run/netns/$pidB

# create the bridge
> ip link add dev br1 type bridge

# bring up the bridge
> ip link set br1 up

# veth pair I
# mtu can be fetched by calling (ip link show br1)
> ip link add name vethA1 mtu 1500 type veth peer name vethA2 mtu 1500

# enslave vethA1 to br1
> ip link set vethA1 master br1
# bring vethA1 up
> ip link set vethA1 up

# attach other end of veth to a namespace
> ip link set vethA2 netns $pidA
# rename vethA2 to eth1 (optional)
> ip netns exec $pidA ip link set vethA2 name eth1
# attach an ipaddr to to the interface
> ip netns exec $pidA ip addr add 192.168.1.1/24 dev eth1
# bring the interface ip
> ip netns exec $pidA ip link set eth1 up
# test by an arping
> ip netns exec $pidA arping -c 1 -A -I eth1 192.168.1.1
ARPING 192.168.1.1 from 192.168.1.1 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)

# veth pair II
> ip link add name vethB1 mtu 1500 type veth peer name vethB2 mtu 1500

# enslave vethB1 to br1
> ip link set vethB1 master br1
# bring vethB1 up
> ip link set vethB1 up

# attach vethB to a namespace
> ip link set vethB2 netns $pidB
# rename to eth1 (optional)
> ip netns exec $pidB ip link set vethB2 name eth1
# attach an ipaddr to interface
> ip netns exec $pidB ip addr add 192.168.1.2/24 dev eth1
# bring the interface up
> ip netns exec $pidB ip link set eth1 up
# arping test
> ip netns exec $pidB arping -c 1 -A -I eth1 192.168.1.2
ARPING 192.168.1.2 from 192.168.1.2 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)

# remove the stuffs we brought in
> unlink /var/run/netns/$pidA
> unlink /var/run/netns/$pidB
> rmdir /var/run/netns

Testing the Setup

In the prompt you created (that is how you got the PID). Try to do a connect. (Reuse the same prompt, don't kill the process already created (and thus the namespaces))

Prompt 1

> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
> ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 6a:8f:4a:61:97:90 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.1/24 scope global eth1
..snip..
# list on one container
> nc -l 1234
hi

Prompt 2

> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
> ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 1e:44:08:bf:a5:ad brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.2/24 scope global eth1
..snip..
# talk to the other container 
> echo "hi" | nc 192.168.1.1 1234
> 

To poke into your docker network namespace netns, you need to link to /var/run/netns and use ip netns exec (ip command looks into the var dir to get info). man(7) namespaces has more info about the /proc/$pid/ns namespaces.

eg,

# it might be missing
> mkdir -p /var/run/netns

# link the process in the container
# you should find $pid
> ln -s /proc/$pid/ns/net /var/run/netns/$pid

# list your namespaces (it should return the pid)
> ip netns ls
<pid>

# list the interfaces in 
> ip netns exec <pid> ip addr
....info about interface in the namespace <pid> ...

# please remove what you did after you are done with the experiments
> unlink /var/run/netns/$pid
# don't force delete, someone else too might be mucking around :-)
> rmdir /var/run/netns

Curious mind can do tcpdump -i br0 -n (provided br0 is your bridge name) to see packets going back and forth.

Host to Container Communication

+---------------------------------------+
|                                       |
|   +-----------+                       |
|   |           |                       |
|   | Container |                       |
|   |          (iface) <===== bridge    |
|   +-----------+               ^       |
|                               |       |
|  Host                        (iface)  |
+---------------------------------------+
  • create a bridge
  • activate the bridge
  • create a vethA pair (vethA1/vethA2)
  • set bridge as the master to one end of the veth pair (vethA1)
  • bring the vethA1 up
  • attach the vethA2 to your container namespace (container1)
    • name the vethA2 interface as eth1 (optional)
    • give ip addr to interface (use the new name or use vethA2)
    • bring up the interface
    • do an arping to make sure interface is good
    • add routing entry from container to bridge
  • assign ip addr to bridge
  • add routing entry from host to container (via bridge)

We need to attach one end of the veth pair to the container while the other end to the global namespace. Here we need to manually assign ip to the bridge br1 and also add the routing table entry from host to the container and also add a routing entry back from container to host (via veth endpoint)

Start Process 1 (prompt 1)

> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]

On Global Namespace

# pid from the prompt 1
> pidA=2264

# make network namespace visible to `ip netns`
> mkdir -p /var/run/netns
> ln -s /proc/$pidA/ns/net /var/run/netns/$pidA

# setup the veth
> ip link add dev br1 type bridge
> ip link set br1 up

> ip link add name vethA1 mtu 1500 type veth peer name vethA2 mtu 1500
> ip link set vethA1 master br1
> ip link set vethA1 up
> ip link set vethA2 netns $pidA

> ip netns exec $pidA ip link set vethA2 name eth1
> ip netns exec $pidA ip addr add 192.168.1.1/24 dev eth1
> ip netns exec $pidA ip link set eth1 up
> ip netns exec $pidA arping -c 1 -A -I eth1 192.168.1.1
ARPING 192.168.1.1 from 192.168.1.1 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
# route entry back to the host
> ip netns exec $pidA ip route add 192.168.2.0/24 dev eth1 proto kernel  scope link src 192.168.1.1

# add ip addr to bridge
> ip addr add 192.168.2.1/24 dev br1
# add a route entry
> ip route add 192.168.1.0/24 dev br1 proto kernel  scope link src 192.168.2.1 

Testing the Setup

Reuse the same prompt, don't kill the process (and thus the namespaces)

Start nc on containter Prompt 1

> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
> nc -l 1234
hi

Send "hi" to port 1234 listening on container using nc. You should be seeing "hi" in the container. Global Namespace

> echo hi | nc 192.168.1.1 1234

pivot_root / chroot

pivot_root moves the root file system of the current process to the directory put-old and makes new-root the new root file system. chroot runs the command with a changed root dir. This will help the process run in a rootfs of the of the linux kernel it prefers along with the custom libraries.

Example

Earlier we mentioned about test.block which can be mounted and contains the rootfs for centos. It also contains some changes which we have brought in (a file with content "bar"). You can mount test.block and make your process run with the new point point as its rootfs.

# mount the ext4 filesystem
> mount -o loop test.block  /tmp/mnt/
# copy the new code to the new mount
> cp bash_ex /tmp/mnt
# change to new root
> cd /tmp/mnt/
# dir for pivot_root
> mkdir put-old
# pivot root, so you can umount put-old
> pivot_root . put-old
# chroot and start your process
> chroot . ./bash_ex
# you are now in /
> pwd
/
# ls should return a a view from /
> ls
bash_ex  bin  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  put-old  root  run  sbin  selinux  srv  sys  tmp  usr  var  vigith_test
# mount your proc
> mount -t proc none /proc
# put your resolv.conf
> cat > /etc/resolv.conf
.. write your stuff ..
# fill in the mtab
> cat /proc/mounts > /etc/mtab
# you are process 1 now
> pstree -a -p
bash,1
  └─bash,22 -i
        └─pstree,46 -a -p

Miscellaneous

I am learning and new to this, so there will be a lot of bugs and misunderstanding of concepts. Please send me pull requests if you find something really preposterous.

About

Learning how Process Containers work

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages