Skip to content
This repository has been archived by the owner on Sep 8, 2018. It is now read-only.

how to run a cluster in containers #86

Open
timwebster9 opened this issue Dec 5, 2017 · 8 comments
Open

how to run a cluster in containers #86

timwebster9 opened this issue Dec 5, 2017 · 8 comments
Labels

Comments

@timwebster9
Copy link

timwebster9 commented Dec 5, 2017

Hi - great project :-)

I'm trying to get an OK Log cluster running as containers in a Rancher environment. I see all the command-line options for starting a cluster, but I think I'm running into similar issues as #51. It was closed so I opened a new one.

After reading through that issue, I'm not sure what to put for the -cluster option, or if I should be using -cluster.advertise-addr somehow. Ideally I wouldn't have to use hard-coded IP addresses or FQDNs - with Rancher each node is accessible by its service name (basically the name of the service in the docker-compose.yml file). I can ping these no problem from inside the running containers. Maybe I'm going about it the wrong way?

Here's the output of one of the nodes:

ts=2017-12-05T20:50:51.251259081Z level=info cluster_bind=oklog-1:7659
ts=2017-12-05T20:50:51.251327396Z level=warn err="couldn't deduce an advertise address: failed to parse bind addr 'oklog-1'"
ts=2017-12-05T20:50:51.251450516Z level=info fast=tcp://0.0.0.0:7651
ts=2017-12-05T20:50:51.251498846Z level=info durable=tcp://0.0.0.0:7652
ts=2017-12-05T20:50:51.251535893Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-12-05T20:50:51.25159731Z level=info API=tcp://0.0.0.0:7650
ts=2017-12-05T20:50:51.252797008Z level=info ingest_path=data/ingest
ts=2017-12-05T20:50:51.254293637Z level=info store_path=data/store
ts=2017-12-05T20:50:51.254323005Z level=debug component=cluster bind_addr=oklog-1 bind_port=7659 ParseIP=<nil>
ts=2017-12-05T20:50:51.268763952Z level=debug component=cluster received=NotifyJoin node=195a150c-b09b-4d2f-9521-4245be6e7926 addr=:::7659
ts=2017-12-05T20:50:51.345736483Z level=debug component=cluster received=NotifyJoin node=79ca8b0d-d4dd-4058-93fb-e23833b8e74f addr=:::7659
ts=2017-12-05T20:50:54.359662923Z level=debug component=cluster Join=2
ts=2017-12-05T20:50:54.460529024Z level=warn component=Consumer state=gather replication_factor=3 available_peers=2 err="replication currently impossible"
ts=2017-12-05T20:50:55.460798346Z level=warn component=Consumer state=gather replication_factor=3 available_peers=2 err="replication currently impossible"
ts=2017-12-05T20:50:56.461030986Z level=warn component=Consumer state=gather replication_factor=3 available_peers=2 err="replication currently impossible"
@peterbourgon
Copy link
Member

peterbourgon commented Dec 6, 2017

Reading the logs, if an oklog process in your environment tries to resolve the string oklog-1, will it get a routable IP?

edit: Maybe you can paste the complete set of flags you use to start the processes?

@timwebster9
Copy link
Author

timwebster9 commented Dec 6, 2017

Hi,

You can see the startup commands in the docker-compose file here: https://github.com/timwebster9/rancher-catalog/blob/master/templates/oklog/0/docker-compose.yml

e.g. command: 'ingeststore -debug -store.segment-replication-factor 3 -cluster oklog-1 -peer oklog-1 -peer oklog-2 -peer oklog-3'

I'm not sure if an OK Log process can resolve those hostnames, but ping definitely can. This is just standard Docker stuff - it would have the same behaviour in just 'normal' Docker compose (i.e. outside of Rancher).

Is there a command I can try with oklog to test out what you are asking for?

BTW I have this running fine as a single node, but we typically run multiple hosts per environment so would be nice to have the cluster as an option in the future (i.e. not a deal-breaker or anything)

@peterbourgon
Copy link
Member

peterbourgon commented Dec 7, 2017

OK, so the first error you get is telling you that oklog-1 doesn't parse as an IP, which I guess is obvious :) But this is actually a red herring, the bit of code which produces that log line is actually just a little sanity-check that is meant to shed some light on what's going on inside the memberlist library, the "failure" (so to speak) shouldn't matter because the code that actually tries to consume the oklog-1 string has no problems with hostnames.

I built an oklog:dev container

$ git checkout 2ae3702
$ env GOOS=linux GOARCH=amd64 go build -o oklog_linux_amd64 github.com/oklog/oklog/cmd/oklog
$ cat Dockerfile
FROM ubuntu:16.04
ADD oklog_linux_amd64 /oklog
$ docker build -t oklog:dev .

and then I made this stripped-down docker-compose.yaml

version: "2"

services:
  oklog-1:
    image: oklog:dev
    command: '/oklog ingeststore -debug -store.segment-replication-factor 3 -cluster oklog-1 -peer oklog-1 -peer oklog-2 -peer oklog-3'

  oklog-2:
    image: oklog:dev
    command: '/oklog ingeststore -debug -store.segment-replication-factor 3 -cluster oklog-2 -peer oklog-1 -peer oklog-2 -peer oklog-3'

  oklog-3:
    image: oklog:dev
    command: '/oklog ingeststore -debug -store.segment-replication-factor 3 -cluster oklog-3 -peer oklog-1 -peer oklog-2 -peer oklog-3'

which reproduced the warning messages (bad), but also successfully formed a cluster (good). Can you try that ^^ and see if you get the same behavior?

I've also filed #88 to improve the logging situation, which I'll merge directly. Once that's in, you can try that rev instead, if you like.

@timwebster9
Copy link
Author

Hi ok yeah that works now - although did you make a functional change or was the logging just misleading?

Sort of related - but there isn't much information to go on in the docs/help about setting up a cluster. For example, do I really want store.segment-replication-factor=3 if I'm running 3 nodes? Does it makes any sense to have a different value there? I guess I'm just looking for what you would recommend one uses for the simplest possible case (like what I'm doing here with 3 nodes...). Thanks :-)

@timwebster9
Copy link
Author

Next I tried to run in in Rancher (I think this would be similar to a k8s setup), with just one container 'definition' but telling it to replicate on every host. I followed your advice on the readme, and only used 1 peer flag (because I won't know the hostnames or IPs of the other hosts). Both the TCP and UI endpoints are also now behind a load-balancer (HAProxy):

services:
  oklog:
    image:  registry.rancher.zone/oklog:dev
    command: '/oklog ingeststore -debug -store.segment-replication-factor 3 -cluster oklog -peer oklog'
    labels:
      io.rancher.scheduler.global: 'true'

It seemed to start up ok (that is, the cluster formed and there were no horrible errors in the logs), and logspout seemed to be fowarding logs. However I wasn't seeing any in the UI. Looking in the /data directories in the containers only one of them had data in it, and I'm guessing that that wasn't the one the UI was hitting.

Then I looked at the OK Log container logs and one of them had this in it (see below).

07/12/2017 20:38:49ts=2017-12-07T20:38:49.68900692Z level=info cluster_bind=oklog:7659
07/12/2017 20:38:49ts=2017-12-07T20:38:49.689054463Z level=warn err="couldn't deduce an advertise address: failed to parse bind addr 'oklog'"
07/12/2017 20:38:49ts=2017-12-07T20:38:49.68924706Z level=info fast=tcp://0.0.0.0:7651
07/12/2017 20:38:49ts=2017-12-07T20:38:49.689292798Z level=info durable=tcp://0.0.0.0:7652
07/12/2017 20:38:49ts=2017-12-07T20:38:49.689331103Z level=info bulk=tcp://0.0.0.0:7653
07/12/2017 20:38:49ts=2017-12-07T20:38:49.689368852Z level=info API=tcp://0.0.0.0:7650
07/12/2017 20:38:49ts=2017-12-07T20:38:49.692868926Z level=info ingest_path=data/ingest
07/12/2017 20:38:52ts=2017-12-07T20:38:52.232325158Z level=info store_path=data/store
07/12/2017 20:38:52ts=2017-12-07T20:38:52.232381798Z level=debug component=cluster bind_addr=oklog bind_port=7659 ParseIP=<nil>
07/12/2017 20:38:52ts=2017-12-07T20:38:52.234012963Z level=debug component=cluster received=NotifyJoin node=0e22cc60-f926-44d4-bce0-7bc2e062458a addr=:::7659
07/12/2017 20:38:52ts=2017-12-07T20:38:52.238185493Z level=debug component=cluster received=NotifyJoin node=27110c7d-f28b-41f0-948a-ec0d009440db addr=:::7659
07/12/2017 20:38:52ts=2017-12-07T20:38:52.240033105Z level=debug component=cluster received=NotifyJoin node=bb43f936-0636-4082-a446-e883a19c5049 addr=:::7659
07/12/2017 20:38:52ts=2017-12-07T20:38:52.240139814Z level=debug component=cluster received=NotifyJoin node=6d53ba4f-472e-4378-b520-0f87195df60b addr=:::7659
07/12/2017 20:38:52ts=2017-12-07T20:38:52.240210544Z level=debug component=cluster received=NotifyJoin node=796009b3-d58d-4949-b4ae-a5b811e3d102 addr=:::7659
07/12/2017 20:38:52ts=2017-12-07T20:38:52.258680637Z level=debug component=cluster Join=3
07/12/2017 20:38:55ts=2017-12-07T20:38:55.563568417Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"
07/12/2017 20:38:55ts=2017-12-07T20:38:55.571320925Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"
07/12/2017 20:38:55ts=2017-12-07T20:38:55.576920555Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"
07/12/2017 20:38:55ts=2017-12-07T20:38:55.582295764Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"
07/12/2017 20:38:55ts=2017-12-07T20:38:55.587756374Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"
07/12/2017 20:38:55ts=2017-12-07T20:38:55.587808914Z level=error component=Consumer op=replicate error="failed to fully replicate: want 3, have 0"
07/12/2017 20:38:58ts=2017-12-07T20:38:58.966004981Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"
07/12/2017 20:38:58ts=2017-12-07T20:38:58.972232201Z level=error component=Consumer op=replicate error="500 Internal Server Error" msg="target [::]:7650, during /replicate: bad status code"

So I don't know if this has anything to do with it or not, but the way the Rancher internal DNS works is that in this case the hostname oklog that I used for -peer and -cluster will resolve to all 3 IP addresses - (one for each replicated container). Can OK Log deal with this?

; <<>> DiG 9.10.3-P4-Ubuntu <<>> oklog
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33597
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;oklog.				IN	A

;; ANSWER SECTION:
oklog.			1	IN	A	10.42.30.129
oklog.			1	IN	A	10.42.226.3
oklog.			1	IN	A	10.42.208.199

;; Query time: 6 msec
;; SERVER: 169.254.169.250#53(169.254.169.250)
;; WHEN: Thu Dec 07 20:49:42 UTC 2017
;; MSG SIZE  rcvd: 71

@peterbourgon
Copy link
Member

Hi ok yeah that works now - although did you make a functional change or was the logging just misleading?

No functional change, just logging (so far).

Sort of related - but there isn't much information to go on in the docs/help about setting up a cluster. For example, do I really want store.segment-replication-factor=3 if I'm running 3 nodes? Does it makes any sense to have a different value there? I guess I'm just looking for what you would recommend one uses for the simplest possible case (like what I'm doing here with 3 nodes...). Thanks :-)

The replication factor is the number of times each log record will be duplicated within the cluster. (Note that records are always deduplicated when returned by a query.) It's a way of accommodating things like failed nodes, broken disks, etc. at a cost of increased disk usage. If you have a 3-node cluster, then a replication factor of 3 means each log record will be duplicated on each node. In my opinion it's a little excessive. A replication factor of 2 should be OK.

Next I tried to run in in Rancher (I think this would be similar to a k8s setup), with just one container 'definition' but telling it to replicate on every host.

This probably isn't going to work, because...

but the way the Rancher internal DNS works is that in this case the hostname oklog that I used for -peer and -cluster will resolve to all 3 IP addresses - (one for each replicated container). Can OK Log deal with this?

No :) Each OK Log node needs to be uniquely addressable. In your Rancher example, you'll need to give each node a unique hostname or IP, like what happens in the docker-compose example. And then you either need to bind the OK Log cluster listener to that hostname/IP via -cluster xxx; or, if you want to bind the cluster listener to a different address, you need to make sure you advertise that unique hostname/IP as the routable address to other nodes in the cluster, via e.g. -cluster 0.0.0.0 -cluster.advertise-addr xxx.

I looked at the OK Log container logs and one of them had this in it (see below).

Those logs tell us that each of the nodes is advertising itself as reachable on the address :::7659, which is obviously non-routable: when the state machine tries to replicate segments by HTTP POSTing to that address, we see that it fails. I guess this is an artifact of passing a non-unique hostname to -cluster.

Your next step to getting this working in Rancher is to give each node a unique hostname, like oklog-{1..N}, and changing the flags to use that unique hostname as the -cluster parameter. Each oklog-{1..N} hostname should resolve to a single IP.

@timwebster9
Copy link
Author

Hi thanks for your reply - yeah it would work fine with the explicitly defined services like in the docker-compose file...

I guess to make this work with things like autoscaling (or manual scaling of the 'click a button type') we would have to get into 'service discovery' territory?

I haven't taken the time to fully understand how the clustering works, but with the 'gossip' protocol you are using I wouldn't think it would be a big stretch to make this work. For example, would a node be able to automatically determine its own IP address and advertise that, without it having to be explicitly configured somehow? As long as there was one node that was resolvable by an explicit IP/hostname (say, a 'bootstrapping node' that only needs to be available/explicitly resolvable when the cluster is started), then the others could be scaled up with that information?

I don't want to try and shoehorn my requirement into your project - but just rather thinking out loud how this could work in an autoscaling fashion in any orchestration environment.

Autoscaling services kind of go hand-in-hand with container orchestration, although I realise that most of these types of services are usually the stateless type. But with OK Log I feel it is on the right track with replication and the gossip protocol you have in place...

@peterbourgon
Copy link
Member

peterbourgon commented Dec 11, 2017

I guess to make this work with things like autoscaling (or manual scaling of the 'click a button type') we would have to get into 'service discovery' territory?

As a general handwavey statement, it's not really possible to autoscale stateful clustered applications like OK Log — or, perhaps better said, it's possible to do, but very easy to do really badly, and end up worse off than if you hadn't tried in the first place. Service discovery is an important part of that, but so, too, is figuring out how to manage persistence volumes, spread ingest load, etc. etc.

I haven't taken the time to fully understand how the clustering works, but with the 'gossip' protocol you are using I wouldn't think it would be a big stretch to make this work. For example, would a node be able to automatically determine its own IP address

In general it's not possible for a process to determine its own IP address. There are too many factors at play: multiple network interfaces on a box or virtual machine, layers of NAT (especially acute in our post-Docker containerized world!!), firewalls, etc. etc.

The safest, easiest, and most reliable way of doing this is to declare which interface(s) the process should bind to, and (when necessary) declare which IP address the process should advertise itself as being reachable via. These correspond to the -cluster and -cluster.advertise-addr flags respectively.

I don't want to try and shoehorn my requirement into your project - but just rather thinking out loud how this could work in an autoscaling fashion in any orchestration environment.

Understood. I've thought long and hard about these things :) and have been on the frontlines of deploying and operating clustered distributed systems for a long time. OK Log represents my current best effort guess at how this should work. With that said, there may be interesting opportunities for making OK Log more elastic, if we're willing to sacrifice some performance. I'll think on this a bit.

I am quite curious to hear if you manage to get it going, though. Please let me know if you hit any other roadblocks, I'll be happy to help however I can.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants