New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite agent websocket client #2560
Conversation
|
||
case msg | ||
when String | ||
@ws.send(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: consider having a separate defer
thread or something to send these... these calls can block the actor, but they are also not thread-safe, because each @ws.send
-> multiple socket write
calls may not be atomic with large messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer does not make sense here because it messes order of message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a single defer
thread that each send
call goes through, to preserve the order and atomicity of messages.
Alternatively, instead of assuming that the Websocket::Driver
class is thread-safe for concurrent parse
, text/binary
, ping
, close
etc calls (which I doubt...), I think I'll just throw in a mutex in the Kontena::Websocket::Driver
to serialize all of those calls.
|
||
case msg | ||
when String | ||
@ws.send(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer does not make sense here because it messes order of message.
# Valid after on :open | ||
# | ||
# @return [Websocket::Driver::Headers] | ||
def heaers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
I think the def websocket_connect
# connecting
ws = Kontena::Websocket::Client.new(url, ...)
ws.listen do |message|
actor.on_message(message)
end
ws.run do
actor.on_open
end
# server closed connection cleanly with code 1000
rescue Kontena::Websocket::CloseError => error
actor.on_close
rescue Kontena::Websocket::Error => error
actor.on_error
ensure
# disconnected
end
end Whereby the EDIT: the api has now been slightly changed: def websocket_connect
# connecting
@ws = Kontena::Websocket::Client.new(url, ...)
@ws.on_message do |message|
actor.on_message(message)
end
@ws.run do
actor.on_open
end
actor.on_close(@ws.close_code, @ws.close_reason)
rescue Kontena::Websocket::Error => exc
actor.on_error(exc)
ensure
@ws = nil
end
end |
Started writing some pseudo-e2e certs, where the
Still needs a bit more work to test the actual websocket messages :) EDIT: done
|
agent/bin/kontena-agent
Outdated
api_token: api_token | ||
api_token: api_token, | ||
ssl_verify: ENV['KONTENA_SSL_VERIFY'], | ||
ssl_ca: ENV['KONTENA_SSL_CA'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't necessarily need the KONTENA_SSL_CA
, because openssl itself understands SSL_CERT_DIR=
and SSL_CERT_FILE=
when using the default cert store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need this then we probably should not add it (hard to remove later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... somehow just replacing KONTENA_SSL_CA=
with SSL_CERT_FILE=
didn't work... keeping this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: For the CLI SSL_CERT_FILE=
, excon actually handles it: https://github.com/excon/excon/blob/v0.57.1/lib/excon/ssl_socket.rb#L34
I find the behavior of the libssl SSL_CERT_FILE=
env in the kontena/agent
container odd... it doesn't seem to be respected by the default cert store:
irb(main):015:0> ssl_cert = OpenSSL::X509::Certificate.new File.read(ENV['SSL_CERT_FILE'])
=> #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name:0x0055eafa90aef8>, issuer=#<OpenSSL::X509::Name:0x0055eafa90af20>, serial=#<OpenSSL::BN:0x0055eafa90af48>, not_before=2017-07-06 08:07:56 UTC, not_after=2018-07-06 08:07:56 UTC>
irb(main):013:0> ssl_store = OpenSSL::X509::Store.new
=> #<OpenSSL::X509::Store:0x0055eafa9b6280 @verify_callback=nil, @error=nil, @error_string=nil, @chain=nil, @time=nil>
irb(main):014:0> ssl_store.set_default_paths
=> nil
irb(main):016:0> ssl_store.verify(ssl_cert)
=> false
Have to explicitly set the ENV['SSL_CERT_FILE']
as a CA path for it to work:
irb(main):017:0> ssl_store = OpenSSL::X509::Store.new
=> #<OpenSSL::X509::Store:0x0055eafa8ecf70 @verify_callback=nil, @error=nil, @error_string=nil, @chain=nil, @time=nil>
irb(main):018:0> ssl_store.add_file ENV['SSL_CERT_FILE']
=> #<OpenSSL::X509::Store:0x0055eafa8ecf70 @verify_callback=nil, @error=nil, @error_string=nil, @chain=nil, @time=nil>
irb(main):019:0> ssl_store.verify(ssl_cert)
=> true
Same thing with ssl_context.set_params
vs ssl_context.set_params(ca_file: ENV['SSL_CERT_FILE'])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare this to ruby 2.3.1 on Ubuntu 16.04:
irb(main):001:0> require 'openssl'
=> true
irb(main):002:0> ssl_cert = OpenSSL::X509::Certificate.new File.read(ENV['SSL_CERT_FILE'])
=> #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name:0x000000026ad130>, issuer=#<OpenSSL::X509::Name:0x000000026ad158>, serial=#<OpenSSL::BN:0x000000026ad180>, not_before=2017-07-06 08:07:56 UTC, not_after=2018-07-06 08:07:56 UTC>
irb(main):003:0> ssl_store = OpenSSL::X509::Store.new
=> #<OpenSSL::X509::Store:0x000000026b49f8 @verify_callback=nil, @error=nil, @error_string=nil, @chain=nil, @time=nil>
irb(main):004:0> ssl_store.set_default_paths
=> nil
irb(main):005:0> ssl_store.verify(ssl_cert)
=> true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the libssl envs are broken on the agent alpine image, but I implemented env handling in the websocket client class, similarly to what excon does.
Agent specs are now ✅, and the websocket client specs are now much more comprehensive than with the EM-based websocket. The timeouts still need work. |
8da6b47
to
a2a80dc
Compare
Untangled from #2559 by moving the |
agent/lib/kontena/agent.rb
Outdated
start_em | ||
@client.ensure_connect | ||
# XXX: does not re-start after crash | ||
Celluloid::Actor[:websocket_client].async.start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
Timeouts are now implemented, including automatic keepalive pings:
|
@connection = nil | ||
|
||
# TODO: errors and timeout? SSLSocket.close in particular is bidirectional? | ||
@socket.close if @socket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it's still possible for this to hang without a timeout, particularly looking at OpenSSL::SSL::SSLSocket#close
-> ossl_ssl_shutdown
, which makes multiple calls to SSL_shutdown
, which waits for the server to reply to the close alert... not sure now to deal with that yet?
A plain TCPSocket#close
could also block, depending on SO_LINGER
, but this is murky stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick testing with ip route add blackhole
shows that this close
does not block with an SSL connection:
D, [2017-07-11T14:44:46.857275 #1] DEBUG -- Kontena::Websocket::Client: ping on read timeout after 29.995182353s
D, [2017-07-11T14:44:46.857926 #1] DEBUG -- Kontena::Websocket::Client: pinging with id=20
D, [2017-07-11T14:44:46.859518 #1] DEBUG -- Kontena::Websocket::Client::Connection: write #buf=8: size=8
D, [2017-07-11T14:44:46.860144 #1] DEBUG -- Kontena::Websocket::Client::Connection: wait read: timeout=4.99993131
D, [2017-07-11T14:44:47.169036 #1] DEBUG -- Kontena::RpcClient: waiting 16.0s of 30.0s until: request /nodes/update has response wth id=1343584851
D, [2017-07-11T14:44:48.881232 #1] DEBUG -- Kontena::RpcClient: waiting 16.0s of 30.0s until: request /node_service_pods/list has response wth id=857451443
D, [2017-07-11T14:44:51.865077 #1] DEBUG -- Kontena::Websocket::Client: disconnect
E, [2017-07-11T14:44:51.865846 #1] ERROR -- Kontena::WebsocketClient: websocket error: read timeout after 4.99993131s while waiting 5.0s for pong
I, [2017-07-11T14:44:51.906012 #1] INFO -- Kontena::WebsocketClient: connecting to master at wss://kontena.test:9293
Which is a little surprising, I might have expected it to...
@socket.write_nonblock(buf) | ||
end | ||
debug "write #buf=#{buf.size}: size=#{size}" | ||
buf = buf[size..-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See reidmorrison/net_tcp_client#10 for why this is important.
agent/bin/kontena-agent
Outdated
api_token: api_token | ||
api_token: api_token, | ||
ssl_verify: ENV['KONTENA_SSL_VERIFY'], | ||
ssl_ca: ENV['KONTENA_SSL_CA'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need this then we probably should not add it (hard to remove later).
@@ -58,7 +59,7 @@ def handle_request(ws_client, message) | |||
# @param [WebsocketClient] ws_client | |||
# @param [Array, Hash] msg | |||
def send_message(ws_client, msg) | |||
ws_client.send_message(MessagePack.dump(msg).bytes) | |||
ws_client.async.send_message(MessagePack.dump(msg).bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why async
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ws_client.send_message
can block or raise errors... no point calling it sync, because the rpc server can't really do anything about that. They will get logged as warnings by the websocket client.
TODO: agent probably also needs some |
ssl_verify_context = OpenSSL::X509::StoreContext.new(ssl_cert_store) | ||
ssl_verify_context.error = verify_result | ||
|
||
Kontena::Websocket::SSLVerifyError.new(ssl_verify_context.error, "certificate verify failed: #{ssl_verify_context.error_string}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a hack with the OpenSSL::X509::StoreContext#error=
, but it was the only way I could figure out how to call the libssl X509_verify_cert_error_string
on the OpenSSL::SSL::SSLSocket#verify_result
and get nice verify error messages likecertificate verify failed: self signed certificate
.
end | ||
|
||
def handle_invalid_token | ||
error 'master does not accept our token, shutting down ...' | ||
EM.next_tick { abort('Shutting down ...') } | ||
Kernel.abort('Shutting down ...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: This doesn't work in a celluloid actor:
E, [2017-07-13T13:08:35.672300 #1] ERROR -- Kontena::WebsocketClient: master does not accept our token, shutting down ...
Shutting down ...
E, [2017-07-13T13:08:35.672802 #1] ERROR -- : Actor crashed!
SystemExit: Shutting down ...
/app/lib/kontena/websocket_client.rb:261:in `abort'
/app/lib/kontena/websocket_client.rb:261:in `handle_invalid_token'
/app/lib/kontena/websocket_client.rb:247:in `on_close'
/app/lib/kontena/websocket_client.rb:133:in `run_websocket'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'
E, [2017-07-13T13:08:35.679187 #1] ERROR -- : thread crashed
SystemExit: Shutting down ...
/app/lib/kontena/websocket_client.rb:261:in `abort'
/app/lib/kontena/websocket_client.rb:261:in `handle_invalid_token'
/app/lib/kontena/websocket_client.rb:247:in `on_close'
/app/lib/kontena/websocket_client.rb:133:in `run_websocket'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
/usr/lib/ruby/gems/2.3.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'
I, [2017-07-13T13:08:35.684933 #1] INFO -- Kontena::WebsocketClient: initialized with token tBsoc...
I, [2017-07-13T13:08:36.687613 #1] INFO -- Kontena::WebsocketClient: connecting to master at wss://...
# SSL_connect returned=1 errno=0 state=error: certificate verify failed | ||
if exc.message.end_with? 'certificate verify failed' | ||
# ssl_socket.peer_cert is not set on errors :( | ||
raise ssl_verify_error(ssl_socket.verify_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lack of ssl_socket.peer_cert
on verify errors means that the resulting agent websocket client errors with KONTENA_SSL_VERIFY=true
are less informative:
E, [2017-07-13T14:09:22.617530 #1] ERROR -- Kontena::WebsocketClient: unable to connect to SSL server with KONTENA_SSL_VERIFY=true: certificate verify failed: self signed certificate
Compared to the KONTENA_SSL_VERIFY=
warning, when using ssl_cert!
-> ssl_verify_cert!
:
W, [2017-07-13T14:10:03.892937 #1] WARN -- Kontena::WebsocketClient: insecure connection established with SSL errors: certificate verify failed: self signed certificate: /CN=kontena.test (issuer /CN=kontena.test)
It might be possible to include the cert information in these verify errors by replacing the built-in verify_mode: OpenSSL::SSL::VERIFY_PEER
with the equivalent (?) verify checks that already happen in ssl_verify_cert!
, but I'm not 100% sure yet that those are the same?
end | ||
elsif ssl_cert | ||
if !ssl_verify | ||
warn "secure connection established without KONTENA_SSL_VERIFY=true: #{ssl_cert.subject} (issuer #{ssl_cert.issuer})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning because while this specific connection is secure, the configuration is still vulnerable to MITM attacks, because the agent will still accept an invalid cert unless you update the configuration to use KONTENA_SSL_VERIFY=true
.
agent/lib/kontena/agent.rb
Outdated
def initialize(opts) | ||
@@instance = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a singleton to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean include Singleton
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Means that it's now configure(opts)
, because it doesn't look like a Singleton
can be initialized with parameters?
@close_timeout = close_timeout | ||
@write_timeout = write_timeout | ||
|
||
unless @uri.scheme == 'ws' || @uri.scheme == 'wss' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might break old installations that have http(s)
scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent bin/kontena-agent
normalizes the URI:
if api_uri.match(/^http.*/)
api_uri = api_uri.sub('http', 'ws')
end
end | ||
end | ||
|
||
#protected XXX: called by specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... the specs call the protected methods. And the send
method breaks the first workaround that came to mind :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the send(message)
method to send_message
, and restored the protected
methods... the specs use subject.send :method
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the send_message
vs protected send
specs, this is now back to #protected XXX
which is TODO on refactoring those protected methods called by specs out to separate classes... most of that is the TCP/SSL connection stuff, which is entirely independent of the websocket client.
end | ||
|
||
# TODO: connect_deadline to impose a single deadline on the entire process | ||
# XXX: specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main remaining TODO is that ssl_connect
can take longer than the connect_timeout
, because the same timeout is used for each read/write step of the SSL handshake... needs to be fixed + spec'd to use a combined deadline.
dea3f49
to
4077e6a
Compare
API design review: High leveldef websocket_client
Kontena::Websocket::Client.connect(url, **options) do |client|
# websocket handshake complete
# other threads can start writing messages
Thread.new { client.send('foo') }
# this blocks until the websocket is closed
client.read do |msg|
if msg == 'go away'
client.close
end
end
end
rescue Kontena::Websocket::CloseError => error
# server closed connection
rescue Kontena::Websocket::ConnectError => error
# failed to open
rescue Kontena::Websocket::Error => error
# ...
else
# client closed connection, and server completed close handshake with close frame
end Low levelws_client = Kontena::Websocket::Client.new(url, **options)
Thread.new {
sleep 1 until ws_client.open?
ws_client.send('foo') # raises if not connected
ws_client.close
}
ws_client.on_pong do |time, delay|
end
def connect_websocket(ws_client)
ws_client.connect # blocks until open, raises Kontena::Websocket::ConnectError
# alternatively: ws_client.read do |msg|
# returns nil (EOF) once client-initiated close handshake complete (server responded with close)
while msg = ws_client.read
end
rescue Kontena::Websocket::Error
# can raise from connect or read
ensure
ws_client.disconnect
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
debug "ping-pong at #{ping_at} in #{ping_delay}s" | ||
|
||
# XXX: defer call without mutex? | ||
@on_pong.call(ping_delay) # TODO: also pass ping_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced the @queue
blocks with just @message_queue
, so now the driver ping callback -> client on_pong
callback gets called directly, with the driver mutex held... that means that the following fails with a ThreadError: deadlock; recursive locking
ATM:
client.on_pong do
client.close
end
Need to either accept that, or restore some kind of deferred-pong-callback-execution thing.
…d.abort to safely re-raise errors
99b8ed6
to
3d0a2bd
Compare
Split the This branch installs it directly from github while the |
Needs re-approval once the final kontena-websocket-client is released
@@ -7,7 +7,7 @@ RUN apk update && apk --update add tzdata ruby ruby-irb ruby-bigdecimal \ | |||
ADD Gemfile /app/ | |||
ADD Gemfile.lock /app/ | |||
|
|||
RUN apk --update add --virtual build-dependencies ruby-dev build-base openssl-dev && \ | |||
RUN apk --update add --virtual build-dependencies ruby-dev build-base openssl-dev git && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The git
is required for bundle install
when testing changes to kontena-websocket-client
using gems from git.
Quick testing to verify that the examples cases still work with the latest code. |
Fixes #2500 by replacing the EventMachine-based
faye-websocket
client with the newkontena-websocket-client
running as a Celluloid actor, which has full support for SSL certificate verification.Drop
eventmachine
andfaye-websocket
dependenciesNow with 100% less EventMachine.
Add
kontena-websocket-client
dependencySupports SSL certification verification.
Agent
WebsocketClient
is now a celluloid actorThe new
Kontena::Websocket::Client
should implement open/ping/close timeouts with keepalive pings at least as robustly as the old EM-based client, probably even more robustly.Fix agent RPC server/client websocket message sends to log a warning and abort the celluloid call on websocket send errors
Testing
Default behavior with
KONTENA_URI=wss://
+ invalid certIntended to be backwards-compatible, but show a warning with the validation error + cert details.
Default behavior with
KONTENA_URI=wss://
+ valid certBackwards compatible, but suggesting to configure
KONTENA_SSL_VERIFY=true
to protect from MITM attacks:Error case with
KONTENA_SSL_VERIFY=true
(not signed by default CAs)Error case with verify + CA, but wrong host
Error case with valid public cert, but invalid server (not a kontena master)
Error case with DNS errors
Success behavior with valid custom cert (using
SSL_CERT_FILE=/etc/kontena-agent/ca.pem
)Success with a public cert (using a hosted master)
Success with a custom
KONTENA_SSL_HOSTNAME=kontena.test
Error case with DNS MITM + invalid self-signed cert
Error case with DNS MITM + valid cert, but wrong subject