Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UX: Keep the 50 most recent envs rather than the first 50 #103

Merged
merged 8 commits into from Jan 10, 2020
Merged

UX: Keep the 50 most recent envs rather than the first 50 #103

merged 8 commits into from Jan 10, 2020

Conversation

OsamaSayegh
Copy link
Member

At the moment when Logster merges similar messages, it keeps the envs of the first 50 occurrences and then stops accepting new envs. This PR changes that so that Logster keeps the envs of the 50 most recent occurrences.

I think it does make sense and it's more useful to have the most recent 50 envs, however this change comes at a performance cost. This change effectively reverts of the performance improvements we did in this commit: cbca14a. Right now when Logster merges 2 similar messages, it checks the number of envs and if we already have 50 envs it doesn't bother loading the envs array from redis and subsequently doesn't need to write the env array back to redis. With this change, we're forced to always load the envs array and write it back to redis after adding the new env. Thoughts on this @SamSaffron?

@SamSaffron
Copy link
Member

Can we use a different data structure in redis that supports this transparently, like some sort of list like message bus uses?

something like this? https://github.com/SamSaffron/message_bus/blob/master/lib/message_bus/backends/redis.rb#L110-L124

@OsamaSayegh
Copy link
Member Author

We could use a different data structure, but then we'd need a top-level key in redis for each message/log. We currently have 1 hash where each key represents a message id and points to a JSON string that represents the message's env array.

We'd need to change that to having a separate sorted set (or a simple list could also work) for each message. So at most Logster would need as many lists/keys as Logster.store.max_backlog which is 1000 by default. Is this OK? Could it cause problems?

@SamSaffron
Copy link
Member

I think it is fine for logster to have lots of keys/lists in redis, we just need to make sure we clean up properly on clear all and be careful to make our operations never leave around orphan bits and pieces.

@OsamaSayegh
Copy link
Member Author

Sounds good!

I can think of 2 approaches that we can use here to migrate from the one hash we currently have to a list per message:

  1. create a script that goes through all the messages and move the env of one to its own list. This would be a once off script that runs when Logster boots up for the first time with this change (the script would set a key in redis when it finishes so that it knows it doesn't have to run).

  2. or defer the migration until the message is needed/read from redis. What we would do here is add logic to RedisStore#get and RedisStore#bulk_get that tries first to see if env exists in a top level list, if it does then it uses that and no further action is needed. If a list doesn't exist, we'd fetch and remove the env from the old hash and write it to a top level list. That way things will gradually migrate away from the hash to top level lists.

For 2, we'd have extra overhead when we fetch messages so at some point in the future (maybe 6 moths?) we'd need to drop the extra logic/overhead. I'm not too sure about 1, do you see any downsides in 1? Any other better approaches?

@SamSaffron
Copy link
Member

I am not super concerned about migration, we can just clear logs and start from scratch.

I guess if you really want something a script is fine.

@OsamaSayegh
Copy link
Member Author

Ok, I've made the changes to make env stored in a separate list for each message. However, there is one issue that affects the maximum_message_size_bytes config we added recently. Currently it relies on the entire env of the message being in memory to calculate the size of the message with its env and determine whether it would still be below the limit if it added the extra env:

def merge_similar_message(other)
self.first_timestamp ||= self.timestamp
self.timestamp = [self.timestamp, other.timestamp].max
self.count += other.count || 1
return false if self.count > Logster::MAX_GROUPING_LENGTH
size = self.to_json(exclude_env: true).bytesize + self.env_json.bytesize
extra_env_size = other.env_json.bytesize
return false if size + extra_env_size > Logster.config.maximum_message_size_bytes
other_env = JSON.load JSON.fast_generate other.env
if Hash === other_env && !other_env.key?("time")
other_env["time"] = other.timestamp
end
if Hash === self.env && !self.env.key?("time")
self.env["time"] = self.first_timestamp
end
if Array === self.env
Array === other_env ? self.env.concat(other_env) : self.env << other_env
else
Array === other_env ? self.env = [self.env, *other_env] : self.env = [self.env, other_env]
end
@env_json = nil
true
end

With this change, we no longer have env in memory when the similar message receives the new env; the new env is handed off to redis which will take care of prepending the new env to the env list and trimming the list from the end to keep it at 50 envs.

I can think of 2 ways to fix this:

  1. The maximum_message_size_bytes config currently only looks at env and trims it to bring the total message size below the limit because env is usually the elephant in the room. I'm thinking we could drop this config and add a new one maximum_env_hash_size_bytes to limit the size of a hash within the env list i.e. if env is a list of 10 hashes, each hash shouldn't exceed this config and if it does it's dropped and never stored.

  2. Redis has MEMORY USAGE command that returns the number of bytes a key and its value take up, but per the documentation, this number includes the size of internal administrative stuff redis needs for the key, so it can be a little off. I was thinking I could write a LUA script that we could use to push new env when we merge messages and part of that process would be to keep throwing old env hashes until the size becomes below the limit.

@SamSaffron
Copy link
Member

I would simply introduce new settings here. Simply have maximum_message_size_bytes ignore env. Then add 2 settings:

  1. Maximum size a single env can be (if it is longer then truncate)
  2. Maximum number of envs per message.

I think this is a far simpler approach and we can ship with sane defaults here.

@OsamaSayegh
Copy link
Member Author

@SamSaffron I've added 2 new settings maximum_size_of_single_env_bytes and maximum_number_of_env_per_message and made maximum_message_size_bytes ignore env. Here is what each one does:

maximum_size_of_single_env_bytes: this one for the maximum size a single env (a hash object) can be. The default value is 1000. If an env is larger than the limit, we'll remove the minimum number of keys to bring the size below the limit.

maximum_number_of_env_per_message: if env is array, this config limits the number of elements in the array. Default is 50.

maximum_message_size_bytes: this one is for the size of the message minus env. Default value is 10,000. If a message size is above this limit, we'll first remove all occurrences of the gems directory in the backtrace and computes the size again. If it's still larger than the limit, we'll remove the last line from the backtrace and repeat that until the size becomes below the limit. There is protection to ensure we never get stuck in an infinite loop.
Note: this setting won't really work with really low values (e.g. < 1000) because there're message attributes that we can't really touch such as timestamp, count, key, and progname and they contribute to the message size. I think that's fine, we can mention this caveat in the readme.

Does this sound good? Any changes you would like me to make?

@SamSaffron
Copy link
Member

I would go with the names:

max_env_bytes and max_env_count_per_message

Changes sound good to me! Be sure to update the readme here to specify all the settings we have.

@OsamaSayegh
Copy link
Member Author

I've renamed the settings and added some documentation about them in the readme. Merging this in now 🎉

@OsamaSayegh OsamaSayegh merged commit 4a81cfd into discourse:master Jan 10, 2020
@OsamaSayegh OsamaSayegh deleted the recent-envs branch January 10, 2020 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants