Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for QoS: Quality of Service #58

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

gregwebs
Copy link
Contributor

@gregwebs gregwebs commented Sep 24, 2020

qos-architecture

QoS-capacity-slice


#### QoS Policy stored in PD

A QoS policy is set by an administrator in PD. It is a combination of a region group and a QoS value. The main region group is a key space. Smaller regions within a key space may be specified such as a table and this QoS setting will take precedence over that of the key space. These groups are dynamic (new regions can be added) and translated to regions by PD which has knowledge of tenant and table groupings.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably when a region splits, it inherits the QoS parameters from its parents. What happens when two regions with different QoS are merged?

Does PD have knowledge of how tables/tenants are represented within a key space? My assumption is that only TiDB knows this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little fuzzy on this detail, but I know we can now prevent tables from sharing regions, so it should be possible for PD to know this.

Copy link
Member

@BusyJay BusyJay Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps QoS police should bind to a range instead of some regions, just like placement rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps QoS police should bind to a range instead of some regions, just like placement rules.

This proposal doesn't closely specify how regions will be grouped. Grouping by key range could be a great solution. The problem with this approach is that a region could span multiple key ranges. We could try to use key range as the underlying primitive but also try to have APIs that talk in terms of key spaces. We can also reject key ranges that already don't fully enclose their existing regions. It is noted in placement rules that the key range of a table can change due to DDL commands. So I am thinking that for the first version of QoS PD can understand key spaces but won't understand tables and may need to accept key ranges.


Ti Components are loosely coupled:
* PD stores policies and communicates them to TiKV
* TiKV performs query admission, providing localized back pressure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably TiFlash would work in the same way as TiKV

Copy link
Contributor Author

@gregwebs gregwebs Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about TiFlash. Managing QoS for the OLTP workload path is critical. For OLAP it is less important. TiFlash is also gaining some direct write support, but I have no idea how that works. As the MPP support for TiFlash improves it will be easier to handle TiFlash load by scaling out. Additionally, applications that benefit from TiFlash would generally be big enough to have their own TiDB cluster. This proposal will benefit smaller applications the most that must use a shared TiDB cluster until they grow larger.


The amount of inhibition required depends on the number of requests and amount of resources being requested. Effectively when resources are highly utilized we build up a queue of pending requests with a limited size where the overflow is rejected.

Policy application is allowed to take into account resources that will be used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you talk about prioritisation of queries but in the above section it sounds like TiKV just has a run/reject binary for queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was not written clearly. To decide what queries to admit we need to apply QoS policies. But admission can also take into account the resources being used. Once admitted I think we will just do policy (region-based) prioritization. This should be fleshed out more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants