Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/blocks #1695

Merged
merged 25 commits into from Oct 30, 2019
Merged

Feat/blocks #1695

merged 25 commits into from Oct 30, 2019

Conversation

thorfour
Copy link
Contributor

The motivation behind this PR was to be able to store larger sized objects into the S3 storage backend. It leverages a lot of code from the thanos project to accomplish that.

This implements ingester storage and s3 storage using the tsdb blocks and thanos shipping to s3.
Each user has a tsdb opened up under their userID, and a shipper that periodically scans that directory and uploads to s3.

Querier creates a block querier that syncs against s3 to perform the long-retention queries.

@thorfour thorfour force-pushed the feat/blocks branch 6 times, most recently from 2c5ef1b to 11deb9d Compare October 1, 2019 02:12
@bboreham
Copy link
Contributor

bboreham commented Oct 2, 2019

Thanks for the PR! Could you comment on the relationship to #1427 which seems similar from the description?

@thorfour
Copy link
Contributor Author

thorfour commented Oct 2, 2019

Thanks for the PR! Could you comment on the relationship to #1427 which seems similar from the description?

I had not seen that draft before. Will delve into that and see what the relationship is.

@thorfour
Copy link
Contributor Author

thorfour commented Oct 2, 2019

@bboreham it looks like #1427 is similar in goals to this. It's hard to know for certain since there's a lot undone on that draft, but it looks like the original idea was to attempt to make tsdb multi-tenant, whereas this PR leaves it single tenant, but creates them for each user (which was later suggested in the comments in that PR)

It also appears to deal with metrics on a per-chunk level as well, where this PR handles them in block format.

There is plenty of future work for this, such as user limits, shipping blocks to non-s3 chunk stores etc. but I think it's a working starting point that can be built on that seems to satisfy the high-level goal of #1427

@tomwilkie tomwilkie mentioned this pull request Oct 3, 2019
9 tasks
@thorfour thorfour force-pushed the feat/blocks branch 4 times, most recently from 2c9b399 to 4eda36b Compare October 4, 2019 13:35
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

  • Looks like only S3 is supported, how hard would it be to add other storages (GCS in particular).
  • I see a huge scalability bottleneck in the querier as every querier will need to have all the block metadata for all the users. It is memory intensive and we should be ideally able to shard on it. I don't have a good solution here.

Other than this, I'll need to test this all out :) Would be super helpful if you could share the manifests for running a small cluster on DOK8S.

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/transfer.go Outdated Show resolved Hide resolved
return nil
}

func (u *UserStore) syncUserStores(ctx context.Context, f func(context.Context, *store.BucketStore) error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll hit this quite quickly in a multi-tenant context: thanos-io/thanos#1471

Further, I don't see a good way to horizontally scale this as there is no sharding and all queriers should have all the blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea make sense. I wonder if we could do something like sharding users to a stateful set of queriers, so only some querier have a given user. This problem is definitely food for thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add some math. Scenario:

  • 1K tenants (reasonably low)
  • 14 months retention (approx. 31 days / month)
  • 2 hours TSDB block range

Total number of blocks per tenant: 14 * 31 * (24 / 2) = 5.2M blocks
Total number of API calls required to list all the blocks (assuming 1000 max items in the response): 5.2M / 1000 = 5200

The initial sync of a querier needs to download the meta.json file for each block for each tenant, which means 5.2M calls to the object store, which looks not feasible to me.

A possible alternative would be index blocks by time range and implement it in a fully lazy way. We may prepend each block name stored on S3 with the min timestamp of that block (using a timestamp format which guarantees alphabetical ordering).

Given a query for the time range [A,B], we need to sync only the blocks whose min timestamp is [A-delta,B] where delta is the max block range allowed by the system. The querier may have a config options to initially sync only the last X hours of blocks (assuming most queries are against recent data), while older blocks metadata will be lazily fetched.

Moreover, the periodic sync to find new blocks wouldn't require to iterate over all blocks of a tenant, but we could set the "StartAfter" param (of S3 ListBucket API) to the last min timestamp synched - a delta.

Does it make any sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3-based indexing is just a trick to avoid introducing another dependency. Indexes on Bigtable (like it work for Cortex chunks today) would serve the same purpose in a probably cleaner way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense. I do personally think this is still an optimization for after merging this PR, as this change is already large. Blocks are already sorted by ULID so we would have time ordering already so I think we could already lazy sync on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you view this as a blocker for the initial feature merge?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for sure. I was sharing thoughts on possible evolutions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi 👋

Yup, in Thanos key thing is that blocks have to be compacted (up to 2w) somewhere so you need to avoid that many blocks. Plus right now, at last on Thanos Store and Compactor you can shard the reads using relabelling, so it can be horizontally scaled.

The only problem might be a single bucket and our flat block structure - even with sharding we still iterate over all blocks when syncing. StartAfter logic sounds good indeed, however, this is not sufficient as blocks (at least for Thanos case) can be compacted/removed so we need something that scans if block is still there. We could do that in a lazy way as well in theory. Something that would make sense is maybe some prefix support, so we can upload/read blocks from different subdirs of object store - and the prefix could be external labels (I hope this is how you recognize tenants as well?). (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. which is discussed here: thanos-io/thanos#1318 Please join the discussion if you have ideas that will work for you as well! (:

Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we load the WAL after the transfer is done. I could see the WAL loading taking sometime too and I don't have a good idea on what to do in ~10mins it takes to load the WAL.

But beyond these, I think the PR is super close. Thanks for all the work on this!

return nil
}

// Close all user databases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while the ingester is in LEAVING it'll continue to receive queries. We shouldn't close the dbs afaics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the first comment. The WAL is replayed when the tsdb is opened for a given user, which may not work for large WALs. I've been using 10m blocks in my testing and that particular issue didn't become a problem, though I could see how it might.

Good call on not closing the dbs. Do you know if it's safe to copy out a tsdb before it's been closed? Maybe the better idea here is to snapshot it and leave it open for queries. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if it's safe to copy out a tsdb before it's been closed?

I'm not sure. I checked a bit the code, and the Close() also ensures that the WAL is fsync-ed. Without closing it, we may have no guarantee that the last samples have been synched to disk.

Would be possible to reverse the problem, and not allow a LEAVING ingester to serve queries when the storage is TSDB?

Copy link
Contributor

@khaines khaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this PR is its own branch of logic, so it is non-impactful to any existing use of s3. I'm leaning towards approve pending responses to a select few comments here.

I am really interested in the perf at scale, but much of that can only be found with load testing and observation.

@bboreham since you are on the list for 0.4 release manager, would this be something you want to be merged for the release, waited until post 0.4... or do we want to mark this as a "beta/experimental" feature?

@@ -138,6 +156,16 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.BoolVar(&cfg.SpreadFlushes, "ingester.spread-flushes", false, "If true, spread series flushes across the whole period of MaxChunkAge")
f.IntVar(&cfg.ConcurrentFlushes, "ingester.concurrent-flushes", 50, "Number of concurrent goroutines flushing to dynamodb.")
f.DurationVar(&cfg.RateUpdatePeriod, "ingester.rate-update-period", 15*time.Second, "Period with which to update the per-user ingestion rates.")

// Prometheus 2.x block storage
f.BoolVar(&cfg.V2, "ingester.v2", false, "If true, use Prometheus block storage for metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it was covered in the community call that we need a migration path for existing deployments via the storage schema config, but not block on this just for that. This is a reminder comment to open an issue for this once blocks are merged.


var bkt *s3.Bucket
s3Cfg := s3.Config{
Bucket: cfg.S3Bucket,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referencing the thanos package, so is there a potential to run into the service limits at scale, that prompted the work in #1625?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can definitely hit the same API limitation of S3 providers (and has in my testing). I'm currently working on a follow-on PR to shard across buckets as in #1625. It's mostly straight forward except for the SyncBlocks func which iterates over a bucket.

pkg/ingester/transfer.go Show resolved Hide resolved
pkg/ingester/transfer.go Show resolved Hide resolved
}
filesXfer++

// TODO(thor) To avoid corruption from errors, it's probably best to write to a temp dir, and then move that to the final location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thorfour was this something you intended to implement before merging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't plan to no, was trying to keep changes in a single PR smaller. I was going to do that as a follow-on PR. But if it's seen as a blocker to merge I'm happy to complete it in this.

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@thorfour thorfour force-pushed the feat/blocks branch 2 times, most recently from 34cd5c1 to 09ed472 Compare October 7, 2019 15:29
@bboreham
Copy link
Contributor

bboreham commented Oct 7, 2019

since you are on the list for 0.4 release manager

I have cut the RC for 0.3 today, so no I don't want any last-minute merges except fixing serious bugs.

@bboreham
Copy link
Contributor

bboreham commented Oct 7, 2019

Do we really want to call this v2 ? We already have schemas named up to v11, and code from Prometheus v1 and v2; it could be confusing.

@thorfour thorfour force-pushed the feat/blocks branch 3 times, most recently from 72ddad1 to 9096870 Compare October 7, 2019 20:49
@thorfour
Copy link
Contributor Author

thorfour commented Oct 8, 2019

Some comments:

* Looks like only S3 is supported, how hard would it be to add other storages (GCS in particular).

* I see a huge scalability bottleneck in the querier as every querier will need to have all the block metadata for all the users. It is memory intensive and we should be ideally able to shard on it. I don't have a good solution here.

Other than this, I'll need to test this all out :) Would be super helpful if you could share the manifests for running a small cluster on DOK8S.

@gouthamve

For the GCS support I believe we'd create a different objstore.Bucket client that implements that interface. As well as objstore.BucketReader for the querier. Just wasn't on my priority list.

I've uploaded a k8s gist: kubectl apply -f https://gist.githubusercontent.com/thorfour/7da604b93856023bd41b86b110737db4/raw/4a7e512dafe4b15874d4c387008e84f8b0b42692/k8s-cortex-blocks.yml

That should create everything needed in the cluster, including grafana preconfigured. Let me know if you run into problems with it.

@thorfour thorfour force-pushed the feat/blocks branch 3 times, most recently from bc1c78c to 878bcf0 Compare October 16, 2019 13:23
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did a great job @thorfour. I've approached Cortex only recently, and my comments may miss something, but I would be glad if you could take a look.

pkg/cortex/modules.go Outdated Show resolved Hide resolved
pkg/cortex/modules.go Outdated Show resolved Hide resolved
pkg/querier/block_store.go Outdated Show resolved Hide resolved
return nil
}

func (u *UserStore) syncUserStores(ctx context.Context, f func(context.Context, *store.BucketStore) error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add some math. Scenario:

  • 1K tenants (reasonably low)
  • 14 months retention (approx. 31 days / month)
  • 2 hours TSDB block range

Total number of blocks per tenant: 14 * 31 * (24 / 2) = 5.2M blocks
Total number of API calls required to list all the blocks (assuming 1000 max items in the response): 5.2M / 1000 = 5200

The initial sync of a querier needs to download the meta.json file for each block for each tenant, which means 5.2M calls to the object store, which looks not feasible to me.

A possible alternative would be index blocks by time range and implement it in a fully lazy way. We may prepend each block name stored on S3 with the min timestamp of that block (using a timestamp format which guarantees alphabetical ordering).

Given a query for the time range [A,B], we need to sync only the blocks whose min timestamp is [A-delta,B] where delta is the max block range allowed by the system. The querier may have a config options to initially sync only the last X hours of blocks (assuming most queries are against recent data), while older blocks metadata will be lazily fetched.

Moreover, the periodic sync to find new blocks wouldn't require to iterate over all blocks of a tenant, but we could set the "StartAfter" param (of S3 ListBucket API) to the last min timestamp synched - a delta.

Does it make any sense?

Thor and others added 19 commits October 25, 2019 13:33
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Thor added 2 commits October 25, 2019 13:36
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think we have a good starting point and can iterate on this!

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
Copy link
Contributor

@khaines khaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Goutham, this LGTM for merging and further iterations.

@gouthamve gouthamve merged commit d8af0ab into cortexproject:master Oct 30, 2019
@thorfour thorfour deleted the feat/blocks branch October 30, 2019 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants