Feat/blocks #1695

thorfour · 2019-09-24T16:19:05Z

The motivation behind this PR was to be able to store larger sized objects into the S3 storage backend. It leverages a lot of code from the thanos project to accomplish that.

This implements ingester storage and s3 storage using the tsdb blocks and thanos shipping to s3.
Each user has a tsdb opened up under their userID, and a shipper that periodically scans that directory and uploads to s3.

Querier creates a block querier that syncs against s3 to perform the long-retention queries.

bboreham · 2019-10-02T16:19:23Z

Thanks for the PR! Could you comment on the relationship to #1427 which seems similar from the description?

thorfour · 2019-10-02T16:37:56Z

Thanks for the PR! Could you comment on the relationship to #1427 which seems similar from the description?

I had not seen that draft before. Will delve into that and see what the relationship is.

thorfour · 2019-10-02T17:51:00Z

@bboreham it looks like #1427 is similar in goals to this. It's hard to know for certain since there's a lot undone on that draft, but it looks like the original idea was to attempt to make tsdb multi-tenant, whereas this PR leaves it single tenant, but creates them for each user (which was later suggested in the comments in that PR)

It also appears to deal with metrics on a per-chunk level as well, where this PR handles them in block format.

There is plenty of future work for this, such as user limits, shipping blocks to non-s3 chunk stores etc. but I think it's a working starting point that can be built on that seems to satisfy the high-level goal of #1427

gouthamve

Some comments:

Looks like only S3 is supported, how hard would it be to add other storages (GCS in particular).
I see a huge scalability bottleneck in the querier as every querier will need to have all the block metadata for all the users. It is memory intensive and we should be ideally able to shard on it. I don't have a good solution here.

Other than this, I'll need to test this all out :) Would be super helpful if you could share the manifests for running a small cluster on DOK8S.

pkg/ingester/ingester.go

pkg/ingester/transfer.go

gouthamve · 2019-10-07T13:26:01Z

pkg/querier/block_store.go

+	return nil
+}
+
+func (u *UserStore) syncUserStores(ctx context.Context, f func(context.Context, *store.BucketStore) error) error {


We'll hit this quite quickly in a multi-tenant context: thanos-io/thanos#1471

Further, I don't see a good way to horizontally scale this as there is no sharding and all queriers should have all the blocks.

Yea make sense. I wonder if we could do something like sharding users to a stateful set of queriers, so only some querier have a given user. This problem is definitely food for thought.

Let me add some math. Scenario:

1K tenants (reasonably low)

14 months retention (approx. 31 days / month)

2 hours TSDB block range

Total number of blocks per tenant: 14 * 31 * (24 / 2) = 5.2M blocks
Total number of API calls required to list all the blocks (assuming 1000 max items in the response): 5.2M / 1000 = 5200

The initial sync of a querier needs to download the meta.json file for each block for each tenant, which means 5.2M calls to the object store, which looks not feasible to me.

A possible alternative would be index blocks by time range and implement it in a fully lazy way. We may prepend each block name stored on S3 with the min timestamp of that block (using a timestamp format which guarantees alphabetical ordering).

Given a query for the time range [A,B], we need to sync only the blocks whose min timestamp is [A-delta,B] where delta is the max block range allowed by the system. The querier may have a config options to initially sync only the last X hours of blocks (assuming most queries are against recent data), while older blocks metadata will be lazily fetched.

Moreover, the periodic sync to find new blocks wouldn't require to iterate over all blocks of a tenant, but we could set the "StartAfter" param (of S3 ListBucket API) to the last min timestamp synched - a delta.

Does it make any sense?

S3-based indexing is just a trick to avoid introducing another dependency. Indexes on Bigtable (like it work for Cortex chunks today) would serve the same purpose in a probably cleaner way.

Totally makes sense. I do personally think this is still an optimization for after merging this PR, as this change is already large. Blocks are already sorted by ULID so we would have time ordering already so I think we could already lazy sync on that.

Do you view this as a blocker for the initial feature merge?

Not a blocker for sure. I was sharing thoughts on possible evolutions.

Hi 👋

Yup, in Thanos key thing is that blocks have to be compacted (up to 2w) somewhere so you need to avoid that many blocks. Plus right now, at last on Thanos Store and Compactor you can shard the reads using relabelling, so it can be horizontally scaled.

The only problem might be a single bucket and our flat block structure - even with sharding we still iterate over all blocks when syncing. StartAfter logic sounds good indeed, however, this is not sufficient as blocks (at least for Thanos case) can be compacted/removed so we need something that scans if block is still there. We could do that in a lazy way as well in theory. Something that would make sense is maybe some prefix support, so we can upload/read blocks from different subdirs of object store - and the prefix could be external labels (I hope this is how you recognize tenants as well?). (:

.. which is discussed here: thanos-io/thanos#1318 Please join the discussion if you have ideas that will work for you as well! (:

gouthamve

I don't see where we load the WAL after the transfer is done. I could see the WAL loading taking sometime too and I don't have a good idea on what to do in ~10mins it takes to load the WAL.

But beyond these, I think the PR is super close. Thanks for all the work on this!

gouthamve · 2019-10-07T13:41:31Z

pkg/ingester/transfer.go

+		return nil
+	}
+
+	// Close all user databases


So while the ingester is in LEAVING it'll continue to receive queries. We shouldn't close the dbs afaics.

As for the first comment. The WAL is replayed when the tsdb is opened for a given user, which may not work for large WALs. I've been using 10m blocks in my testing and that particular issue didn't become a problem, though I could see how it might.

Good call on not closing the dbs. Do you know if it's safe to copy out a tsdb before it's been closed? Maybe the better idea here is to snapshot it and leave it open for queries. Thoughts?

Do you know if it's safe to copy out a tsdb before it's been closed?

I'm not sure. I checked a bit the code, and the Close() also ensures that the WAL is fsync-ed. Without closing it, we may have no guarantee that the last samples have been synched to disk.

Would be possible to reverse the problem, and not allow a LEAVING ingester to serve queries when the storage is TSDB?

khaines

Most of this PR is its own branch of logic, so it is non-impactful to any existing use of s3. I'm leaning towards approve pending responses to a select few comments here.

I am really interested in the perf at scale, but much of that can only be found with load testing and observation.

@bboreham since you are on the list for 0.4 release manager, would this be something you want to be merged for the release, waited until post 0.4... or do we want to mark this as a "beta/experimental" feature?

khaines · 2019-10-04T20:40:16Z

pkg/ingester/ingester.go

@@ -138,6 +156,16 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
 	f.BoolVar(&cfg.SpreadFlushes, "ingester.spread-flushes", false, "If true, spread series flushes across the whole period of MaxChunkAge")
 	f.IntVar(&cfg.ConcurrentFlushes, "ingester.concurrent-flushes", 50, "Number of concurrent goroutines flushing to dynamodb.")
 	f.DurationVar(&cfg.RateUpdatePeriod, "ingester.rate-update-period", 15*time.Second, "Period with which to update the per-user ingestion rates.")
+
+	// Prometheus 2.x block storage
+	f.BoolVar(&cfg.V2, "ingester.v2", false, "If true, use Prometheus block storage for metrics")


IIRC it was covered in the community call that we need a migration path for existing deployments via the storage schema config, but not block on this just for that. This is a reminder comment to open an issue for this once blocks are merged.

khaines · 2019-10-04T20:43:18Z

pkg/ingester/ingester.go

+
+	var bkt *s3.Bucket
+	s3Cfg := s3.Config{
+		Bucket:    cfg.S3Bucket,


This is referencing the thanos package, so is there a potential to run into the service limits at scale, that prompted the work in #1625?

This can definitely hit the same API limitation of S3 providers (and has in my testing). I'm currently working on a follow-on PR to shard across buckets as in #1625. It's mostly straight forward except for the SyncBlocks func which iterates over a bucket.

pkg/ingester/transfer.go

khaines · 2019-10-04T21:23:41Z

pkg/ingester/transfer.go

+		}
+		filesXfer++
+
+		// TODO(thor) To avoid corruption from errors, it's probably best to write to a temp dir, and then move that to the final location


@thorfour was this something you intended to implement before merging?

I didn't plan to no, was trying to keep changes in a single PR smaller. I was going to do that as a follow-on PR. But if it's seen as a blocker to merge I'm happy to complete it in this.

pkg/ingester/ingester.go

bboreham · 2019-10-07T15:29:50Z

since you are on the list for 0.4 release manager

I have cut the RC for 0.3 today, so no I don't want any last-minute merges except fixing serious bugs.

bboreham · 2019-10-07T15:30:54Z

Do we really want to call this v2 ? We already have schemas named up to v11, and code from Prometheus v1 and v2; it could be confusing.

thorfour · 2019-10-08T16:29:28Z

Some comments:
* Looks like only S3 is supported, how hard would it be to add other storages (GCS in particular).

* I see a huge scalability bottleneck in the querier as every querier will need to have all the block metadata for all the users. It is memory intensive and we should be ideally able to shard on it. I don't have a good solution here.
Other than this, I'll need to test this all out :) Would be super helpful if you could share the manifests for running a small cluster on DOK8S.

@gouthamve

For the GCS support I believe we'd create a different objstore.Bucket client that implements that interface. As well as objstore.BucketReader for the querier. Just wasn't on my priority list.

I've uploaded a k8s gist: kubectl apply -f https://gist.githubusercontent.com/thorfour/7da604b93856023bd41b86b110737db4/raw/4a7e512dafe4b15874d4c387008e84f8b0b42692/k8s-cortex-blocks.yml

That should create everything needed in the cluster, including grafana preconfigured. Let me know if you run into problems with it.

pracucci

You did a great job @thorfour. I've approached Cortex only recently, and my comments may miss something, but I would be glad if you could take a look.

pkg/cortex/modules.go

pkg/querier/block_store.go

pracucci · 2019-10-23T11:28:46Z

pkg/querier/block_store.go

+	return nil
+}
+
+func (u *UserStore) syncUserStores(ctx context.Context, f func(context.Context, *store.BucketStore) error) error {


Let me add some math. Scenario:

1K tenants (reasonably low)

14 months retention (approx. 31 days / month)

2 hours TSDB block range

Total number of blocks per tenant: 14 * 31 * (24 / 2) = 5.2M blocks
Total number of API calls required to list all the blocks (assuming 1000 max items in the response): 5.2M / 1000 = 5200

The initial sync of a querier needs to download the meta.json file for each block for each tenant, which means 5.2M calls to the object store, which looks not feasible to me.

A possible alternative would be index blocks by time range and implement it in a fully lazy way. We may prepend each block name stored on S3 with the min timestamp of that block (using a timestamp format which guarantees alphabetical ordering).

Given a query for the time range [A,B], we need to sync only the blocks whose min timestamp is [A-delta,B] where delta is the max block range allowed by the system. The querier may have a config options to initially sync only the last X hours of blocks (assuming most queries are against recent data), while older blocks metadata will be lazily fetched.

Moreover, the periodic sync to find new blocks wouldn't require to iterate over all blocks of a tenant, but we could set the "StartAfter" param (of S3 ListBucket API) to the last min timestamp synched - a delta.

Does it make any sense?

Signed-off-by: Thor <thansen@digitalocean.com>

gouthamve

LGTM! I think we have a good starting point and can iterate on this!

pkg/ingester/ingester.go

khaines

I agree with Goutham, this LGTM for merging and further iterations.

thorfour force-pushed the feat/blocks branch 6 times, most recently from 2c5ef1b to 11deb9d Compare October 1, 2019 02:12

tomwilkie mentioned this pull request Oct 3, 2019

Use prometheus/tsdb in ingester. #1427

Closed

9 tasks

thorfour force-pushed the feat/blocks branch 4 times, most recently from 2c9b399 to 4eda36b Compare October 4, 2019 13:35

gouthamve requested a review from tomwilkie October 7, 2019 04:49

gouthamve reviewed Oct 7, 2019

View reviewed changes

khaines reviewed Oct 7, 2019

View reviewed changes

thorfour force-pushed the feat/blocks branch 2 times, most recently from 34cd5c1 to 09ed472 Compare October 7, 2019 15:29

thorfour force-pushed the feat/blocks branch 3 times, most recently from 72ddad1 to 9096870 Compare October 7, 2019 20:49

thorfour force-pushed the feat/blocks branch 3 times, most recently from bc1c78c to 878bcf0 Compare October 16, 2019 13:23

pracucci reviewed Oct 23, 2019

View reviewed changes

Thor and others added 19 commits October 25, 2019 13:33

vendor

7b6fb62

Signed-off-by: Thor <thansen@digitalocean.com>

ingester.v2.enable

dfc439b

Signed-off-by: Thor <thansen@digitalocean.com>

split v2 into separate file

145ae45

Signed-off-by: Thor <thansen@digitalocean.com>

log error

af22858

Signed-off-by: Thor <thansen@digitalocean.com>

check for creation of db in-between locks

f65a578

Signed-off-by: Thor <thansen@digitalocean.com>

anchor regexp

5f1af3a

Signed-off-by: Thor <thansen@digitalocean.com>

configurable ship interval

271deaf

Signed-off-by: Thor <thansen@digitalocean.com>

fixed transfer_test flake

eacce09

Signed-off-by: Thor <thansen@digitalocean.com>

skip store init when v2 enabled

59735e5

Signed-off-by: Thor <thansen@digitalocean.com>

s3insecure flag

542c534

Signed-off-by: Thor <thansen@digitalocean.com>

ignore eof errors on block sync

3ce32cb

Signed-off-by: Thor <thansen@digitalocean.com>

moved transfer function to common wrapper

274f91e

Signed-off-by: Thor <thansen@digitalocean.com>

don't stop stores if nil

221fb16

Signed-off-by: Thor <thansen@digitalocean.com>

use TSDB base dir for querier directories

1305e60

Signed-off-by: Thor <thansen@digitalocean.com>

ignore table manager in v2

5cdd7a1

Signed-off-by: Thor <thansen@digitalocean.com>

Refactored TSDB config file structure and CLI flags

8238dc6

Signed-off-by: Thor <thansen@digitalocean.com>

fixed lint errors, mark flags as experimental

294d436

Signed-off-by: Thor <thansen@digitalocean.com>

use separate directory for syncing block data

c0979bc

Signed-off-by: Thor <thansen@digitalocean.com>

fixed unit test with new config changes

c77d3bc

Signed-off-by: Thor <thansen@digitalocean.com>

thorfour force-pushed the feat/blocks branch from e4097d7 to 8ba5fc7 Compare October 25, 2019 18:36

Thor added 2 commits October 25, 2019 13:36

sync-dir -> sync_dir

c53be35

Signed-off-by: Thor <thansen@digitalocean.com>

validate configs

feb5809

Signed-off-by: Thor <thansen@digitalocean.com>

thorfour force-pushed the feat/blocks branch from 8ba5fc7 to feb5809 Compare October 25, 2019 18:37

fixed rebase overflow changes

29ec961

Signed-off-by: Thor <thansen@digitalocean.com>

thorfour force-pushed the feat/blocks branch from b5ca44f to 29ec961 Compare October 25, 2019 18:45

gouthamve approved these changes Oct 30, 2019

View reviewed changes

pkg/ingester/ingester.go Outdated Show resolved Hide resolved

khaines approved these changes Oct 30, 2019

View reviewed changes

gouthamve merged commit d8af0ab into cortexproject:master Oct 30, 2019

thorfour deleted the feat/blocks branch October 30, 2019 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/blocks #1695

Feat/blocks #1695

thorfour commented Sep 24, 2019

bboreham commented Oct 2, 2019

thorfour commented Oct 2, 2019

thorfour commented Oct 2, 2019

gouthamve left a comment

gouthamve Oct 7, 2019

thorfour Oct 7, 2019

pracucci Oct 23, 2019

pracucci Oct 23, 2019

thorfour Oct 23, 2019

thorfour Oct 23, 2019

pracucci Oct 23, 2019

bwplotka Oct 24, 2019

bwplotka Oct 24, 2019

gouthamve left a comment

gouthamve Oct 7, 2019

thorfour Oct 7, 2019

pracucci Oct 24, 2019

khaines left a comment •

edited

khaines Oct 4, 2019

khaines Oct 4, 2019

thorfour Oct 7, 2019

khaines Oct 4, 2019

thorfour Oct 7, 2019

bboreham commented Oct 7, 2019

bboreham commented Oct 7, 2019

thorfour commented Oct 8, 2019 •

edited

pracucci left a comment

pracucci Oct 23, 2019

gouthamve left a comment

khaines left a comment

Feat/blocks #1695

Feat/blocks #1695

Conversation

thorfour commented Sep 24, 2019

bboreham commented Oct 2, 2019

thorfour commented Oct 2, 2019

thorfour commented Oct 2, 2019

gouthamve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gouthamve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaines left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboreham commented Oct 7, 2019

bboreham commented Oct 7, 2019

thorfour commented Oct 8, 2019 • edited

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gouthamve left a comment

Choose a reason for hiding this comment

khaines left a comment

Choose a reason for hiding this comment

khaines left a comment •

edited

thorfour commented Oct 8, 2019 •

edited