Closed
Description
Is there an existing feature request for this?
- I have searched the existing issues
Describe the Feature
- dbt 0.20.0 changed how tests were selected, to avoid errors in more complex dbt projects that don't build the entire project during development/CI.
- Specifically: In a fresh schema, operators like
@
and+
can cause models to be tested that don't exist (1827, 2132). Singular tests that touch a model are run even if they are only indirectly selected (3832) - The change was to only run tests if all first-order parents are included in selection (2891). The writeup in the PR (3235) is very good so I won't re-summarise it.
- Specifically: In a fresh schema, operators like
- As an unfortunate side effect, it broke people's mental model (it used to do what you meant; now it does what you asked)
- I was originally writing an article explaining why the changes are important and helping dbt users understand when they should use the new-in-0.21.0
--greedy
flag. As I dug into the edge cases, my thinking changed and I now think we should restore the 0.19.0 behaviour as the default, with an option to disable greedy selection for the use cases that necessitated the changes in the first place.
Matrix of functionality/options:
Describe alternatives you've considered
-
Make the new behaviour much more discoverable.
- I think if the DAG in dbt Docs also visualised tests, then we could more reasonably expect users to preview their selections there (I used to use it as a visual debugger of complex selection syntax)
- Document the crap out of it, particularly in the tutorial.
-
Change nothing: there's a risk of creating more confusion by flipping back to the old way and annoying the people who liked it
- If we're going to change, we should really do it before 1.0 or we're stuck.
- There's already significant confusion in the community, for both new and longstanding members. 3496, 3571, 3706, 3903, Slack 1 (9mths in community), Slack 2 (2yrs in community), Slack 3 (5mths in community), Slack 4 (18mths in community), Slack 5 (new to community), Slack 6 (1mth in community)
-
Try harder to internalise the new system.
- I don't think it's better to expect this of new users: the old failure modes were real, but they tended to impact advanced dbt users (the green users).
NB: Venn diagram is entirely unscientific - The new failure mode catches people immediately (the yellow users), especially since the Tutorial currently says "Running tests on one model looks very similar to running a model: use the
--select
flag (or-s
flag), followed by the name of the model"
- I don't think it's better to expect this of new users: the old failure modes were real, but they tended to impact advanced dbt users (the green users).
Who will this benefit?
- People whose mental model has been broken by the changes
- Specifically, who expect to be able to switch between
dbt run -m some_model
anddbt test -m some_model
and have all tests attached tosome_model
run. - The union of
dbt test --exclude tag:x
anddbt test --select tag:x
is no longer the same asdbt test
when one model in a test is taggedx
and one isn't. - Side note: docs#514 correctly calls for endless examples of node selection. We should do this, no matter what the outcome here!
- Specifically, who expect to be able to switch between
- A lot of the takeaway from my original post was going to be "the good thing is that there's now a flag to allow you to choose the right approach for your use case". That remains true even if we flip the default back.
- I wanted to be able to provide a clear set of guidelines covering when you should use
--greedy
and not. All of the cases devolved towards "yes, but..." or "it depends". If the settings were flipped, I think a lot of the guidance becomes easier because we can describe the handful of cases where it becomes desirable with less ambiguity: "Exclude the other parents when they won't exist"
Are you interested in contributing this feature?
Sure am! @VersusFacit and I are going to tag-team it
Anything else?
No response
Activity
jtcohen6 commentedon Oct 18, 2021
Thanks for opening this @joellabes! We've discussed this extensively offline, and you've managed to convince me.
In retrospect, we should have introduced non-greedy test selection in v0.20 as an option (as called for by #2891!) rather than a change to the default. In my hubris, I thought I had managed to square the circle of test selection in a way that would make everyone happy. Instead, it made everyone confused. We live and you learn :)
I think this change should be accompanied by:
GREEDY
toGREEDY_SELECTION
, and the flags to--greedy-selection
and--no-greedy-selection
(for clarity IMO)profiles.yml
, and turned off via a--no
flagThen, the actual change just looks like:
GREEDY
flag toFalse
True
by default (for yaml selectors, which set their specification separately from the flag)Now that greedy selection is going to be the default, a few questions:
my_model+1
instead ofmy_model
)?Thrilled to see that you and @VersusFacit are going to take a swing at this :) We've got a couple of weeks ahead of v1.0.0-rc1. Let's get this in before then!
joellabes commentedon Oct 18, 2021
The only way to be right all the time is to change your mind when you get one wrong! You're still on a pretty solid batting average ⚾️
Thinking a bit more about the naming, I don't love greedy's implications. I know there's a ton of precedent for the term in technical contexts, and I'm not going to die on this hill, but do you think we could get the same thing done while calling it eager-selection or similar?
No I think this can be removed again. If someone has a compelling reason for needing it, adding logging will be a non-breaking change in the future.
It's only in the last few weeks that I've even realised that my_model+1 is a first-class way to access tests (see also my comments about discoverability of them in the visual DAG).
I'm not averse to it, but I don't really have strong feelings one way or the other.
jtcohen6 commentedon Oct 18, 2021
As far as slugging percentage is concerned, we have to take some big swings, and more than a few will be misses!
Certainly! I'd be all for renaming this to
EAGER_SELECTION
(or evenEAGER_INDIRECT_SELECTION
). Alternatively, we could recast this as an optional boolean that'sFalse
by default (CAUTIOUS_SELECTION
), since that's a recommended pattern.Since we're already breaking this by turning it on by default, we should also take the opportunity to rename it to something we'll be happy with for a long while.
Let's think a bit more about whether less-eager selection should mean "less eager for multi-parent tests only," vs. "avoid indirect selection entirely" (a much more sweeping change). There's a real chance that the latter is the right move in the 5% of cases where "full control mode" is desirable: Only execute the tests that are themselves selected. Or, to put it in the terms you described above: Do what's asked only, and not at all what might be meant (but not asked).
It will come as no surprise that I think selection is really important, especially as projects get bigger, and complementary tools like deferral/cloning get more powerful. The right answer, ultimately, might be to support multiple configurations, or gradations of eagerness. At one time, that would have felt like overdoing it; nowadays, it's clear to me that there's a real need here. This is something that folks are willing to think about in great depth and detail, especially if we're able to give them solid tools to do that thinking with—e.g., a DAG for visualizing test selection, too.
joellabes commentedon Oct 19, 2021
I don't think cautious correctly covers it as a standalone word, but I'm on board with an off-by-default naming pattern. I'll have a ponder. For the rest of this I'll call it
--non-eager-selection
as a placeholder. Let's not do that though:no-non-eager-selection
would be a disaster 🙃I'm wary of making more big, sweeping behavioural changes right before 1.0 ships (especially ones I don't use so don't have strong opinions on their behaviour), but I'm on board with keeping the door open for us to do it in the future.
How about this?
--non-eager-selection
with the same behaviour as was in 0.19.0--non-eager-selection
can take an argument to set the level of eagerness, e.g.some
– or no value for backwards compatibility – andextreme
for "full control mode".Or perhaps better still (writing is thinking):
--indirect-selection
now, witheager
as default (same behaviour as 0.19.0) andcautious
(the current greedy=false behaviour)none
to enable "full control mode".I like this because it means we're not constrained to a boolean representation of a spectrum, and eager and cautious as the two opposite options are very grokkable ☯️ . The naming gets very fraught if we ever plan to have a fourth option...
extremely-cautious
?overenthusiastic
? bleugh.jtcohen6 commentedon Oct 19, 2021
I'm on board!
Your second proposal tickles me. The naming and string type is a bit against convention, but the thing we're asking people to do here is more complex than a standard on/off global config. It feels more like "dbt language," wherein we speak about ephemeral materializations, high-maturity exposures, etc.
[-]Return to greedy test selection by default, with an option to disable it[/-][+]Return to eager test selection by default, with an option to tone it down[/+]