Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Twisted trial workers should know an index to differentiate between each other #11949

Open
p12tic opened this issue Sep 3, 2023 · 16 comments · May be fixed by #11950
Open

Parallel Twisted trial workers should know an index to differentiate between each other #11949

p12tic opened this issue Sep 3, 2023 · 16 comments · May be fixed by #11950

Comments

@p12tic
Copy link
Contributor

p12tic commented Sep 3, 2023

Is your feature request related to a problem? Please describe.
Currently tests run under parallel trial do not know about each other. This makes it difficult to coordinate access to shared resources (e.g. database) that take a long time to construct.

For example, consider a case of 10000 tests that access a database. Currently tests can only be run in sequence, each using the same database and cleaning up after itself. Running tests in parallel would involve constructing 10000 separate databases which is infeasible. However, if each test knew the index of the parallel Twisted trial worker it runs under, then only small number of databases would need to be created and tests would use them like in serial case. E.g. when running tests using trial -j16, only 16 databases would need to be created.

Describe the solution you'd like
Expose the index of the parallel trial worker as an environment variable e.g. TWISTED_TRIAL_PARALLEL_INDEX.

Describe alternatives you've considered
An alternative is to inspect os.getcwd() by the tests. It contains the index of the parallel trial worker as the last path component. E.g. _trial_temp/21. However, this is an implementation detail of trial which is problematic to rely on. Twisted should provide a proper way to detect this.

@glyph
Copy link
Member

glyph commented Sep 3, 2023

Expose the index of the parallel trial worker as an environment variable e.g. TWISTED_TRIAL_PARALLEL_INDEX.

If we expose this it should be an API, like twisted.trial.workerIndex(), not an environment variable. Among other things, reporting an error like "trial isn't running", "trial is running but it's not running in parallel", or "we're running in parallel, but in a (heretofore unimplemented) elastic mode that doesn't support fixed indexing" will be much easier if you can actually have the code know how to determine those things rather than have everyone have to implement their own client code for parsing this env var.

@exarkun
Copy link
Member

exarkun commented Sep 4, 2023

For example, consider a case of 10000 tests that access a database. Currently tests can only be run in sequence, each using the same database and cleaning up after itself.

This isn't quite the case. It's already possible for each worker process to manage a single database, if that's the behavior you want. I don't know how knowing "this is worker 7" even helps make it easier to solve that problem.

@glyph
Copy link
Member

glyph commented Sep 4, 2023

This isn't quite the case. It's already possible for each worker process to manage a single database, if that's the behavior you want. I don't know how knowing "this is worker 7" even helps make it easier to solve that problem.

Managing a database connection is pretty easy by just using a global variable in the relevant process, but I can imagine some hypotheticals where it might help. Like, say, you have a database pre-populated in your CI environment with gigabytes of test information, and your trial tests need to connect to it but not interfere with each other. You might want to have separate actual database servers 1-N and have those correspond to your trial-worker index, assuming that your CI has some kind of exclusive lock around those.

However, this hypothetical example is a sort of hodgepodge drawn from experiences that I had around 2002 in a previous epoch of testing technology, so while I could believe this might be useful in some circumstances, I'd love to hear what @p12tic specifically wants to do and whether there might be better approaches, either with trial as it is today or perhaps with other improvements to trial. For example, if there's really some very tricky fixture setup and teardown coordination, it might make more sense to have a way to allow client code to run something in the manager process that the workers can exchange messages with, rather than depending on hard-coding fixed indexes.

@glyph
Copy link
Member

glyph commented Sep 4, 2023

I'd also like to hear about specific information associated with this claim:

Running tests in parallel would involve constructing 10000 separate databases which is infeasible

I wrote a script to quickly create 10000 separate databases using createdb in postgres on my laptop and it completed in about 600 seconds. While I wouldn't consider this a great baseline for excellent test performance (and cleanup took another 600 or so), it's not so bad that it's beyond the pale of feasibility, especially considering that we're talking about 10,000 real-database-using tests for some reason. There are much more efficient ways to do this; turning off fsync, for example, or even just disabling fseventsd for the database volume (which took more CPU than the database in my case). So it's not clear that this is a great use-case for Twisted doing anything at all, and perhaps just creating the databases is fine?

@p12tic
Copy link
Contributor Author

p12tic commented Sep 4, 2023

Thanks for responses.

If we expose this it should be an API, like twisted.trial.workerIndex(), not an environment variable

Agreed.

I'd love to hear what @p12tic specifically wants to do and whether there might be better approaches, either with trial as it is today or perhaps with other improvements to trial.

You're right that in my case the problem is not performance per se. My problem description was misleading because the problem was described in terms of performance when I wasn't even measuring performance of create database xyz; drop database xyz; pairs.

The databases in question have special setup before the test, outside tests run by Twisted. This setup can't be part of tests run by Twisted. Even if it could, we would be looking into performance penalty of >2 seconds for each test.

It would be great if I could just run the setup N times for N databases and just use them.

@p12tic
Copy link
Contributor Author

p12tic commented Sep 4, 2023

Also, just a note. The approach of having a worker index is one of the standard approaches of optimizing parallel execution. In such case it's possible to avoid all synchronization between work done by a single worker. And since the number of workers is small compared to the number of tasks, the synchronization overhead is practically zero.

So the approach of exposing worker index is not something special I thought of.

@adiroiban
Copy link
Member

I left some feedback on the PR #11954

I have approved the PR but I will wait for another person to take a look and merge or reject.


I think that if this helps in some use cases, the implementation is simple enough.

If we want to implement a high level API, I guess the API could use a similar env variable to detect the worker index.


rather than have everyone have to implement their own client code for parsing this env var.

I don't expect this env var to be used that much.

Once we have more usecases, it should be easier to design an API.

So maybe it's easier to have this as it is and see what people need.

@exarkun
Copy link
Member

exarkun commented Sep 4, 2023

Also, just a note. The approach of having a worker index is one of the standard approaches of optimizing parallel execution. In such case it's possible to avoid all synchronization between work done by a single worker. And since the number of workers is small compared to the number of tasks, the synchronization overhead is practically zero.

So the approach of exposing worker index is not something special I thought of.

To be explicit, I don't necessarily think tracking a worker index is a bad idea or a useless feature. I just want to understand how doing it like this is useful.

As a counter-point, some CI systems themselves support parallelization and are in charge of how many workers are run - and will tell the job what its index is. Should trial allow its own worker configuration to somehow be influenced by this? If there are two indexing systems, should they be related to each other somehow? Should they be completely distinct?

I suspect there are good answers to these questions out there and I suspect any feature we add to trial will be better if we consider those answers before designing it.

@p12tic
Copy link
Contributor Author

p12tic commented Sep 4, 2023

As a counter-point, some CI systems themselves support parallelization and are in charge of how many workers are run - and will tell the job what its index is. Should trial allow its own worker configuration to somehow be influenced by this? If there are two indexing systems, should they be related to each other somehow? Should they be completely distinct?

I think it's enough to just allow user to access trial worker index. The fact that the user needs the index in the first place suggests that there are likely some very specific requirements. In such case it's likely that trial cannot reliably guess what the user wants. The provided API will be confusing and unlikely to match the requirements of the majority of users anyway.

For example, in your example with two indexing systems, it's quite clear how everything works from the user perspective . CI runs a job and gives CI-level index to the job. Trial is run by a command specified by the user. Maximum number of trial worker indexes is already known because -jN option is specified by the user. Then the user can easily calculate global test index by <ci_index> * N + <worker_index>. If trial tried to guess this formula, we would need to write a page of documentation of how exactly it works. And even then as a user I would be wary of trial API implementation being wrong and write my own code anyway.

Another example: we started to design workerIndex() API. For some users it's an error if test runs without parallelization. For me it's perfectly fine - I would want workerIndex() to return zero in such case. So even here there is chance for trial to encode a wrong decision of how the API would be used into the API itself.

As a user it would be completely fine to use an environment variable. There are hundreds of lines of code related to test setup anyway. Writing two or three lines for parsing environment variables is perfectly fine. What is bad is relying on trial internals such as os.getcwd().

@exarkun
Copy link
Member

exarkun commented Sep 4, 2023

Another example: we started to design workerIndex() API. For some users it's an error if test runs without parallelization. For me it's perfectly fine - I would want workerIndex() to return zero in such case. So even here there is chance for trial to encode a wrong decision of how the API would be used into the API itself.

There isn't really a difference between the hypothetical workerIndex API and the environment variable in this regard. Both must be capable of representing the two cases: with parallelization and without parallelization. Above, there was discussion of how the environment variable represents the "with parallelization" case. I think the implication is that in the "without parallelization" case the environment variable is simply missing. This perhaps feels natural because it is common for the existence or non-existence of environment variables to signal something - though this often leads to silent error since you must believe the environment but correct propagation of environment variables ends up being tricky in many cases.

A Python API like workerIndex (I don't know if that's the right API or not, just continuing to use it as an example for discussion) has a richer signaling ability. It can return a result. It can return different kinds of results in different cases. It can raise an exception. This doesn't imply any requirement that the API "guess" anything. It can faithfully signal what's going on to the best of its knowledge and leave the rest to the application.

So for the "without parallelization" case, environment variables seem worse (that is, more failure prone) than a Python API.

@p12tic
Copy link
Contributor Author

p12tic commented Sep 4, 2023

So for the "without parallelization" case, environment variables seem worse (that is, more failure prone) than a Python API.

OK, I was more focusing on the idea that the initial suggestion was to error out in cases which may be not an error for the end user in certain cases.

I agree that for pure reporting workerIndex() is fine.

I suggest the following:

def workerIndex(defaultValue=None):
    if not inTests():
        raise Exception("Not in tests")
    value = os.environ.get("ENVIRONMENT_VARIABLE", None)
    if value is None:
        return defaultValue
    return int(value)

Would that work?

@glyph
Copy link
Member

glyph commented Sep 4, 2023

I'd probably prefer that this value be passed as something other than an env var, like adding something to workercommands.Start. I don't want code to start relying on the presence of an env var; you should have to go through the "front door" of the API to get this information.

Also, I think that for a lot of use-cases here, you need to know both the worker's index and the total number of workers currently configured, so we shouldn't just be passing one value. I'd want to see something like this:

@dataclass
class InStaticPool:
    totalWorkerCount: int
    thisWorkerIndex: int

class SingleProcess:
    "Trial's in single-process mode."

class NotRunningInTrial:
    "You're not running under Trial at all, good luck"

def workerConfiguration() -> Union[InStaticPool, SingleProcess, NotRunningInTrial]:
    "implementation left as an exercise for the reader"

@exarkun
Copy link
Member

exarkun commented Sep 5, 2023

I'd probably prefer that this value be passed as something other than an env var

Yes, definitely (for all of the reasons given against using env vars upthread from here).

def workerConfiguration() -> Union[InStaticPool, SingleProcess, NotRunningInTrial]:

I'd suggest exploring not making this global. Instead, making it part of the runner or result interfaces - or possibly introducing a new value specific to trial or disttrial where it could live.

With an env var-based implementation, a global interface to access the information is the obvious move (because the env vars are process-global). If the values are passed in the disttrial Start command, though, promoting that non-global state to global state is both more work and ruins a lot of the value of getting away from env vars in the first place. For example, consider how you would test the behavior of this interface such that your tests work both under normal trial and in a disttrial worker. It's possible, but by going global you invent a bunch more work for yourself and all future maintainers.

@glyph
Copy link
Member

glyph commented Sep 5, 2023

I'd suggest exploring not making this global. Instead, making it part of the runner or result interfaces - or possibly introducing a new value specific to trial or disttrial where it could live.

I'd definitely prefer it not be global. The reason I proposed it as such was that I didn't really want to think in a quick sketch about how to thread this through to the (Synchronous)TestCase implementation, and whether stdlib unittests got this interface provided somehow, and how result proxies deal with it. But, you're right, actually implementing it globally would involve a bunch of similarly annoying details anyway.

@p12tic
Copy link
Contributor Author

p12tic commented Sep 7, 2023

Just a note: I haven't disappeared, but it will take some time to get back to this issue and implement to the maintainer satisfaction. I agree with all points raised here.

@glyph
Copy link
Member

glyph commented Sep 7, 2023

Thanks @p12tic !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@adiroiban @exarkun @glyph @p12tic and others