New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Twisted trial workers should know an index to differentiate between each other #11949
Comments
If we expose this it should be an API, like |
This isn't quite the case. It's already possible for each worker process to manage a single database, if that's the behavior you want. I don't know how knowing "this is worker 7" even helps make it easier to solve that problem. |
Managing a database connection is pretty easy by just using a global variable in the relevant process, but I can imagine some hypotheticals where it might help. Like, say, you have a database pre-populated in your CI environment with gigabytes of test information, and your trial tests need to connect to it but not interfere with each other. You might want to have separate actual database servers 1-N and have those correspond to your trial-worker index, assuming that your CI has some kind of exclusive lock around those. However, this hypothetical example is a sort of hodgepodge drawn from experiences that I had around 2002 in a previous epoch of testing technology, so while I could believe this might be useful in some circumstances, I'd love to hear what @p12tic specifically wants to do and whether there might be better approaches, either with trial as it is today or perhaps with other improvements to trial. For example, if there's really some very tricky fixture setup and teardown coordination, it might make more sense to have a way to allow client code to run something in the manager process that the workers can exchange messages with, rather than depending on hard-coding fixed indexes. |
I'd also like to hear about specific information associated with this claim:
I wrote a script to quickly create 10000 separate databases using |
Thanks for responses.
Agreed.
You're right that in my case the problem is not performance per se. My problem description was misleading because the problem was described in terms of performance when I wasn't even measuring performance of The databases in question have special setup before the test, outside tests run by Twisted. This setup can't be part of tests run by Twisted. Even if it could, we would be looking into performance penalty of >2 seconds for each test. It would be great if I could just run the setup N times for N databases and just use them. |
Also, just a note. The approach of having a worker index is one of the standard approaches of optimizing parallel execution. In such case it's possible to avoid all synchronization between work done by a single worker. And since the number of workers is small compared to the number of tasks, the synchronization overhead is practically zero. So the approach of exposing worker index is not something special I thought of. |
I left some feedback on the PR #11954 I have approved the PR but I will wait for another person to take a look and merge or reject. I think that if this helps in some use cases, the implementation is simple enough. If we want to implement a high level API, I guess the API could use a similar env variable to detect the worker index.
I don't expect this env var to be used that much. Once we have more usecases, it should be easier to design an API. So maybe it's easier to have this as it is and see what people need. |
To be explicit, I don't necessarily think tracking a worker index is a bad idea or a useless feature. I just want to understand how doing it like this is useful. As a counter-point, some CI systems themselves support parallelization and are in charge of how many workers are run - and will tell the job what its index is. Should trial allow its own worker configuration to somehow be influenced by this? If there are two indexing systems, should they be related to each other somehow? Should they be completely distinct? I suspect there are good answers to these questions out there and I suspect any feature we add to trial will be better if we consider those answers before designing it. |
I think it's enough to just allow user to access trial worker index. The fact that the user needs the index in the first place suggests that there are likely some very specific requirements. In such case it's likely that trial cannot reliably guess what the user wants. The provided API will be confusing and unlikely to match the requirements of the majority of users anyway. For example, in your example with two indexing systems, it's quite clear how everything works from the user perspective . CI runs a job and gives CI-level index to the job. Trial is run by a command specified by the user. Maximum number of trial worker indexes is already known because Another example: we started to design As a user it would be completely fine to use an environment variable. There are hundreds of lines of code related to test setup anyway. Writing two or three lines for parsing environment variables is perfectly fine. What is bad is relying on trial internals such as |
There isn't really a difference between the hypothetical A Python API like So for the "without parallelization" case, environment variables seem worse (that is, more failure prone) than a Python API. |
OK, I was more focusing on the idea that the initial suggestion was to error out in cases which may be not an error for the end user in certain cases. I agree that for pure reporting I suggest the following:
Would that work? |
I'd probably prefer that this value be passed as something other than an env var, like adding something to Also, I think that for a lot of use-cases here, you need to know both the worker's index and the total number of workers currently configured, so we shouldn't just be passing one value. I'd want to see something like this: @dataclass
class InStaticPool:
totalWorkerCount: int
thisWorkerIndex: int
class SingleProcess:
"Trial's in single-process mode."
class NotRunningInTrial:
"You're not running under Trial at all, good luck"
def workerConfiguration() -> Union[InStaticPool, SingleProcess, NotRunningInTrial]:
"implementation left as an exercise for the reader" |
Yes, definitely (for all of the reasons given against using env vars upthread from here).
I'd suggest exploring not making this global. Instead, making it part of the runner or result interfaces - or possibly introducing a new value specific to trial or disttrial where it could live. With an env var-based implementation, a global interface to access the information is the obvious move (because the env vars are process-global). If the values are passed in the disttrial |
I'd definitely prefer it not be global. The reason I proposed it as such was that I didn't really want to think in a quick sketch about how to thread this through to the ( |
Just a note: I haven't disappeared, but it will take some time to get back to this issue and implement to the maintainer satisfaction. I agree with all points raised here. |
Thanks @p12tic ! |
Is your feature request related to a problem? Please describe.
Currently tests run under parallel
trial
do not know about each other. This makes it difficult to coordinate access to shared resources (e.g. database) that take a long time to construct.For example, consider a case of 10000 tests that access a database. Currently tests can only be run in sequence, each using the same database and cleaning up after itself. Running tests in parallel would involve constructing 10000 separate databases which is infeasible. However, if each test knew the index of the parallel Twisted trial worker it runs under, then only small number of databases would need to be created and tests would use them like in serial case. E.g. when running tests using
trial -j16
, only 16 databases would need to be created.Describe the solution you'd like
Expose the index of the parallel trial worker as an environment variable e.g.
TWISTED_TRIAL_PARALLEL_INDEX
.Describe alternatives you've considered
An alternative is to inspect
os.getcwd()
by the tests. It contains the index of the parallel trial worker as the last path component. E.g._trial_temp/21
. However, this is an implementation detail of trial which is problematic to rely on. Twisted should provide a proper way to detect this.The text was updated successfully, but these errors were encountered: