Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to have a task list execute in parallel #24

Open
MartinWallgren opened this issue Apr 27, 2021 · 15 comments
Open

Is it possible to have a task list execute in parallel #24

MartinWallgren opened this issue Apr 27, 2021 · 15 comments
Labels
help wanted Extra attention is needed

Comments

@MartinWallgren
Copy link

Is it possible to create a task that executes multiple tasks in parallel?

I know I can create a compond task as

test = ["mypy", "pylint", "pytest"]

Calling poe test will run each task in sequence one after the other. It would be nice to be able to configure these task lists as safe to start in parallel.

Parallel should of course not be parallel by default since some tasks requires output from previous tasks (coverage being the prime example that needs a completed test run to before generating a coverage report).

@nat-n
Copy link
Owner

nat-n commented Apr 27, 2021

This is not currently supported. I considered it when first implementing this sequence task type. I thought it might be nice if by default an array inside an array would be interpreted as a ParallelTask type within a SequenceTask, so for example the following would run mypy and pylint in parallel then pytest after that:

test = [["mypy", "pylint"], "pytest"]

And of course you could also do:

test.parallel = ["mypy", "pylint", "pytest"]

However the problem is that I'm not sure what it should do with stdout. I imagine one wouldn't simply want both subprocesses to write to the same console at the same time! Maybe there could be a solution along the lines of capturing the output and feeding it out to the console one line at a time (maybe with a prefix linking it to the task that produced it, kind of like docker-compose does) but that's getting complicated to implement.

As I mention in #26, if the stdout of those tasks were configured to be captured anyway – such as for use in another task, or maybe to be piped to a file or discarded – then this problem goes away, and the tasks might as well be run in parallel. There's just the question left of how to handle a failure of one task in the set (whether to wait for the others).

I'd like to support parallel execution, but I'm really not sure how it should work. What do you think @MartinWallgren?

@nat-n
Copy link
Owner

nat-n commented Apr 27, 2021

Also a potential if inelegant workaround might be to use a shell task with background jobs, like something along the lines of:

[tool.poe.tasks.test]
shell = """
poe mypy &
poe pylint &
poe pytest &
wait $(jobs -p)
"""

@asfaltboy
Copy link

We do something like this in bash:

  1. First run all processes together, use a temp file to store each command's output
  2. Then, we iterate over each command waiting for its completion and display that command's output.

The side effect of this simple method is that it seemingly "stalls" on the slowest command, returning when they all complete. This means that CMDS array should preferably be sorted fastest to slowest.

Click to expand!
for cmd in "${CMDS[@]}"; do
    stdout="$(mktemp)"
    timer="$(mktemp)"
    { { time $cmd >>"$stdout" 2>&1 ; } >>"$timer" 2>&1 ; } &
    pids+=($!)
    stdouts+=("$stdout")
    timers+=("$timer")
done

for i in ${!CMDS[*]}; do
    if wait "${pids[$i]}"; then
        codes+=(0)
    else
        codes+=(1)
    fi

    if [ "${codes[$i]}" -eq "0" ]; then
        echo -en "${C_GREEN}"
        echo -en "${CMDS[$i]}"
        echo -en "$C_RESET"
        echo -e " ($(cat "${timers[$i]}")s)"
    else
        echo -en "${C_RED}${C_UNDERLINE}"
        echo -en "${CMDS[$i]}"
        echo -e "$C_RESET"
        echo -e "$(cat "${stdouts[$i]}")"
    fi
    echo ""
done

@jnoortheen
Copy link

another way is to use the gnu-parallel command

parallel ::: "flake8" "mypy dirname"

@nat-n the intial implementation can be very simple.

  1. lets say there are three tasks passed, we give the tty to the first task only (that means no capturing), so user can see the progress from that task. once the first task is finished running, we print the output/error from the next task and so on.
  2. regarding error, we run all tasks even if we counter errors and return a failure code and mention what are all failed. (doing 1, we will be printing the errors already)

we can later add some config about how these are executed. It can be crossplatform alternative to parallel

@ThatXliner
Copy link
Contributor

we can later add some config about how these are executed. It can be crossplatform alternative to parallel

like a backend config on how to parallelize?

@jnoortheen
Copy link

like a backend config on how to parallelize?

Yes some task or project level configs

@nat-n
Copy link
Owner

nat-n commented Apr 18, 2022

Hi @jnoortheen, thanks for the idea.

I understand that you're proposing the following strategy which I'll call Strategy 1:

  1. let the first task in the list output directly to stdout until it completes
  2. for each subsequent task: buffer its stdout in memory (or a tempfile to avoid unbounded memory use) until it completes
  3. dump the buffered output of each completed task, once all previous tasks have been output

This is probably the best solution in terms of having a coherent output log at the end. Though it assumes that the tasks in the list are meaningfully ordered which doesn't seem necessary. Therefore it might sometimes make more sense to use the following Strategy 2 instead:

  1. treat all tasks in the list as having equal precedence and buffer their output until they complete
  2. whenever a task completes then dump its output to stdout (even if tasks specified earlier in the list are still running)

Both Strategy 1 and Strategy 2 would benefit from poe providing some extra output lines to clarify which output is from which task (unless running in quiet mode).

Strategy 3 would be like Strategy 2 except we capture and output each line of task output as it arrives (with some prefix indicating which task it came from)

And Strategy 4 would be to just let all tasks output directly to stdout on top of one another, which may sometimes be necessary to support

Are there any other strategies worth considering? Is is worthwhile also being able to direct outputs to separate filesystem locations? e.g. f"task_name_{subtask_number}.out"

I think it would be best if the user can configure the strategy for a specific parallel task independently for stdout and stderr, with Strategy 1 being the default for stdout and Strategy 3 or 4 being the default for stderr.

Maybe how to handle errors should also be configurable, with continuing other tasks but returning non-zero at the end as the default behaviour if one or more tasks fail. But also having the option to stop all tasks if one fails, or even to always continue and return zero, would also make sense.

I'm thinking this would require having a thread per running subtask, which is responsible for monitoring the subtask and handling its output.

To be clear I would not be keen on making gnu parallel (or any other binary less common than bash itself) a dependency of poethepoet, and implementing such an integration mechanism would probably be a bit complex to get right.

Any other ideas?

@ThatXliner
Copy link
Contributor

ThatXliner commented Apr 18, 2022

Seems good but why 3 or 4 as default for stderr?

On second thought, yeah: you want to see the errors quickly. I was thinking of those multi-line errors/warnings like those from pip… so maybe buffer the lines a bit until, say, 0.2 seconds has passed and no more new lines has been seen so far from process X?

@jnoortheen
Copy link

I've actually implemented that suggested solution here using asyncio.subprocess module. It just outputs stdout from commands to sys.stdout, stderr to stderr.

@ThatXliner
Copy link
Contributor

We could take some inspiration from https://github.com/open-cli-tools/concurrently#readme

@luketych
Copy link

luketych commented Apr 9, 2023

+1 interest on implementing this

@nat-n
Copy link
Owner

nat-n commented Apr 10, 2023

I think this is an important feature, but it's currently not near the top of my list. If someone wants to submit a PoC for one or more of the strategies discussed above then that would help move it along :)

Strategy 1 using asyncio.subprocess as @jnoortheen suggests is probably a good place to start. I'm thinking this would be a new task type: parallel that is otherwise similar to the sequence task type.

@nat-n nat-n added the help wanted Extra attention is needed label Apr 10, 2023
@luketych
Copy link

@nat-n what is currently at the top of your list? Maybe some of us could help on those.

@sewi-cpan
Copy link

+1 for this request

@JCHacking
Copy link

JCHacking commented Apr 11, 2024

I think it would be easier to run it in threads, since the current code is not asynchronous (maybe for version 0.3 it could be rewritten asynchronously?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants