Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get-Content should release its file handle as early as possible #10025

Closed
vexx32 opened this issue Jun 28, 2019 · 18 comments
Closed

Get-Content should release its file handle as early as possible #10025

vexx32 opened this issue Jun 28, 2019 · 18 comments
Labels
Issue-Enhancement the issue is more of a feature request than a bug Resolution-No Activity Issue has had no activity for 6 months or more WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module

Comments

@vexx32
Copy link
Collaborator

vexx32 commented Jun 28, 2019

Summary of the new feature/enhancement

Currently, a pipeline like this is impossible:

Get-Content $Path | ForEach-Object { $_ -f $a } | Set-Content $Path

This is due to Get-Content not releasing its file handle until the completion of its pipeline. Adding parentheses around the Get-Content call to force it to run to completion and collect output before proceeding does work but is a bit clunky, and there's no reason that this shouldn't also work:

Get-Content $Path -Raw | ForEach-Object { $_ -f $a } | Set-Content $Path

Currently this doesn't, because the file handle is still not released until the pipeline's completion, despite all the data being read at once. There are other caveats to using -Raw, but this would at least enable simple operations with less clunky syntax.

Proposed technical implementation details (optional)

Something that will help alleviate the issue a little bit is some of the changes to command disposal behaviour in #9900, which causes commands to be disposed as soon as their role in the pipeline is fully completed (after EndProcessing / end{} is completed).

Beyond that, the rest of the changes can be implemented in Get-Content itself for -Raw, and it may be worth looking into whether there are other avenues for alleviating the issue.

For example, in Set-Content we can have a fallback behaviour whereby if it can't get a write handle on the file, it instead writes to a temporary file, and during EndProcessing() or with a Dispose() step it then copies the file over the original.

@vexx32 vexx32 added the Issue-Enhancement the issue is more of a feature request than a bug label Jun 28, 2019
@iSazonov iSazonov added the WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module label Jun 28, 2019
@iSazonov
Copy link
Collaborator

Pipeline executes BeginProcessing for every command in the pipeline, then ProcessRecord for every command in the pipeline and then EndProcessing for every command in the pipeline. It is not clear that we could fix here.

@vexx32
Copy link
Collaborator Author

vexx32 commented Jun 28, 2019

Aye, I suspect the main opportunity here would be doing this for the Get-Content -Raw case (where the whole file is read at once and submitted to the pipeline as a single object) and then for Set-Content to attempt to overwrite at the end of the pipeline.

Potentially, Set-Content could wait until the pipeline is disposed before it actually writes the data to the file, or we could simply defer the write to a temporary file by default before copying it over the original if there is a problem like this that prevents writing during the pipeline.

@SeeminglyScience
Copy link
Collaborator

Pipeline executes BeginProcessing for every command in the pipeline, then ProcessRecord for every command in the pipeline and then EndProcessing for every command in the pipeline. It is not clear that we could fix here.

If every command in the pipeline always writes an object for ProcessRecord and never writes any objects in EndProcessing that is correct. Otherwise, EndProcessing will run after after the last input object is received for each command:

$one = { end { 0; Write-Host '1 end called' }}
$two = { process { $_ } end { Write-Host '2 end called' }}
$three = { process { } end { Write-Host '3 end called'; 0..2;}}
$four = { process { Write-Host $_ } end { Write-Host '4 end called' }}

& $one | & $two | & $three | & $four

1 end called
2 end called
3 end called
0
1
2
4 end called

In the original issue, the idea was that if you add Out-String before Set-Content then Get-Content should be disposed after Out-String has consumed all of it's output. That would be after EndProcessing runs for Get-Content, but before it runs for Out-String.

@iSazonov
Copy link
Collaborator

This looks like a workaround but cannot be a general fix, as it causes the accumulation of information that for large files will lead to memory exhaustion. Even in other languages, the best practice is to write to an auxiliary file and then rename it to the original one that is more safe. I guess this can be achieved with Rename-Item in the pipeline end.

@vexx32
Copy link
Collaborator Author

vexx32 commented Jun 29, 2019

Sure, it can be done manually, but it would be nice if PS could handle that behind the scenes, in my opinion. 😄

Also, that still can't be done in a single pipeline, I don't think, since currently Get-Content won't release the handle until the entire pipeline is complete regardless of the use case.:/

@KirkMunro
Copy link
Contributor

Sounds more like a need for an optional switch on Set-Content to make it buffer updates and write the content to the file in the end block.

@vexx32
Copy link
Collaborator Author

vexx32 commented Jun 29, 2019

That would be OK for small to medium files, but if you're writing large files it will probably need to buffer to a temporary file, or we'll have huge memory usage. 😕

@KirkMunro
Copy link
Contributor

But that's what you're doing by using Get-Content -Raw, which is the use case you talked about in the first place. My point was that maybe you should be buffering downstream rather than upstream if you want to support one-liner content updates so that it works whether you use raw or not, and it's explicit rather than implicit.

@iSazonov
Copy link
Collaborator

Even with downstream buffering the operation is dangerous. We should use intermediate/temporary file and rename it or use a transactional file system.

@KirkMunro
Copy link
Contributor

I agree. My point was just about doing "the work" downstream. Writing to a temp file and then moving it and/or doing transactional work would be the right way to go.

@HumanEquivalentUnit
Copy link
Contributor

HumanEquivalentUnit commented Jul 29, 2019

Sounds more like a need for an optional switch on Set-Content to make it buffer updates and write the content to the file in the end block.

That's sooo close to -outvariable:

Get-Content $Path | ForEach-Object { $_ -f $a } | Set-Content $Path
Get-Content $Path | ForEach-Object { $_ -f $a } -ov B; Set-Content $Path $B

(nb. small risk of using get-content -wait and then having a temporary file grow indefinitely)

@vexx32
Copy link
Collaborator Author

vexx32 commented Aug 19, 2019

Some filesystems support transactions, but unfortunately PS's provider does not.

But yeah writing to a temp file would be a good solution if the file is still open when Set-Content tries to get a write handle.

@iSazonov
Copy link
Collaborator

iSazonov commented Aug 20, 2019

but unfortunately PS's provider does not.

The code is in the repo but was commented at porting time (Windows only).

@michaeltlombardi
Copy link
Contributor

This same class of problem applies to Select-String, which similarly keeps the file handles open until the pipeline closes.

I do think this is a bit of a trap for new users (and people like me, who forget this happens), because on its face this pattern is perfectly reasonable:

Get-Something | Format-Something | Write-Something

But because file cmdlets (at least Get-Content and Select-String) keep the file handle open until the entire pipeline ends, you need to do something like:

(Get-Something) | Format-Something | Write-Something

Which is both non-intuitive and a little mysterious if you're not familiar with why that pattern is being used.

Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

2 similar comments
Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

@microsoft-github-policy-service microsoft-github-policy-service bot added Resolution-No Activity Issue has had no activity for 6 months or more labels Nov 16, 2023
Copy link
Contributor

This issue has been marked as "No Activity" as there has been no activity for 6 months. It has been closed for housekeeping purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Enhancement the issue is more of a feature request than a bug Resolution-No Activity Issue has had no activity for 6 months or more WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module
Projects
None yet
Development

No branches or pull requests

6 participants