Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The data packets are misaligned when SFTP timeout caught by outside code and continue downloading. #1959

Open
SyuTingSong opened this issue Nov 14, 2023 · 3 comments

Comments

@SyuTingSong
Copy link

SyuTingSong commented Nov 14, 2023

I believe I have identified a bug. I am downloading multiple files in a loop. The code is based on Laravel and use the league/flysystem-sftp-v3 as the underlying storage driver. The relevant part is as follows:

foreach ($file_paths as $file_path) {
    try {
        Log::info("[$file_path] is selected for downloading.");
        $source_stream = Storage::disk('sftp')->readStream($file_path);
        $this->processData($source_stream);
    } catch (Throwable $e) {
        Log::error('Failed to download. Error: ' . $e->getMessage());
    }
}

I use a try-catch block to catch exceptions that occur during the remote file download process and continue with the next file. However, I've noticed that sometimes, after a network error occurs, the downloaded file size does not match as the size from metadata.

This issue occurs sporadically, and I have spent several months trying to reproduce it a few times. I enabled the NET_SFTP_LOGGING and added some more info to SFTP logs for debugging. Finally, I have obtained enough log information to reconstruct the entire process of the issue occurring.

Here are the key logs. sftp.log


Everything begins with the \phpseclib3\Net\SFTP::close_handle call $this->get_sftp_packet() timeout. As the get_sftp_packet method will return false when timeout, the $response of course not equal NET_SFTP_STATUS in this scenario, so it throw an UnexpectedValueException.

67| 2023-11-10T11:21:48.718Z <- [31]NET_SFTP_STATUS(EOF) (0.0001s)
68| 2023-11-10T11:21:48.718Z -> [01]NET_SFTP_CLOSE(1) (0.0001s)
69| 2023-11-10T11:22:18.095Z Failed to download. Error: Expected NET_SFTP_STATUS. Got packet type: false
70| 2023-11-10T11:22:18.267Z [08001525/08001525.CRS20231107.RPS.zip] is selected for downloading.
71| 2023-11-10T11:22:18.749Z -> [01]NET_SFTP_OPEN(/Outbound/08001525/08001525.CRS20231107.RPS.zip) (0.0001s)
72| 2023-11-10T11:22:48.291Z Failed to download. Error: Expected NET_SFTP_HANDLE or NET_SFTP_STATUS. Got packet type: false

As shown in line 69 of the log, the close_handle method throws an UnexpectedValueException when it waits for a NET_SFTP_STATUS packet timeout. The try-catch block in loop caught the exception, recoard the message and trying to download the next file. Since the network is still stalled, the NET_SFTP_OPEN also waits timeout and throws another UnexpectedValueException. As shown in line 72 of the log.

The loop keep trying next and next files, the client keep sending OPEN and OPEN packets. Until sometime the stalled network resume to respond packets.

142| 2023-11-10T11:34:19.956Z [08003970/08003970.CRS20231107.RPS.zip] is selected for downloading.
143| 2023-11-10T11:34:20.277Z -> [01]NET_SFTP_OPEN(/Outbound/08003970/08003970.CRS20231107.RPS.zip) (0.0001s)
144| 2023-11-10T11:34:49.981Z Failed to download. Error: Expected NET_SFTP_HANDLE or NET_SFTP_STATUS. Got packet type: false
145| 2023-11-10T11:34:50.026Z [08003970/08003970.CRS20231107.UPN.zip] is selected for downloading.
146| 2023-11-10T11:34:50.306Z -> [01]NET_SFTP_OPEN(/Outbound/08003970/08003970.CRS20231107.UPN.zip) (0.0001s)
147| 2023-11-10T11:34:57.848Z <- [01]NET_SFTP_STATUS(OK) (7.5397s)
148| 2023-11-10T11:34:57.867Z Failed to download. Error: Unable to read file from location: 08003970/08003970.CRS20231107.UPN.zip.
149| 2023-11-10T11:34:57.931Z [08003970/08003970.CRS20231107.UPT.zip] is selected for downloading.
150| 2023-11-10T11:34:58.311Z -> [01]NET_SFTP_OPEN(/Outbound/08003970/08003970.CRS20231107.UPT.zip) (0.0001s)
151| 2023-11-10T11:34:58.311Z <- [01]NET_SFTP_HANDLE(1) (0.0001s)
152| 2023-11-10T11:34:58.311Z -> [00]NET_SFTP_READ(1) (0s)
153| 2023-11-10T11:34:58.311Z -> [01]NET_SFTP_READ(1) (0s)

As shown in line 147 of the log, the network resume and a NET_SFTP_STATUS packet response. It should have corresponded to the NET_SFTP_CLOSE packet (log line 68) that occurred before the network stalled. But the get method have no idea about this and return false so the league/flysystem-sftp-v3 throws an UnableToReadFileException that recorded in line 148 of the log. Then the loop catch it again and try to download the next file.

Coincidentally, the next NET_SFTP_OPEN packet gets a NET_SFTP_HANDLE packet "as expected"(log line 151), which should have corresponded to the first NET_SFTP_OPEN(log line 71) during the network outage. But the get method again have no idea about this and happily send 32 × NET_SFTP_READ for downloading the file.

183| 2023-11-10T11:34:58.313Z -> [31]NET_SFTP_READ(1) (0s)
184| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(2) (0s)[buffer]
185| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(3) (0s)[buffer]
186| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(4) (0.0001s)[buffer]
187| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(5) (0s)[buffer]
188| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(6) (0s)[buffer]
189| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(7) (0s)[buffer]
190| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(8) (0s)[buffer]
191| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(9) (0s)[buffer]
192| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(10) (0s)[buffer]
193| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_HANDLE(11) (0s)[buffer]
194| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
195| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0.0001s)[buffer]
196| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
197| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0.0001s)[buffer]
198| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
199| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
200| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
201| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
202| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
203| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
204| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
205| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
206| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
207| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
208| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
209| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0.1028s)[buffer]
210| 2023-11-10T11:34:58.313Z <- [00]NET_SFTP_DATA(32768bytes) (0.1937s)
211| 2023-11-10T11:34:58.313Z -< [01]NET_SFTP_STATUS(NO_SUCH_FILE) (0s)[buffer]
212| 2023-11-10T11:34:58.313Z <- [01]NET_SFTP_DATA(32768bytes) (0.0654s)[buffer]
213| 2023-11-10T11:34:58.313Z <- [02]NET_SFTP_DATA(32768bytes) (0.0643s)
214| 2023-11-10T11:34:58.313Z <- [03]NET_SFTP_DATA(32768bytes) (0.0645s)
215| 2023-11-10T11:34:58.313Z <- [04]NET_SFTP_DATA(32768bytes) (0.0645s)
216| 2023-11-10T11:34:58.313Z <- [05]NET_SFTP_DATA(32768bytes) (0.0645s)
217| 2023-11-10T11:34:58.313Z <- [06]NET_SFTP_DATA(32768bytes) (0.0005s)
218| 2023-11-10T11:34:58.313Z <- [07]NET_SFTP_DATA(2157bytes) (0.0001s)
219| 2023-11-10T11:34:58.313Z <- [08]NET_SFTP_STATUS(EOF) (0.0001s)
220| 2023-11-10T11:34:58.313Z <- [09]NET_SFTP_STATUS(EOF) (0s)
221| 2023-11-10T11:34:58.313Z <- [10]NET_SFTP_STATUS(EOF) (0.0457s)

When calling get_sftp_packet(0) to get the response data for the first (with index 0) NET_SFTP_READ, response packets correspond to the previous timeout NET_SFTP_OPENs received with packet_id 1(log line 184-209). The get_sftp_packet method will keep setting buffer for packet_id 1. When the NET_SFTP_DATA with packet_id 0 finally recevied (log line 210), the 32k data block will be written into local file.

And then calling get_sftp_packet(1) will get the buffered packet NET_SFTP_STATUS (log line 211), which should have corresponded to the last NET_SFTP_OPEN during the network outage. But the get method have no idea about this and it will start clearing the responses by reading response without handling. In these processes, when calling get_sftp_packet(2), a NET_SFTP_DATA with packet_id 1 corresponded to the recent NET_SFTP_READ will be set into the buffer for packet_id 1 (log line 212).

242| 2023-11-10T11:34:58.313Z <- [31]NET_SFTP_STATUS(EOF) (0.0001s)
243| 2023-11-10T11:34:58.313Z -> [01]NET_SFTP_CLOSE(1) (0.0001s)
244| 2023-11-10T11:34:58.321Z <- [01]NET_SFTP_STATUS(OK) (0.086s)
245| 2023-11-10T11:34:58.374Z Failed to download. Error: Downloaded file[08003970/08003970.CRS20231107.UPT.zip] size[32768] doesn't equal to source file size[33336].
246| 2023-11-10T11:34:58.417Z [08003970/08003970.CRS20231108.ACC.zip] is selected for downloading.
247| 2023-11-10T11:34:59.321Z -> [01]NET_SFTP_OPEN(/Outbound/08003970/08003970.CRS20231108.ACC.zip) (0.0001s)
248| 2023-11-10T11:34:59.321Z <- [01]NET_SFTP_HANDLE(11) (0.095s)
249| 2023-11-10T11:34:59.321Z -> [00]NET_SFTP_READ(11) (0.0001s)
250| 2023-11-10T11:34:59.321Z -> [01]NET_SFTP_READ(11) (0s)

The rest of the response packets will be read and dropped (line 213-242), and the file could be closed successfullly (line 243-244). However, since the NET_SFTP_HANDLE is misaligned with the NET_SFTP_OPEN, the downloaded file content is not as expected. Also the downloaded file will only contain the first 32k data.

At this time, all packet buffers except for packet_id 1 are empty. The leftover packet_id 1 buffer would impact the following file downloads. Everytime calling get_sftp_packet(1) will return the previously buffered packet and everytime calling get_sftp_packet(2) will fill the packet_id 1 buffer with new data. That makes the downloaded file contains wrong content bewteen 32k-64k.

@SyuTingSong
Copy link
Author

I have considered several approaches to resolve the issue, but none of them are perfect, so I comment on this to discuss it.

Initially, I focused on the unconsumed packets left over from a timeout. One possible solution is to send a cleanup flag packet with a random ID, which could be of a harmless type such as NET_SFTP_STAT. This random ID could be stored in the SFTP object, and then the get_sftp_packet method should include a check for the packet ID. Any packets received before encountering the cleanup flag ID would be discarded. This way, if a timeout occurs but the network connection remains open, the correspondence between packets can be quickly restored.

However, SFTP is a stateful protocol, and simply discarding the packets corresponding to timed-out OPEN requests would result in resource leaks on the server side. So this may not be a good approach.

Later, I considered actively closing the channel when a timeout occurs. However, the RFC does not specify that the server must clean up any open files when a channel close occurs. When I tried sending SSH_MSG_CHANNEL_CLOSE to my test SFTP server, the server abruptly closed the entire TCP connection.

The last solution I can think of is to call $this->reconnect() whenever a timeout occurs, which clears the erroneous state by closing the SSH connection. However, the reconnect is designed to be a private method in the SSH2 class, there must be some considerations I missed.

@terrafrost
Copy link
Member

First, nice sleuthing!

As shown in line 147 of the log, the network resume and a NET_SFTP_STATUS packet response. It should have corresponded to the NET_SFTP_CLOSE packet (log line 68) that occurred before the network stalled. But the get method have no idea about this and return false so the league/flysystem-sftp-v3 throws an UnableToReadFileException that recorded in line 148 of the log. Then the loop catch it again and try to download the next file.

I think the best fix would be for phpseclib to support multiple channels. So like each SFTP transfer would be on its own dedicated channel. If the server times out, like you described, then it doesn't matter because the next download or upload would be on a whole new channel. This would also enable multiple concurrent file downloads as well.

Unfortunately, this would be a pretty significant rewrite of SSH2.php and SFTP.php and would only be included in a major release and it's very doubtful the next major release will happen even next year.

The last solution I can think of is to call $this->reconnect() whenever a timeout occurs, which clears the erroneous state by closing the SSH connection. However, the reconnect is designed to be a private method in the SSH2 class, there must be some considerations I missed.

I could always make it public. Altho I feel like just creating a new SFTP object would be sufficient as well?

@SyuTingSong
Copy link
Author

It sounds that the next major release is unlikely to occur soon.

Since I am indirectly using the phpseclib, I am not able to access the SFTP instance held by \League\Flysystem\PhpseclibV3\SftpAdapter. It is difficult for me to invoke the reconnect method eventhough it's changed to public.

I submitted a PR to thephpleague/flysystem to catch the UnexpectedValueException and then call the disconnect method to close the timeout connection. This may be a workaround for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants