Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect header encoding conversion #2011

Open
821938089 opened this issue Oct 15, 2023 · 11 comments
Open

Incorrect header encoding conversion #2011

821938089 opened this issue Oct 15, 2023 · 11 comments
Assignees
Milestone

Comments

@821938089
Copy link

821938089 commented Oct 15, 2023

        private static String fixHeaderEncoding(String val) {
            byte[] bytes = val.getBytes(ISO_8859_1);
            if (!looksLikeUtf8(bytes))
                return val;
            return new String(bytes, UTF_8);
        }

This encoding conversion is wrong, you cannot restore the original binary content from a string without knowing its encoding.
Such conversion leads to loss of some characters.

image

Related references: https://stackoverflow.com/a/39308860

By the way: when will the next version be released?

@jhy
Copy link
Owner

jhy commented Oct 18, 2023

Can you give me the code for this vs screenshots so that I can review?

@821938089
Copy link
Author

Is this it? search.php?search=我的

@jhy jhy self-assigned this Oct 20, 2023
@jhy jhy closed this as completed in 9de27fa Oct 20, 2023
@jhy jhy added bug Confirmed bug that we should fix fixed labels Oct 20, 2023
@jhy jhy added this to the 1.16.2 milestone Oct 20, 2023
@jhy
Copy link
Owner

jhy commented Oct 20, 2023

OK, I've moved the re-encoding fix-up to only response headers. That's in place to fix #706 where the header was encoded as 8559 but held UTF bytes instead. Browsers seem to do this fix up too so the solution seems necessary. We do need better tests for this - I wasn't able to get Jetty to emit the header incorrectly so can't directly add a test case.

For request headers, the value set by the user is now retained directly. When making the request, Java will encode the header as UTF-8. Servers will probably expect 8559 and so this may or not work. Per spec, the content should either be limited to 8559 content or encoded with RFC 2047. We don't attempt to automatically do that (and some servers will be OK). A bit of a grey area here. Happy for other suggestions.

@821938089
Copy link
Author

image
This header comes from the server's response. Is there a way to fix it?

@jhy
Copy link
Owner

jhy commented Oct 20, 2023

Can you give me a sample URL or code so that I can actually review the server's response properly?

@821938089
Copy link
Author

https://www.zhenshezw.com/
image
Enter "我的" and click on the search icon.

@821938089
Copy link
Author

Hi, if you can't reproduce the issue could you add a configuration option to skip the fixHeaderEncoding?

@jhy
Copy link
Owner

jhy commented Nov 10, 2023

I get caught in bot detections when I try this. Can you provide sample code so that I can try and repro?

I won't add a configuration option unless I can validate it. You could always fork the code yourself, of course.

@jhy jhy reopened this Nov 10, 2023
@jhy jhy removed bug Confirmed bug that we should fix fixed labels Nov 10, 2023
@821938089
Copy link
Author

package org.example;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class Main {
  public static void main(String[] args) {
    Map<String, String> headers = new HashMap<>();
    // get these header from browser devtools
    headers.put("User-Agent", "");
    headers.put("Cookie", "");

    try {
      Connection.Response response = Jsoup.connect("https://www.zhenshezw.com/gut.php")
              .followRedirects(false)
              .requestBody("search=%E6%88%91%E7%9A%84")
              .headers(headers)
              .method(Connection.Method.POST)
              .execute();

      System.out.println(response.header("Location"));
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

@821938089
Copy link
Author

I found this issue to be platform related, in java it works fine but in android it has issues.
After some research, I found out that Java and Android use different HttpURLConnection implementations and they have different handling of headers.
Since the Android platform's HttpURLConnection implementation already decodes the headers correctly, there is a problem when fixing the headers again in jsoup.

Java:
image

Android:
image

@jhy
Copy link
Owner

jhy commented Nov 16, 2023

Thanks, that's good sleuthing! Need to think of a good way to detect and handle this situation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants