Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding charset sniffing #736

Merged
merged 3 commits into from Oct 19, 2021
Merged

Conversation

boarwell
Copy link
Contributor

@boarwell boarwell commented Oct 2, 2021

Please ensure that your pull request fulfills these requirements:

  • The pull request is being made against the master branch
  • Tests for the changes have been added (for bug fixes / features)

What is the purpose of this pull request? (bug fix, enhancement, new feature,...)

Resolves #426

What changes did you make?

Added charset sniffing for text files.

Is there anything you'd like reviewers to focus on?

To sniff charset, text files need to be read in sync (fs.readFileSync), rather than in stream (fs.createReadStream). This change would cause more memory usage and thus, if a file exceeds buffer max length (about 4GB on 64-bit architectures1), Node.js will throw an error.

However, I think there are not so many cases that a plain text file is over 4GB and this is acceptable for charset sniffing. What do you think about this?

UPDATE:
I added file size checking on 4392921, so the above problem is fixed.

Please provide testing instructions, if applicable:

npm run test

Footnotes

  1. https://nodejs.org/api/buffer.html#buffer_buffer_constants_max_length

@boarwell boarwell marked this pull request as ready for review October 2, 2021 05:28
@boarwell
Copy link
Contributor Author

boarwell commented Oct 5, 2021

The CI test is failing in Node.js 10.x environments, but since 2021-04-301 version 10.x has been EOL. So, I created another PR (#739) to remove it from the CI environments.

Once that PR is merged, I will rebase this branch onto master and update this PR.

Footnotes

  1. https://github.com/nodejs/Release#end-of-life-releases

@thornjad
Copy link
Member

thornjad commented Oct 5, 2021

What about these changes fail in Node 10? I'm not opposed to dropping support, but I don't want to without good reason

@thornjad thornjad mentioned this pull request Oct 5, 2021
2 tasks
@thornjad thornjad added the major version Major, potentially breaking, change label Oct 5, 2021
@boarwell
Copy link
Contributor Author

boarwell commented Oct 6, 2021

@thornjad

What about these changes fail in Node 10? I'm not opposed to dropping support, but I don't want to without good reason

I see. That makes sense. I looked into it and found that the problem seemed to be where the buffer read into memory for sniffing was being reused to create the response stream. Commenting out the line made everything fine.

https://github.com/boarwell/http-server/blob/4392921610551087231db925d3ad65c8907e3e83/lib/core/index.js#L240

Some errors said that "Cannot set headers after they are sent to the client"1, so it seems there may be differences in the event loop (or something like that) between v10.x and newer versions, but I'm not sure.

So, I think we have two options:

  1. (Minor Update): Stop reusing loaded buffer. It will read files two times, the first is for sniffing, the second is for creating a response.
  2. (Major Update): Drop v10.x, and leave the code as-is.

I think Option 1 may be better at this time because it doesn't matter to read text files twice in most cases and it can ship this as a minor update. But also, dropping support for unmaintained versions itself can be a good reason for a major update. What do you think?

Footnotes

  1. https://github.com/http-party/http-server/pull/736/checks?check_run_id=3777027455

@boarwell
Copy link
Contributor Author

boarwell commented Oct 7, 2021

Rebased onto master

@thornjad
Copy link
Member

So, I think we have two options:

(Minor Update): Stop reusing loaded buffer. It will read files two times, the first is for sniffing, the second is for creating a response.
(Major Update): Drop v10.x, and leave the code as-is.

I think Option 1 may be better at this time because it doesn't matter to read text files twice in most cases and it can ship this as a minor update. But also, dropping support for unmaintained versions itself can be a good reason for a major update. What do you think?

Yeah I think I'm actually leaning toward dropping support if that's the case, I'd rather drop 10 than introduce reading files twice. You're right that it shouldn't matter too much, but less I/O is always better.

Some errors said that "Cannot set headers after they are sent to the client", so it seems there may be differences in the event loop (or something like that) between v10.x and newer versions, but I'm not sure.

That error message can (but not always) mean the server timed out, and so once the request finally completes, it tries to send a response which has already been sent. I've seen the same in Apache, it may mean we have some tests which are too slow.

@thornjad
Copy link
Member

Node 10 dropped in master! Another merge/rebase should fix the tests!

@boarwell
Copy link
Contributor Author

That error message can (but not always) mean the server timed out, and so once the request finally completes, it tries to send a response which has already been sent. I've seen the same in Apache, it may mean we have some tests which are too slow.

Thanks! This helped me understand what had happened. Though I rebased this branch onto master, there are still some errors on 12.x, macOS-latest1. It says:

  394 passing (31s)
  4 failing

  1) test/cli.test.js setting mimeTypes via cli - directly test unfinished:
     Error: test unfinished
      at Object.<anonymous> (test/cli.test.js:84:1)

  2) test/cli.test.js setting mimeTypes via cli - directly test count !== plan:

      test count !== plan
      + expected - actual

      -1
      +4
      
  

  3) test/cli.test.js child test left in queue: t.test --proxy requires you to specify a protocol:
     child test left in queue: t.test --proxy requires you to specify a protocol
  

  4) test/cli.test.js test count !== plan:

      test count !== plan
      + expected - actual

      -4
      +3

According to the messages (eg. Error: test unfinished, child test left in queue and test count !== plan), it seems that some tests are failing due to timeouts as you mentioned above. I think I could fix this by reducing the test repetition times (like t.plan(4)). Do you have any idea?

Footnotes

  1. https://github.com/http-party/http-server/pull/736/checks?check_run_id=3891594723

@thornjad
Copy link
Member

thornjad commented Oct 18, 2021

Yeah those are some non-deterministic errors we're getting at random, usually re-running the tests fixes them. I've been trying to weed those out, including adding more plans. I've managed to make them better recently, but they still pop up. I reran tests for this PR and they're passing now.

But there's a conflict now to clear up, probably just need to recreate the package-lock again. Once we get this merged I'll probably make the next major release!

@boarwell
Copy link
Contributor Author

Fixed the conflict and passed all tests! I'm looking forward to the next release!!

Copy link
Member

@thornjad thornjad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! This is a great improvement, thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement major version Major, potentially breaking, change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default Encoding Charset
2 participants