Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jekyll fails if markdown contains UTF-8 Byte Order Marker #2853

Closed
ericlaw1979 opened this issue Aug 29, 2014 · 45 comments · Fixed by #4404
Closed

Jekyll fails if markdown contains UTF-8 Byte Order Marker #2853

ericlaw1979 opened this issue Aug 29, 2014 · 45 comments · Fixed by #4404

Comments

@ericlaw1979
Copy link

This is difficult to troubleshoot and causes lots of problems. UTF-8 BOMs should not cause Jekyll to fail.

http://andrewbolster.info/2014/01/unicode-madness-in-jekyll/
http://stackoverflow.com/questions/3140111/jekyll-does-not-parse-utf-8

@ericlaw1979
Copy link
Author

Is fixing this as simple as changing the "r" parameters in File.open calls to "r:bom|utf-8" and updating file_read_opts to have this option too?

@parkr
Copy link
Member

parkr commented Nov 29, 2014

Have you hit this again? How common is this issue? Does the above option explode if your page doesnt include the BOM?

@redknitin
Copy link

The documentation states at http://jekyllrb.com/docs/frontmatter/ that users should avoid the UTF8 BOM (I'm guessing that's why the documentation tag was taken off from the issue 2 days ago).

I recall Notepad++ (or was it EditPlus?) on Windows has an encoding option with the ability to disable the BOM when saving the file. That could help work around the problem till a fix is in place.

@arebee
Copy link

arebee commented Dec 1, 2014

You can use a variety of tools to work around it... but a fix is what's required. It's not as if the BOM is out of the ordinary.

-----Original Message-----
From: Nitin [mailto:notifications@github.com]
Sent: Monday, December 1, 2014 11:43 AM
To: jekyll/jekyll
Cc: Richard Burte
Subject: Re: [jekyll] Jekyll fails if markdown contains UTF-8 Byte Order Marker (#2853)

The documentation states at http://jekyllrb.com/docs/frontmatter/ that users should avoid the UTF8 BOM (I'm guessing that's why the documentation tag was taken off from the issue 2 days ago).

I recall Notepad++ (or was it EditPlus?) on Windows has an encoding option with the ability to disable the BOM when saving the file. That could help work around the problem till a fix is in place.


Reply to this email directly or view it on GitHub #2853 (comment) . https://github.com/notifications/beacon/ADOOddAfm-xd4Sll7jZcnOGSDUD-4-C_ks5nTLxMgaJpZM4CcobA.gif

@mattr-
Copy link
Member

mattr- commented Dec 1, 2014

Windows is the only platform we're aware of where the BOM is even used on a
consistent basis, so to us, it is out of the ordinary. That's why we
suggest the workaround of having the editors disable the BOM when saving
files.

@ericlaw1979
Copy link
Author

Let's step back a bit-- this is a huge annoyance for at least some users, and it seems like a trivial thing to fix. http://stackoverflow.com/questions/543225/how-to-avoid-tripping-over-utf-8-bom-when-reading-files implies that the r:bom option works fine even if the three byte signature isn't present.

If I were to go learn Ruby and submit the patch, would my pull request get accepted?

@mattr-
Copy link
Member

mattr- commented Dec 1, 2014

I'd say your chances are good that it would be accepted. 😃

On Mon, Dec 1, 2014 at 2:18 PM, Eric Lawrence notifications@github.com
wrote:

Let's step back a bit-- this is a huge annoyance for at least some users,
and it seems like a trivial thing to fix.

If I were to go learn Ruby and submit the patch, would my pull request get
accepted?


Reply to this email directly or view it on GitHub
#2853 (comment).

@redknitin
Copy link

@ericlaw1979 , @mattr

I tried to go down that path... learning Ruby and figuring out how to patch the issue. I got a copy of the Jekyll source code and tried to track down where it reads from the files. There's a File.readlines().join() that I spotted while trying to find my way through the source code but I'll probably have to keep shovelling to get to the right point in the code.

Meanwhile, I also tried to reproduce the issue with a simple Ruby script that reads from a file and displays the output with a File.open followed by File_object.read and that didn't seem to have an issue with a UTF-8 file with a BOM (I used Notepad++ to convert a text file to UTF-8 with BOM). I used Ruby 2.1.5 x64 on Windows 8.1 for trying it so I'm not sure if it's something that the newer versions of Ruby handle the BOM, or if it's because I'm calling the open and read methods for file reading and Jekyll calls a different set of methods, or maybe even Notepad++ left the file unchanged for some reason. Also, I have Jekyll running on a Linux box and haven't yet installed it on Windows so there's that keeping me from experiencing the problem first hand.

@jheidemann
Copy link

The following patch tries to fix the problem, but it leaves some files with mojibake. Perhaps someone else can get it closer.

--- document.rb-    2015-07-10 11:52:17.194428864 -0700
+++ document.rb 2015-07-10 14:20:34.117080950 -0700
@@ -210,6 +210,10 @@
             @data = defaults
           end
           @content = File.read(path, merged_file_read_opts(opts))
+          # Ignore stray byte-order marks.  (xxx: maybe better fix is to open :utf8?).
+     if content =~ /\A\xEF\xBB\xBF/m
+       @content = content.sub( '\xEF\xBB\xBF', '' )
+     end
           if content =~ /\A(---\s*\n.*?\n?)^(---\s*$\n?)/m
             @content = $POSTMATCH
             data_file = SafeYAML.load($1)
--- convertible.rb- 2015-07-10 11:52:39.131473331 -0700
+++ convertible.rb  2015-07-10 14:20:40.670092976 -0700
@@ -45,6 +45,10 @@
       begin
         self.content = File.read(site.in_source_dir(base, name),
                                  merged_file_read_opts(opts))
+        if content =~ /\A\xEF\xBB\xBF/m
+          self.content = content.sub( '\xEF\xBB\xBF', '' )
+        end
+        # Ignore stray byte-order marks.  (xxx: maybe better fix is to open :utf8?).
         if content =~ /\A(---\s*\n.*?\n?)^((---|\.\.\.)\s*$\n?)/m
           self.content = $POSTMATCH
           self.data = SafeYAML.load($1)
--- utils.rb-   2015-07-10 14:16:43.509709222 -0700
+++ utils.rb    2015-07-10 14:23:29.550402893 -0700
@@ -99,7 +99,8 @@
     #
     # Returns true if the YAML front matter is present.
     def has_yaml_header?(file)
-      !!(File.open(file, 'rb') { |f| f.read(5) } =~ /\A---\r?\n/)
+      !!(File.open(file, 'r:bom|UTF-8') { |f| f.read(8) } =~ /\A\uFEFF?---\r?\n/u)
+#      !!(File.open(file, 'rb') { |f| f.read(8) } =~ /\A(\xEF\xBB\xBF)?---\r?\n/u)
     end

     # Slugify a filename or title.

@envygeeks
Copy link
Contributor

Or you could just r:bom|utf-8 every read...

@RX14
Copy link

RX14 commented Dec 21, 2015

Just spent an hour debugging this, why is this not fixed already?

marcusteaton pushed a commit to marcusteaton/marcusteaton.github.io that referenced this issue Dec 21, 2015
@johnlinvc
Copy link

I encounter this issue too. Please merge the fix if possible.

@mkpankov
Copy link

#4404 was merged, but how does it help with this issue? How does one use that option (Utils.merged_file_read_opts)?

@arebee
Copy link

arebee commented Sep 20, 2016

@parkr Did this option ever make it into the documentation - https://jekyllrb.com/docs/configuration/ doesn't mention it. Could it be added as a default given that it is transparent?

@parkr
Copy link
Member

parkr commented Sep 20, 2016

@arebee what option?

@arebee
Copy link

arebee commented Sep 21, 2016

In #4404 the code appears to pull a value from site config - am I reading it correctly? lib/jekyll/utils.rb line 279 in commit aad54c9

@steveklabnik
Copy link

I've set

encoding: "bom|utf-8"

In my _config.yml, yet it still seems that jekyll is ignoring files with a BOM.

@parkr
Copy link
Member

parkr commented Sep 28, 2016

@steveklabnik That might be because of this:

!!(File.open(file, "rb", &:readline) =~ %r!\A---\s*\r?\n!)
:(

@pathawks
Copy link
Member

Or differently, does this issue need a sample file attached?

That would certainly be a good first step, yes 👍

@steveklabnik
Copy link

@arebee
Copy link

arebee commented Apr 5, 2017

Contains a Markdown file typical on Windows:
CRLF line endings and
saved as UTF8+BOM (Notepad2)
saved as Unicode (Unicode 16, Little Endian + BOM from the Unicode option in Notepad).

CRLFandBOM.zip

@envygeeks
Copy link
Contributor

Please stop posting zips to ticket, it's a security problem, provide a repository.

@arebee
Copy link

arebee commented Apr 6, 2017

@jekyllbot jekyllbot added the stale Nobody stepped up to work on this issue. label Jun 6, 2017
@DirtyF DirtyF removed the stale Nobody stepped up to work on this issue. label Jun 6, 2017
@arebee
Copy link

arebee commented Jun 6, 2017

@pathawks You added a "not-reproduced" label to this issue. Were you not able to reproduce using the linked repo?

@DirtyF DirtyF modified the milestones: 3.5, 3.6 Jun 18, 2017
@parkr parkr mentioned this issue Aug 17, 2017
3 tasks
@jekyllbot
Copy link
Contributor

This issue has been automatically marked as stale because it has not been commented on for at least two months.

The resources of the Jekyll team are limited, and so we are asking for your help.

If this is a bug and you can still reproduce this error on the 3.3-stable or master branch, please reply with all of the information you have about it in order to keep the issue open.

If this is a feature request, please consider building it first as a plugin. Jekyll 3 introduced hooks which provide convenient access points throughout the Jekyll build pipeline whereby most needs can be fulfilled. If this is something that cannot be built as a plugin, then please provide more information about why in order to keep this issue open.

This issue will automatically be closed in two months if no further activity occurs. Thank you for all your contributions.

@jekyllbot jekyllbot added the stale Nobody stepped up to work on this issue. label Aug 18, 2017
@steveklabnik
Copy link

#2853 (comment) still applies

@jekyllbot jekyllbot removed the stale Nobody stepped up to work on this issue. label Aug 18, 2017
@DirtyF
Copy link
Member

DirtyF commented Aug 18, 2017

Reproduced with latest Jekyll and Ruby under macOS with @arebee example files:

Configuration file: /Users/frank/code/jekyll/tests/blog/_config.yml
            Source: /Users/frank/code/jekyll/tests/blog
       Destination: /Users/frank/code/jekyll/tests/blog/_site
 Incremental build: disabled. Enable with --incremental
      Generating...
             Error: could not read file /Users/frank/code/jekyll/tests/blog/_posts/2017-07-17-Unicode16LECRLFandBOM.md: invalid byte sequence in UTF-8
  Liquid Exception: invalid byte sequence in UTF-8 in /Users/frank/code/jekyll/tests/blog/_posts/2017-07-17-Unicode16LECRLFandBOM.md
bundler: failed to load command: jekyll (/Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/bin/jekyll)
ArgumentError: invalid byte sequence in UTF-8
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/tokenizer.rb:23:in `split'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/tokenizer.rb:23:in `tokenize'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/tokenizer.rb:8:in `initialize'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/template.rb:226:in `new'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/template.rb:226:in `tokenize'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/template.rb:132:in `parse'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/liquid-4.0.0/lib/liquid/template.rb:116:in `parse'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/liquid_renderer/file.rb:11:in `block in parse'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/liquid_renderer/file.rb:47:in `measure_time'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/liquid_renderer/file.rb:10:in `parse'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/renderer.rb:118:in `render_liquid'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/renderer.rb:76:in `render_document'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/renderer.rb:62:in `run'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:456:in `block (2 levels) in render_docs'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:454:in `each'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:454:in `block in render_docs'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:453:in `each'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:453:in `render_docs'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:194:in `render'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/site.rb:73:in `process'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/command.rb:26:in `process_site'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/commands/build.rb:63:in `build'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/commands/build.rb:34:in `process'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/lib/jekyll/commands/build.rb:16:in `block (2 levels) in init_with_program'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/mercenary-0.3.6/lib/mercenary/command.rb:220:in `block in execute'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/mercenary-0.3.6/lib/mercenary/command.rb:220:in `each'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/mercenary-0.3.6/lib/mercenary/command.rb:220:in `execute'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/mercenary-0.3.6/lib/mercenary/program.rb:42:in `go'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/mercenary-0.3.6/lib/mercenary.rb:19:in `program'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/jekyll-3.5.2/exe/jekyll:13:in `<top (required)>'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/bin/jekyll:23:in `load'
  /Users/frank/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/bin/jekyll:23:in `<top (required)>'

@parkr
Copy link
Member

parkr commented Aug 18, 2017

@DirtyF Oh great! Would you mind adding that file in as a test fixture so we can setup a test for it?

@DirtyF
Copy link
Member

DirtyF commented Aug 18, 2017

@parkr Done. There's two files, I left them as provided, because if I open them in Atom, then I can not reproduced the bug anymore. One of them is detected as binary.

@pathawks
Copy link
Member

pathawks commented Oct 18, 2017

Fixed via #6433 🎉🌮

@jekyll jekyll locked and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.