Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search plugin crashes for h[1-6] contained in other elements #4824

Closed
4 tasks done
jeffreytolar opened this issue Jan 4, 2023 · 8 comments
Closed
4 tasks done

Search plugin crashes for h[1-6] contained in other elements #4824

jeffreytolar opened this issue Jan 4, 2023 · 8 comments
Labels
bug Issue reports a bug resolved Issue is resolved, yet unreleased if open

Comments

@jeffreytolar
Copy link
Contributor

Context

Our CI recently pulled in the newly released 9.0.1 this week; the build for one of our sites started failing afterwards.

Bug description

We have some old (non-ideal) markdown that crashes the search plugin during indexing:

$ ../bin/mkdocs build
INFO     -  Cleaning site directory
INFO     -  Building documentation to directory: /private/tmp/mkdocs/test/site
ERROR    -  Error building page 'index.md': list index out of range
Traceback (most recent call last):
  File "/tmp/mkdocs/test/../bin/mkdocs", line 8, in <module>
    sys.exit(cli())
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/mkdocs/__main__.py", line 250, in build_command
    build.build(cfg, dirty=not clean)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/mkdocs/commands/build.py", line 329, in build
    _build_page(file.page, config, doc_files, nav, env, dirty)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/mkdocs/commands/build.py", line 226, in _build_page
    context = config.plugins.run_event(
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/mkdocs/plugins.py", line 520, in run_event
    result = method(item, **kwargs)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/material/plugins/search/plugin.py", line 90, in on_page_context
    self.search_index.add_entry_from_context(page)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/material/plugins/search/plugin.py", line 143, in add_entry_from_context
    parser.feed(page.content)
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
    self.handle_endtag(elem)
  File "/private/tmp/mkdocs/lib/python3.9/site-packages/material/plugins/search/plugin.py", line 417, in handle_endtag
    elif data[-1].isspace() and data[-2] == f"<{tag}>":
IndexError: list index out of range

The markdown that results in this is a section header in a list:

- # Section header in a list

... disabling the search plugin, that snippet results in this HTML:

<ul>
<li>
<h1 id="section-header-in-a-list">Section header in a list</h1>
</li>
</ul>

It's entirely possible that this is "invalid" markdown - however, this did "work" (not crash) with previous releases (including 8.5.11).

We have no real need for that that hierarchy (I suspect it was originally a typo), so I've updated our site to remove the bullet points. It took some digging (aided by breakpoint() :)) to narrow down which part of the file was triggering the crash.


Changing

elif data[-1].isspace() and data[-2] == f"<{tag}>":
to include a check for len(data) >= 2 results in a successful build, although I'm not sure the generated search index makes sense:

$ jq . site/search/search_index.json
{
  "config": {
    "lang": [
      "en"
    ],
    "separator": "[\\s\\-]+",
    "pipeline": [
      "stopWordFilter"
    ]
  },
  "docs": [
    {
      "location": "",
      "title": "Welcome to MkDocs",
      "text": "<ul> <li>"
    },
    {
      "location": "#section-header-in-a-list",
      "title": "Section header in a list",
      "text": "</li> </ul>"
    }
  ]
}

Related links

Reproduction

example.zip

Steps to reproduce

Run mkdocs build with the attached zip (probably need to remove info from the mkdocs.yml)

Browser

No response

Before submitting

@squidfunk
Copy link
Owner

squidfunk commented Jan 4, 2023

Thanks for reporting. Great report and reproduction! While it is good to know that hoisting the headlines fixes the problem, I think we need to do some thinking how to handle nested headlines, as this is likely something that may occur. For example, headlines may be contained in admonitions or other block elements, which would lead to the same problem. Your suggested fixes would likely break the search, as now there are unbalanced HTML tags in the search index.

As a possible solution, we could only include the tags in a search index section that appear on the same level as the headline. However, we would also need to take care of content that happens after a nested block with a headline is closed and before the next headline:

<div class="admonition">
  <h2>Headline</h2>
  <p>This would be indexed, but would need to be hoisted</p>
</div>
<p>This also needs to be indexed</p>
<h2>Next headline</h2>

This will likely take a little time to fix.

@squidfunk squidfunk changed the title Crash (IndexError) with new search plugin Search plugin crashes for h1...6 contained in other elements Jan 4, 2023
@squidfunk squidfunk changed the title Search plugin crashes for h1...6 contained in other elements Search plugin crashes for h[1-6] contained in other elements Jan 4, 2023
@squidfunk
Copy link
Owner

I've generalized the title, so it's easier to find for other users.

@squidfunk squidfunk added the bug Issue reports a bug label Jan 4, 2023
zhanbao2000 added a commit to zhanbao2000/blog that referenced this issue Jan 6, 2023
@defenestration
Copy link

We also ran into this.

Additionally, numbered lists immediately followed by an octothorpe also caused this problem for us. ex: 1. ### blah

We also found some docs with an empty list items, that caused a similar error with the search plugin. ( let me know if you want a separate issue opened?)

We ended up just cleaning up the docs but didn't have this issue on 8.x.

    -   NetApp SnapDrive/LUN vs VMDK
    - 
Traceback (most recent call last):
  File "/usr/local/bin/mkdocs", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mkdocs/__main__.py", line 250, in build_command
    build.build(cfg, dirty=not clean)
  File "/usr/local/lib/python3.11/site-packages/mkdocs/commands/build.py", line 329, in build
    _build_page(file.page, config, doc_files, nav, env, dirty)
  File "/usr/local/lib/python3.11/site-packages/mkdocs/commands/build.py", line 226, in _build_page
    context = config.plugins.run_event(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/mkdocs/plugins.py", line 520, in run_event
    result = method(item, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/material/plugins/search/plugin.py", line 90, in on_page_context
    self.search_index.add_entry_from_context(page)
  File "/usr/local/lib/python3.11/site-packages/material/plugins/search/plugin.py", line 143, in add_entry_from_context
    parser.feed(page.content)
  File "/usr/local/lib/python3.11/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/usr/local/lib/python3.11/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
        ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/html/parser.py", line 413, in parse_endtag
    self.handle_endtag(elem)
  File "/usr/local/lib/python3.11/site-packages/material/plugins/search/plugin.py", line 417, in handle_endtag
    elif data[-1].isspace() and data[-2] == f"<{tag}>":
                                ~~~~^^^^
IndexError: list index out of range

@squidfunk
Copy link
Owner

Fixed in 81e7b8c. Interesting problem! This was tricky, but I think I found a generalizable solution. The search plugin will now track the depth of sections when parsing, and attribute content to the nearest unclosed section. A section is considered closed, once it's exited. Now, nested headlines should work without any problems.

I'm now looking into a better approach for filtering empty paragraphs and tags that is more resilient, which I removed temporarily to fix the nesting problem first. Expect a fix later today.

Example:

# Example

- ## List

    !!! info "Admonition title"

        Admonition content (before)

        ### Admonition

        Admonition content (after)

    List content 1

    - ### Nested list

        Nested list content

    List content 2

Some other text

## Foo

Bar

!!! info "Admonition title 2"

    Admonition content 2 (before)

    # Admonition 2

    Admonition content 2 (after)

Result:

{
  // ...
  "docs": [
    {
      "location": "",
      "title": "Example",
      "text": "<ul> <li> </li> </ul> <p>Some other text</p>"
    },
    {
      "location": "#list",
      "title": "List",
      "text": "<p>Admonition title</p> <p>Admonition content (before)</p>  <p>List content 1</p> <ul> <li> </li> </ul> <p>List content 2</p>"
    },
    {
      "location": "#admonition",
      "title": "Admonition",
      "text": "<p>Admonition content (after)</p>"
    },
    {
      "location": "#nested-list",
      "title": "Nested list",
      "text": "<p>Nested list content</p>"
    },
    {
      "location": "#foo",
      "title": "Foo",
      "text": "<p>Bar</p>  <p>Admonition title 2</p> <p>Admonition content 2 (before)</p>"
    },
    {
      "location": "#admonition-2",
      "title": "Admonition 2",
      "text": "<p>Admonition content 2 (after)</p>"
    }
  ]
}

@squidfunk squidfunk added the resolved Issue is resolved, yet unreleased if open label Jan 8, 2023
@squidfunk
Copy link
Owner

6825734 makes collapsing of adjacent whitespace and removal of empty elements more resilient. This results in the following, more compact search index that only includes paragraphs with content:

{
  // ...
  "docs": [
    {
      "location": "",
      "title": "Example",
      "text": "<p>Some other text</p>"
    },
    {
      "location": "#list",
      "title": "List",
      "text": "<p>Admonition title</p> <p>Admonition content (before)</p> <p>List content 1</p> <p>List content 2</p>"
    },
    {
      "location": "#admonition",
      "title": "Admonition",
      "text": "<p>Admonition content (after)</p>"
    },
    {
      "location": "#nested-list",
      "title": "Nested list",
      "text": "<p>Nested list content</p>"
    },
    {
      "location": "#foo",
      "title": "Foo",
      "text": "<p>Bar</p> <p>Admonition title 2</p> <p>Admonition content 2 (before)</p>"
    },
    {
      "location": "#admonition-2",
      "title": "Admonition 2",
      "text": "<p>Admonition content 2 (after)</p>"
    }
  ]
}

Additionally, the case with empty lists reported by @defenestration doesn't crash the plugin.

@squidfunk
Copy link
Owner

Released as part of 9.0.3.

@danimesq

This comment was marked as off-topic.

@danimesq

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue reports a bug resolved Issue is resolved, yet unreleased if open
Projects
None yet
Development

No branches or pull requests

4 participants