Introduce `treatAsNonHtml` to remove `isHTML` implementation #1230

edwinv · 2024-03-25T16:34:03Z

Fixes #608

When Turbo navigates to a new URL, locationIsVisitable is called to check if the location is visitable. One part of the check is checking if the requested URL returns HTML. We can never determine with 100% confidence if a request will return HTML. A page with a .html extension might still return another content-type. A page with a .foobar extension might return HTML. Using a check on the file extension is therefore imprecise and prone to bugs.

Turbo already handles non-HTML responses. When a response doesn't contain an HTML part, a contentTypeMismatch is raised. This causes the page to reload to the requested URL. So we could remove the isHTML check completely and still have a functioning situation. The major downside is having two requests: one Turbo fetch and one regular browser request.

A user could prevent these double requests by adding data-turbo="false" to these links. But there might be some sensible default by never handling known non-HTML extensions as a Turbo request. This PR introduces the Turbo.treatAsNonHtml setting to enable this.

By default, a range of common non-HTML extensions are included in this setting, like .png and .jpg. If users have many links to other extensions and don't want to sprinkle their HTML with data-turbo="false" attribute, they can add new extensions to the set:

Turbo.treatAsNonHtml.add(".mp3")

Because these changes are added to the lower-level checks, they work both for regulars visits and the newly introduced page refresh morphs.

Mentioning @packagethief @dhh @jorgemanrubia for some visibility.

packagethief · 2024-03-25T16:59:56Z

I'm wary of using an exclusion list and assuming that most things are HTML by default. I think the list will just keep growing. But I do like the API you've proposed better than isHTML's existing regex. What would you think of using that API with an HTML extension allowlist? Then applications can more easily configure the extensions they want treated as HTML:

Turbo.htmlExtensions.add(".foobar")

The implementation would be as you've proposed, but would check for inclusion in the HTML extensions, rather than exclusion. We'd ship with the defaults already defined by the regex:

export const htmlExtensions = new Set([ ".html", ".htm", ".xhtml", ".php" ])

export function locationIsVisitable(location, rootLocation) {
  return isPrefixedBy(location, rootLocation) && htmlExtensions.has(getExtension(location))
}

EDIT: I guess that won't work when there's no extension – the implementation would need to account for that.

edwinv · 2024-03-25T18:11:36Z

On the one hand finding the extension is always a educated guess. In our application we have URL's with a filter, containing a date range presentation. Example: /invoices/filter/period:202401..202403. This is a valid URL and is perfectly handled by Rails routing. Turbo thinks the extension is .202403. This is a dynamic part of the URL and can never be added to an htmlExtensions array on the frontend. Turbo thinks it is no HTML and will always reload the page. Up until recently this wasn't an issue: while navigating the user just got a fresh page. With Turbo streams refreshing/morphing the page, the user now faces a complete refresh while we are just trying to update a small part of it. This is degrading the user experience.

On the other hand, we need to think about fallback situations. If the treatAsNonHtml set is not covering all situations, the page will still load. It takes two requests, but the user won't notice. The major downside is wasting traffic and a bit of a delay for the user. Compare this to an incomplete htmlExtensions. Then the user will have a degraded experience due to full page reloads, while morphing is expected.

So, I'm still in favor of the proposed set for non-HTML extensions. I'm aware it is not covering all possible extensions, but when it does not cover an extension, the user won't notice and the experience stays the same.

I like the htmlExtensions name, so maybe we should rename treatAsNonHtml to a simpler name like nonHtmlExtensions or neverLoadWithTurbo.

packagethief · 2024-03-25T18:37:20Z

So, I'm still in favor of the proposed set for non-HTML extensions. I'm aware it is not covering all possible extensions, but when it does not cover an extension, the user won't notice and the experience stays the same.

Thanks for the explanation. I hadn't considered dynamic paths like the one you described. I think your reasoning makes sense.

I think we'd need a much larger default exclusion list, however. Just looking at Basecamp and HEY, we have a pretty extensive list of extensions we'll auto-link to: zip, tar, gz, bz2, rar, 7z, dmg, exe, msi, pkg, deb, iso, bmp, mp4, mov, avi, mkv, wmv, heic, heif, mp3, wav, ogg, aac, wma, webm, ogv, mpg, mpeg. While these could return HTML, it's extremely unlikely, and we wouldn't want to start making two requests to resolve them.

packagethief · 2024-03-25T19:01:27Z

Thinking about it more, I think there are more downsides to an exclusion list than the inclusion regex we have now. It makes more sense to define the extensions that we're confident are HTML than it does to define the ones that aren't. The former is going to be a much smaller set than the latter for all practical purposes.

The example with a pathname like /period:202401..202403 feels like an uncommon case, and one that could still be supported by augmenting the regex we have now. I'd be in favor of a nicer API that didn't involve replacing the regex though, like a way to define additional extension matchers.

edwinv · 2024-03-25T19:39:18Z

For the inclusion list/regex to work, it should have perfect coverage. My situation is very specific, but there have been plenty of other examples like FQDN like example.com, Rails ids in paths, .deploy, file extensions, and usernames with a dot in it. I don't see how a default list/regex is going to cover all these situations.

If the list/regex is a configuration, a user could extend the list and cover their needs. But even then it is hard to get it covered. Many paths contain dynamic parts that are hard to contain in a list or regex.

But I'm also thinking about the developer experience. A new developer starting with Turbo expects Turbo to work all the time. When a part of the requests/refreshes are suddenly page reloads, it is not clear at first hand what is happening. This was my firsthand experience last few days when users complained about random refreshes and we couldn't find the cause because the . is only in specific situations in the URL.
Having a inclusion list/regex makes things more flexible for the developer, but also opens the door for new bugs. What if the regex matches all extensions by accident, then all of the sudden Turbo starts loading full PDF files or large binaries.

I don't see a downside for a long(er) exclusion list. We probably can find a reasonable list of say 100 most used file extensions on the web that give no HTML response. If we miss something, the end-user won't notice other than the double request. And the developer can safely extend the list by providing a specific extension.

What I was thinking about, is the introduction of the (first) Turbo configuration. Instead of going with the Turbo.someConfig route, we can also introduce a specific <meta> tag for the configuration. This is more in line with how other configurations for Turbo are made.

brunoprietog · 2024-04-01T10:46:33Z

Is it too complicated to add a slash at the end? In Rails you can easily do it with the trailing_slash option, at least. It would automatically solve your problem.

edwinv · 2024-04-05T09:31:36Z

Is it too complicated to add a slash at the end? In Rails you can easily do it with the trailing_slash option, at least. It would automatically solve your problem.

Yes, this has all kinds of side effect in our application. Furthermore I believe a library like Turbo shouldn't dictate if slashes are added or not. Rails might have an easy solution for this, but isn't this library intended to be used by other frameworks too?

edwinv · 2024-05-13T08:04:31Z

How are we going to move forward with this PR? I haven't read any arguments that have convinced me the exclusion list is not going to work, other than some minor inconveniences with maintaining the list. In my point of view this downside does not outweigh the downside of unexpected behavior with full page refreshes due to some random . being interpreted as a file extension. I still consider detecting file types by their extension to be a code smell.

Introduce treatAsNonHtml to remove isHTML implementation

970f251

chillenberger mentioned this pull request Mar 26, 2024

Dan doc nav no scroll postgresml/postgresml#1388

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `treatAsNonHtml` to remove `isHTML` implementation #1230

Introduce `treatAsNonHtml` to remove `isHTML` implementation #1230

edwinv commented Mar 25, 2024

packagethief commented Mar 25, 2024 •

edited

edwinv commented Mar 25, 2024

packagethief commented Mar 25, 2024

packagethief commented Mar 25, 2024

edwinv commented Mar 25, 2024

brunoprietog commented Apr 1, 2024

edwinv commented Apr 5, 2024

edwinv commented May 13, 2024

Introduce treatAsNonHtml to remove isHTML implementation #1230

Are you sure you want to change the base?

Introduce treatAsNonHtml to remove isHTML implementation #1230

Conversation

edwinv commented Mar 25, 2024

packagethief commented Mar 25, 2024 • edited

edwinv commented Mar 25, 2024

packagethief commented Mar 25, 2024

packagethief commented Mar 25, 2024

edwinv commented Mar 25, 2024

brunoprietog commented Apr 1, 2024

edwinv commented Apr 5, 2024

edwinv commented May 13, 2024

Introduce `treatAsNonHtml` to remove `isHTML` implementation #1230

Introduce `treatAsNonHtml` to remove `isHTML` implementation #1230

packagethief commented Mar 25, 2024 •

edited