Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snippets : warn against file or part changes through checksum #1824

Closed
feloxyde opened this issue Oct 1, 2022 · 11 comments
Closed

Snippets : warn against file or part changes through checksum #1824

feloxyde opened this issue Oct 1, 2022 · 11 comments
Labels
P: maybe Pending approval of low priority request. S: rejected The pull request is rejected for the stated reasons. T: feature Feature.

Comments

@feloxyde
Copy link

feloxyde commented Oct 1, 2022

Description

Issue #1462, closed not long ago, added a feature to allow including only specified line ranges from external file with snippets.
One concern regarding this feature was inconsistency it could create if file changed (e.g if lines were added/removed, creating an offset).
I think a checksum mechanism could be used to warn against changes, and it isn't limited to range inclusion, but could also benefit general snippets.

Context

I use pymdownx through Material for Mkdocs to build a computer science course. The course is for absolute beginners, and quite young. For this reason, there are heavy constraints :

  1. I can't give them full code if the file is large, otherwise they get confused, as for example if tests are included.
  2. I often need to explain the code in text, and thus make references to lines.
  3. I refer to some snippets at different places, for example exercises are often referred to in the course and in an "exercises" part of the doc.

Range inclusion solves problem 1, but inconsistency is still a danger here, as even in file fully included, an inconsistency in lines numbers can confuse the students.

I think a lot of tutorials and docs have this problems to a certain extent, and it would be beneficial to have a change warning mechanism.

Benefits

Protects against possible inconsistency when a file included through snippets in the docs changes, makes easier to structure large docs with multiple references to the same file, or with references to out-of-docs files, as test cases.

Solution Idea

Here is a draw at a possible solution :

Through the inclusion syntax, snippets could allow user to provide, at the end of file name chain, a checksum between bracket. So for example :

somefile::4[EA8A25E440]

Then snippet extension would check this checksum against either :

  • The entire file (likely easier to implement, but a slight bit less practical, yet sufficient i think)
  • The snippet included (more practical, but maybe more difficult to implement)

In case a checksum isn't matching, a warning message would be issued, mentioning both checksum :

Inclusion somefile::4[EA8A25E440] has mismatching checksum for somefile::4 : expected EA8A25E440, yet found EA5E4408A2.

This way, it is simple for user to update checksum after checking file content. Warning would likely be failing the build, or maybe an option in config could allow to choose whether warn or fail.

An empty checksum [] would be considered as a checksum as well, always failing, simply to allow user to use snippets to compute checksum.

@feloxyde feloxyde added the T: feature Feature. label Oct 1, 2022
@gir-bot gir-bot added the S: triage Issue needs triage. label Oct 1, 2022
@facelessuser
Copy link
Owner

I understand the problem associated with using line numbers, it's one of the reasons I was against it for so long, but, people want what they want.

Now, reducing this problem with hashes does increase the maintenance burden even more.

  • Some git repos, depending on settings, may checkout files with Windows new lines (\r\n) and on Linux/Unix systems they will use \n. So, now we need to normalize content to provide more realistic hashes as Python Markdown usually does this as well.
  • Hashes are done with byte data, not Unicode data, so we'll have to encode all the data, and perform the hash.
  • How do we make it easy for the user to know what the hash should be? Well, now we have to provide a tool that will also do the above steps that they can probably run on files.
  • We may start out with a hash for the full file, but I can also see the future push for per-line hashes coming as well 🙃 .

The better solution would simply be to provide some arbitrary way to target a partial from a file. That would be something like what is found here: https://rust-lang.github.io/mdBook/format/mdbook.html#including-portions-of-a-file. The link shows a feature where a user can specify a named section in addition to specifying lines.

The problem is how do you may it agnostic to the file in question. What if I want to extract a portion of a JavaScript file, but not render the JavaScript file broken? Well, I have to use comments, but how do I detect comments in a variety of file types. Well, I probably can't look for comments specifically, so maybe we just look for a special syntax anywhere, even inline in some cases, but maybe ignore everything on that line except the start and end markers (i.e. we don't ever include any characters from the line with the start and end markers).

Maybe:

// ---8<--- [start: name]
var test = 3
console.log(3 + 3)
// ---8<--- [end: name]

Then we can just specify in our Markdown:

---8<--- "somefile.js:name"

I don't know, I'm just thinking out loud here, so this isn't an official proposal, but maybe it could be 🤷🏻. Currently, I'm taking baby steps to allow people to utilize Snippets in an easier way. With that said, I'm not particularly sold on the hash idea as of now.

@facelessuser
Copy link
Owner

I've created a prototype at #1825. You are welcome to try it out.

@adriangb
Copy link

adriangb commented Oct 1, 2022

Docsify does it via comments as well: https://docsify.js.org/#/embed-files?id=embedded-code-fragments

@facelessuser
Copy link
Owner

Yeah, allowing them to be inline to be placed in comments is probably the most sane approach.

@feloxyde
Copy link
Author

feloxyde commented Oct 1, 2022

Thanks for quick reply !

Some git repos, depending on settings, may checkout files with Windows new lines (\r\n) and on Linux/Unix systems they will use \n. So, now we need to normalize content to provide more realistic hashes as Python Markdown usually does this as well

Oh, right, forgot about this ! Anyway I would say that is a Git problem, not a snippets problem, even if it is indeed better if snippets can convert, to avoid portability issues, especially if you want snippet be easy to use !

Hashes are done with byte data, not Unicode data, so we'll have to encode all the data, and perform the hash.

Why the need to encode data ? Can't the raw file be hashed instead ? Of course for lines that's quite a problem

How do we make it easy for the user to know what the hash should be? Well, now we have to provide a tool that will also do the above steps that they can probably run on files.

And that tool could be called snippets extension :D As I said (sorry if I was unclear), in error log, there would simply be the "expected" (in comment) hash, and the "actual" (currently computed) hash. When user runs build, he/she can copy-paste the actual hash and use it in the inclusion line.

The better solution would simply be to provide some arbitrary way to target a partial from a file. That would be something like what is found here: https://rust-lang.github.io/mdBook/format/mdbook.html#including-portions-of-a-file. The link shows a feature where a user can specify a named section in addition to specifying lines.

I agree with this, it would be really cool ! However mdbook approach is unsuited for simplicity I think, as it introduces comments that come polluting the code. In a course, this is especially problematic, because I tend to show both parts of code and full files, at different moments. For example, part of the code step by step, and a full file example recap at the end.

It also doesn't solve detection of changes in general, for a full file inclusion.

The problem is how do you may it agnostic to the file in question. What if I want to extract a portion of a JavaScript file, but not render the JavaScript file broken? Well, I have to use comments, but how do I detect comments in a variety of file types.

There, I think comment aren't needed actually, or that we can come to a more generic solution : simply specifying part of a line to be looked for.

For example, if we want to include following function from the file cool_file.py :

def cool_function(a, b):
    #do stuff here
    return "whao"

Syntax could be :

--8<-- "cool_file.py":{{cool_function}}:{{return}}"

This include lines from the first (included) containing cool_function, and until the first (included) line containing return.
I used double brackets {{ }} as delimiters, as they are already quite rares in code and other representations, but maybe could provide more alternatives, as for example [[ ]].

This would allow to both use delimiters comments "cool_file.py":{{//js section start}}:{{//js section end}}", and also to include code by function names and so on, letting user decide how he/she wants to identify lines. We can assume someone writing docs for whatever content knows the content sufficiently well to define unique lines identifiers.

I've created a prototype at #1825. You are welcome to try it out.

Thanks, that was fast ! trying out when I a bit more free time, likely in few days.

@facelessuser
Copy link
Owner

Oh, right, forgot about this ! Anyway I would say that is a Git problem, not a snippets problem, even if it is indeed better if snippets can convert, to avoid portability issues, especially if you want snippet be easy to use !

It's a problem that would keep coming my way regardless.

Why the need to encode data ? Can't the raw file be hashed instead ? Of course for lines that's quite a problem

If you want to do something like an MD5 hash, they are applied to the raw bytes of a file. You don't apply it to Unicode. None of this matters except that there is a bit more overhead, and you'd have to provide a way the user can figure out what the hash is

And that tool could be called snippets extension :D As I said (sorry if I was unclear), in error log, there would simply be the "expected" (in comment) hash, and the "actual" (currently computed) hash. When user runs build, he/she can copy-paste the actual hash and use it in the inclusion line.

How do you get your initial hash? Put in a bogus hash and run Python Markdown to get a dump of all the hashes?

It also doesn't solve detection of changes in general, for a full file inclusion.

The problem is, I'm not sure this is a problem I am necessarily wanting to solve. If a partial from a snippet is enclosed in special fences, as I was suggesting, you do not have to be notified that the file is changed. You are no longer stuck with specific lines that could change, you just have to place what you want to be included between the fence, and what you don't want outside the fence.

There, I think comment aren't needed actually, or that we can come to a more generic solution : simply specifying part of a line to be looked for.

I think this becomes a highly fragile solution potentially having numerous false positives. It also greatly complicates the syntax as well.

Keep in mind, I'm attempting to solve a number of general complaints. Another request was to include only content under a particular Markdown header, which frankly doesn't have much chance of happening as we do not have the appropriate context to do this without running the content through the parser which we are trying to preprocess before running through the parser.

Personally, I do not think it unreasonable to suggest that a user that intends to heavily use a file, extracting partials from it, use section fences to gain greater control of what is included.

@feloxyde
Copy link
Author

feloxyde commented Oct 4, 2022

How do you get your initial hash? Put in a bogus hash and run Python Markdown to get a dump of all the hashes?

I was thinking about an empty hash. For example, if we put hash as [defb945096...], it would be []. This way there is one single process to both state a file should be hash-checked and to detect and correct mismatching hashes ^^

The problem is, I'm not sure this is a problem I am necessarily wanting to solve. If a partial from a snippet is enclosed in special fences, as I was suggesting, you do not have to be notified that the file is changed.

I think the problem is indeed secondary, but there are still cases where it is nice to detect changes. For example if the documentation explains code referring to identifiers or other constructs in the code, and the code is refactored, out-of-code explanations can become confusing. However this is a really marginal problem, as such explanations can eventually be done through comments in the code in most cases.

I think this becomes a highly fragile solution potentially having numerous false positives. It also greatly complicates the syntax as well.

Keep in mind, I'm attempting to solve a number of general complaints. Another request was to include only content under a particular Markdown header, which frankly doesn't have much chance of happening as we do not have the appropriate context to do this without running the content through the parser which we are trying to preprocess before running through the parser.

It is more fragile than section comments for sure, but I allows inclusion of nearly-arbitrary content. User then is responsible for providing his/her own unique delimiters, even in files that have no comment syntax.
Of course, syntax would be a bit more complicated, however you don't need to add per-language support for comment syntax.

From an implementation perspective, maybe a simpler way to parse such expression would be to specify delimiters first in the expression so they are not comprised in the delimited parts.

Possible syntax :

--8<-- "filename"!|function_name|return|
  • You can remove filename with a shortest regex match on "" delimiters
  • The ! indicates that the next character is a delimiter.
  • Having a delimiter | not in either specified part, we can simply split the last part of the expression on | to get the start and stop content.
  • If you want to support sort of comment delimiter syntax, there could be two start symbols, for example ? and !, stating whether the line with content specified should be included or not.

However maybe it is a bit too complex for some users ?

@facelessuser
Copy link
Owner

This keeps referencing the following syntax.

--8<-- "filename"!|function_name|return|

I know the idea is that it is separate from the filename enclosed in quotes, but that is just for the single-line notation. We also allow block notation that does not use quotes for file names. Adding this additional requirement in syntax to filenames is not really desired.

--8<--
filename
filename2
--8<--

We can also allow URLs, and URLs already have more uncommon formats, which makes such a complex syntax even harder to use without running into issues.

Snippets was always meant to be a more simple file inclusion plugin. Recently I've caved and URL support and line numbers to appease people despite my apprehensions. I'm open to explicit sections as well to ease this a little more, but there are probably various, more complex enhancements that I'm likely not willing to add.

I've had requests to add more advanced templating features, and I'm frankly not likely to add all of that complexity to provide such behaviors. I feel that if we start going down this road, it would make more sense for someone to just implement a Jinja2 preprocessor plugin.

As far as this request is concerned, I'm not sure how I feel about adding arbitrary search specifiers. I'm sure that this syntax may work well for your specific needs, but I'm almost positive that it could easily have false positives. What about functions with multiple exit points that look similar? In languages that end functions with braces, it would be quite easy to get a false positive ending if reliant on brace endings.

In this case, I feel explicit, known fences are going to be better, and while I understand how that may be frustrating to someone who'd like more freedom to pull snippets from any arbitrary source code without polluting the file with special comments, this is probably as far as I am willing to extend things at this time.

Unfortunately, the hash idea is probably something I'm not interested in adding at this point either. The overhead and complexity to add it is currently more than I am currently willing to maintain at this time. I'm not sure if I'll change my mind in the future or not, but currently, this is not a direction in which I wish to go.

@facelessuser
Copy link
Owner

I think I'm willing to pursue the explicit sections, but not the arbitrary search tokens. I'm also not planning on implementing the checksums at this time either. With that said, I will probably leave this issue open (specifically the request for checksums) for now as a possibility for the future, assuming I change my mind on checksums.

@facelessuser facelessuser added P: maybe Pending approval of low priority request. and removed S: triage Issue needs triage. labels Oct 4, 2022
@feloxyde
Copy link
Author

feloxyde commented Nov 7, 2022

Sorry for the (very) late answer, and thank you a lot for your time !

I think I'm willing to pursue the explicit sections, but not the arbitrary search tokens. I'm also not planning on implementing the checksums at this time either. With that said, I will probably leave this issue open (specifically the request for checksums) for now as a possibility for the future, assuming I change my mind on checksums.

I definitely understand, I'll adapt myself to the tool then, which shouldn't be too hard. I'll eventually write my own extension when I have time and/or if things becomes too critical. In case I do so I'll keep you informed if you want to include it totally or partially in pymdownx.

As far as this request is concerned, I'm not sure how I feel about adding arbitrary search specifiers. I'm sure that this syntax may work well for your specific needs, but I'm almost positive that it could easily have false positives. What about functions with multiple exit points that look similar? In languages that end functions with braces, it would be quite easy to get a false positive ending if reliant on brace endings.

False positives are definitely a bad thing in this case, especially for you since people would get frustrated and post issues and so on. After thinking about it, while it would be a convenient feature, you are right it can't be reasonably implemented as a general-purpose solution, as it would too much of a bother most users to deal with.

In this case, I feel explicit, known fences are going to be better, and while I understand how that may be frustrating to someone who'd like more freedom to pull snippets from any arbitrary source code without polluting the file with special comments, this is probably as far as I am willing to extend things at this time.

One way to use special comments without polluting most files would be to have a way to tell snippets to include a file whole while stripping the lines where the comments live. Not sure about syntax, but maybe something like :

---8<--- "somefile.js:-"

In my case it would solve most problems, and I think most documentation cases as well, since it is common to use single file standalone examples for in docs.

Stripping could also be done manually with an external script, but it is less convenient unfortunately.

@facelessuser facelessuser added the S: rejected The pull request is rejected for the stated reasons. label Feb 23, 2023
@facelessuser
Copy link
Owner

I've spent some time thinking about this, and I've decided I'm not going to go this way. If I change my mind in the future, I'll open a new issue, but its time to clean out old issues that I'm probably not going to tackle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P: maybe Pending approval of low priority request. S: rejected The pull request is rejected for the stated reasons. T: feature Feature.
Projects
None yet
Development

No branches or pull requests

4 participants