Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding source endings to tokens #76

Open
jbsulli opened this issue Jan 17, 2017 · 26 comments
Open

Adding source endings to tokens #76

jbsulli opened this issue Jan 17, 2017 · 26 comments

Comments

@jbsulli
Copy link

jbsulli commented Jan 17, 2017

Currently working on adding source line endings to all tokens as discussed here. However, I wanted to move this to it's own issue instead of hijacking the other.

Going to make it match babylon's loc format which looks like:

var token = {
    loc: {
        start: { line:1, column:1 },
        end: { line:1, column:13 }
    }
};
@jbsulli
Copy link
Author

jbsulli commented Jan 17, 2017

Note: Babylon's loc property also includes a loc.filename property; however, attaching the filename using the this.filename property from the lexer resulted in the absolute file path and thus produced unit tests that are OS/folder dependent. I'm just leaving it off right now until I get input on this. Using the relative file path or reworking the tests a bit would solve this issue.

Any thoughts?

@jbsulli
Copy link
Author

jbsulli commented Jan 17, 2017

So, given the following examples:

- var name = "Josh"
^ ^
1 2

p hello #{name}!
        ^ ^
        1 2

p= name
 ^ ^
 1 2

The lexer currently marks the beginning of the code/interpolated-code token as being at position 1 for all examples. However, it then goes on to set val starting at position 2. I would argue that any issues regarding Pug's code syntax will throw an error in the lexer at the place of the error. After lexing, the position of the code syntax is not nearly as important as the code itself.

If we create tokens for JavaScriptExpressions (as @ForbesLindesay mentioned here) and store them inside code/interpolated-code tokens, then we would be storing both positions.

Thoughts?

For now, I'll just use position 1 and will change it later if needed.

@ForbesLindesay
Copy link
Member

For the filename, in the test cases, you can just replace __dirname with '' then replace \\ with /.

Yes, I think we will need a separate JavaScript expression token. This will be essential when we generate source maps/use babel-code-gen. Using position 1 sounds fine for now.

@jbsulli
Copy link
Author

jbsulli commented Jan 18, 2017

Cool, I'll go with that. I suppose we could have the lexer just store the relative file path as well.

Yeah, keep thinking an identifier token for tag/attribute/mixin/class/id/filter/etc names as well as a JavaScript expression token will be in order later down the road. For now just focusing on getting an end location for all current tokens.

For future reference though, consider the following:

a("foo"=bar)
--^^-^^---^
  12 34   5

I would expect the lexer to have an attribute token spanning from 1 to 5, followed by an identifier spanning 2 to 3 (since quoted attribute names do not include the quotes), and then a JavaScript expression to span bar, correct?

So far, things are going pretty well. Had to write a function to console log all result tokens followed by the text they are spanning so I can quickly see if one of the endpoints are off. After that, things are moving along well. Getting a little hung up on the attributes as the code is a little tricky but not bad. Attributes with quoted names currently start at position 2 (using the example above) where I would expect it to start at position 1, correct?

@jeromew
Copy link

jeromew commented Jan 18, 2017

For info if you need it, here is the work-in-progress rewrite of pug-code-gen using babel AST. it works and parses the javascript expressions AST which we can remove when the lexer will 'cache' the parsed AST.

https://github.com/jeromew/pug-code-gen/blob/master/index.js

currently I use https://github.com/jeromew/pug-code-gen/blob/master/index.js#L108 for parsing expressions and https://github.com/jeromew/pug-code-gen/blob/master/index.js#L111 for parsing argument-like structures.

The loc numbering with these hacks has an offset + there are corner cases that could be parsed wrongly with these constructs so a PR is ongoing on babylon on babel/babylon#213 to be ably to parse expressions directly.

@ForbesLindesay
Copy link
Member

ForbesLindesay commented Jan 18, 2017

The lexer should not produce a separate tag token + identifier token, since the tag token in the lexer ends at the end of the tag name. It's the parser where we would create most of that nesting.

Yes, I think currently we just produce a single "attributes" token, but I think it would be better if the lexer produced a stream of start-attributes attribute-key attribute-value end-attributes.

@jbsulli
Copy link
Author

jbsulli commented Jan 22, 2017

@jeromew: Nice! I'm excited for that babylon pull request. I was doing the hacky stuff you described on a project myself. Thank you for pointing these lines out! I'll keep them in mind as I get closer to that piece.

@ForbesLindesay: That makes sense. I've been reading up on the difference between lexers and parsers and believe I was blurring the two. I'm keeping the single attribute token for now until I have spans on all tokens but it is ready for the setup you described. Definitely cleaned it up IMO.

I did find an inconsistency. One of the tests has the following line:

a(foo, bar, baz)

The expected tokens for these attributes are:

{"type":"attribute","line":3,"col":3,"name":"foo","val":true,"mustEscape":true}
{"type":"attribute","line":3,"col":8,"name":"bar","val":true,"mustEscape":false}
{"type":"attribute","line":3,"col":13,"name":"baz","val":true,"mustEscape":false}

Notice the values of mustEscape for each? The first is true while the others are false. Maybe it doesn't matter because a boolean val doesn't use the mustEscape value? Or is this by design? The value is initialized here and reset here.

@TimothyGu
Copy link
Member

@jbsulli, good catch! Practically, I guess it doesn't matter, but I think the preferred behavior should be to reset mustEscape to true (in case some plugin that changed val forgot to set mustEscape to true).

@ForbesLindesay
Copy link
Member

Yes, this definitely looks like a bug. mustEscape should be set to the same value in all three attributes. I would tend to agree with @TimothyGu that true is the better default here.

@ForbesLindesay
Copy link
Member

@jbsulli We're moving everything to a single monorepo: https://github.com/pugjs/pug

There a few advantages:

  1. It is easy to run all the tests, rather than only the tests that directly test the lexer.
  2. Jest, the test runner we're now using, has a great watch mode (npm run watch).
  3. Jest has snapshots builtin, and I've set them up so they automatically truncate absolute file paths so that they work fine, providing the filename is a string and is a "filename" property.

I suggest making your changes on the monorepo as this repo will be disappearing soon :)

@jbsulli
Copy link
Author

jbsulli commented Jan 25, 2017

Done with the lexer! That was much crazier than I expected but I learned a lot. All unit tests are passing and I have comparable test coverage (I'd be down for getting it to 100%, if you want).

All tokens are the same except the location data and there are no new tokens, yet. Most tokens just needed the this.tokEnd(tok) function I wrote to be called just before saving the token.

Attributes, however, were completely rewritten as it was much too complicated to keep track of actual starting and ending locations. There are now three functions handling attributes: attrs, attributeName, and attributeValue. This was to make it easy to pull them out if we want separate tokens in the future. The mustEscape value is now true for all boolean attributes.

I'll work on building a pull request for the new monorepo so you guys can look at it. Need to work on updating the parser as well.

I'm super excited for the monorepo. Should be much easier to handle pulling in changes for both the lexer and parser for this.

@ForbesLindesay
Copy link
Member

Thanks, this all sounds incredibly awesome. Getting to 100% test coverage would be great, but don't stress over getting it there as part of this pull request. I'd much rather see this pull request merged with decent test coverage, then see a subsequent pull request that just increases the test coverage.

Can't wait to see the code you've written, especially the new attributes parsing code.

@jbsulli
Copy link
Author

jbsulli commented Jan 26, 2017

Started working on moving it over to the monorepo. The snapshots are making my eyes cross. Is there a way to break those out so it's not a single 46,000 line file? Haven't messed with Jest so I'm still learning it. So far it seems pretty cool. Their expect function is more inline with what I'm used to (Mocha/Chai) and they have a pretty sweet plugin for VSCode. The custom matchers sound awesome! I've been writing things like that in Mocha for a while.

@ForbesLindesay
Copy link
Member

The only way to split the file up is to run the tests in separate JavaScript files. I'm starting to move towards this approach with new tests (e.g. pugjs/pug@d91bab7), but I don't think it's worth doing so with existing ones. I don't think it makes too much difference to how bad it is to review the diffs though.

It might be something we could propose to jest though.

@jbsulli
Copy link
Author

jbsulli commented Jan 29, 2017

Yeah, that's true. I'll play with them more and see how it goes. I like some of the things Jest does to try and keep diffs readable. Right now, I was able to copy over the updated tokens from the lexer output to the parser test cases but I see that as being a pain to maintain in the future.

Spent most of the weekend trying to track down why I was getting strange behavior when debugging Jest tests with VSCode. Oddly, the fix was to do npm install --legacy-bundling. I submitted an issue for it but now I should finally be able get to the parser.

@jbsulli
Copy link
Author

jbsulli commented Feb 9, 2017

Hey everyone. I've been fighting the flu so I've not been able to get anything done on this lately. But now I'm ready to get this going again.

What are your thoughts for things like tags, mixins, blocks, etc? Do you want the loc value to span just the tag and attributes? Or all child content as well? Ex:

#content
    p.first
        a(href="http://www.google.com") Google
        | is planning on taking over the world.

So do you want the loc to span just the tag name and attributes (ex: #content, p.first, a(href="http://www.google.com) or the children as well (ex: #content would span the entire source above, p.first would span all the way until the period after world, and a(...) would span Google)? We could have it span the definition but have block nodes (that hold the children anyway) span the children. Any thoughts?

@jbsulli
Copy link
Author

jbsulli commented Feb 9, 2017

Hm, you also have different file sources being included into the main file. The loc can only span lines/columns of a single file as it is currently defined. I don't think this is really an issue but a block node spanning an include would later have content from different source files. Just some thoughts.

So what I am thinking for now is have blocks span all children but the tag would span its name, id, classes, attributes, and &attributes but not the children. When includes/extends happen, the container block's loc value would stay the same but the children merged in would have their own loc values pointing to their source files.

@ForbesLindesay
Copy link
Member

The general rule is: the loc should refer to the entirety of the source code that was part of that token (before linking) In the lexer, this normally means just the tag. In the parser, it always means tag + children. i.e. everything that's within that object. Don't worry about the multiple files issue, since that doesn't happen until the pug-linker stage. After linking, we just preserve the values before linking.

@jbsulli
Copy link
Author

jbsulli commented Feb 12, 2017

@ForbesLindesay: That makes things much easier.

I've gotten pretty far with the parser but not sure I like how I've implemented adding the loc value to the node. The problem is merging all children into the current node. Right now I have wrapper functions that you add around each node you add as a child. This takes care of expanding the parent's loc so it includes the child. However, the wrapper function approach seems fragile and requires a pretty good knowledge of what is actually going on. Doesn't seem like it will be very user friendly for other maintainers going forward. Right now there are 4 different functions for handling this and they really don't seem intuitive. You can see what is going on by going checking out this commit.

I feel like a better approach would be to track the last legit token's ending location and expand the parent's end to that position when the children are done being parsed. I don't know that I can do this in a friendly way without some pretty big changes. I'm curious if a lot of the parser logic could be cleaned up by treating the nodes more anonymously. I'm going to play with it a bit and see if it's a bad road to go down or not. Any thoughts? Are you OK with some pretty big changes? I don't want to hijack this.

If you want to really see what the loc values look like, you should check out my handy CLI tool pug-loc-debugger. It's been a life-saver for this project and is kind of fun to see in action. It should be pretty future-proof too because it's a global package and just takes over the pug modules in the folder it's running from and taps the plugin system. Great example of using the pug plugin system too. Just clone my repo, switch to the source-end-locations branch and run the pug-loc-debugger from that directory.

@ForbesLindesay
Copy link
Member

I don't have any strong opinions on this at the moment. Would you be able to submit a pull request for adding token endings to the lexer? I'd much rather review the updates to the lexer, get that landed, then start looking at what changes are needed in the parser. It's much easier if we do one thing at a time :)

@ForbesLindesay
Copy link
Member

P.S. pug-loc-debugger looks awesome. I've been meaning to build something like that for a while.

@nojvek
Copy link

nojvek commented Jan 28, 2018

Kindly following up on this.

Would love to see this merged so I can build proper source maps for pug based webpack loaders.

Anything that I can help with?

@ForbesLindesay
Copy link
Member

A pull request to the lexer: https://github.com/pugjs/pug/tree/master/packages/pug-lexer to add full location info to each token would be helpful and get accepted.

@jbsulli
Copy link
Author

jbsulli commented Feb 11, 2018

@nojvek, thanks for the bump! I forgot about this project with all the life changes I had (new job/big move). I'm wrapping up the pug-lexer piece now and just got it to pass all the tests. This is just the first phase though. Gotta update the parser etc. to implement endings as well.

Pug source maps are going to be sick. Also looking forward to getting everything moved over to babel and passing the babel AST along so we (hopefully) have a nice speed improvement.

@nojvek
Copy link

nojvek commented Feb 11, 2018 via email

@jbsulli
Copy link
Author

jbsulli commented Feb 12, 2018

pugjs/pug#2957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants