Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: interface type productions #229

Closed
mccolljr opened this issue Jun 10, 2022 · 21 comments
Closed

Proposal: interface type productions #229

mccolljr opened this issue Jun 10, 2022 · 21 comments

Comments

@mccolljr
Copy link
Contributor

In the past, when I've written recursive descent parsers by hand, I have used interfaces to represent "classes" of productions, such as Stmt, Expr, etc - this simplifies the consuming code considerably.

I would like to be able to do something similar using participle - for example, I would like to define an Expr interface, and then parse that manually while allowing the rest of the grammar to be managed by participle.

Perhaps it could look something like this:

// Expr is the interface implemented by expression productions
type Expr interface { expr() }

// These types all implement Expr, and are parsed manually
type Atom struct { /* ... */ }
type Unary struct { /* ... */ }
type Binary struct { /* ... */ }

// This is the parser definition, with a new `UseInterface` option to define a custom parsing func for the Expr interface
var theParser = participle.MustBuild(&Grammar{}, participle.UseInterface(
    func (lex *lexer.PeekingLexer) (Expr, error) {
        /* Provide an implementation of the expression parser here */
    },
))

// And then I can use `Expr` like this:
type Stmt struct {
    Expr   Expr    `@@` // <- participle has registered `Expr` as an interface that can be parsed
    Assign *Assign `@@`
}

As an additional nice-to-have, it would be cool if there was a way to give a *lexer.PeekingLexer and a production to the parser, and have it fill in the data for me. Then I could compose participle-managed productions within the manual parsing code for Expr. This would allow me to break out of participle in a limited way, just for the things that I need (such as managing precedence climbing)

@alecthomas
Copy link
Owner

I've thought the same and I agree it would be nice, but I haven't come up with an API that I like. Maybe something like this could work...

package main

import "fmt"

type sum interface{ sum() }

type One struct{}
func (o One) sum() {}

type Two struct{}
func (t Two) sum() {}

func OneOf[T any](a ...T) {
	for _, b := range a {
		fmt.Printf("%T\n", b)
	}
}

func main() {
	OneOf[sum](One{}, Two{})
	fmt.Println("Hello, 世界")
}

@alecthomas
Copy link
Owner

As an additional nice-to-have, it would be cool if there was a way to give a *lexer.PeekingLexer and a production to the parser, and have it fill in the data for me.

You can achieve this with ParseFromLexer

@mccolljr
Copy link
Contributor Author

You can achieve this with ParseFromLexer

Ah, I will give this a try - the docs made me think this only accepted the top-level production:

ParseFromLexer into grammar v which must be of the same type as the grammar passed to Build().

@mccolljr
Copy link
Contributor Author

I've thought the same and I agree it would be nice, but I haven't come up with an API that I like. Maybe something like this could work...

Intriguing, but I think I'm not making the connection with how the example function would be used with the parser. Would OneOf be a parser option?

@alecthomas
Copy link
Owner

Ah, I will give this a try - the docs made me think this only accepted the top-level production:

That's correct, but the idea would be to have a separate sub-parser, then implement Parseable and call ParseFromLexer with the lexer, on the sub-grammar.

Would OneOf be a parser option?

Yes that was my thought, and internally it would effectively translate to a disjunction of each element, a bit like:

type intermediate struct {
  One *One `  @@`
  Two *Two `| @@
}

@mccolljr
Copy link
Contributor Author

That's correct, but the idea would be to have a separate sub-parser, then implement Parseable and call ParseFromLexer with the lexer, on the sub-grammar.

Ah that's unfortunate. In my case, the "grammar" and the "sub-grammar" have overlap in what productions they can produce. I don't think this solution would work for me in this case.

Yes that was my thought, and internally it would effectively translate to a disjunction of each element, a bit like:

Ah ok, I see - and what if the scenario is that my interface sum needs to be parsed manually?

@mccolljr
Copy link
Contributor Author

I threw together an ugly proof of concept, but it works and demonstrates a bit of what I'd like to be able to do:

master...mccolljr:feat/interface_productions

@alecthomas
Copy link
Owner

Ah that's unfortunate. In my case, the "grammar" and the "sub-grammar" have overlap in what productions they can produce. I don't think this solution would work for me in this case.

Why is that a problem? You can have a sub-parser include overlapping productions, that's not an issue.

Ah ok, I see - and what if the scenario is that my interface sum needs to be parsed manually?

The same way you would do it normally - implement Parseable.

@alecthomas
Copy link
Owner

BTW re. precedence climbing, the included example also does this by overriding Parseable.

@mccolljr
Copy link
Contributor Author

mccolljr commented Jun 10, 2022

RE:

Ah ok, I see - and what if the scenario is that my interface sum needs to be parsed manually?

The same way you would do it normally - implement Parseable.

I don't think this works - If the production type is itself an interface value (for example type Expr interface { expr() }, there is no concrete implementation of Parseable to be used. I would need some way to provide a function which takes a *lexer.PeekingLexer and return a value of the appropriate interface. This is what the proof-of-concept code enables. For me, the impetus behind this feature request is to avoid large structs like

type Expr struct {
    VariantA *VariantA
    VariantB *VariantC
    VariantC *VariantC
    // ...
    VariantZ *VariantZ
}

Additionally, if I want to encode operator precedence in the parsed AST, it results in a hierarchy that is deeply nested:

type Expr struct {
    PrecLevel BinaryLevelTop
}
type BinaryLevelTop struct {
    Lhs BinaryLevelTopSub1
    Op *string
    Rhs *BinaryLevelTop
}
type BinaryLevelTopSub1 struct {
    Lhs BinaryLevelTopSubN
    Op *string
    Rhs *BinaryLevelTopSub1
}
// ...
type BinaryLevelTopSubN struct {
    Unary Unary
}
type Unary struct { Op *string, Atom Atom }
type Atom struct { /* ... */ }

If I have an expression that consists of only an Atom value, I have to access it as expr.PrecLevel.Lhs.Lhs.Lhs./*...*/.Unary.Atom.

What I would like to be able to do instead is expr.(Atom) (or switch expr.(type) { /* ... */ }).

Which leads me to, RE:

Ah that's unfortunate. In my case, the "grammar" and the "sub-grammar" have overlap in what productions they can produce. I don't think this solution would work for me in this case.

Why is that a problem? You can have a sub-parser include overlapping productions, that's not an issue.

Let's say I have anonymous lambda functions in my expression grammar. My Stmt grammar needs to know how to parse the Expr interface, which requires registering interface parsing in that grammar. The Expr interface needs to know how to parse the Stmt grammar as the body of any inline function. This results in a confusing, cyclic parsing infrastructure that duplicates the parseable nodes multiple times. It would be a much nicer interface to be able to do something like ParseProduction(lex, &funcBody) from within the Parse function for my Expr type.

For a concrete example of the full behavior I'd like to use:

type File struct {
    Statements []Stmt `@@*`
}

type Stmt struct {
    ExprStmt Expr `@@`
    /** other stmt types */
}

// The Expr interface type
type Expr interface{ isExpr() }

// The parser definition
var theParser = participle.MustBuild(&File{}, participle.UseInterface(exprParser))

func exprParser(lex *lexer.PeekingLexer) (Expr, error) { return exprParserWithPrec(lex, 0) }

func exprParserWithPrec(lex *lexer.PeekingLexer, minPrec int) (Expr, error) {
    /* ... example of parsing specific productions ... */
    if lex.Peek().Value == "func" {
        /*... parse funcHeader, etc ...*/
        var funcBody []Stmt
        for lex.Peek().Value != "}" {
              var s Stmt
              if err := theParser.ParseProduction(lex, &s); err != nil { return err }
              funcBody = append(funcBody, s)
        }
        return FuncExpr{ funcHeader, funcBody }
    }
    /* ... */
}

@alecthomas
Copy link
Owner

I don't think this works - If the production type is itself an interface value (for example type Expr interface { expr() }, there is no concrete implementation of Parseable to be used.

Yes... I meant that you would implement Parseable on the concrete elements of the interface. This wouldn't work if your top-level parsing logic is more than @@ | @@ | @@ ..., though. In my experience 99% of the cases where I want a sum type this would suffice, so I would want this use case to be ergonomic. There still might need to be an escape hatch for manually parsing interfaces, do you have any syntax examples?

But eg. for a terminal this would be fine.

type Value interface { val() }

type Num struct {
  Value int64 `@Int`
}

func (Num) val() {}

type String struct {
  Value string `@String`
}
func (String) val() {}

type Lambda struct {
  Params []*Param `"func" "(" @@* ")"`
  Body *Body `"{" @@ "}"`
}
func (Lambda) val() {}

One downside of this approach is non-struct types have to be wrapped in a struct.

ParseProduction(lex, target) could be a nice QoL improvement, but all I was stating is that what you're trying to achieve is currently possible, it's just a bit more work.

@mccolljr
Copy link
Contributor Author

Yes... I meant that you would implement Parseable on the concrete elements of the interface. This wouldn't work if your top-level parsing logic is more than @@ | @@ | @@ ...,

Ah, ok - I misunderstood what you meant initially, I see what you are saying now, and that makes sense.

though. In my experience 99% of the cases where I want a sum type this would suffice, so I would want this use case to be ergonomic. There still might need to be an escape hatch for manually parsing interfaces, do you have any syntax examples?

I agree with the fact that the average case is @@ | @@ | @@ (etc)

In my case, the expression grammar can all be parsed with a single-token lookahead and if I can place this all inside a single function that is just munching tokens and producing Expr nodes, I can avoid all backtracking in expression parsing, with the benefit of only needing a single BinaryExpr definition which can be constructed appropriately in code.

Would you consider it ergonomic enough if there were a separate UseUnion option, which would be a wrapper around the UseInterface option and which would just generate a parsing function to match the union items in the order defined? This would allow the user to escape to manual parsing via UseInterface in the special case, but would allow a simple mechanism for specifying a union of types for the more common case of an @@ | @@ | @@-style union. Implementing this would be simplified if the ParseProduction addition were also included, as it could be a for-loop over the inputs to choose the first match of the given productions using ParseProduction.

@alecthomas
Copy link
Owner

Would you consider it ergonomic enough if there were a separate UseUnion option, which would be a wrapper around the UseInterface option

Yeah, I think that would be totally fine. I think ParseProduction() would be handy too, if you're interested, though my vague memory indicates that might require quite a bit of plumbing.

@mccolljr
Copy link
Contributor Author

mccolljr commented Jun 10, 2022

Yeah, I think that would be totally fine. I think ParseProduction() would be handy too, if you're interested, though my vague memory indicates that might require quite a bit of plumbing.

Based on what I've seen I think you're right about the plumbing. There's a map[reflect.Type]node that gets constructed during the build phase that I'm thinking might be possible to capture into the constructed parser. I will dig in a bit and put together a proof-of-concept in the same branch from before.

If the ergonomics work out there, I can clean up and put together a PR

@mccolljr
Copy link
Contributor Author

I have a working prototype, but there are some interesting cases to consider:

If a user calls ParseProduction from inside a Parseable implementation or a UseInterface function, should those calls appear in the trace? My instinct is yes, they probably should. Enabling this will require a bit of a refactoring of how tracing is currently handled.

There may be a "broken chain" where there are some productions that only get parsed via ParseProduction from within a call to Parseable.Parse or a UseInterface function. These productions would not be detected during parser construction because they are not directly attached to the top-level grammar construct. The workaround is to provide a top-level grammar object that just lists all the available productions... which brings me to my third thing:

I think this API makes the most sense if, rather than a specific ParseProduction, all parse calls can take a target of any production type, and will choose the proper parse node automatically based on this. It allows specifying a top-level grammar object that enumerates specific productions without even being parsed directly itself, and it simplifies the ergonomics a bit in a backwards-compatible way.

I'm going to keep playing with this a bit this morning and will push up the proof-of-concept branch a bit later, but wanted to solicit some thoughts on the above

@alecthomas
Copy link
Owner

I have a working prototype, but there are some interesting cases to consider:

If a user calls ParseProduction from inside a Parseable implementation or a UseInterface function, should those calls appear in the trace? My instinct is yes, they probably should. Enabling this will require a bit of a refactoring of how tracing is currently handled.

There may be a "broken chain" where there are some productions that only get parsed via ParseProduction from within a call to Parseable.Parse or a UseInterface function. These productions would not be detected during parser construction because they are not directly attached to the top-level grammar construct. The workaround is to provide a top-level grammar object that just lists all the available productions... which brings me to my third thing

Valid concern, and obviously already the case with Parseable implementations, but this would likely exacerbate the issue for sure. This brings up a bigger concern for me actually, around the common case of Union() we discussed: it will not be a first-class node type that can be represented by the EBNF() method, or visited. This makes me think that this case should be separated.

Somewhat related, I also had the thought that this solution shouldn't really be explicitly tied to interfaces - it's basically an alternative mechanism for specifying a Parseable, without having to implement it on a type. There doesn't seem to be a reason to tie it specifically to interfaces that I can see, and allowing any type would be a more general solution.

I think this API makes the most sense if, rather than a specific ParseProduction, all parse calls can take a target of any production type, and will choose the proper parse node automatically based on this. It allows specifying a top-level grammar object that enumerates specific productions without even being parsed directly itself, and it simplifies the ergonomics a bit in a backwards-compatible way.

This seems less obvious to me, but also I have a plan to make the parser generic (see #208) which this would break.

@alecthomas
Copy link
Owner

alecthomas commented Jun 11, 2022

BTW breaking the API is not a big concern (assuming it's not a gratuitous change) as v2 is still in alpha.

@mccolljr
Copy link
Contributor Author

mccolljr commented Jun 12, 2022

I'm in agreement with all of your conclusions, but I think it will be easier to discuss specific ergonomic issues etc. once there's some concrete code to talk about. I'm making these adjustments to the proof-of-concept, and once that's done I'll open a draft PR for some feedback and with I think a few additional questions around handling of unions.

@mccolljr
Copy link
Contributor Author

@alecthomas I've put up a draft PR at #233. I tried to add some complex test cases to test & demonstrate the use cases for these two features. I've also left some comments there discussing a few details around unions and custom productions.

Notably, right now this PR does not include the ParseProduction stuff - that might be better suited as a separate PR after we hammer out the details of custom and union productions.

alecthomas pushed a commit that referenced this issue Jun 13, 2022
@mccolljr
Copy link
Contributor Author

Now that #233 has been merged, are you still open to the addition of ParseProduction? During my initial experiments it worked well except for tracing. I'll have to figure out how to add tracing during node construction, but aside from that I think I have a workable approach

@alecthomas
Copy link
Owner

Yeah I think so, as long as it's not crazy intrusive it seems like a reasonable thing to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants