Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quoting logic for attribute values #339

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ferdnyc
Copy link
Contributor

@ferdnyc ferdnyc commented Apr 30, 2024

This started off as an effort to simply fix quoting for attribute values containing colons (#258), which was being hosed by the logic that recognizes ID values in the form foo:n as IDs with an optional port value.

That's what made me realize that the rules for quoting IDs and the rules for quoting attributes are totally different, and we shouldn't be trying to make them the same.

So, this PR breaks out quote_if_necessary() into two functions, quote_id_if_necessary() (which keeps the same logic as before), and quote_attr_if_necessary(), which is heavily biased towards quoting. It'll basically quote anything that isn't numeric, HTML, or already double-quoted. Simple string values like shape=box that used to be left unquoted will now be output as shape="box", because it's never wrong to do that. And it avoids a lot of problems.

I also removed the re.UNICODE flag from all the regexps, as in Python3 that's a tautology, and replaced a '"' + s + '"' with fr'"{s}"'. (Though that needn't have been a raw string, come to think of it.)

Three tests (only 3!!) had to be adjusted, so their expected values matched the new quoting rules.

Fixes #258

(Doesn't fix the other issues involving IDs, like #118)

Instead of quote_if_necessary() trying to be all things to all
situations, split into quote_id_if_necessary(), which uses the
same logic as before, and quote_attr_if_necessary(), which is
heavily biased towards quoting, not doing so only if the value
is numeric, HTML, or already double-quoted.
Adjust three expected strings to match the new quoting logic for
attributes, which will quote strings far more often than before.
@ferdnyc
Copy link
Contributor Author

ferdnyc commented Apr 30, 2024

Hmm. It seems GitHub has started running some macos:latest jobs on Apple Silicon machines, and Python for darwin-arm64 didn't show up until 3.8.10.

Python 3.7 is actually EOL as of 2023-06-27, so we could just drop it entirely.

(We could add in 3.13 as a replacement. It won't technically be out until October, but testing on the pre-release versions isn't the worst idea, to ensure compatibility.)

@ferdnyc ferdnyc marked this pull request as draft April 30, 2024 19:26
@ferdnyc
Copy link
Contributor Author

ferdnyc commented Apr 30, 2024

Set to Draft state because I just noticed there's a full grammar for ID strings in the Graphviz docs, including regular expressions, which don't match ours. Seems like a good idea to make them match.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented Apr 30, 2024

So, turns out I was mostly right. The distinction isn't between attribute values and IDs, but between IDs, and node_IDs when they're used to define nodes or edges.

All attribute values are IDs, in their grammar, but only the node_id in a node_stmt1 or on either side of an edge_stmt can have ports and compass directions. So, the magic logic we've been using to detect :-separated ID values should only apply to node IDs. For any other type of ID, colons are just another reason to double-quote.

Technically we could even double-quote node IDs, since all IDs are just strings and it makes no difference to Graphviz whether we write...

# This
strict digraph {
    a [shape=ellipse style=filled fillcolor="#1f77b4"]
    b [shape=polygon style=filled fillcolor="#ff7f0e"]
    a -> b [fillcolor="#a6cee3" color="#1f78b4"]
}

# Or this
strict digraph {
    "a" [shape="ellipse" style="filled" fillcolor="#1f77b4"]
    "b" [shape="polygon" style="filled" fillcolor="#ff7f0e"]
    "a" -> "b" [fillcolor="#a6cee3" color="#1f78b4"]
}

We just have to be careful that...

# This... (OK)
digraph {
  node1:port1 -> node2:port5:nw;
}

# Doesn't get turned into this... (WRONG)
digraph {
  "node1:port1" -> "node2:port5:nw";
}

# But it can be turned into this... (OK)
digraph {
  "node1":"port1" -> "node2":"port5":"nw";
}

Part of me almost thinks we should give the user the ability to set the port and compass_pt separately from the node name, to avoid any ambiguity about whether their colons are separators, or just colons in the name.

IOW, if a user had to define the graph above not with this:

G = pydot.Dot()
G.add_edge(pydot.Edge("node1:port1", "node2:port5:nw"))

But with something like this:

G = pydot.Dot()
node1 = pydot.Node("node1")
node2 = pydot.Node("node2")
ep1 = pydot.Endpoint(node1, port="port1")
ep1 = pydot.Endpoint(node2, port="port5", compass="nw")
G.add_edge(pydot.Edge(ep1, ep2))

Then constructing the Dot syntax would be a whole hell of a lot easier. And if we provided some method of configuring auto-port/compass parsing off for users who don't want it, they could name their nodes things like "01: Math" and "02: Science" without having to worry about them possibly being misinterpreted.

Right now, we attempt to parse out the port from a Node() name, but we don't allow those values to be set unambiguously. We even have a get_port() for Node objects, but no set_port()!

Even worse, if you set your graph up like this:

G = pydot.Dot()
node1 = pydot.Node("node1:port1")
node2 = pydot.Node("node2:port5:nw")
G.add_edge(pydot.Edge(node1, node2))

You'll get this output:

digraph G {
node1 -> node2;
}

We lose everything except the name, when a Node object is used as an Edge endpoint.

And that node2? Here's how we parse it:

>>> node2.get_port()
':port5:nw'

Whoops.

Maybe I'll work on adding set_port(), get_compass(), and set_compass() to Nodes (with fixed parsing), and an Endpoint class as an optional no-ambiguity method of specifying complex edges.

Heck, we could even provide an enum for the valid compass points, to be used like:

ep1 = pydot.Endpoint(
    node2, port="port5", compass=pydot.Compass.nw)

Notes

  1. The graphviz grammar says that node statements start with node_id values, so I'm taking it on faith that it's useful to create and output node statements like this:
    digraph {
      node2:port5:nw [label="Node2!", color="red"];
      node1:port1 -> node2:port5:nw;
    }
    But I'm not really clear on why that would ever be useful, and I don't see any actual examples of it in their docs so far. Afaict it's completely identical to this:
    digraph {
      node2 [label="Node2!", color="red"];
      node1:port1 -> node2:port5:nw;
    }
    I've posted a question to the Graphviz forum about that.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 3, 2024

  1. I've posted a question to the Graphviz forum about that.

So, no less than Stephen North hisself has confirmed:

digraph {
# Complex Node statements like...
nodeName:portname:nw [attributes=values];

# Are meaningless, and always equivalent to...
nodeName [attributes=values];
}

He says it was a "small mistake", that the grammar was specified to make the former legal syntax. The node:port:compass form was only intended to be used for edge endpoints.

Obviously we can't just disregard the grammar entirely, but IMHO (and as I argued in that discussion), I feel this does give us the leeway to parse Node calls like:

pydot.Node("01:Math", color="red")

As if the user actually wrote:

pydot.Node("\"01:Math\"", color="red")

And we can therefore assume they expect this as the resulting .dot code:

"01:Math" [color=red];

As opposed to the original, un-double-doublequoted pydot.Node() call defining a node named 01 with a port named Math, which is how we currently interpret it.

@lkk7
Copy link
Member

lkk7 commented May 3, 2024

That's a long journey.

BTW the port definition is weird, there's another port in form of "<...>", but in a different context (https://graphviz.org/doc/info/shapes.html – section "Record-based nodes").

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 3, 2024

That's a long journey.

BTW the port definition is weird, there's another port in form of "<...>", but in a different context (https://graphviz.org/doc/info/shapes.html – section "Record-based nodes").

Yeah, that goes inside the label attribute for the destination node, it's how you provide targets for the edge-endpoint ports. Technically without those labels, any ports attached to endpoints do nothing; with labels defined, the node:port syntax can target individual cells of the record-based nodes.

Apparently that whole "thing" is now discouraged in favor of HTML-like labels anyway. (Which can also define ports, by adding a port="name" attribute to targeted HTML entities.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String attributes containing colon are not quoted
2 participants