Fix quoting logic for attribute values #339

ferdnyc · 2024-04-30T09:41:36Z

This started off as an effort to simply fix quoting for attribute values containing colons (#258), which was being hosed by the logic that recognizes ID values in the form foo:n as IDs with an optional port value.

That's what made me realize that the rules for quoting IDs and the rules for quoting attributes are totally different, and we shouldn't be trying to make them the same.

So, this PR breaks out quote_if_necessary() into two functions, quote_id_if_necessary() (which keeps the same logic as before), and quote_attr_if_necessary(), which is heavily biased towards quoting. It'll basically quote anything that isn't numeric, HTML, or already double-quoted. Simple string values like shape=box that used to be left unquoted will now be output as shape="box", because it's never wrong to do that. And it avoids a lot of problems.

I also removed the re.UNICODE flag from all the regexps, as in Python3 that's a tautology, and replaced a '"' + s + '"' with fr'"{s}"'. (Though that needn't have been a raw string, come to think of it.)

Three tests (only 3!!) had to be adjusted, so their expected values matched the new quoting rules.

Fixes #258

(Doesn't fix the other issues involving IDs, like #118)

Instead of quote_if_necessary() trying to be all things to all situations, split into quote_id_if_necessary(), which uses the same logic as before, and quote_attr_if_necessary(), which is heavily biased towards quoting, not doing so only if the value is numeric, HTML, or already double-quoted.

Adjust three expected strings to match the new quoting logic for attributes, which will quote strings far more often than before.

ferdnyc · 2024-04-30T09:49:55Z

Hmm. It seems GitHub has started running some macos:latest jobs on Apple Silicon machines, and Python for darwin-arm64 didn't show up until 3.8.10.

Python 3.7 is actually EOL as of 2023-06-27, so we could just drop it entirely.

(We could add in 3.13 as a replacement. It won't technically be out until October, but testing on the pre-release versions isn't the worst idea, to ensure compatibility.)

Also, fix some docstrings.

ferdnyc · 2024-04-30T19:27:42Z

Set to Draft state because I just noticed there's a full grammar for ID strings in the Graphviz docs, including regular expressions, which don't match ours. Seems like a good idea to make them match.

ferdnyc · 2024-04-30T23:59:22Z

So, turns out I was mostly right. The distinction isn't between attribute values and IDs, but between IDs, and node_IDs when they're used to define nodes or edges.

All attribute values are IDs, in their grammar, but only the node_id in a node_stmt¹ or on either side of an edge_stmt can have ports and compass directions. So, the magic logic we've been using to detect :-separated ID values should only apply to node IDs. For any other type of ID, colons are just another reason to double-quote.

Technically we could even double-quote node IDs, since all IDs are just strings and it makes no difference to Graphviz whether we write...

# This
strict digraph {
    a [shape=ellipse style=filled fillcolor="#1f77b4"]
    b [shape=polygon style=filled fillcolor="#ff7f0e"]
    a -> b [fillcolor="#a6cee3" color="#1f78b4"]
}

# Or this
strict digraph {
    "a" [shape="ellipse" style="filled" fillcolor="#1f77b4"]
    "b" [shape="polygon" style="filled" fillcolor="#ff7f0e"]
    "a" -> "b" [fillcolor="#a6cee3" color="#1f78b4"]
}

We just have to be careful that...

# This... (OK)
digraph {
  node1:port1 -> node2:port5:nw;
}

# Doesn't get turned into this... (WRONG)
digraph {
  "node1:port1" -> "node2:port5:nw";
}

# But it can be turned into this... (OK)
digraph {
  "node1":"port1" -> "node2":"port5":"nw";
}

Part of me almost thinks we should give the user the ability to set the port and compass_pt separately from the node name, to avoid any ambiguity about whether their colons are separators, or just colons in the name.

IOW, if a user had to define the graph above not with this:

G = pydot.Dot()
G.add_edge(pydot.Edge("node1:port1", "node2:port5:nw"))

But with something like this:

G = pydot.Dot()
node1 = pydot.Node("node1")
node2 = pydot.Node("node2")
ep1 = pydot.Endpoint(node1, port="port1")
ep1 = pydot.Endpoint(node2, port="port5", compass="nw")
G.add_edge(pydot.Edge(ep1, ep2))

Then constructing the Dot syntax would be a whole hell of a lot easier. And if we provided some method of configuring auto-port/compass parsing off for users who don't want it, they could name their nodes things like "01: Math" and "02: Science" without having to worry about them possibly being misinterpreted.

Right now, we attempt to parse out the port from a Node() name, but we don't allow those values to be set unambiguously. We even have a get_port() for Node objects, but no set_port()!

Even worse, if you set your graph up like this:

G = pydot.Dot()
node1 = pydot.Node("node1:port1")
node2 = pydot.Node("node2:port5:nw")
G.add_edge(pydot.Edge(node1, node2))

You'll get this output:

digraph G {
node1 -> node2;
}

We lose everything except the name, when a Node object is used as an Edge endpoint.

And that node2? Here's how we parse it:

>>> node2.get_port()
':port5:nw'

Whoops.

Maybe I'll work on adding set_port(), get_compass(), and set_compass() to Nodes (with fixed parsing), and an Endpoint class as an optional no-ambiguity method of specifying complex edges.

Heck, we could even provide an enum for the valid compass points, to be used like:

ep1 = pydot.Endpoint(
    node2, port="port5", compass=pydot.Compass.nw)

Notes

The graphviz grammar says that node statements start with node_id values, so I'm taking it on faith that it's useful to create and output node statements like this:
```
digraph {
  node2:port5:nw [label="Node2!", color="red"];
  node1:port1 -> node2:port5:nw;
}
```
But I'm not really clear on why that would ever be useful, and I don't see any actual examples of it in their docs so far. Afaict it's completely identical to this:
```
digraph {
  node2 [label="Node2!", color="red"];
  node1:port1 -> node2:port5:nw;
}
```
I've posted a question to the Graphviz forum about that.

ferdnyc · 2024-05-03T03:23:32Z

I've posted a question to the Graphviz forum about that.

So, no less than Stephen North hisself has confirmed:

digraph {
# Complex Node statements like...
nodeName:portname:nw [attributes=values];

# Are meaningless, and always equivalent to...
nodeName [attributes=values];
}

He says it was a "small mistake", that the grammar was specified to make the former legal syntax. The node:port:compass form was only intended to be used for edge endpoints.

Obviously we can't just disregard the grammar entirely, but IMHO (and as I argued in that discussion), I feel this does give us the leeway to parse Node calls like:

pydot.Node("01:Math", color="red")

As if the user actually wrote:

pydot.Node("\"01:Math\"", color="red")

And we can therefore assume they expect this as the resulting .dot code:

"01:Math" [color=red];

As opposed to the original, un-double-doublequoted pydot.Node() call defining a node named 01 with a port named Math, which is how we currently interpret it.

lkk7 · 2024-05-03T08:56:08Z

That's a long journey.

BTW the port definition is weird, there's another port in form of "<...>", but in a different context (https://graphviz.org/doc/info/shapes.html – section "Record-based nodes").

ferdnyc · 2024-05-03T21:52:16Z

That's a long journey.

BTW the port definition is weird, there's another port in form of "<...>", but in a different context (https://graphviz.org/doc/info/shapes.html – section "Record-based nodes").

Yeah, that goes inside the label attribute for the destination node, it's how you provide targets for the edge-endpoint ports. Technically without those labels, any ports attached to endpoints do nothing; with labels defined, the node:port syntax can target individual cells of the record-based nodes.

Apparently that whole "thing" is now discouraged in favor of HTML-like labels anyway. (Which can also define ports, by adding a port="name" attribute to targeted HTML entities.)

ferdnyc added 3 commits April 30, 2024 05:23

Tests: Adjust for new quoting logic

f45c94e

Adjust three expected strings to match the new quoting logic for attributes, which will quote strings far more often than before.

Apply black fixes

8ad4ced

Quoting: Ensure boolean values are lowercased

cb4c19a

Also, fix some docstrings.

ferdnyc marked this pull request as draft April 30, 2024 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix quoting logic for attribute values #339

Fix quoting logic for attribute values #339

ferdnyc commented Apr 30, 2024

ferdnyc commented Apr 30, 2024 •

edited

ferdnyc commented Apr 30, 2024

ferdnyc commented Apr 30, 2024

ferdnyc commented May 3, 2024 •

edited

lkk7 commented May 3, 2024

ferdnyc commented May 3, 2024 •

edited

Fix quoting logic for attribute values #339

Are you sure you want to change the base?

Fix quoting logic for attribute values #339

Conversation

ferdnyc commented Apr 30, 2024

ferdnyc commented Apr 30, 2024 • edited

ferdnyc commented Apr 30, 2024

ferdnyc commented Apr 30, 2024

Notes

ferdnyc commented May 3, 2024 • edited

lkk7 commented May 3, 2024

ferdnyc commented May 3, 2024 • edited

ferdnyc commented Apr 30, 2024 •

edited

ferdnyc commented May 3, 2024 •

edited

ferdnyc commented May 3, 2024 •

edited