Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #312, #317, Response\*Visitor rewrite #406

Conversation

Thinkscape
Copy link
Contributor

This fixes #312 and allows for traversal of XML documents including nested attributes and namespaces.

  • Remove implicit conversion of SimpleXMLElement to array
  • Add traversal of XML document in XMLVisitor.
  • Investigate ways of working around OperationResponseParser::visitAdditionalProperties() limitations
  • Add XML namespace support traversal.
  • Add combination tests for OperationResponseParser
  • Add complex nested XML structures' tests with attributes and various additionalProperties stances.
  • Protect xml and json structures in Visitors
  • Optimize XmlVisitor::xmlToArray()
  • Incorporate support for top-level arrays and model location definition.
  • Fix Response\HeaderVisitor with additionalProperties

@Thinkscape
Copy link
Contributor Author

Mike, I'm having trouble understanding some decisions in OperationResponseParser. For some reason, the responsibility of scanning additional properties is transferred to the parser, while it should be a responsibility of a Visitor.

For example, in the revised XmlVisitor the additionalProperties Parameter option is now scanned on each traversal deeper into the structure - because each nested Schema might define its own value (true/false/Parameter).

Currently the ::visitAdditionalProperties() method is responsible of poking visitors and forcing them to visit individual "members of array" that would come from ::before() method. There's a lot of assumptions there and a lot of "thinking" inside the Parser which is duplicated by Visitors.

In the newest XmlVisitor no array is produced by ::before() so those assumptions will fail - however the visitor is perfectly capable of handling additionalProperties with schemas or otherwise. I believe the best direction here would be to deprecate ::visitAdditionalProperties() and leave this kind of processing to Visitors.

ps: ::vAP() seems to use the "topmost" location value, however it still loops through all visitors, which is kind of weird because such mixture of formats should not really happen :\ I believe it's for pluggability however it might cause some clashing (if topmost location does not match and conflicts with additionalParameters/location).

@mtdowling
Copy link
Member

I agree that it looks like there's a lot of code duplication. I'm open to changes as long as they aren't BC changes.

The OperationResponseParser does, however, need to know which visitors were triggered in order to call the appropriate before and after methods of each visitor (regardless of if it's a no-op or not).

ps: ::vAP() seems to use the "topmost" location value

Yeah, locations are only relevant for top-level properties.

however it still loops through all visitors, which is kind of weird because such mixture of formats should not really happen :\ I believe it's for pluggability however it might cause some clashing (if topmost location does not match and conflicts with additionalParameters/location).

With the current setup, the before() method is expected to create an array response based on each found location. Not all locations do this sort of thing, it's only really found in parts of an entity body (e.g., json, xml). If this were changed (as you are proposing), then this would no longer make sense. However, we will always need to trigger before and after methods for each visited visitor before we actually visit anything. As long as that still happens, then changing things around is fine.

Note that you can have defined parameters that all have a location different than the location of an additionalProperties parameter (real parameters with a location of header, while AP has a location of json).

- prevent visitors from resetting the resulting array (allowing for custom visitors performing different operations on resulting structures).
- make visitor processing non-conflicting with other visitors (locations).
- make JSON and XML vistors' processing respective structures internally, picking out necessary properties.
- move processing of additionalProperties out of OperationResponseParser and into visitors keeping BC.
@Thinkscape
Copy link
Contributor Author

Ok, I'm almost finished. I'll add NS support for XML... I'll also add some more complex combination tests and XML tests. I'll also work on coverage because I've found there is a lot of functionality that is described in docs but not tested at all in unit tests. I'm off for tonight. Cheers!

/**
* @param array $json
*/
public function setJson(array $json)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does setJson and getJson need to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they are used for unit tests and can be handy for add-ons.

For example, you might want to have a visitor that re-visits JSON (or XML) to fetch some additional data off of them. getXml() and getJson() allow such visitors to exist and work on the original document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but I'd rather the internals of the visitor not be exposed (e.g. SimpleXMLElement) in case the implementation changes (e.g. moving to XmlReader or something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n/p.

Subclass for tests ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclassing sounds fine. You could also try PHPUnit's $this->readAttribute() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there's no writeAttribute() so subclassing it'll have to be (to keep SOC in tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Thinkscape
Copy link
Contributor Author

I've been testing this branch on some real-life data and it seems that my current implementation of XmlVisitor::xmlToArray() is sturdy but quite expensive on large XML structures (10.000+ nodes). I'll consider replacing it with something speedier.

- universal support for visiting top-level array in service description.
- support for retrieving top-level array as unnamed or named model property.
- support for 'location' definition within 'items'.
- support for 'location' definition on model itself.
- drop the requirement of using array( 'jsonFlatten' => 1).
@Thinkscape
Copy link
Contributor Author

I was able to rewrite OperationResponseParser which fixes #317 in an elegant way. location is now supported in items and in model itself. A type=>"array" is now recognised on the top-level model and will behave accordingly.

  • In case a name is provided (i.e. name => "foo") the flat array will be read and put under $model->foo.
  • If a name is not provided, the flat array will be read and put for direct traversal.

@Thinkscape
Copy link
Contributor Author

@mtdowling I've got one thing to crack, usability-wise.

Because I've dropped the requirement array('jsonFlatten'=>true) I have reached a small caveat.

  • In case type=>"array" and sentAs is not provided (empty), JsonVisitor infers that we're parsing a list (numerical array)
  • In case type=>"array" and sentAs is provided, JsonVisitor looks for the list (numeric array) inside $json[$sentAs]

Make sense, right ?

The case when this doesn't work is when sentAs is not provided AND name is provided. This will happen when using it the old way - so we have a model of type object, properties that use different visitors, one of them is Json and the property name goes inside Parameter::$name. Now when the visitor tries to get sentAs, the getWireName() functions infers sentAs from name. This will break the above behavior and JsonVisitor will attempt to look for the flat-array inside $json['something'] as opposed to reading it as top-level array.

Here's a workaround I'm currently using.. I couldn't use false because of the way getWireName() works. I'd go bananas if I had to use "magic" value, so I decided to try with true. A true in this case means that the value is sent but does not have a property name, hence it should be retrieved as a list.

Phew. I hope you're still with me.

I don't know how I could make this easier. I really like dropping that jsonFlatten weirdness and it just-makes-sense™ now - look at how cute the top-level-array model is now.

Feedback much appreciated ;-)

@mtdowling
Copy link
Member

You've been busy!

I liked jsonFlatten because it was analogous to xmlFlatten. IMO, having to set "sentAs" => true to grab a top-level array and place it under a key is magical too.

@Thinkscape
Copy link
Contributor Author

Have you seen those example models in test cases ?

@mtdowling
Copy link
Member

I did look at the sentAs=>true examples. I don't see any jsonFlatten examples.

@Thinkscape
Copy link
Contributor Author

Consider the following two examples.
Both work with the same incoming json:

[{"foo":1},{"foo":2},{"foo":3}]
// this one is an "array" schema with JSON location that will
// read the above JSON into a model that can be iterated
// with foreach($model as $item) { ... } 
'models'     => array(
    'Foo' => array(
        'type'     => 'array',
        'location' => 'json',
        'items'    => array(
            'type' => 'object'
        )
    )
)
// this one is an "object" schema which uses 2 different 
// locations. It will read read the above JSON into a
// model under "result", so one can iterate via
//  foreach($model['result'] as $item) { ... } 
'models'     => array(
    'Foo' => array(
        'type' => 'object',
        'properties' => array(
            'result' => array(
                'type'     => 'array',
                'location' => 'json',
                'items'    => array(
                    'type' => 'object'
                )
            ),
            'code' => array(
                'location' => 'statusCode'
            )
        )
    )
)

@Thinkscape
Copy link
Contributor Author

The problem is: in example 2, how does the user tell that the json array (put inside $model['result']) is supposed to be an unnamed array ?

@mtdowling
Copy link
Member

Right. So the first example works, which is great. For the second example, I'm suggesting we go back to adding an jsonFlattened data attribute to pull the root level array of JSON data into a named parameter.

Alternatively, what if we just don't support example #2 with the json location, and force people to use the jsonPath location? That is starting to make more sense to me. I'm also starting to think that instead of calling it jsonPath, it should be jmesPath.

@Thinkscape
Copy link
Contributor Author

Hmm. A few questions then.

If example #2 required a path, what would the path be ? / , * or something else ?
jmesPath name would be more puzzling to users than jsonPath, however it'd help with adoption of the term ;-) (it's still a mouthful, looks like "James" but it's not... and why James anyways? )

@mtdowling
Copy link
Member

The second example would be [*] to pull out a wildcard index result (see: https://github.com/boto/jmespath/tree/develop/tests/compliance).

We're still finalizing the last bit of the grammar and I'm still working on the PHP implementation (with a lexer, parser, and AST). It should be released in the next few weeks.

jmesPath name would be more puzzling to users than jsonPath, however it'd help with adoption of the term ;-) (it's still a mouthful, looks like "James" but it's not... and why James anyways? )

I agree, but I also wouldn't want to confuse people into thinking that the visitor accepts jsonPath expressions (which JmesPath is similar but not 1:1).

it's still a mouthful, looks like "James" but it's not... and why James anyways?

I know, and I agree with you :). I don't have a good explanation for the name.

@Thinkscape
Copy link
Contributor Author

Ok, here's what I propose: because of the limitations as described in example 2, I will replace true with * (single asterisk). This will hint the Visitor, that the user is expecting to flatten the incoming result and still shove it into a property with some name => "...".

This is 100% BC, does not mess with example 1 (being the simplest case) and is easy to explain.

We will be then able to merge this one and start work on xPath and JmesPath locations (with an optional jsonPath using this implementation or our own).

Final example no. 2:

// this one is an "object" schema which uses 2 different 
// locations. It will read read the above JSON into a
// model under "result", so one can iterate via
//  foreach($model['result'] as $item) { ... } 
'models'     => array(
    'Foo' => array(
        'type' => 'object',
        'properties' => array(
            'result' => array(
                'type'     => 'array',
                'location' => 'json',
                'sentAs'   => '*',
                'items'    => array(
                    'type' => 'object'
                )
            ),
            'code' => array(
                'location' => 'statusCode'
            )
        )
    )
)

@mtdowling
Copy link
Member

This looks really great! I've gone through and optimized / cleaned up a few things. Can you take a look at 99d3508?

  1. I split out the large parts of OperationResponseParser::visitResult() into multiple methods for readability
  2. Only check for AP when the outer value is an object
  3. Optimized HeaderVisitor
  4. Optimized JsonVisitor. Not tracking knownProperties, but rather unsetting known properties from the input array as they are parsed. The leftover array is then used as the AP array which is merged in if appropriate.
  5. JsonVisitor merging is handled using array unions (+) rather than array_merge. I believe a union is more appropriate when merging associative arrays and it ensures that two values of the same name do not create an array.
  6. Optimized / tweaked some of the XmlVisitor.

I have one question though: Why are you setting and checking if $context is true in the JsonVisitor? I noticed true is passed into the $context argument when traversing additional properties and array items in OperationResponseParser. I removed those settings and checks, and the tests still passed. Is it ok to remove? If so, I can do it in my branch and merge into master.

@mtdowling
Copy link
Member

Looks like the rewritten XmlVisitor is not performing and same parsing as the XmlVisitor in master. I'm not sure yet why, but almost all of the xml location tests in the AWS SDK for PHP fail using this branch. For example, https://github.com/aws/aws-sdk-php/blob/master/src/Aws/Sqs/Resources/sqs-2012-11-05.php#L867 does not parse correctly. Here's some example XML: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryGetQueueAttributes.html

Note: the xmlFlattened attribute doesn't actually do anything (I should probably cleanup our model generation scripts to scrub that attribute).

I believe this is the issue: https://github.com/guzzle/guzzle/blob/response-visitors/src/Guzzle/Service/Command/LocationVisitor/Response/XmlVisitor.php#L96. If the type is array and there is only one value, then the node will have no children. In this case (and possibly others?) you really want to iterate over the node, not the node's children.

@Thinkscape
Copy link
Contributor Author

I have one question though: Why are you setting and checking if $context is true in the JsonVisitor? I noticed true is passed into the $context argument when traversing additional properties and array items in OperationResponseParser. I removed those settings and checks, and the tests still passed. Is it ok to remove? If so, I can do it in my branch and merge into master.

yes. It was branching because of the difference on how name works on the top level and on >=2 level. $context=true meant that it's a top-level operation.

If the type is array and there is only one value, then the node will have no children. In this case (and possibly others?) you really want to iterate over the node, not the node's children.

Isn't that exactly the same thing ?
This code fragment is referring to sentAs of items - which means, the user has explicitly specified, that items (xml child nodes) have a node name of "whatever". If there are no children - there's no iteration. If there's a single child node, it gets iterated. Not sure what you're getting at.

I'm not sure yet why, but almost all of the xml location tests in the AWS SDK for PHP fail using this branch.

I believe I know the answer.
The behavior that has changed a lot and has not been tested much in Unit Tests (which is why I bragged about it) is the xml reading and conversion.

With the "magical" xml to array conversion through json, on the top level, attributes were shoved under @attributes and discarded on deeper levels. The new implementation, xmlToArray() to be precise, is merging attributes with the node itself. In case there are both xml attributes and a text node (AKA node value) then the value is stored under value. I believe that some of the AWS code relies on additionalAttributes and some implicit data structures that were created during the previous conversion process.

I'll try adding a test case based on what you sent me.

Although it seems that you have been cherry-picking commits from this PR, which means I don't have your recent changes and we're "forking apart" from each other... this can lead to strange things.

@mtdowling
Copy link
Member

yes. It was branching because of the difference on how name works on the top level and on >=2 level. $context=true meant that it's a top-level operation

Hm ok. I wonder if $context should become an associative array so that in the future we could possibly use it for other things.

Isn't that exactly the same thing?

It should be, but I think the visiting is recursing incorrectly. When I changed that particular line from iterating children to iterating the node itself, almost everything started working.

With the "magical" xml to array conversion through json, on the top level, attributes were shoved under @attributes and discarded on deeper levels. The new implementation, xmlToArray() to be precise, is merging attributes with the node itself. In case there are both xml attributes and a text node (AKA node value) then the value is stored under value. I believe that some of the AWS code relies on additionalAttributes and some implicit data structures that were created during the previous conversion process.

Maybe we should emulate the old implementation of placing attributes in @attributes to prevent a breaking change? I don't think many AWS responses use attributes though. I believe most of the problems are coming from iterating arrays with the XmlVisitor.

Although it seems that you have been cherry-picking commits from this PR, which means I don't have your recent changes and we're "forking apart" from each other... this can lead to strange things.

I merged your branch entire onto master and then made a commit. Can you not use that?

@Thinkscape
Copy link
Contributor Author

It should be, but I think the visiting is recursing incorrectly. When I changed that particular line from iterating children to iterating the node itself, almost everything started working.

I still don't get that. Throw some code at me.

@Thinkscape
Copy link
Contributor Author

I merged your branch entire onto master and then made a commit. Can you not use that?

Ah, so you skipped github... that's why it's still open and there's no trace of it. Ok.

@mtdowling
Copy link
Member

I still don't get that. Throw some code at me.

Ok. If you get rid of the $sentAs switch on line 94 and just use the foreach on line 102 (https://github.com/guzzle/guzzle/blob/response-visitors/src/Guzzle/Service/Command/LocationVisitor/Response/XmlVisitor.php#L102) then everything works:

foreach ($node as $child) {
    $result[] = $this->recursiveProcess($items, $child);
}

Ah, so you skipped github... that's why it's still open and there's no trace of it. Ok.

Is there another way I can get the code to you that would be better? Should I fork your repo and submit a PR to your branch? Let me know what works best for you.

@Thinkscape
Copy link
Contributor Author

Ok. If you get rid of the $sentAs switch on line 94 and just use the foreach on line 102

Line 94... .change:

if ($sentAs) {

to:

if ($items->getSentAs()) {

Is there another way I can get the code to you that would be better? Should I fork your repo and submit a PR to your branch? Let me know what works best for you.

When working with zend framework we tend to always follow the basic PR flow. In case there's a need to augment a PR, contributors send PRs against the branch of the original author which he quickly merges. After the whole batch of work is done PR is merged as a whole.

@mtdowling
Copy link
Member

Ok. I'll send a PR to your branch.

@Thinkscape
Copy link
Contributor Author

Maybe we should emulate the old implementation of placing attributes in @attributes to prevent a breaking change? I don't think many AWS responses use attributes though. I believe most of the problems are coming from iterating arrays with the XmlVisitor.

I was thinking about that. We could do that, although merging allows for much "cleaner" means of working with attributes. It's easy to iterate over them, and attribute-heavy xml documents are much easier to work with.

But there's the BC problem I acknowledge :-\ Your call...

@mtdowling
Copy link
Member

Let's go with adding attributes to @attributes to lessen the potential for breaking users. It is slightly less desirable, but perhaps more explicit that a value came from an attribute.

@Thinkscape
Copy link
Contributor Author

I'll merge with current master (hope it doesn't explode) and do the work.

@Thinkscape
Copy link
Contributor Author

btw: did my fix work for you ?

@mtdowling
Copy link
Member

That appears to have fixed some issues, but many others still remain. For example, it looks like top-level additonalProperties aren't being picked up for some reason.

@Thinkscape
Copy link
Contributor Author

Mike, I'd like to have this merged as it fixes a lot of bugs. Tests pass.
Can you elaborate on what doesn't work for you ?

@Thinkscape
Copy link
Contributor Author

Ok, I'm going through failed tests on aws-php-sdk.

@Thinkscape
Copy link
Contributor Author

Ok, I know where the problem lies. Here's a result descr. from aws sdk (exerpt):

 'DeleteObjectsOutput' => array(
            'type' => 'object',
            'additionalProperties' => true,
            'properties' => array(
                'Deleted' => array(
                    'type' => 'array',
                    'location' => 'xml',
                    'data' => array(
                        'xmlFlattened' => true,
                    ),
                    'items' => array(
                        'type' => 'object',
                        'properties' => array(
                            'Key' => array(
                                'type' => 'string',
                            ),
                            'VersionId' => array(
                                'type' => 'string',
                            ),
                            'DeleteMarker' => array(
                                'type' => 'boolean',
                            ),
                            'DeleteMarkerVersionId' => array(
                                'type' => 'string',
                            ),
                        ),
                    ),
                ),
// [...]

Now that tells me, that we're expecting an xml document with Deleted node, which contains 0 or more child nodes (type=array) of any name, with child-nodes as defined in items.

Here's the document that we're testing against:

<DeleteResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Deleted>
    <Key>sample1.txt</Key>
  </Deleted>
  <Error>
    <Key>sample2.txt</Key>
    <Code>AccessDenied</Code>
    <Message>Access Denied</Message>
  </Error>
</DeleteResult>

This document does not fit the description - <Deleted> is clearly not an array, but more of an object (a single object containing property <Key>. Because we've set type=array the visitor is transforming this object into an array.

That's probably why you've been using the xmlFlatten gizmo everywhere to silently ignore this scenario and march on.

How would you like to approach this ?

@Thinkscape
Copy link
Contributor Author

Here's another problem I've found with aws definitions:

'Errors' => array(
    'type' => 'array',
    'location' => 'xml',
    'sentAs' => 'Error',
    'data' => array(
        'xmlFlattened' => true,
    ),
    'items' => array(
        'type' => 'object',
        'sentAs' => 'Error',
        'properties' => array(
            'Key' => array(
                'type' => 'string',
            ),
            'VersionId' => array(
                'type' => 'string',
            ),
            'Code' => array(
                'type' => 'string',
            ),
            'Message' => array(
                'type' => 'string',
            ),
        ),
    ),
),

Given the result XML mentioned above, the intention is to have Errors variable in the resulting model which will always be a list (numeric array) of objects, obtained by reading <Error></Error> node. The problem with this description is, it contains sentAs twice - first on the main parameter, then inside items. After removing the second "sentAs" the test succeeds and we're ending up with an array of objects (as expected).

I'm not sure what to make of it. The whole visitor thing is already quite foggy for me, as there are a lot of inconsistencies (i.e. same params meaning something different depending on location/nesting) and code duplication in each visitor. I'm becoming quite confused and annoyed by the whole thing.

My main motivation is to-make-it-work™ - and by that I mean response parser and visitors were not able to parse simple remote api's (xml and Json) that I have to use for my real-world™ app - because of issues that are listed on the top of this ticket. But now, deeper into the rabbit hole, I'm having trouble getting out :-) I'm tempted to either dump the project altogether and end up using hand-crafted transformers and hydrators with a decent http lib, or work through the above, or rewrite the whole bloody visitor/processing mess.

Any friendly advice greatly appreciated.

@mtdowling
Copy link
Member

I want to get this merged as well. As you can see though, there are a lot of changes going on, so I don't want to rush it. I apologize for not being able to spend that much time on it.

The first example you are referencing is how we handle repeated keys under the same container. In this case, the "DeletedContainer" would be repeated underneath it's parent node. However, we want to represent the "Deleted" nodes in an array the same way if there is one value or many values. Consider this example:

<DeleteResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Deleted>
    <Key>sample1.txt</Key>
  </Deleted>
  <Deleted>
    <Key>sample2.txt</Key>
  </Deleted>
  <!-- ... --!>
</DeleteResult>

When processed, the containing result will have a "Deleted" key that has an array of associative arrays containing each parsed "Deleted" XML node.

I'm not sure about the second example. It looks like there's an error with how we are generating the description, and only one of those sentAs attributes is required. It looks like you've deduced that the "sentAs" on the item is redundant. This seems like a bug in the description more than an issue with the current implementation. I'll be able to look into this early next week.

I'm tempted to either dump the project altogether and end up using hand-crafted transformers and hydrators with a decent http lib

If you choose to go this route, Guzzle is an HTTP library too (the best IMO).

or work through the above

I like this option :)

or rewrite the whole bloody visitor/processing mess

You basically have done that at this point. Let's finish this off. I'd definitely like to hear your thoughts on how it could be improved though even after your refactoring. We could chat about it on IRC or something whenever you'd like.

@Thinkscape
Copy link
Contributor Author

Let's tackle the first one.

When processed, the containing result will have a "Deleted" key that has an array of associative arrays containing each parsed "Deleted" XML node.

So there's a duality, branching within the parsing logic. Basically:

  • In case there's a single <Deleted/> node, create an object
  • In case there are 2 or more <Deleted/> nodes, create an array of objects.

The thing is, that might or might not be specific to xml.


Here's an example in JSON:

Specimen 1

{
    "result": {
        "ok" : true
    }
}

In order to parse it correctly, we have to set result->type = "object", right? Makes sense, we'd have properties to understand the ok key.

Specimen 2

{
    "result": [{
        "ok" : true
    },{
        "ok" : "somewhat"
    }]
}

The above does not fit the description of an object, so it would not processed - i.e. empty/null. In order to process it, we'd have to have an array with items set as type="object" with ok property.


Following the logic from aws and <Deleted>, we should be able to have a similar "logic" inside json visitor:

  • If result is an object, return a single result with the contents of this object
  • If result is an array of objects, process it as array of objects.

The problem I have with that is it's slippery slope. It's "wiggly", "relaxed specification" or rather recommendation. Note how this behavior (pre-PR) is not symmetrical - xml visitor does this "magic", while json visitor won't.

While I think about it, what would make sense to me is parameter alternatives. I'm not sure yet how, but it would make sense to define the parameter as being dual in nature - either an array (list) or a single item.

Unless I got this all wrong... what do you think ?

@Thinkscape
Copy link
Contributor Author

btw: problem with current xml visitor duality - you can't rely on the resulting model... i.e.

echo $model->deleted->key; // <-- error in case there are 2 errors
foreach($model->deleted as $del){
    echo $del->key; // <-- error in case there is only 1 error
}

My gut tells me (model) specifications and api schemas are there for a reason. I'm either expecting an array or an object and the rest of the code follows the spec - nothing unexpected can happen with that.

@mtdowling
Copy link
Member

Parsing XML into a json like strucure (PHP arrays) requires more configuration options than parsing JSON into a PHP array.

For example, this XML example, the result from parsing should give the same structure regardless of if there is one node or multiple nodes. We have flattened XML attributes because this is something that can happen in XML responses. It can't happen in json, so there's no need to have any json flattening. This is because JSON has arrays and objects, while XML can have many permutations: repeated nodes with the same name and no container, ditto and a container, attributes, etc.

The whole goal of parsing responses is to represent the data received over the wire in a PHP assoc array that will provide consistent results of translating the over the wire format to the array. This is more difficult with XML, hence the extra data attributes.

I don't consider this magic in the XML visitor, but simply a necessity of parsing XML into an associative array.

On Oct 17, 2013, at 1:04 PM, Artur Bodera notifications@github.com wrote:

Let's tackle the first one, first.

When processed, the containing result will have a "Deleted" key that has an array of associative arrays containing each parsed "Deleted" XML node.

So there's a duality, branching within the parsing logic. Basically:

In case there's a single node, create an object
In case there are 2 or more nodes, create an array of objects.
The thing is, that might or might not be specific to xml.

Here's an example in JSON:

Specimen 1

{
"result": {
"ok" : true
}
}
In order to parse it correctly, we have to set result->type = "object", right? Makes sense, we'd have properties to understand the ok key.

Specimen 2

{
"result": [{
"ok" : true
},{
"ok" : "somewhat"
}]
}
The above does not fit the description of an object, so it would not processed - i.e. empty/null. In order to process it, we'd have to have an array with items set as type="object" with ok property.

Following the logic from aws and , we should be able to have a similar "logic" inside json visitor:

If result is an object, return a single result with the contents of this object
If result is an array of objects, process it as array of objects.
The problem I have with that is it's slippery slope. It's "wiggly", "relaxed specification" or rather recommendation. Note how this behavior (pre-PR) is not symmetrical - xml visitor does this "magic", while json visitor won't.

While I think about it, what would make sense to me is parameter alternatives. I'm not sure yet how, but it would make sense to define the parameter as being dual in nature - either an array (list) or a single item.

Unless I got this all wrong... what do you think ?


Reply to this email directly or view it on GitHub.

@Thinkscape
Copy link
Contributor Author

That's not entirely true. It's a form of a processing pipe. You push XML/JSON from one side, a model comes out of the other side.

Input formats differ in semantics, grammar and capabilities. Key differences here are:

  1. XML has attributes and namespaces, json doesn't.
  2. JSON has arrays and objects, XML doesn't (everything's a node)

The first difference is solvable by paths and scoping (this PR adds support for namespace scoping).

The second difference is moot in my opinion. It's a characteristic of the input format. It's not even a real feature - xml just repeats nodes and doesn't call that a list/array, but a node with child nodes. It's basically the same thing.

So the first step is reading, ingesting the data in particular format. That's least important as it's just Guzzle's implementation detail. What is important here is what comes out of Guzzle. The example above with json exchanging arrays and objects is not without a merit. You could have the very same case with JSON, where you could get a response in form of an array or a single object. The grammar differs, the principle doesn't - it's 1 or more objects.

I have a problem with the output not being symmetrical, not the input. The input doesn't matter. Many APIs allow you to switch data formats between XML/SOAP/JSON/array etc. Because Guzzle supplies api model abstraction layer I'm expecting Guzzle to take care of interpreting the input and giving me the same (predictable) output every single time... regardless of headers, chunking, ssl, revalidation, caching, data format, text encoding format et al.

Symmetry argument is very important in terms of using models and consuming incoming model. As mentioned in the example above, if my userland code can't rely on the incoming model from guzzle, then what's the point of using it. I could just load the XML and foreach() through nodes. In that particular aws-php-sdk example, it seems feeble. For example that s3 delete items thing relies on a response checking method, that checks Error key of the model. What if AWS sends back 2 errors ? The lib would then do something unexpected, such as return a (silent) success (!!!), while items were actually not purged from s3... (because a numeric array does not contain Error key).

@mtdowling
Copy link
Member

I appreciate you taking the time to convey your thoughts on this.

I have a problem with the output not being symmetrical, not the input. The input doesn't matter. Many APIs allow you to switch data formats between XML/SOAP/JSON/array etc. Because Guzzle supplies api model abstraction layer I'm expecting Guzzle to take care of interpreting the input and giving me the same (predictable) output every single time... regardless of headers, chunking, ssl, revalidation, caching, data format, text encoding format et al.

For better or worse, as of right now, Guzzle's service descriptions are closely tied to the serialization format used over the wire (e.g., xml, json, etc). This is evident in the special care that needs to be taken to parse XML responses into the normalized array like syntax of Guzzle's response models. The current service description format was not designed to work across different serialization formats for the same API. The reason for this is because the end user of a web service client doesn't care about the over the wire format of the webservice because it is abstracted away by the service description. Service descriptions are written with a specific serialization format in mind from the start to work around this problem entirely. This is something that we can definitely look at for a future version of Guzzle (maybe Guzzle service descriptions become versioned or something).

As mentioned in the example above, if my userland code can't rely on the incoming model from guzzle, then what's the point of using it. I could just load the XML and foreach() through nodes. In that particular aws-php-sdk example, it seems feeble. For example that s3 delete items thing relies on a response checking method, that checks Error key of the model. What if AWS sends back 2 errors ? The lib would then do something unexpected, such as return a (silent) success (!!!), while items were actually not purged from s3... (because a numeric array does not contain Error key).

That is incorrect. One of the purposes of the xmlFlattened attribute is to account for the case of the API returning a single node for something is supposed to be a collection of nodes. Take the following example for the response of a DeleteObjects operation: https://github.com/aws/aws-sdk-php/blob/master/src/Aws/S3/Resources/s3-2006-03-01.php#L3024.

$subsetOftheReallyBigServiceDescription = array(
        'DeleteObjectsOutput' => array(
            'type' => 'object',
            'additionalProperties' => true,
            'properties' => array(
                'Deleted' => array(
                    'type' => 'array',
                    'location' => 'xml',
                    'data' => array(
                        'xmlFlattened' => true,
                    ),
                    'items' => array(
                        'type' => 'object',
                        'properties' => array(
                            'Key' => array(
                                'type' => 'string',
                            ),
                            'VersionId' => array(
                                'type' => 'string',
                            ),
                            'DeleteMarker' => array(
                                'type' => 'boolean',
                            ),
                            'DeleteMarkerVersionId' => array(
                                'type' => 'string',
                            ),
                        ),
                    ),
                ),
                'Errors' => array(
                    'type' => 'array',
                    'location' => 'xml',
                    'sentAs' => 'Error',
                    'data' => array(
                        'xmlFlattened' => true,
                    ),
                    'items' => array(
                        'type' => 'object',
                        'sentAs' => 'Error',
                        'properties' => array(
                            'Key' => array(
                                'type' => 'string',
                            ),
                            'VersionId' => array(
                                'type' => 'string',
                            ),
                            'Code' => array(
                                'type' => 'string',
                            ),
                            'Message' => array(
                                'type' => 'string',
                            ),
                        ),
                    ),
                ),
                'RequestId' => array(
                    'location' => 'header',
                    'sentAs' => 'x-amz-request-id',
                ),
            ),
        )
);

Let's say we get this response back from the service:

<?xml version="1.0" encoding="UTF-8"?>
<DeleteResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Deleted>
    <Key>sample1.txt</Key>
  </Deleted>
  <Error>
    <Key>sample2.txt</Key>
    <Code>AccessDenied</Code>
    <Message>Access Denied</Message>
  </Error>
</DeleteResult>

Regardless of if the XML response returned a single Deleted node or multiple Deleted nodes, the response model will always contain a "Deleted" key that contains an array of objects. The above XML parsed using the service description I linked will create the following response model (represented as JSON just to type it quickly):

{
  "Deleted": [{"Key": "string", "VersionId": "string", ... }, {...}, ...],
  "Errors": [{"Key": "string", "VersionId": "string", ... }, {...}, ...],
}

(You can also see the structure in the API docs of the SDK for this method: http://docs.aws.amazon.com/aws-sdk-php/latest/class-Aws.S3.S3Client.html#_deleteObjects).

Collections of XML nodes can be represented in various ways while a collection of JSON values can only be represented in an array.

Wrapping node (this is not flattened because "things" is the collection of nodes (an array type) and "thing" is a node stored in the collection):

<response>
  <things>
    <thing>...</thing>
    <thing>...</thing>
  </things>
</response>

No wrapping node (flattened):

<response>
  <thing>...</thing>
  <thing>...</thing>
</response>

@Thinkscape
Copy link
Contributor Author

Gotcha. So I have to reincarnate xmlFlatten and make it behave similar to what it used to be.

@bendavies
Copy link

any news on this?

@mtdowling
Copy link
Member

The service description layer of Guzzle 4 has been rewritten. Many of the design decisions for the rewrite were taken from this issue. You can find the service description layer for Guzzle 4 at https://github.com/guzzle/guzzle-services.

Guzzle 3 development has moved to https://github.com/guzzle/guzzle3. Feel free to reopen this PR on the Guzzle 3 repository (though I feel this is a risky PR due to the amount of changes and considering that 4.0 is now being released).

@mtdowling mtdowling closed this Mar 15, 2014
@Thinkscape
Copy link
Contributor Author

👍 ... a lot of hours went into that, but I'm happy that at least it was sorta inspiration for refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XmlVisitor looses attributes
3 participants