Writes compressed output of given format #2221

fenilsuchak · 2019-03-16T13:11:01Z

Fixes #2216

Changes made in this Pull Request:

Preferring file_format argument over filename in get_writer_for .
The compressed form of atomgroups can be written as ag.write(filename.writer.gz)
Tests are written

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

codecov · 2019-03-16T13:31:46Z

Codecov Report

Merging #2221 into develop will decrease coverage by 1.14%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop    #2221      +/-   ##
===========================================
- Coverage    90.71%   89.57%   -1.15%     
===========================================
  Files           15      158     +143     
  Lines         1982    19686   +17704     
  Branches         0     2769    +2769     
===========================================
+ Hits          1798    17633   +15835     
- Misses         184     1458    +1274     
- Partials         0      595     +595

Impacted Files	Coverage Δ
package/MDAnalysis/coordinates/GRO.py	`93.87% <100%> (ø)`
package/MDAnalysis/core/groups.py	`98.25% <100%> (ø)`
package/MDAnalysis/analysis/nuclinfo.py	`90.14% <100%> (ø)`
package/MDAnalysis/lib/util.py	`88.47% <100%> (ø)`
package/MDAnalysis/coordinates/base.py	`93.3% <0%> (ø)`
package/MDAnalysis/analysis/helanal.py	`85.19% <0%> (ø)`
package/MDAnalysis/topology/MinimalParser.py	`100% <0%> (ø)`
package/MDAnalysis/visualization/__init__.py	`100% <0%> (ø)`
package/MDAnalysis/core/qcprot.py	`100% <0%> (ø)`
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9428866...2c92916. Read the comment docs.

fenilsuchak · 2019-03-16T15:36:02Z

Not sure why codecov is failing

fenilsuchak · 2019-03-17T05:34:46Z

I guess the test is already written for this in Class TestGetWriterFor in test_util.py, heres the test

def test_compressed_extension(self):
        for ext in ('.gz', '.bz2'):
            fn = 'test.gro' + ext
            writer = mda.coordinates.core.get_writer_for(filename=fn)
            assert writer == mda.coordinates.GRO.GROWriter
            # Make sure ``get_writer_for`` works with compressed file file names
def test_compressed_extension_fail(self):
        for ext in ('.gz', '.bz2'):
            fn = 'test.unk' + ext
            # Make sure ``get_writer_for`` fails if an unknown format is compressed
            with pytest.raises(TypeError):
                mda.coordinates.core.get_writer_for(filename=fn)

jbarnoud · 2019-03-18T14:56:25Z

Thank you for your contribution!

Codecov complains because you added code which is not tested. You need to add a test that calls ag.write with a file name ending on .gz to go through the lines you added.

From what I see, your change would indeed allow to write a gziped file. Though, as you noticed, get_writer_for is already dealing with compressed file extensions. Therefore, I think a better solution would be to get rid of the redundant format logic in write rather than adding to it like you did.

@richardjgowers, it seems you wrote the lines around

mdanalysis/package/MDAnalysis/core/groups.py

Line 3086 in 04056e3

format = os.path.splitext(filename)[1][1:] # strip initial dot!

. Do you remember why you extracted the format there rather than leaving get_writer_for deal with it?

fenilsuchak · 2019-03-18T16:06:05Z

@jbarnoud, ohh ok, I'll add such a test. from what I understood get_writer_for gets the relevant writer( PDB etc.), based on the file extension that is stored in format which is provided as an argument to it. I think format logic would be necessary as it needs to handle three cases

Usual File Extension (Which has a writer)
GZipped File
No File extension
format I guess is a single variable for selecting between these. I couldn't think of a better solution given the code structure. But if there is something I'm missing, I'm willing to change the code accordingly to remove redundancy, I'll add tests after the changes are confirmed.

jbarnoud · 2019-03-18T17:15:15Z

@Fenil3510 The format discovery logic has to exist, indeed. My point is that it is already implemented in get_writer_for that is called in ag.write. In its current version, ag.write overwrite the format discovery logic and explicitly ask for a file format, instead of letting get_writer_for figure out what format to use.

The user should still be able to ask for a specific format by providing the format keyword argument to ag.write, though. But just passing the format argument untouched should be enough.

fenilsuchak · 2019-03-19T06:43:23Z

Oh, I get it @jbarnoud you are right. get_writer_for does handle compression thing with function calls get_ext and check_compressed_format, making the thing I did redundant. We can just pass format to writer(filename, format = None) and its done. I guess there is no need for format variable then.
I've made changes. Thanks

jbarnoud · 2019-03-19T09:15:23Z

This is what I had in mind.

But, you are breaking the file_format keyword argument of ag.write. This keyword argument should allow to bypass the format detection, and it is rather handy from time to time. The default value is "PDB", which is confusing because we actually default to guessing... The default should probably be None. This change should be documented, and we need to test what happens when the format cannot be guessed.

fenilsuchak · 2019-03-19T09:36:48Z

But, you are breaking the file_format keyword argument of ag.write. This keyword argument should allow to bypass the format detection, and it is rather handy from time to time.

@jbarnoud yes that makes sense. Currently when the user has given argument filename.extenstion and if the extension is valid then the file_format keyword would be redundant and preference would be given to the former to identify the format. This is what is happening currently if I'm right. So giving preference to file_format over extraction from filename is what you mean? ie. reversing the preference.

The issue then will be that if neither file_format nor extraction from filename can guess the format, in that case, we could use a default value like PDB?

jbarnoud · 2019-03-19T11:02:05Z

Look at the signature of get_writer_for.

mdanalysis/package/MDAnalysis/core/_get_readers.py

Line 198 in a0bc631

def get_parser_for(filename, format=None):

It takes an optional format. The guessing is only done if that format is not explicitly provided.

mdanalysis/package/MDAnalysis/core/_get_readers.py

Line 221 in a0bc631

# Only guess if format is not provided

This is good practice because, if the user tells the function to do something, then the function should do what the user asked.

What you did in your last commit is to disregard an explicit format that the user may have passed with the file_format argument of ag.write.

In the ideal scenario for ag.write:

the format is guessed by default
if a format is explicitly provided, then that format is used

fenilsuchak · 2019-03-19T11:24:39Z

Ok, @jbarnoud I understand what you say but my confusion is in get_writer_for vs get_selection_writer_for.
Supposedly user provides ag.write with arguments filename(without extension) , file_format = GRO so this calls get_writer_for which throws an error (current implmentation) so get_selection_writer_for gets the control and writes in GRO format. So if it is done such that get_writer_for handles all possible scenarios, then get_selection_writer_for would be redundant, wouldn't it? Or is get_selection_writer_for for some different purpose?

jbarnoud · 2019-03-19T13:26:46Z

get_selection_writer_for is about getting a selection writer. It is a different things, but ag.write can write either a structure (writer) or a selection (selection writer).

fenilsuchak · 2019-03-19T14:41:06Z

In the ideal scenario for ag.write:

the format is guessed by default

if a format is explicitly provided, then that format is used

Ok, thanks for the clarification about get_selection_writer. I will make changes as quoted above.

fenilsuchak · 2019-03-20T13:31:48Z

@jbarnoud a doubt, if the user writes ag.write(filename.pdb.gz, file_format = 'GRO') this results in a conflict in the things I'm about to change. Do these cases need to be take care of? If file_format is to be given preference as you said then it must be saved as filename.gro.gz ? or we throw an error?

jbarnoud · 2019-03-20T14:35:23Z

As a general rule, it is very frustrating for users when a software tries to outsmart them. Indeed, the example you give presents a contradiction, but the request is clear: the user wants to write a GRO file named "filename.pdb.gz"; therefore it is what we should generate.

richardjgowers · 2019-03-20T15:28:08Z

format is meant to overrule everything, the filename is just a (often very good) hint at the format.

I would have thought that you couldn't compress the file frame by frame, but instead you'd need to see the entire file before compressing. It would be useful to see a test that writes a 'test.pdb.gz', then a 'test.pdb' which gets compressed and check the relative file sizes? Also just a simple round trip test, so make a test.pdb.gz, create a new Universe from this file and check that positions & names match.

fenilsuchak · 2019-03-20T17:09:25Z

As a general rule, it is very frustrating for users when a software tries to outsmart them. Indeed, the example you give presents a contradiction, but the request is clear: the user wants to write a GRO file named "filename.pdb.gz"; therefore it is what we should generate.

Noted @jbarnoud, thanks

fenilsuchak · 2019-03-20T17:10:04Z

format is meant to overrule everything, the filename is just a (often very good) hint at the format.

Also just a simple round trip test, so make a test.pdb.gz, create a new Universe from this file and check that positions & names match.

Ok @richardjgowers . I'll write the test as suggested.

fenilsuchak · 2019-03-21T07:37:38Z

@jbarnoud @richardjgowers I've made a fix. But this is by no means robust. There will be a bit of an issue in the compression part. How would the user indicate that a compressed format is needed? This would be via extraction filename itself, right?. The file_format argument is only to specify writers according to the current docstring. Should we extend the argument file_format to take values like pdb.gz and gro.gz?
Currently if filename = name.pdb.gz and file_format = pdb then gz(extracted from filename) file is created, but as a user, I think its a bit confusing or can we change the argument name from file_format to file_writer, because it essentially selects writer currently.

fenilsuchak · 2019-03-22T14:50:29Z

The build is failing as get_selection_writer_for is expecting file_format to be a default one(previously 'PDB'). As there is no default_format now, should I change get_selection_writer_for to use PDB if file_format is None or let it throw error?

orbeckst · 2019-03-27T19:24:38Z

I haven't reviewed the code but I note that although the tests pass now, the coverage is still low. Do you still have to write tests?

fenilsuchak · 2019-03-28T04:37:43Z

@orbeckst, yes haven't added the tests yet. Will do as soon as changes are reviewed. There are some issues that need to be addressed that I've mentioned in the previous comments before tests be written.

fenilsuchak · 2019-04-02T10:37:42Z

@Fenil3510 write the tests for what you have – people don't always have time to do reviews, and when they can make time then it's better that everything is already there, even if that means that you have to change things after the review.

I'm sorry, will follow it. I'm writing tests.

fenilsuchak · 2019-04-02T17:02:14Z

It would be useful to see a test that writes a 'test.pdb.gz', then a 'test.pdb' which gets compressed and check the relative file sizes? Also just a simple round trip test, so make a test.pdb.gz, create a new Universe from this file and check that positions & names match.

Tests were written as per this and from what I understood. Are any other tests expected?

I got a weird failure in PR MDAnalysis#2231 in Travis job https://travis-ci.com/MDAnalysis/mdanalysis/jobs/189553940 : Exception occurred: File "/home/travis/miniconda/envs/test/lib/python3.5/site-packages/sphinx/ext/napoleon/docstring.py", line 703, in _parse_raises_section m = self._name_rgx.match(_type).groupdict() AttributeError: 'NoneType' object has no attribute 'groupdict' and the docs build find locally (under Python 3.6)

orbeckst · 2019-04-02T20:30:58Z

The failure of the doc build https://travis-ci.com/MDAnalysis/mdanalysis/jobs/189731062 is not your fault, I just had the same thing in PR #2231 . Maybe my patch there will fix this.

If you haven't added your name yet to AUTHORS then do that, too.

Also add an entry to CHANGELOG.

orbeckst

Thanks for adding tests. See comments (formatting, tests for capitalized extensions). Also add yoursel to AUTHORS and and entry to CHANGELOG.

orbeckst · 2019-04-02T20:31:27Z

package/MDAnalysis/coordinates/GRO.py

@@ -326,7 +326,7 @@ def __init__(self, filename, convert_units=None, n_atoms=None, **kwargs):
               w.write(u.atoms)

        """
-        self.filename = util.filename(filename, ext='gro')
+        self.filename = util.filename(filename, ext='gro', keep = True)


PEP8; no space around = in args.

Are we also restricting the number of characters per line to 80/79 as per pep8?

See the Style Guide: yes, we try to keep within 79 characters. (Not all our code rigorously adheres to it, but that's what we're striving for)

orbeckst · 2019-04-02T20:31:49Z

package/MDAnalysis/core/groups.py

@@ -3054,8 +3054,7 @@ def write(self, filename=None, file_format="PDB",
        if filename is None:
            trjname, ext = os.path.splitext(os.path.basename(trj.filename))
            filename = filenamefmt.format(trjname=trjname, frame=trj.frame)
-        filename = util.filename(filename, ext=file_format.lower(), keep=True)
-
+        filename = util.filename(filename, ext= file_format or 'PDB', keep=True)


Did you test that you can omit .lower()??

I don't like file_format or 'PDB'. Better be explicit so that only None really gets defaulted:

file_format if file_format is not None else 'PDB'

package/MDAnalysis/core/groups.py

orbeckst · 2019-04-02T20:36:29Z

testsuite/MDAnalysisTests/core/test_atomgroup.py

@@ -167,7 +167,27 @@ def test_write_frame_none(self, u, tmpdir, extension):
            u.atoms.positions[None, ...], new_positions, decimal=2
        )

+    @pytest.mark.parametrize('extension', ('xtc', 'dcd', 'pdb', 'xyz'))


Se should capitalized extensions, too.

I'm afraid I don't understand. Should I capitalize all the extensions?. I've used the parameters from the previous test already written.

I've added a test for a capitalized extension 'PDB'.

Sorry, clearly I should proof-read what I am typing... yes, test for capitalized extensions was what I was looking for.

testsuite/MDAnalysisTests/core/test_atomgroup.py

orbeckst · 2019-04-02T20:43:02Z

@jbarnoud can you please also have a quick peek? Thanks.

fixed 2'-hydroxyl dihedral calculation

…o Fenil3510-develop

orbeckst · 2019-04-03T01:35:50Z

I merged the latest develop into your branch, which should fix the doc build failure (via 4f8dc17).

fix MDAnalysis#2233

orbeckst · 2019-04-03T04:00:39Z

oops, I killed travis #2233 ... fixing (so that the tests ACTUALLY run)

* pin the version of sphinx used in Travis doc builds until upstream issues are resolved in version 2.0

…2233 fix travis file (tab -> space)

fenilsuchak · 2019-04-04T04:26:48Z

oops, I killed travis #2233 ... fixing (so that the tests ACTUALLY run)

Would need to commit again for Travis to build? As I see it is fixed now.

…o Fenil3510-develop

orbeckst · 2019-04-04T20:16:33Z

Merged develop to get the travis fix from PR #2234.

orbeckst · 2019-04-05T18:02:43Z

Congratulations @Fenil3510 to your first commit. Thank you for your effort.

orbeckst · 2019-04-05T18:05:22Z

@Fenil3510 I sent you a GitHub invite so that you can be part of @MDAnalysis/contributors.

fenilsuchak · 2019-04-05T19:02:17Z

Happy to be a part of the team!

@tbengtsen

Update of AUTHORS and CHANGELOG with inferred author contributions. * Removed duplicate mattwthompson in 0.20.0 changelog entry.: mattwthompson was placed twice by accident, this removes this duplication. * Addition of missing authors. An retrospective analysis of the authors found via `git shortlog -s -n --email --branches="develop"` found several commits by authors which were not present in the `AUTHORS.md` file. - Zhenbo Li: commited via PR: Started tests for gnm. #803 and Make Travis run tests on OSX. #771, - Jenna M. Swarthout via PR Update CoC according to suggestions from current CoC committee #4289 and Point to new reporting form link (owned by conduct@mdanalysis.org) #4298, - Bradley Dice via PR Fix GSD Windows compatibility #2384 , - David Minh via PR #2668 There seemed to be no indications in the above mentioned PRs that the author did not want to be in the authors file, it looks like it was just forgotten. * Addition of missing entries from the changelog Continued cross referencing of the git shortlog output but also accounting for the changelog identified several individuals that had not been included in the changelog entries for the release they contributed under. They were added to the relevant entry of the changelog based on the merge commit date. Please note that for Tone Bengsten, we a) had no github handle (so they were assigned @tbengtsen), and b) no specific commit. Given we know that this individual was including alongside the encore merge, they were assigned to the 0.16.0 release. * Update CHANGELOG * PR #1218 * PR #1284 and #1408 * PR #4109 * based on git history * PRs #803 and #771 (author addition, changelog addition) * PR #2255 and #2221 * PR #1225 * PR #4289 and #4298 * PR #4031 * PR #4085 * PR #3635 * PR #2356 * PR #2559 * No GH handle - Encore author addition @tbengtsen * PR #4184 * PR #2614 * PR #2219 * PR #2384 * PR #2668 * Add missing entry for Jenna Thanks to @fiona-naughton for helping out with digging into this data :) Co-authored-by: Fiona Naughton <fiona@mdanalysis.org> Co-authored-by: Oliver Beckstein <orbeckst@mdanalysis.org>

fix1

04056e3

removed redundant format

e7d8164

file_format_precedence

62601a3

fenilsuchak added 2 commits March 21, 2019 22:11

fixed changes due to file_format being none

1b23c61

fixed redundant

94ccf13

fixed util.filename by default PDB

70a314b

fenilsuchak added 2 commits March 29, 2019 15:01

some logical fix

a718617

GRO Zipping

cc4a013

fenilsuchak force-pushed the develop branch from 0f5d024 to cc4a013 Compare April 2, 2019 11:52

tests_added

92634f0

orbeckst requested changes Apr 2, 2019

View reviewed changes

orbeckst requested a review from jbarnoud April 2, 2019 20:42

orbeckst added 2 commits April 2, 2019 21:30

Merge pull request MDAnalysis#2231 from MDAnalysis/issue-2218-2-OH

6a60c8a

fixed 2'-hydroxyl dihedral calculation

Merge branch 'develop' of https://github.com/Fenil3510/mdanalysis int…

7d7fcbe

…o Fenil3510-develop

fix travis file (tab -> space)

e88265d

fix MDAnalysis#2233

orbeckst mentioned this pull request Apr 3, 2019

fix travis file (tab -> space) #2234

Merged

4 tasks

fenilsuchak and others added 5 commits April 3, 2019 10:13

made_changes

755e246

changes_fixed

28298b1

added changelog and authors

d84232f

TST: pin sphinx version

bb54223

* pin the version of sphinx used in Travis doc builds until upstream issues are resolved in version 2.0

Merge pull request MDAnalysis#2234 from MDAnalysis/fix-broken-travis-…

7dfb65a

…2233 fix travis file (tab -> space)

fenilsuchak and others added 2 commits April 4, 2019 16:20

updated changelog

941e3cd

Merge branch 'develop' of https://github.com/Fenil3510/mdanalysis int…

2c92916

…o Fenil3510-develop

orbeckst approved these changes Apr 5, 2019

View reviewed changes

orbeckst merged commit 1be9f73 into MDAnalysis:develop Apr 5, 2019

Writes compressed output of given format #2221

Writes compressed output of given format #2221

Conversation

fenilsuchak commented Mar 16, 2019 • edited

PR Checklist

codecov bot commented Mar 16, 2019 • edited

Codecov Report

fenilsuchak commented Mar 16, 2019

fenilsuchak commented Mar 17, 2019

jbarnoud commented Mar 18, 2019

fenilsuchak commented Mar 18, 2019 • edited

jbarnoud commented Mar 18, 2019

fenilsuchak commented Mar 19, 2019 • edited

jbarnoud commented Mar 19, 2019

fenilsuchak commented Mar 19, 2019 • edited

jbarnoud commented Mar 19, 2019

fenilsuchak commented Mar 19, 2019 • edited

jbarnoud commented Mar 19, 2019

fenilsuchak commented Mar 19, 2019 • edited

fenilsuchak commented Mar 20, 2019 • edited

jbarnoud commented Mar 20, 2019

richardjgowers commented Mar 20, 2019

fenilsuchak commented Mar 20, 2019

fenilsuchak commented Mar 20, 2019

fenilsuchak commented Mar 21, 2019 • edited

fenilsuchak commented Mar 22, 2019

orbeckst commented Mar 27, 2019

fenilsuchak commented Mar 28, 2019

fenilsuchak commented Apr 2, 2019

fenilsuchak commented Apr 2, 2019 • edited

orbeckst commented Apr 2, 2019

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Apr 2, 2019

orbeckst commented Apr 3, 2019

orbeckst commented Apr 3, 2019

fenilsuchak commented Apr 4, 2019

orbeckst commented Apr 4, 2019

orbeckst commented Apr 5, 2019

orbeckst commented Apr 5, 2019

fenilsuchak commented Apr 5, 2019

fenilsuchak commented Mar 16, 2019 •

edited

codecov bot commented Mar 16, 2019 •

edited

fenilsuchak commented Mar 18, 2019 •

edited

fenilsuchak commented Mar 19, 2019 •

edited

fenilsuchak commented Mar 19, 2019 •

edited

fenilsuchak commented Mar 19, 2019 •

edited

fenilsuchak commented Mar 19, 2019 •

edited

fenilsuchak commented Mar 20, 2019 •

edited

fenilsuchak commented Mar 21, 2019 •

edited

fenilsuchak commented Apr 2, 2019 •

edited