Show who tests what #170

nedbat · 2012-03-21T14:52:05Z

Originally reported by andrea crotti (Bitbucket: andrea_crotti, GitHub: Unknown)

I was just using the awesome HTML report to see my test coverage and I had the following thought.

Wouldn't it be nice to be able to see easily what parts of the test suites are actually testing my code?

I guess that these information is collected while doing the annotation, right?

In this way we could actually see if the tests are actually good very easily, which is specially important when working with other people code.

Bitbucket: https://bitbucket.org/ned/coveragepy/issue/170

nedbat · 2012-03-21T14:57:40Z

This is a very interesting idea, one that figleaf pioneered with "sections". Right now we don't collect this information. The trace function would have to be modified to walk up the stack to identify "the test", then that information would have to be stored somehow. Then the reporting would have to be changed to somehow display the information.

That's three significant problems, but only three! Do you have ideas how to do them?

nedbat · 2012-03-21T15:20:26Z

Original comment by andrea crotti (Bitbucket: andrea_crotti, GitHub: Unknown)

Well I need to dive more into the internals to suggest something that
makes sense, however I see that the stats struct is this:

struct {
    unsigned int calls;
    unsigned int lines;
    unsigned int returns;
    unsigned int exceptions;
    unsigned int others;
    unsigned int new_files;
    unsigned int missed_returns;
    unsigned int stack_reallocs;
    unsigned int errors;
} stats;

which probably doesn't help much my idea, because I think we would need
to associate every line to a list of lines that are testing it.

So for example

silly_module.py
def silly_func():
foobar()

silly_test.py:
assert silly_func()

silly_test2.py:
assert silly_func()

I should have that

silly_func:0 = [silly_test.py:0, silly_test2:0]
silly_func:1 = [silly_test.py:0, silly_test2:0]

I'm afraid that it would be a awful lot of information to store if the
project gets really big though.

For the reporting I imagine just to add a clickable button near every
line that opens up a page that collects the different tests that run
that line, with some context around.

That should probably be the easier part, even if I'm not really a good
web-developer at the moment..

nedbat · 2012-03-21T15:26:30Z

I don't think we need to collect all the lines that test product lines, we need to collect the tests that test product lines, which reduces the data collection a bit, but it will still be a challenge.

nedbat · 2012-03-21T21:05:10Z

Original comment by andrea crotti (Bitbucket: andrea_crotti, GitHub: Unknown)

For the tests you mean the code object of the test function?

In that case I agree, because it should keep track of original file / line where it's defined if I remember correctly.

Anyway another possible usecase which of this feature is checking if unit-tests are really unit tests.

If I see for example that module a.py is tested by test_a.py but also
from test_z.py which has almost nothing to do with a.py, then there is
something wrong, and a system to visualize it would be nice..

nedbat · 2012-11-10T23:46:32Z

Issue #185 was marked as a duplicate of this issue.

nedbat · 2013-06-24T22:44:42Z

Original comment by Kevin Qiu (Bitbucket: kevinjqiu, GitHub: kevinjqiu)

I have a prototype that does just as the OP described: https://github.com/kevinjqiu/nostrils

It's currently as a nosetest plugin but I'd love to see coverage.py do this.

nedbat · 2014-06-05T11:06:10Z

Issue #311 was marked as a duplicate of this issue.

nedbat · 2014-06-13T13:02:06Z

Original comment by Thomas Güttler (Bitbucket: thomas-guettler, GitHub: Unknown)

I guess you need this data structure to implement this:

I use django ORM since it is what I know best, but SQLAlchemy might be a better solution
for this. Storage to sqlite would be enough.

#!python

class Line(models.Model):
    file_name=models.CharField()
    line_number=models.IntegerField()

class StackFrame(models.Model):
    executed_line=models.ForeignKey(Line)
    lines_of_stack=models.ManyToManyField(Line)

This structure needs to be filled with every line that gets executed by coverage.

A HTML Report could be created by this data.

I guess this is really slow... but who cares? For me it would be enough to run
this once very week in a cron/batch job.

nedbat · 2014-06-16T07:39:49Z

Original comment by Alexander Walters (Bitbucket: sdamon, GitHub: sdamon)

using an orm in the trace function... I would think would turn a 30 second test suit into a 45 minute to an hour test suit.

nedbat · 2014-06-16T09:34:18Z

Original comment by andrea crotti (Bitbucket: andrea_crotti, GitHub: Unknown)

Yes that would be probably too slow agreed..
Anyway there is a project that might solve the problem in a different way:

https://pypi.python.org/pypi/nose-knows/0.1.1

nedbat · 2014-06-16T10:57:24Z

Original comment by Alexander Walters (Bitbucket: sdamon, GitHub: sdamon)

Be very careful using that with coverage. ...In fact, it may be impossible, depending on how nice nose-knows and coverage are about setting their trace functions

nedbat · 2014-06-16T13:33:48Z

Original comment by Thomas Güttler (Bitbucket: thomas-guettler, GitHub: Unknown)

About the ORM: Linus Torvald said sometimes: good programmers care about data structures. That's why I would implement this first.
And the db structure in this case is quite easy. The implementation could use pure sqlite without ORM.

Yes, the execution time would increase a lot. But I don't think an alternative to sqlite would be much faster.

And: This is not intended to be run every time. We can optimize later.

nedbat · 2014-06-17T02:28:05Z

If you are going to record which line was tested by each test, what will you do as the code shifts around due to insertion and deletion of lines?

nedbat · 2015-09-21T07:18:41Z

Original comment by xcombelle (Bitbucket: xcombelle, GitHub: xcombelle)

Instead of inspecting the trace at each call to the trace function, I thought of something which could be faster:

at the start of a test record on which test we are, and during the call of trace check which is the current test at the end of the test forget the current test

nedbat · 2015-09-21T07:35:50Z

Original comment by Florian Bruhin (Bitbucket: The-Compiler, GitHub: The-Compiler)

@xcombelle that would work with extensible test frameworks (like pytest and nose), but how are you going to do this with e.g. unittest?

nedbat · 2015-09-21T07:53:26Z

Original comment by Tibor (Bitbucket: tibor_arpas, GitHub: Unknown)

I think the conceptual problem here is, that coverage.py has avoided the concept of "test case" and "test". It's the job of a test runner to define, discover, instantiate, execute them.. And each test runner has a slightly different definition of what is a test and what not..

e.g. unittest has this definition of a test: methods of a unittest.TestCase subclass beginning with letters "test"

Other test runners have different definitions... E.g. pytest is very flexible and you can configure almost anything to be a test..

The practical solution might be that coverage.py:
a) does detection of tests during runtime if executed under unittest and
b) provides an API for annotating phases(sections/tests) for the rest.

@nedbat do you see b) as a challenge also, or were you referring to a) as not beeing easy? :)

nedbat · 2015-09-21T12:00:45Z

Original comment by Ronny Pfannschmidt (Bitbucket: RonnyPfannschmidt, GitHub: RonnyPfannschmidt)

providing a contextmanager to record test setup/execution/teardlow would be nice

then all test runner could extend uppon it

nedbat · 2015-09-22T15:18:38Z

Original comment by Paul Sargent (Bitbucket: PaulS, GitHub: PaulS)

I do this kind of analysis in my day job all the time with other tools. What we normally do is store separate coverage results files for each test, and then we can do various bits of analysis like:

Calculate the union of all the coverage to get a total (obvious).
Find the top test which gets the most coverage
- ...then rank the rest of the tests by how much additional coverage they each contribute.
Construct sets of tests, which are quick to run, but give high coverage because they're the top 10 (for example) ranked tests. Useful for commit checks when your full test suite takes an hour.
Identify tests which don't contribute any unique coverage.

It all starts with having identifiable coverage for each test.

nedbat · 2015-09-23T16:28:14Z

Original comment by Alexander Walters (Bitbucket: sdamon, GitHub: sdamon)

This route would require coverage to turn into a test runner, wouldn't it?

nedbat · 2015-09-24T12:56:35Z

Original comment by Paul Sargent (Bitbucket: PaulS, GitHub: PaulS)

Not necessarily, it could just be as simple as giving the test runner a mechanism to control where to store the coverage data.

nedbat · 2015-09-24T16:12:00Z

Original comment by Alexander Walters (Bitbucket: sdamon, GitHub: sdamon)

this would also mean there would have to be a way for coverage to start and stop covering too...

nedbat · 2015-09-28T07:06:55Z

Original comment by Thomas Güttler (Bitbucket: thomas-guettler, GitHub: Unknown)

@PaulS you do this type of analysis in your day job?

How do you do this?

nedbat · 2015-10-24T12:45:50Z

Original comment by Laurens Timmermans (Bitbucket: lauwe, GitHub: lauwe)

@tibor_arpas : A while back I made a small proof of concept which basically does what you described under 'b'.

I've uploaded the (extended) htmlcov of this proof of concept here. It basically provides a count ('covered by how many unique test-cases') and a heatmap kind of visualization to get an idea of which part of your code is touched most. The dropdown (called 'label') at the top and mouse-over in the column on the right allow selection/highlighting of test-cases.

The test-suite and unittest.TestCase derived class which produced these results can be found here. The changes I made in coverage.py to support this are not there since they are really hacky and incomplete, but if anyone is interested; let me know.

nedbat · 2015-10-25T21:37:08Z

I think my preference would be to provide a plugin interface that would let a plugin define the boundaries between tests. In fact, it need not be "tests" at all. Perhaps someone wants to distinguish not between specific tests, but between directories of tests, or between unit and integration tests. Figleaf implemented a feature like this and called it generically, "sections".

So the plugin could demarcate the tests (runners? callers? regions? sections? what's a good name?) any way it liked. Coverage.py can ship with a simple one that looks for test_* methods, for the common case.

Any ideas about how to present the data? I'd like it to scale to 10k tests...

nedbat · 2015-10-25T21:38:53Z

@RonnyPfannschmidt I'd rather not rely on the test runners updating to add the feature, though if the test runners want to add support, it'd be good to offer it to them in a way that provides the best support.

nedbat · 2015-10-26T07:34:08Z

Original comment by Thomas Güttler (Bitbucket: thomas-guettler, GitHub: Unknown)

@nedbat " Perhaps someone wants to distinguish not between specific tests, but between directories of tests, or between unit and integration tests"

I think doing "Separation of concerns" here would be nice:

First, collect the data as detailed as possible.
Second, do aggregate the data.

This way both can be done: distinguish between tests methods and distinguish between directories/sections.

nedbat · 2015-10-26T07:35:11Z

Original comment by Paul Sargent (Bitbucket: PaulS, GitHub: PaulS)

@thomas-guettler So my day job is verification of hardware designs, but really the fact that it's hardware is not important. We have tests and we have code under test. The analysis is done with the commercial hardware design tools we use, but the principles of what's done is relatively straight forward.

Rather than put a lot of detail here, I've written a snippet

nedbat · 2015-10-26T10:26:40Z

@thomas-guettler I agree about separation of concerns. That one of the reasons I'm leaning toward a plugin approach: it isn't even clear to me that "test methods" is always the finest granularity we need. Some people use coverage.py without a test suite at all, and they may have their own idea about where interesting slices begin and end.

BTW: I like the name "slice" for this concept. It's the same word as "string slicing", but I don't think that collision is a problem. "Segment" is similar, but not as nice.

nedbat · 2015-10-26T11:29:36Z

Original comment by Thomas Güttler (Bitbucket: thomas-guettler, GitHub: Unknown)

@nedbat coverage.py usage without tests.... good catch. Yes, that was not on my mind.

You are right, it should be flexible.

method: stacktrace_to_??? (unsure how to call it)

Input: stacktrace (list of nested method calls).
output: ???. Maybe just a string. Example for the use case "store which line was executed in which test?": myapp.tests.test_foo.test_foo_with_bar

The above use case would go down the stacktrace until it sees a method which starts with "test_....".

nedbat · 2015-10-26T11:34:45Z

I'd have to play around with possible plugin semantics. The challenge will be to support it in a way that doesn't require invoking a Python function too often, as that will kill performance.

nedbat · 2015-10-27T02:28:53Z

I appreciate the "make it work, then make it fast" approach. In the case of designing a plugin API, though, the details of the API could have a big effect on the speed. But I hear you: it could be fine for this to be only enabled occasionally, and slow is fine.

@lauwe Hmm, "context" is a good (if boring!) word... :)

nedbat · 2016-07-20T21:20:10Z

Original comment by Chris Beaumont (Bitbucket: chris_beaumont, GitHub: Unknown)

Hey there. I've been thinking about this issue lately, and thought it might be worth leaving some notes here. I've been working on a coverage wrapper called smother (https://github.com/chrisbeaumont/smother) based on the ideas I've seen on this ticket, @kevinjqiu's nostrils repo, and the experimental WTW code in coverage.py's source. A quick summary of smother's approach:

Relies on using a test runner (currently pytest or nose are supported) to hook into the start of every test. Uses coverage to actually collect coverage information, and saves the information separately for each test. This is similar to nostrils except it uses coverage.py for tracing (so it's fast and robust). It's also like the WTW code in coverage.py, but relies on the test runner instead of the tracer to detect test boundaries (so it's fast and robust).
It builds a JSON file whose shape is {test_name: {file_name: [list_of_lines]}}
The CLI has 4 main commands: lookup (given a function name or line range, list which tests visit this code section), diff (given a git diff, list which tests might be affected), csv (dump a CSV of test/source code pairs for exploration), and to_coverage (build a vanilla coverage report)

In answer to some of the questions in this thread:

If you are going to record which line was tested by each test, what will you do as the code shifts around due to insertion and deletion of lines?

This is primarily relevant for smother diff, where the smother data has been generated for an old version of code and then queried against a set of modifications. Smother takes the approach of mapping each line of code to a "semantic region" (essentially the smallest function or class block that contains that line). smother diff converts the set of modified lines from a changeset into a set of semantic regions, converts that to a set of line ranges on the old version of the code, and matches those line numbers to what's in the smother report. This conservative approach will match some tests that may not have actually touched a specific modification (if the modification was in an unevaulated branch, say), but has the benefit that these "semantic regions" are much more stable across changesets than line numbers are.

Any ideas about how to present the data? I'd like it to scale to 10k tests...

The inspiration for smother was a 11K test suite of a 100K line legacy codebase, and is reasonably performant (negligible time overhead, a somewhat-ungainly 100MB data file that could easily be optimized for size, and ~5 sec query times). I've experimented with different visualizations of smother's CSV output, but ultimately found that the lookup and diff commands are most useful -- for other exploration coverage's normal html report is sufficient. In other words, "Who tests what" feels most useful in the context of specific questions (what tests might I have just broken?).

nedbat · 2016-07-26T16:27:25Z

Original comment by Tibor (Bitbucket: tibor_arpas, GitHub: Unknown)

@chris_beaumont For reference, I'll link here also http://testmon.org .
pytest-testmon is a py.test plug-in which automatically selects and re-executes only tests affected by recent changes.

I didn't have a time to look at smother yet. testmon uses a notion of "python code blocks" (probably something similar to smother "semantic region"). pytest-testmon also takes into account holes in the blocks. Which is described here in the second half: https://github.com/tarpas/pytest-testmon/wiki/Determining-affected-tests

My anwher to the question:

If you are going to record which line was tested by each test, what will you do as the code shifts around due to insertion and deletion of lines?

would be that I think coverage.py doesn't need to care but if it really wants to it can store checksums of code blocks as implemented in testmon .

nedbat · 2016-12-11T01:07:54Z

I wrote a blog post laying out the challenges: http://nedbatchelder.com/blog/201612/who_tests_what.html

nedbat · 2016-12-11T01:41:22Z

Original comment by xcombelle (Bitbucket: xcombelle, GitHub: xcombelle)

I don't get how you get the figure of C/4 more information to store. (I don't know neither how it is stored now) As I understand now you have to store all the lines executed. With new way you would have to store also the contexts where the line is executed so O(n) more information where n is the average number of simultaneous context.

nedbat · 2016-12-11T03:22:46Z

The way I've implemented the contexts so far, there is a separate data structure for each context. So I don't store a list of contexts for each line. Instead, each context has a subset of the line data. So the question is, what fraction of the full product's coverage will a single context be. I took a crude guess at 25%. Hence, C/4.

nedbat · 2016-12-11T06:18:05Z

Original comment by xcombelle (Bitbucket: xcombelle, GitHub: xcombelle)

I realize both way to store data is equivalent and that the full project coverage with one test is heavily dependent on the grain of a test. For unittest, it is only a small part of the codebase which is tested by a test but for an integration test a bigger part is covered. So you are totally right that two order of magnitude of data might be necessary

nedbat · 2017-01-05T09:22:35Z

Original comment by Loic Dachary (Bitbucket: dachary, GitHub: dachary)

For the record some discussions on that topic at https://bitbucket.org/ned/coveragepy/pull-requests/120/wip-list-of-contexts-instead-of-none/diff

nedbat · 2017-01-05T23:45:46Z

Original comment by Loic Dachary (Bitbucket: dachary, GitHub: dachary)

For the record, a failed hack can be found at https://bitbucket.org/ned/coveragepy/pull-requests/121/wtw-draft/diff

nedbat · 2018-02-05T18:02:07Z

Original comment by Tibor (Bitbucket: tibor_arpas, GitHub: Unknown)

How about, for every line, collecting filename and line number of previously executed line? This would be a set of course, which would grow only if there is new occurrence - new caller. No idea how much slower this would be. At the measurement stage it would be enough to just work with hash of filename or index into an array of course, so hopefully there is no difference to the current "wtw/context" drafts. Advantages:

an IDE could use the information to determine the real-runtime call hierarchy (goto definition, find usages)
it's test runner agnostic (the test/context is not set on runtime, it can be calculated from the stored data afterwards)

nedbat · 2018-05-08T18:58:09Z

Original comment by Tibor (Bitbucket: tibor_arpas, GitHub: Unknown)

There is now a good UI designed for this feature: http://bit.ly/livetestdemo

nedbat · 2018-05-09T01:08:05Z

@tibor_arpas Do you have information about who made the UI, and what it is built on?

massich · 2018-08-13T16:46:29Z

how is the status of this?

nedbat · 2018-08-15T11:18:02Z

I'm currently working on switching to SQLite data storage, as part of this work: https://nedbatchelder.com/blog/201808/sqlite_data_storage_for_coveragepy.html .

nedbat · 2018-10-09T23:15:24Z

@zooko @massich @schettino72 @tarpas @xcombelle @ChrisBeaumont @lauwe @PaulS @sdamon @guettli @RonnyPfannschmidt @kevinjqiu This is now available in v5.0a3: https://nedbatchelder.com/blog/201810/who_tests_what_is_here.html

Please let me know what you think!

MarcoGorelli · 2022-12-27T14:59:31Z

Please let me know what you think!

I think it's beyond awesome! It took me hours' to figure out how to get it working though

Would you consider a PR to expand the documentation for this, with some examples?

EDIT: realistically, not going to get a chance to do this any time soon I'm afraid

nedbat · 2022-12-27T23:04:08Z

I will gladly take a pull request for that. It's hard sometimes to step back and see things as a newcomer would, so you have expertise I don't. Thanks.

nedbat added major report labels Jun 23, 2018

nedbat closed this as completed Oct 9, 2018

vorpal-buildbot mentioned this issue Dec 15, 2019

Update coverage to 5.0 PennyDreadfulMTG/Penny-Dreadful-Tools#6937

Merged

nedbat added enhancement New feature or request and removed proposal labels Oct 31, 2021

paultiplady mentioned this issue Nov 24, 2021

Documented API to query the testmon database for information tarpas/pytest-testmon#62

Open

Show who tests what #170

Show who tests what #170

Comments

nedbat commented Mar 21, 2012

nedbat commented Mar 21, 2012

nedbat commented Mar 21, 2012

nedbat commented Mar 21, 2012

nedbat commented Mar 21, 2012

nedbat commented Nov 10, 2012

nedbat commented Jun 24, 2013

nedbat commented Jun 5, 2014

nedbat commented Jun 13, 2014

nedbat commented Jun 16, 2014

nedbat commented Jun 16, 2014

nedbat commented Jun 16, 2014

nedbat commented Jun 16, 2014

nedbat commented Jun 17, 2014

nedbat commented Sep 21, 2015

nedbat commented Sep 21, 2015

nedbat commented Sep 21, 2015

nedbat commented Sep 21, 2015

nedbat commented Sep 22, 2015

nedbat commented Sep 23, 2015

nedbat commented Sep 24, 2015

nedbat commented Sep 24, 2015

nedbat commented Sep 28, 2015

nedbat commented Oct 24, 2015

nedbat commented Oct 25, 2015

nedbat commented Oct 25, 2015

nedbat commented Oct 26, 2015

nedbat commented Oct 26, 2015

nedbat commented Oct 26, 2015

nedbat commented Oct 26, 2015

nedbat commented Oct 26, 2015

nedbat commented Oct 27, 2015

nedbat commented Jul 20, 2016

nedbat commented Jul 26, 2016

nedbat commented Dec 11, 2016

nedbat commented Dec 11, 2016

nedbat commented Dec 11, 2016

nedbat commented Dec 11, 2016

nedbat commented Jan 5, 2017

nedbat commented Jan 5, 2017

nedbat commented Feb 5, 2018

nedbat commented May 8, 2018

nedbat commented May 9, 2018

massich commented Aug 13, 2018

nedbat commented Aug 15, 2018

nedbat commented Oct 9, 2018

MarcoGorelli commented Dec 27, 2022 • edited

nedbat commented Dec 27, 2022

MarcoGorelli commented Dec 27, 2022 •

edited