Differentiating organic vs automated installations #5499

mahmoud · 2018-06-13T08:39:25Z

What's the problem this feature will solve?

Currently, pip installation statistics are aggregated to the gCloud and made available on libraries.io and pepy.tech. A lot of effort has gone into these numbers, but thanks to automation, they mean less now than they did a few years ago.

CI and other automation, combined with maybe a bit too much reliance on PyPI's central infrastructure, have inflated the download numbers and diluted the signal with noise.

Describe the solution you'd like

We could detect when pip is being used interactively (by checking if stdin is a tty or some other mechanism), and include that in the pip install request headers, to be included in the statistics generated by the server.

This would provide us with much cleaner data for highlighting actual community activity, instead of drowning in automation trends, overly favoring professionalized sectors of Python. Specifically, a library being manually installed 100 times may well indicate something much more interesting than a CI (or, unfortunately, a production) fleet installing a package 10,000 times.

Additional context

I wasn't sure whether to file this on pip or on Warehouse, it seems kind of 🐔 / 🥚 to me.
I'm not really sure if/how other package indexes solve this, but would be very interested in hearing.
As an arbitrary example, I happen to know Mozilla uses PyPI for quite a few relatively-internal packages. Granted, they're open-source and I'm happy to see some infrastructure synergy. But, picking at random, mozlog actually ranks ok for downloads, even though it's not a very broadly-useful package, and I'm pretty sure the data will show it's mostly Mozilla infrastructure downloading it.

Thanks for your attention and keep up the good work!

mhsmith · 2018-06-13T14:11:56Z

Note that because of pypi/linehaul#30, the numbers on Google BigQuery may already be meaningless for answering some questions.

pradyunsg · 2018-06-13T16:33:40Z

I agree. It would definitely be useful to have separation of automated vs direct usage.

Maybe pypa/packaging-problems would be a good place for it?

mahmoud · 2018-06-15T02:11:16Z

@mhsmith like Nathaniel pointed out, the lossage should be fairly uniform, so I think the numbers would still be somewhat representative, if we were collecting them on top of the leaky linehaul, that is :)

@pradyunsg Glad you agree! Given that I suspect (and suggested) a straightforward pip enhancement, I'd like to keep this issue open. That said, I may cross-post this there, if you think it would improve the visibility. Let me know if so!

dstufft · 2018-06-15T15:15:40Z

I think the fundamental problem here is I don't think you can actually detect this reasonably. For instance, if someone manually runs a bash script (or even a tox command), we'd probably want that to be not set as automated-- but by default those things will not have a tty. On the flip side, you have things like Travis CI which I believe mimics a tty, so then Travis CI will look like like a manual install instead of automated.

On a theoretical level, I don't have any problem with the idea-- I just have never been able to think of a good way of actually differentiating the types of uses automatically.

njsmith · 2018-07-22T04:41:14Z

If we want to detect running under CI, I think that's actually fairly easy, because CI systems tend to advertise that fact in the environment. Just checking for "CI" in os.environ or "BUILD_ID" in os.environ or "BUILD_BUILDID" in os.environ would probably catch 95% of cases (including at least Travis-CI, Appveyro, Circle-CI, Jenkins, VSTS).

Or if you want to get fancier, it looks like the ci-info package (2.5 million weekly downloads) has a fairly comprehensive list of envvars to check for: https://github.com/watson/ci-info/blob/master/index.js
(Looks like they're missing VSTS though.)

hroncok · 2019-02-14T18:06:06Z

See https://github.com/The-Compiler/pytest-vw for a Python project that can detect CI.

pradyunsg · 2019-02-15T06:37:00Z

Yea, it isn't difficult to detect whether you're running in a CI, on most CI services -- or for that matter even which one you're running on. We likely still won't know what %age of the non-CI runs are not automated but having a separation between CI/non-CI is a good start.

I don't know if we'd want to have any distinction between various CI services (logging NULL if we don't have the information, otherwise a string like "travis" representing the service).

cjerdonek · 2019-02-16T23:43:06Z

I posted #6273 to start addressing this.

…-agent Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI

cjerdonek · 2019-02-24T22:30:51Z

I'm going to leave this open for now as opposed to auto-closing for the purposes of discussing whether an additional key-value should be added to store the value of isatty(). The PR that was just merged stored the different info of whether something is known to be running in CI.

pradyunsg · 2019-02-25T05:51:00Z

FWIW, I pinged on #zuul on Freenode, to see if anyone there has inputs on how to detect running within Zuul. That said, better detection of that is not a blocker in any form.

theacodes · 2019-04-29T07:24:38Z

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

mahmoud · 2019-04-29T16:51:45Z

@theacodes if one can set an environment variable, wouldn't setting CI=true achieve just that? Or will that have an impact on other parts of the CI?

theacodes · 2019-04-29T17:38:46Z

Yeah, it might have unintended consequences.

…

On Mon, Apr 29, 2019, 9:51 AM Mahmoud Hashemi ***@***.***> wrote: @theacodes <https://github.com/theacodes> if one can set an environment variable, wouldn't setting CI=true achieve just that? Or will that have an impact on other parts of the CI? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5499 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB5I4565CGSWO5SZCHYFM3PS4RS7ANCNFSM4FEXCP4A> .

cjerdonek · 2019-05-07T03:20:21Z

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

I think it'd be fine (and low maintenance) to support this. The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable here:

pip/src/pip/_internal/download.py

Line 80 in 5a00ac4

CI_ENVIRONMENT_VARIABLES = (

pradyunsg · 2019-05-19T13:31:23Z

The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable

@theacodes If you could file a PR for this, that'd be great!

methane · 2019-05-20T01:36:46Z

Is PIP_IS_CI recommended for non-CI automated installations?
For example, provisioning server via cloud-init or ansible.

cjerdonek · 2019-05-23T16:47:53Z

Is PIP_IS_CI recommended for non-CI automated installations?

It seems like it should be for any automated runs, but I'm not the one using this data. Is it worth making the environment variable name more descriptive (e.g. PIP_IS_AUTOMATED)? I'm also not sure to what extent this should be publicized / recommended for others to use.

cjerdonek · 2019-05-23T18:55:39Z

Reflecting a bit more on this, to @methane's implicit point, if we're going to expose an environment variable I'm thinking it would be better to call it something like PIP_IS_AUTOMATED. That would document the intent more clearly.

njsmith · 2019-05-23T20:58:39Z

I think there are several different things we might be trying to track here. Test vs non-test: installs for testing are "subsidiary" to "real" installs: they don't directly solve someone's problem; their purpose is just to make sure things are working for later when someone tries to use the code for its primary purpose. If you want to count how many installs are intended to use the code for its primary purpose, then you want to eliminate test installs. But if someone installs on a big fleet of production boxes, that's real usage. Automated versus interactive: if you want to count how many people actually typed "pip install mypackage", then that's a different question, and automated installs *shouldn't* count. In principle, maybe we should track both of these seperately. More data allows you to do more :-). In practice, I don't think we have any technical mechanism to track automated vs interactive installs. Even if everyone on this thread goes off and manually updates their deployment system to set some magic envvar, I'm guessing the vast majority of automated installs *won't* set that envvar, and that will make the data really hard to interpret. A field for "is this running in CI?" is also hard to interpret or connect to what we really want to know, like how many users our project has. But it's at least technically feasible, and it's easy to communicate what it does and doesn't mean to people trying to interpret the data. So I'm inclined to say, let's just keep it as a CI flag for now. And we can always revisit once we see the data :-)

…

On Thu, May 23, 2019, 11:55 Chris Jerdonek ***@***.***> wrote: Reflecting a bit more on this, to @methane <https://github.com/methane>'s implicit point, if we're going to expose an environment variable I'm thinking it would be better to call it something like PIP_IS_AUTOMATED. That would document the intent more clearly. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5499?email_source=notifications&email_token=AAEU42ABPK7WMCIKWHB7XGDPW3SDXA5CNFSM4FEXCP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDFFBI#issuecomment-495342213>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEU42HTHIOUNWCZWQWP25LPW3SDXANCNFSM4FEXCP4A> .

cjerdonek · 2019-05-23T21:43:32Z

Okay, that's fine with me. And that would mean then that the answer to @methane's original question ("Is PIP_IS_CI recommended for non-CI automated installations?") is no.

pradyunsg added state: needs discussion This needs some more discussion type: feature request Request for a new feature labels Jun 15, 2018

cjerdonek mentioned this issue Feb 16, 2019

Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI #6273

Merged

cjerdonek closed this as completed in #6273 Feb 24, 2019

cjerdonek added a commit that referenced this issue Feb 24, 2019

Merge pull request #6273 from cjerdonek/issue-5499-detect-ci-for-user…

821247d

…-agent Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI

cjerdonek reopened this Feb 24, 2019

cjerdonek added type: enhancement Improvements to functionality and removed type: feature request Request for a new feature labels Feb 24, 2019

theacodes mentioned this issue May 22, 2019

Check for explicit PIP_IS_CI environment variable to report automated installs to Warehouse. #6522

Merged

cjerdonek closed this as completed in #6522 May 24, 2019

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 23, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiating organic vs automated installations #5499

Differentiating organic vs automated installations #5499

mahmoud commented Jun 13, 2018

mhsmith commented Jun 13, 2018

pradyunsg commented Jun 13, 2018

mahmoud commented Jun 15, 2018

dstufft commented Jun 15, 2018

njsmith commented Jul 22, 2018

hroncok commented Feb 14, 2019

pradyunsg commented Feb 15, 2019

cjerdonek commented Feb 16, 2019

cjerdonek commented Feb 24, 2019

pradyunsg commented Feb 25, 2019

theacodes commented Apr 29, 2019

mahmoud commented Apr 29, 2019

theacodes commented Apr 29, 2019 via email

cjerdonek commented May 7, 2019

pradyunsg commented May 19, 2019

methane commented May 20, 2019

cjerdonek commented May 23, 2019 •

edited

cjerdonek commented May 23, 2019

njsmith commented May 23, 2019 via email

cjerdonek commented May 23, 2019

Differentiating organic vs automated installations #5499

Differentiating organic vs automated installations #5499

Comments

mahmoud commented Jun 13, 2018

mhsmith commented Jun 13, 2018

pradyunsg commented Jun 13, 2018

mahmoud commented Jun 15, 2018

dstufft commented Jun 15, 2018

njsmith commented Jul 22, 2018

hroncok commented Feb 14, 2019

pradyunsg commented Feb 15, 2019

cjerdonek commented Feb 16, 2019

cjerdonek commented Feb 24, 2019

pradyunsg commented Feb 25, 2019

theacodes commented Apr 29, 2019

mahmoud commented Apr 29, 2019

theacodes commented Apr 29, 2019 via email

cjerdonek commented May 7, 2019

pradyunsg commented May 19, 2019

methane commented May 20, 2019

cjerdonek commented May 23, 2019 • edited

cjerdonek commented May 23, 2019

njsmith commented May 23, 2019 via email

cjerdonek commented May 23, 2019

cjerdonek commented May 23, 2019 •

edited