Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all attributes are documented #14312

Closed
amueller opened this issue Jul 12, 2019 · 82 comments · Fixed by #14320
Closed

Ensure all attributes are documented #14312

amueller opened this issue Jul 12, 2019 · 82 comments · Fixed by #14320
Labels
Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted

Comments

@amueller
Copy link
Member

amueller commented Jul 12, 2019

as discussed in #13385 we need to ensure all attributes are documented.

if you want to work on this, you should pick a specific submodule and fix all the attribute documentation mismatches in that submodule.

Here's a script to find remaining ones (there might be some false positives):

import numpy as np
from sklearn.base import clone
from sklearn.utils.testing import all_estimators
from sklearn.utils.estimator_checks import pairwise_estimator_convert_X, enforce_estimator_tags_y
from numpydoc import docscrape

ests = all_estimators()

for name, Est in ests:
    try:
        estimator_orig = Est()
    except:
        continue
    rng = np.random.RandomState(0)
    X = pairwise_estimator_convert_X(rng.rand(40, 10), estimator_orig)
    X = X.astype(object)
    y = (X[:, 0] * 4).astype(np.int)
    est = clone(estimator_orig)
    y = enforce_estimator_tags_y(est, y)
    try:
        est.fit(X, y)
    except:
        continue
    fitted_attrs = [(x, getattr(est, x, None))
                    for x in est.__dict__.keys() if x.endswith("_")
                    and not x.startswith("_")]
    doc = docscrape.ClassDoc(type(est))
    doc_attributes = []
    incorrect = []
    for att_name, type_definition, param_doc in doc['Attributes']:
        if not type_definition.strip():
            if ':' in att_name and att_name[:att_name.index(':')][-1:].strip():
                incorrect += [name +
                              ' There was no space between the param name and '
                              'colon (%r)' % att_name]
            elif name.rstrip().endswith(':'):
                incorrect += [name +
                              ' Parameter %r has an empty type spec. '
                              'Remove the colon' % (att_name.lstrip())]

        if '*' not in att_name:
            doc_attributes.append(att_name.split(':')[0].strip('` '))
    assert incorrect == []
    fitted_attrs_names = [x[0] for x in fitted_attrs]

    bad = sorted(list(set(fitted_attrs_names) ^ set(doc_attributes)))
    if len(bad) > 0:
        msg = '{}\n'.format(name) + '\n'.join(bad)
        print("Docstring Error: Attribute mismatch in " + msg)
@amueller amueller added Easy Well-defined and straightforward way to resolve Documentation good first issue Easy with clear instructions to resolve help wanted Sprint labels Jul 12, 2019
@alexitkes
Copy link
Contributor

alexitkes commented Jul 12, 2019

I have already found at least one mismatch in attribute documentation in NMF class description. I think I can take some of this work. I am almost ready to propose some changes within decomposition and random_projection submodules.

@thomasjpfan
Copy link
Member

thomasjpfan commented Jul 13, 2019

Missing attribute docstrings for each estimator

Reference this issue in your PR

  • ARDRegression, [intercept_]
  • AdaBoostClassifier, [base_estimator_]
  • AdaBoostRegressor, [base_estimator_]
  • AdditiveChi2Sampler, [sample_interval_]
  • AgglomerativeClustering, [n_components_] (deprecated)
  • BaggingClassifier, [n_features_]
  • BaggingRegressor, [base_estimator_, n_features_]
  • BayesianGaussianMixture, [mean_precision_prior, mean_precision_prior_]
  • BayesianRidge, [X_offset_, X_scale_]
  • BernoulliNB, [coef_, intercept_]
  • BernoulliRBM, [h_samples_]
  • Birch, [fit_, partial_fit_]
  • CCA, [coef_, x_mean_, x_std_, y_mean_, y_std_]
  • CheckingClassifier, [classes_]
  • ComplementNB, [coef_, intercept_]
  • CountVectorizer, [stop_words_, vocabulary_]
  • DecisionTreeRegressor, [classes_, n_classes_]
  • DictVectorizer, [feature_names_, vocabulary_]
  • DummyClassifier, [output_2d_]
  • DummyRegressor, [output_2d_]
  • ElasticNet, [dual_gap_]
  • ElasticNetCV, [dual_gap_]
  • EllipticEnvelope, [dist_, raw_covariance_, raw_location_, raw_support_]
  • ExtraTreeClassifier, [feature_importances_]
  • ExtraTreeRegressor, [classes_, feature_importances_, n_classes_]
  • ExtraTreesClassifier, [base_estimator_]
  • ExtraTreesRegressor, [base_estimator_]
  • FactorAnalysis, [mean_]
  • FeatureAgglomeration, [n_components_]
  • GaussianProcessClassifier, [base_estimator_]
  • GaussianRandomProjection, [components_]
  • GradientBoostingClassifier, [max_features_, n_classes_, n_features_, oob_improvement_]
  • GradientBoostingRegressor, [max_features_, n_classes_, n_estimators_, n_features_, oob_improvement_]
  • HistGradientBoostingClassifier, [bin_mapper_, classes_, do_early_stopping_, loss_, n_features_, scorer_]
  • HistGradientBoostingRegressor, [bin_mapper_, do_early_stopping_, loss_, n_features_, scorer_]
  • IncrementalPCA, [batch_size_]
  • IsolationForest, [base_estimator_, estimators_features_, n_features_]
  • IsotonicRegression, [X_max_, X_min_, f_]
  • IterativeImputer, [random_state_]
  • KNeighborsClassifier, [classes_, effective_metric_, effective_metric_params_, outputs_2d_]
  • KNeighborsRegressor, [effective_metric_, effective_metric_params_]
  • KernelCenterer, [K_fit_all_, K_fit_rows_]
  • KernelDensity, [tree_]
  • KernelPCA, [X_transformed_fit_, dual_coef_]
  • LabelBinarizer, [classes_, sparse_input_, y_type_]
  • LabelEncoder, [classes_]
  • LarsCV, [active_]
  • Lasso, [dual_gap_]
  • LassoLarsCV, [active_]
  • LassoLarsIC, [alphas_]
  • LatentDirichletAllocation, [bound_, doc_topic_prior_, exp_dirichlet_component_, random_state_, topic_word_prior_]
  • LinearDiscriminantAnalysis, [covariance_]
  • LinearRegression, [rank_, singular_]
  • LinearSVC, [classes_]
  • LocalOutlierFactor, [effective_metric_, effective_metric_params_]
  • MDS, [dissimilarity_matrix_, n_iter_]
  • MLPClassifier, [best_loss_, loss_curve_, t_]
  • MLPRegressor, [best_loss_, loss_curve_, t_]
  • MinMaxScaler, [n_samples_seen_]
  • MiniBatchDictionaryLearning, [iter_offset_]
  • MiniBatchKMeans, [counts_, init_size_, n_iter_]
  • MultiLabelBinarizer, [classes_]
  • MultiTaskElasticNet, [dual_gap_, eps_, sparse_coef_]
  • MultiTaskElasticNetCV, [dual_gap_]
  • MultiTaskLasso, [dual_gap_, eps_, sparse_coef_]
  • MultiTaskLassoCV, [dual_gap_]
  • NearestCentroid, [classes_]
  • NearestNeighbors, [effective_metric_, effective_metric_params_]
  • NeighborhoodComponentsAnalysis, [random_state_]
  • NuSVC, [class_weight_, fit_status_, probA_, probB_, shape_fit_]
  • NuSVR, [class_weight_, fit_status_, n_support_, probA_, probB_, shape_fit_]
  • OAS, [location_]
  • OneClassSVM, [class_weight_, fit_status_, n_support_, probA_, probB_, shape_fit_]
  • OneVsOneClassifier, [n_classes_]
  • OneVsRestClassifier, [coef_, intercept_, n_classes_]
  • OrthogonalMatchingPursuit, [n_nonzero_coefs_]
  • PLSCanonical, [coef_, x_mean_, x_std_, y_mean_, y_std_]
  • PLSRegression, [x_mean_, x_std_, y_mean_, y_std_]
  • PLSSVD, [x_mean_, x_std_, y_mean_, y_std_]
  • PassiveAggressiveClassifier, [loss_function_, t_]
  • PassiveAggressiveRegressor, [t_]
  • Perceptron, [loss_function_]
  • QuadraticDiscriminantAnalysis, [classes_, covariance_]
  • RBFSampler, [random_offset_, random_weights_]
  • RFE, [classes_]
  • RFECV, [classes_]
  • RadiusNeighborsClassifier, [classes_, effective_metric_, effective_metric_params_, outputs_2d_]
  • RadiusNeighborsRegressor, [effective_metric_, effective_metric_params_]
  • RandomForestClassifier, [oob_decision_function_, oob_score_]
  • RandomForestRegressor, [oob_prediction_, oob_score_]
  • RandomTreesEmbedding, [base_estimator_, feature_importances_, n_features_, n_outputs_, one_hot_encoder_]
  • RidgeCV, [cv_values_]
  • RidgeClassifier, [classes_]
  • RidgeClassifierCV, [cv_values_]
  • SGDClassifier, [classes_, t_]
  • SGDRegressor, [average_coef_, average_intercept_]
  • SVC, [class_weight_, shape_fit_]
  • SVR, [class_weight_, fit_status_, n_support_, probA_, probB_, shape_fit_]
  • SelectKBest, [pvalues_, scores_]
  • ShrunkCovariance, [shrinkage]
  • SkewedChi2Sampler, [random_offset_, random_weights_]
  • SparseRandomProjection, [components_, density_]
  • SpectralEmbedding, [n_neighbors_]
  • TfidfVectorizer, [stop_words_, vocabulary_]

@thomasjpfan thomasjpfan added this to To do in Sprint Scipy 2019 Jul 13, 2019
@mepa
Copy link
Contributor

mepa commented Jul 13, 2019

I can take up the tree submodule attribute documentation mismatches, which includes:

  • DecisionTreeRegressor, [classes_, n_classes_]
  • ExtraTreeClassifier, [classes_, max_features_, n_classes_, n_features_, n_outputs_, tree_]
  • ExtraTreeRegressor, [classes_, max_features_, n_classes_, n_features_, n_outputs_, tree_]

@wendyhhu
Copy link
Contributor

I'm working on LinearRegression, [rank_, singular_].

@wendyhhu
Copy link
Contributor

I'm working on LinearSVC, [n_iter_] and LinearSVR, [n_iter_]

@matsmaiwald
Copy link

I'll take up Gradient boosting i.e.

  • GradientBoostingClassifier [base_estimator_, max_features_, n_classes_, n_features_]
  • GradientBoostingRegressor [base_estimator_, classes_, max_features_, n_estimators_, n_features_]

@TomDLT TomDLT reopened this Jul 14, 2019
@matsmaiwald
Copy link

nevermind, misread where attributes are missing and where not

@alexitkes
Copy link
Contributor

It's looking like there is also classes_ attribute undocumented for classifiers of naive_bayes submodule. I have started to fix it.

@mandalbiswadip
Copy link
Contributor

I will work on TfidfVectorizer, [fixed_vocabulary_]

@rcwoolston
Copy link
Contributor

rcwoolston commented Jul 14, 2019

I will work on:

  • RandomForestClassifier, [base_estimator_]
  • RandomForestRegressor, [base_estimator_, n_classes_]
  • ExtraTreesClassifier, [base_estimator_]
  • ExtraTreesRegressor, [base_estimator_, n_classes_]

@wendyhhu
Copy link
Contributor

wendyhhu commented Jul 14, 2019

I'm working on:

  • SGDClassifier, [average_coef_, average_intercept_, standard_coef_, standard_intercept_]
  • SGDRegressor, [standard_coef_, standard_intercept_]

EDIT: opened an issue to change these attributes from public to private (reference: #14364)

@SwordKnight6216
Copy link
Contributor

I am working on:
KernelCenterer, [K_fit_all_, K_fit_rows_]
MinMaxScaler, [n_samples_seen_]

@rcwoolston
Copy link
Contributor

I will work on:

  • RandomTreesEmbedding, [base_estimator_, classes_, feature_importances_, n_classes_, n_features_, n_outputs_, one_hot_encoder_]

@marenwestermann
Copy link
Member

I'm working on Lasso.

@marenwestermann
Copy link
Member

I'm now working on adding the attribute sparse_coef_ to MultiTaskElasticNet and MultiTaskLasso.

@marenwestermann
Copy link
Member

I'm working on LarsCV.

@marenwestermann
Copy link
Member

@thomasjpfan it is said in the classes SVR and OneClassSVM:
"The probA_ attribute is deprecated in version 0.23 and will be removed in version 0.25." and
"The probB_ attribute is deprecated in version 0.23 and will be removed in version 0.25."

Therefore, these attributes probably don't need documentation anymore, right?
Going from here, will these two attributes also be deprecated in the class NuSVR?

@marenwestermann
Copy link
Member

The attributes classes_ and n_classes_ for ExtraTreeRegressor are false positives.

@thomasjpfan
Copy link
Member

Therefore, these attributes probably don't need documentation anymore, right?
Going from here, will these two attributes also be deprecated in the class NuSVR?

Since we are deprecating them I would say we would not need document them.

The attributes classes_ and n_classes_ for ExtraTreeRegressor are false positives.

Yup those should be deprecated then removed if they are not already.

@Abilityguy
Copy link

The DecisionTreeRegressor class says:
"the n_classes_ attribute is to be deprecated from version 0.22 and will be removed in 0.24."
"the classes_ attribute is to be deprecated from version 0.22 and will be removed in 0.24."

So these attributes also don't need documentation right?

@cmarmo
Copy link
Member

cmarmo commented Sep 16, 2020

So these attributes also don't need documentation right?

Right @Abilityguy, thanks for pointing out that.

@mynkdsi1011
Copy link

I can see below mismatch in RidgeGCV :
Docstring Error: Attribute mismatch in RidgeGCV
alpha

best_score

coef_
dual_coef_
intercept_
n_features_in_

and in BaseRidgeCV:
Docstring Error: Attribute mismatch in BaseRidgeCV
alpha

best_score

coef_
intercept_
n_features_in_

Can I take it up? I am first timer and wants to contribute.

@srivathsa729
Copy link

srivathsa729 commented Sep 26, 2020

@marenwestermann in the class FeatureAgglomeration, it is said that, in version 0.21, n_connected_components_ was added to replace n_components_, then n_components_ would be false positive right..?

@marenwestermann
Copy link
Member

@srivathsa729 from my understanding yes. However, it would be good if one of the core developers could double check.

@disha4u
Copy link

disha4u commented Oct 5, 2020

I will take up ElasticNet

@marenwestermann
Copy link
Member

marenwestermann commented Nov 4, 2020

Documentation of the attributes X_offset_ and X_scale_ for BayesianRidge has been added with #18607 .

@marenwestermann
Copy link
Member

The attribute output_2d_ is deprecated in DummyClassifier and DummyRegressor (see #14933).

@marenwestermann
Copy link
Member

I ran the script provided by @amueller at the top of this PR (the code needs to be slightly modified because things have moved around). I couldn't find any more attributes that need to be documented with the exception of n_features_in_ which I see has been introduced in #16112. This attribute is undocumented in I think all classes it was introduced to. Should it be documented?
ping @NicolasHug

@ShyamDesai
Copy link
Contributor

Hello. I wanted to take this on as a first issue, but it seems that all attributes have already been documented?

@cmarmo
Copy link
Member

cmarmo commented Feb 23, 2021

Thanks @marenwestermann for checking! This is very helpful.
n_features_in_ documentation is now tracked in #19333.

@cmarmo
Copy link
Member

cmarmo commented Feb 23, 2021

It turns out that all detections from the script in the descriptions are false positives, I'm closing this one. Thanks to all the contributors for their helpful work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted
Projects
No open projects
Development

Successfully merging a pull request may close this issue.