ENH: inter_rater.fleiss_kappa p-values and confidence interval #9207

josef-pkt · 2024-04-14T19:58:39Z

https://stackoverflow.com/questions/78323943/statistic-values-of-fleiss-kappa-using-statsmodels-stats-inter-rater/78324041#78324041

Note our fleiss_kappa includes also randolph's kappa, i.e. we would need p-values also for those.

(needs reference, I have not looked at this in a long time)

copy from answer

import numpy as np
import pandas as pd
from statsmodels.stats.inter_rater import fleiss_kappa
from scipy.stats import norm

np.random.seed(42)

data = {
    f'Item{i+1}': np.random.choice([0, 1, 2], size=30, p=[0.33, 0.33, 0.34]) for i in range(15)
}
df = pd.DataFrame(data)

formatted_data = {
    f"Category {cat}": [(df[item] == cat).sum() for item in df] for cat in range(3)
}
formatted_df = pd.DataFrame(formatted_data)

kappa = fleiss_kappa(formatted_df.values)

category_totals = formatted_df.sum(axis=1) 
p = np.sum((category_totals / (30 * 15))**2)  

n = 15  
k = 3   
N = n * 30  

variance = (1 / (N * (n - 1))) * (N * p * (1 - p) + (n * (k - 1) * (p - (1 / k)**2)))
if variance > 0:
    z_value = kappa / np.sqrt(variance)
    p_value = 2 * (1 - norm.cdf(np.abs(z_value)))
    z_critical = norm.ppf(0.975)
    margin_of_error = z_critical * np.sqrt(variance)
    lower_bound = kappa - margin_of_error
    upper_bound = kappa + margin_of_error

    print("Fleiss' kappa:", kappa)
    print("Z-value:", z_value)
    print("P-value:", p_value)
    print("Confidence interval (95%):", (lower_bound, upper_bound))
else:
    print("Variance calculation error: Non-positive variance", variance)

Fleiss' kappa: -0.008536683290635389
Z-value: -0.1312124600755962
P-value: 0.8956072394628303
Confidence interval (95%): (-0.13605194965657783, 0.11897858307530704)

The text was updated successfully, but these errors were encountered:

jseabold · 2024-05-08T22:12:45Z

I needed this today as well coincidentally, so coded something up based on Fleiss, Nee, and Landis (1979) "Large sample variance of kappa in the case of different set of raters." Equation 3 in this paper (which says don't do it). This is what stata uses. If the number of raters is not the same for each subject, they don't produce anything for inference.

def fleiss_standard_error(table):
    n, k = table.shape  # n_subjects, n_choices
    m = table.sum(axis=1)[0]  # assume they all have the same ratings count
    p_bar = table.sum(axis=0) / (n * m)
    q_bar = 1 - p_bar

    return (
        (2 ** .5 / (p_bar.dot(q_bar) * np.sqrt(n * m * (m - 1))))
        * (
            (p_bar.dot(q_bar) ** 2) - np.sum(p_bar * q_bar * (q_bar - p_bar))
        ) ** .5
    )

Closes statsmodels#9207

josef-pkt added type-enh comp-stats labels Apr 14, 2024

josef-pkt added this to the 0.15 milestone Apr 14, 2024

jseabold added a commit to jseabold/statsmodels that referenced this issue May 9, 2024

ENH: Test stat for Fleiss Kappa

9a14ff5

Closes statsmodels#9207

jseabold linked a pull request May 9, 2024 that will close this issue

ENH: Test stat for Fleiss Kappa #9241

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: inter_rater.fleiss_kappa p-values and confidence interval #9207

ENH: inter_rater.fleiss_kappa p-values and confidence interval #9207

josef-pkt commented Apr 14, 2024

jseabold commented May 8, 2024 •

edited

ENH: inter_rater.fleiss_kappa p-values and confidence interval #9207

ENH: inter_rater.fleiss_kappa p-values and confidence interval #9207

Comments

josef-pkt commented Apr 14, 2024

jseabold commented May 8, 2024 • edited

jseabold commented May 8, 2024 •

edited