import pymc3
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Diferencia de proporciones con PyMC3¶

Introducción¶

En (Pham-Gia et al 2017) se describe con detalle una solución cerrada para estimar la distribucion de la diferencia de proporciones binomiales usando estadística bayesiana.

Aquí hago un intento por realizar el análisis de comparar dos proporciones utilizando el paquete para estadística bayesiana PyMC3. Aunque la solución cerrada (con series infinitas pero que sí convergen) es bastante complicada, sería óptimo implementarla y obtener cálculos exactos, sin embargo, es un ejercicio bastante didáctico resolver este problema aproximadamente utilizando simulaciones de monte carlo con PyMC3.

Método¶

El modelo es bastante sencillo:

model = pm.Model()
with model:
    th1 = pymc3.Beta('th1', alpha=0.5, beta=0.5)
    p1 = pymc3.Binomial('p1', n1, p = th1, observed=k1)
    th2 = pymc3.Beta('th2', alpha=0.5, beta=0.5)
    p2 = pymc3.Binomial('p2', n2, p = th2, observed=k2)
    d = pymc3.Deterministic("d", th1 - th2)
model

Se tienen dos distribuciones previas Beta, dos distribuciones binomiales que van a modelar las proporciones y el término de la diferencia de las proporciones:

$d = \theta_1 – \theta_2$

He hecho dos modelos alternativos. El primero tiene una distribución previa Beta(0.5, 0.5) conocida como Jeffrey’s Prior, que genera una creencia previa de que la proporción que se está modelando es 0 o 1. La segunda alternativa es con Beta(1,1) que genera una creencia previa de que la proporción está distribuida uniformemente entre $[0,1]$.

def plotTrace(trace, varname):
    hist,bins = np.histogram(trace[varname], bins=np.linspace(np.min(trace[varname]),np.max(trace[varname]),20), normed=True)
    width = 0.7 * (bins[1] - bins[0])
    center = (bins[:-1] + bins[1:]) / 2
    plt.bar(center, hist, align='center', width=width)

def propsDiff(n1, k1, n2, k2):
    model = pm.Model()
    with model:
        th1 = pymc3.Beta('th1', alpha=0.5, beta=0.5)
        p1 = pymc3.Binomial('p1', n1, p = th1, observed=k1)
        th2 = pymc3.Beta('th2', alpha=0.5, beta=0.5)
        p2 = pymc3.Binomial('p2', n2, p = th2, observed=k2)
        d = pymc3.Deterministic("d", th1 - th2)
    print(model)
    map_estimate = pymc3.find_MAP(model=model)
    print("MAP quick Estimates:", map_estimate)
    print("Difference, according to data = ", k1/n1-k2/n2)
    with model:
        # draw 10000 posterior samples
        p_trace = pymc3.sample(10000)
    plt.rcParams["figure.figsize"] = (12,3)
    plt.subplot(1,3,1)
    plotTrace(p_trace, "th1")
    plt.subplot(1,3,2)
    plotTrace(p_trace, "th2")
    plt.subplot(1,3,3)
    plotTrace(p_trace, "d")
    print("HPD: ", pymc3.stats.hpd(p_trace))
    
    
def propsDiffB(n1, k1, n2, k2):
    model = pm.Model()
    with model:
        th1 = pymc3.Beta('th1', alpha=1, beta=1)
        p1 = pymc3.Binomial('p1', n1, p = th1, observed=k1)
        th2 = pymc3.Beta('th2', alpha=1, beta=1)
        p2 = pymc3.Binomial('p2', n2, p = th2, observed=k2)
        d = pymc3.Deterministic("d", th1 - th2)
    print(model)
    map_estimate = pymc3.find_MAP(model=model)
    print("MAP quick Estimates:", map_estimate)
    print("Difference, according to data = ", k1/n1-k2/n2)
    with model:
        # draw 10000 posterior samples
        p_trace = pymc3.sample(10000)
    plt.rcParams["figure.figsize"] = (12,3)
    plt.subplot(1,3,1)
    plotTrace(p_trace, "th1")
    plt.subplot(1,3,2)
    plotTrace(p_trace, "th2")
    plt.subplot(1,3,3)
    plotTrace(p_trace, "d")
    print("HPD: ", pymc3.stats.hpd(p_trace))

Resultados¶

Se muestran a continuación dos pruebas para ambos modelos. Una población con pocos casos y una población con mayor cantidad de casos.

En la primera se comparan dos grupos, uno de 60 y otro de 45 casos, con proporciones 0.2 y 0.125 respectivamente. La diferencia es de 0.075.

En la segunda se comparan grupos con 600 y 450 casos, con las mismas proporciones que arriba y con la misma diferencia.

Como es de esperar, la diferencia estimada para el primer caso incluye el cero en la highest posterior density (HPD), por lo que no se puede rechazar la hipótesis nula de que las proporciones son iguales. Mientras que en la segunda prueba el cero sí queda fuera del HPD y por lo tanto sí se puede concluir que hay una diferencia creíble entre las proporciones.

En cada caso se muestran estimaciones rápidas, la diferencia teórica, se hace un sampling pequeño (de 10000 samples), y se muestra los intervalos HPD. Luego se muestran las distribuciones posteriores de $\theta_1$, $\theta_2$ y $d$.

Los cálculos con el prior de Jeffrey muestran valores ligeramente lejanos a los valores teóricos mientras que el prior plano muestra estimaciones bastante cercanas a los valores teóricos.

Prueba con tamaño de muestra pequeño y con prior Beta(0.5,0.5)¶

propsDiff(50,10, 40, 5)

<pymc3.model.Model object at 0x7f5361fe3828>

logp = -26.837, ||grad|| = 21.213: 100%|██████████| 10/10 [00:00<00:00, 3203.72it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

MAP quick Estimates: {'th1_logodds__': array(-1.42500895), 'th2_logodds__': array(-2.03688232), 'th1': array(0.19387754), 'th2': array(0.11538458), 'd': array(0.07849296)}
Difference, according to data =  0.07500000000000001

Multiprocess sampling (2 chains in 2 jobs)
NUTS: [th2, th1]
Sampling 2 chains: 100%|██████████| 21000/21000 [00:06<00:00, 3317.74draws/s]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: VisibleDeprecationWarning: Passing `normed=True` on non-uniform bins has always been broken, and computes neither the probability density function nor the probability mass function. The result is only correct if the bins are uniform, when density=True will produce the same result anyway. The argument will be removed in a future version of numpy.

HPD:  {0: {'th1_logodds__': array([-2.09294563, -0.70250104]), 'th2_logodds__': array([-2.88691161, -1.02638267]), 'th1': array([0.10075487, 0.31781437]), 'th2': array([0.04062569, 0.24022526]), 'd': array([-0.08034746,  0.22168089])}, 1: {'th1_logodds__': array([-2.09188145, -0.72587994]), 'th2_logodds__': array([-2.88527632, -1.00449275]), 'th1': array([0.10375017, 0.31767871]), 'th2': array([0.04019962, 0.23813412]), 'd': array([-0.07932138,  0.22192351])}}

Prueba con tamaño de muestra pequeño y con prior Beta(1,1)¶

propsDiff(500,100, 400, 50)

<pymc3.model.Model object at 0x7f536254b828>

logp = -229.75, ||grad|| = 212.13: 100%|██████████| 10/10 [00:00<00:00, 3020.74it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

MAP quick Estimates: {'th1_logodds__': array(-1.39005615), 'th2_logodds__': array(-1.954531), 'th1': array(0.19939879), 'th2': array(0.12406014), 'd': array(0.07533865)}
Difference, according to data =  0.07500000000000001

Multiprocess sampling (2 chains in 2 jobs)
NUTS: [th2, th1]
Sampling 2 chains: 100%|██████████| 21000/21000 [00:06<00:00, 3337.08draws/s]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: VisibleDeprecationWarning: Passing `normed=True` on non-uniform bins has always been broken, and computes neither the probability density function nor the probability mass function. The result is only correct if the bins are uniform, when density=True will produce the same result anyway. The argument will be removed in a future version of numpy.

HPD:  {0: {'th1_logodds__': array([-1.6000989 , -1.16843006]), 'th2_logodds__': array([-2.24587141, -1.66452981]), 'th1': array([0.16625677, 0.23508809]), 'th2': array([0.09550201, 0.15884491]), 'd': array([0.02771922, 0.12135239])}, 1: {'th1_logodds__': array([-1.61078082, -1.16898728]), 'th2_logodds__': array([-2.23953405, -1.65293047]), 'th1': array([0.16633756, 0.23685864]), 'th2': array([0.0948786 , 0.15876037]), 'd': array([0.02935635, 0.12390263])}}

Prueba con tamaño de muestra grande y con prior Beta(0.5,0.5)¶

propsDiffB(50,10, 40, 5)

<pymc3.model.Model object at 0x7f53631a2390>

/usr/local/lib/python3.6/dist-packages/pymc3/tuning/starting.py:61: UserWarning: find_MAP should not be used to initialize the NUTS sampler, simply call pymc3.sample() and it will automatically initialize NUTS in a better way.
  warnings.warn('find_MAP should not be used to initialize the NUTS sampler, simply call pymc3.sample() and it will automatically initialize NUTS in a better way.')
logp = -25.934, ||grad|| = 21.213: 100%|██████████| 10/10 [00:00<00:00, 3166.95it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

MAP quick Estimates: {'th1_logodds__': array(-1.38629438), 'th2_logodds__': array(-1.94591024), 'th1': array(0.2), 'th2': array(0.12499999), 'd': array(0.07500001)}
Difference, according to data =  0.07500000000000001

Multiprocess sampling (2 chains in 2 jobs)
NUTS: [th2, th1]
Sampling 2 chains: 100%|██████████| 21000/21000 [00:06<00:00, 3124.00draws/s]
The acceptance probability does not match the target. It is 0.8789693509149036, but should be close to 0.8. Try to increase the number of tuning steps.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: VisibleDeprecationWarning: Passing `normed=True` on non-uniform bins has always been broken, and computes neither the probability density function nor the probability mass function. The result is only correct if the bins are uniform, when density=True will produce the same result anyway. The argument will be removed in a future version of numpy.

HPD:  {0: {'th1_logodds__': array([-2.03146286, -0.69860728]), 'th2_logodds__': array([-2.74116031, -0.94995942]), 'th1': array([0.10906388, 0.321899  ]), 'th2': array([0.04750216, 0.24919993]), 'd': array([-0.08242316,  0.22322617])}, 1: {'th1_logodds__': array([-2.02084349, -0.66106498]), 'th2_logodds__': array([-2.7279132 , -0.94411687]), 'th1': array([0.10743923, 0.32472466]), 'th2': array([0.04675944, 0.25078614]), 'd': array([-0.08117984,  0.22080431])}}

Prueba con tamaño de muestra grande y con prior Beta(1,1)¶

propsDiffB(500,100, 400, 50)

<pymc3.model.Model object at 0x7f536251f198>

logp = -228.84, ||grad|| = 212.13: 100%|██████████| 10/10 [00:00<00:00, 3172.70it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

MAP quick Estimates: {'th1_logodds__': array(-1.38629438), 'th2_logodds__': array(-1.94591024), 'th1': array(0.2), 'th2': array(0.12499999), 'd': array(0.07500001)}
Difference, according to data =  0.07500000000000001

Multiprocess sampling (2 chains in 2 jobs)
NUTS: [th2, th1]
Sampling 2 chains: 100%|██████████| 21000/21000 [00:06<00:00, 3228.58draws/s]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: VisibleDeprecationWarning: Passing `normed=True` on non-uniform bins has always been broken, and computes neither the probability density function nor the probability mass function. The result is only correct if the bins are uniform, when density=True will produce the same result anyway. The argument will be removed in a future version of numpy.

HPD:  {0: {'th1_logodds__': array([-1.60132406, -1.17403385]), 'th2_logodds__': array([-2.23112866, -1.63514405]), 'th1': array([0.1677513 , 0.23607541]), 'th2': array([0.09471379, 0.1599122 ]), 'd': array([0.02514927, 0.12069278])}, 1: {'th1_logodds__': array([-1.59595389, -1.16578966]), 'th2_logodds__': array([-2.23561169, -1.64956192]), 'th1': array([0.16801928, 0.23702813]), 'th2': array([0.09508156, 0.15911868]), 'd': array([0.02489183, 0.1208492 ])}}

Referencias¶

Pham-Gia, T., Thin, N.V. and Doan, P.P. (2017) Inferences on the Difference of Two Proportions: A Bayesian Approach. Open Journal of Statistics, 7, 1-15. https://doi.org/10.4236/ojs.2017.71001

Test de diferencia de proporciones bayesiano