Proposed Debiasing Method for Inferring High-Dimensional Linear Models (Example Code via `Debias-Infer`)

[1]:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
import scipy.stats
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import CalibratedClassifierCV

from debiasing.debias_prog import ScaledLasso, DebiasProg, DualObj, DualCD, DebiasProgCV

Simulate Random Sample

We generate the i.i.d. data \(\{(Y_i,R_i,X_i)\}_{i=1}^n \subset \mathbb{R}\times \{0,1\} \times\mathbb{R}^d\) from the following linear model

\[Y_i=X_i^T \beta_0 + \epsilon_i \quad \text{ with } \quad X_i \perp\!\!\!\perp \epsilon_i, \quad Y_i \perp\!\!\!\perp R_i|X_i, \quad X_i \sim \mathcal{N}_d(\mathbf{0},\Sigma), \quad \text{ and } \quad \epsilon_i \sim \mathcal{N}(0,1),\]

where \(d=1000\) and \(n=900\).

We adopt the circulant symmetric matrix \(\Sigma^{\mathrm{cs}}\) in Javanmard and Montanari (2014) that is defined as \(\Sigma_{jj}=1\), \(\Sigma_{jk}=0.1 \text{ when } j+1\leq k\leq j+5 \text{ or } j+d-5\leq k \leq j+d-1\) with \(\Sigma_{jk}=0\) elsewhere for \(j\leq k\), and \(\Sigma_{jk}=\Sigma_{kj}\).

The true regression coefficient is \(\beta_0^{sp}=\left(\underbrace{\sqrt{5},...,\sqrt{5}}_5,0,...,0\right)^T \in \mathbb{R}^d\), and the query point is \(x=\left(1,\frac{1}{2},\frac{1}{4},0,0,0,\frac{1}{2},\frac{1}{8},0,...,0\right)^T\in\mathbb{R}^d\) to infer the joint effects of a few components of \(\beta_0\).

[2]:

d = 1000
n = 900

# Circulant symmetric covariance
Sigma = np.zeros((d,d)) + np.eye(d)
rho = 0.1
for i in range(d):
    for j in range(i+1, d):
        if (j < i+6) or (j > i+d-6):
            Sigma[i,j] = rho
            Sigma[j,i] = rho
sig = 1

# True regression coefficient
s_beta = 5
beta_0 = np.zeros((d,))
beta_0[:s_beta] = np.sqrt(5)

X_sim = np.random.multivariate_normal(mean=np.zeros(d), cov=Sigma, size=n)
eps_err_sim = sig*np.random.randn(n)
Y_sim = np.dot(X_sim, beta_0) + eps_err_sim

[3]:

# Query point
x = np.zeros((d,))
x[0] = 1
x[1] = 1/2
x[2] = 1/4
x[6] = 1/2
x[7] = 1/8

# True regression function
m_true = np.dot(x, beta_0)

To increase the complexity of estimating the propensity score, we generate the missing indicators \(R_i,i=1,...,n\) for the outcome variables \(Y_i,i=1,...,n\) as:

\[\mathrm{P}(R_i=1|X_i) = \Phi\left(-4+\sum_{k=1}^K Z_{ik} \right),\]

where \(\Phi(\cdot)\) is the CDF of \(\mathcal{N}(0,1)\) and the vector \((Z_{i1},...,Z_{iK})\) contains all polynomial combinations of the first eight components \(X_{i1},...,X_{i8}\) of the covariate vector \(X_i\) with degrees less than or equal to two (i.e., including the linear, quadratic, and one-way interaction terms).

[4]:

# MAR
inter_mat = PolynomialFeatures(degree=2, interaction_only=False,
                               include_bias=False).fit_transform(X_sim[:,:8])
obs_prob = scipy.stats.norm.cdf(-4 + np.sum(inter_mat, axis=1))
R = np.ones((n,))
R[np.random.rand(n) >= obs_prob] = 0

Detailed Procedures of Applying Our Proposed Debiasing Method

1. Lasso pilot estimate:

To select the regularization parameter \(\lambda>0\) in a data adaptive way, we adopt the scaled Lasso (Sun and Zhang, 2012) with its universal regularization parameter \(\lambda_0=\sqrt{\frac{2\log d}{n}}\) as the initialization. Specifically, it provides an iterative algorithm to obtain the Lasso estimate \(\hat{\beta}\) and a consistent estimator \(\hat{\sigma}_{\epsilon}\) of the noise level \(\sigma_{\epsilon}\) from the following jointly convex optimization problem:

\[\left(\hat{\beta}(\tilde{\lambda}), \hat{\sigma}_{\epsilon}(\tilde{\lambda})\right) = \arg\min_{\beta\in \mathbb{R}^d, \sigma_{\epsilon} >0} \left[\frac{1}{2n\sigma_{\epsilon}} \sum_{i=1}^n R_i\left(Y_i - X_i^T\beta\right)^2 + \frac{\sigma_{\epsilon}}{2}+\tilde{\lambda}\|\beta\|_1\right],\]

where the regularization parameter \(\tilde{\lambda} >0\) is updated iteratively as well.

[5]:

# Lasso pilot estimator
beta_pilot, sigma_hat = ScaledLasso(X=X_sim[R == 1,:], Y=Y_sim[R == 1], lam0='univ')

2. Propensity Score Estimation:

We consider both the oracle/true propensity scores and the estimated propensity scores by the Lasso-type logistic regression, two-layer neural network (with and without the Platt’s logistic calibration (Platt, 1999)).

[6]:

## Propensity score estimation (Oracle, LR, MLP neural network)
for non_met in ['Oracle', 'LR', 'NN', 'NNcal']:
    if non_met == 'Oracle':
        prop_score1 = obs_prob.copy()
        MAE_prop = np.mean(abs(prop_score1 - obs_prob))

    if non_met == 'LR':
        zeta2 = np.logspace(-1, np.log10(300), 40)*np.sqrt(np.log(d)/n)
        lr2 = LogisticRegressionCV(Cs=1/zeta2, cv=5, penalty='l1', scoring='neg_log_loss',
                                   solver='liblinear', tol=1e-6, max_iter=10000).fit(X_sim, R)
        prop_score2 = lr2.predict_proba(X_sim)[:,1]
        MAE_prop = np.mean(abs(prop_score2 - obs_prob))

    if non_met == 'NN':
        lr2_NN = MLPClassifier(hidden_layer_sizes=(80,50,), activation='relu',
                               random_state=None, learning_rate='adaptive',
                               learning_rate_init=0.001, max_iter=1000).fit(X_sim, R)
        prop_score3 = lr2_NN.predict_proba(X_sim)[:,1]
        MAE_prop = np.mean(abs(prop_score3 - obs_prob))

    if non_met == 'NNcal':
        NN_base = MLPClassifier(hidden_layer_sizes=(80,50,), activation='relu', random_state=None,
                                learning_rate='adaptive', learning_rate_init=0.001, max_iter=1000)
        lr2_NN = CalibratedClassifierCV(NN_base, method='sigmoid', cv=5).fit(X_sim, R)
        prop_score4 = lr2_NN.predict_proba(X_sim)[:,1]
        MAE_prop = np.mean(abs(prop_score4 - obs_prob))

3. Solving Our Debiasing Program

We solve for the weights through our debiasing program, where the tuning parameter \(\frac{\gamma}{n}>0\) is selected via 5-fold cross-validations.

[7]:

## Oracle
w_1se1, ll_1se1, gamma_n_1se1 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score1, gamma_lst=None,
                                             cv_fold=5, cv_rule='1se')
# w_minfeas1, ll_minfeas1, gamma_n_minfeas1 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score1,
#                                                          gamma_lst=None, cv_fold=5, cv_rule='minfeas')

The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!

[8]:

## LR
w_1se2, ll_1se2, gamma_n_1se2 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score2, gamma_lst=None,
                                             cv_fold=5, cv_rule='1se')

The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!

[9]:

## NN
w_1se3, ll_1se3, gamma_n_1se3 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score3, gamma_lst=None,
                                             cv_fold=5, cv_rule='1se')

The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The coordinate descent algorithm has reached its maximum number of iterations: 5000! Reiterate one more times without small perturbations to the scaled design matrix...
The strong duality between primal and dual programs does not satisfy when \gamma/n=0.051!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!

[10]:

## NNcal
w_1se4, ll_1se4, gamma_n_1se4 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score4, gamma_lst=None,
                                             cv_fold=5, cv_rule='1se')

The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.001!
The primal debiasing program is infeasible!
The primal debiasing program for this fold of the data is not feasible when \gamma/n=0.026!

If the tuning parameter \(\frac{\gamma}{n}>0\) is pre-selected, then we can run the primal/dual debiasing program without cross-validations.

[11]:

w_1se1_new = DebiasProg(X=X_sim.copy(), x=x, Pi=np.diag(prop_score1), gamma_n=gamma_n_1se1)
ll_1se1_new = DualCD(X=X_sim.copy(), x=x, Pi=np.diag(prop_score1), gamma_n=gamma_n_1se1, ll_init=None,
                 eps=1e-9, max_iter=5000)

4. Construct the Debiased Estimator and its Asymptotic 95% Confidence Interval

[13]:

print('The true regression function is '+str(m_true)+'.\n')

## Oracle
m_deb1 = np.dot(x, beta_pilot) + np.sum(w_1se1 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)
asym_sd1 = np.sqrt(np.sum(prop_score1 * w_1se1**2)/n)
print('The 95% confidence interval of our debiasing method under the oracle propensity scores is '\
      +str([m_deb1 - asym_sd1*sigma_hat*scipy.stats.norm.ppf(1-0.05/2),
            m_deb1 + asym_sd1*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\n')

## LR
m_deb2 = np.dot(x, beta_pilot) + np.sum(w_1se2 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)
asym_sd2 = np.sqrt(np.sum(prop_score2 * w_1se2**2)/n)
print('The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by '\
      'the Lasso-type logistic regression is \n'\
      +str([m_deb2 - asym_sd2*sigma_hat*scipy.stats.norm.ppf(1-0.05/2),
            m_deb2 + asym_sd2*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\n')

## NN
m_deb3 = np.dot(x, beta_pilot) + np.sum(w_1se3 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)
asym_sd3 = np.sqrt(np.sum(prop_score3 * w_1se3**2)/n)
print('The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by '\
      'the two-layer neural network is \n'\
      +str([m_deb3 - asym_sd3*sigma_hat*scipy.stats.norm.ppf(1-0.05/2),
            m_deb3 + asym_sd3*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\n')

## NNcal
m_deb4 = np.dot(x, beta_pilot) + np.sum(w_1se4 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)
asym_sd4 = np.sqrt(np.sum(prop_score4 * w_1se4**2)/n)
print("The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by "\
      "the two-layer neural network with Platt's logistic calibration is \n"\
      +str([m_deb4 - asym_sd4*sigma_hat*scipy.stats.norm.ppf(1-0.05/2),
            m_deb4 + asym_sd4*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\n')

The true regression function is 3.913118960624632.

The 95% confidence interval of our debiasing method under the oracle propensity scores is [3.8227718343812698, 3.9852952726087865].

The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by the Lasso-type logistic regression is
[3.8531350930448305, 4.029273820154903].

The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by the two-layer neural network is
[3.824591171878295, 3.9872310918920717].

The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by the two-layer neural network with Platt's logistic calibration is
[3.8490969609870653, 4.02376413689852].

Proposed Debiasing Method for Inferring High-Dimensional Linear Models (Example Code via Debias-Infer)