{ "cells": [ { "cell_type": "markdown", "id": "c5233c68", "metadata": {}, "source": [ "# Proposed Debiasing Method for Inferring High-Dimensional Linear Models (Example Code via ``Debias-Infer``)" ] }, { "cell_type": "code", "execution_count": 1, "id": "182569fb", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T04:52:09.618389Z", "start_time": "2023-09-05T04:52:07.973125Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.preprocessing import PolynomialFeatures\n", "import scipy.stats\n", "from sklearn.linear_model import LogisticRegressionCV\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.calibration import CalibratedClassifierCV\n", "\n", "from debiasing.debias_prog import ScaledLasso, DebiasProg, DualObj, DualCD, DebiasProgCV" ] }, { "cell_type": "markdown", "id": "76583f3b", "metadata": {}, "source": [ "## Simulate Random Sample\n", "\n", "We generate the i.i.d. data $\\{(Y_i,R_i,X_i)\\}_{i=1}^n \\subset \\mathbb{R}\\times \\{0,1\\} \\times\\mathbb{R}^d$ from the following linear model \n", "$$Y_i=X_i^T \\beta_0 + \\epsilon_i \\quad \\text{ with } \\quad X_i \\perp\\!\\!\\!\\perp \\epsilon_i, \\quad Y_i \\perp\\!\\!\\!\\perp R_i|X_i, \\quad X_i \\sim \\mathcal{N}_d(\\mathbf{0},\\Sigma), \\quad \\text{ and } \\quad \\epsilon_i \\sim \\mathcal{N}(0,1),$$\n", "where $d=1000$ and $n=900$.\n", "\n", "We adopt the circulant symmetric matrix $\\Sigma^{\\mathrm{cs}}$ in Javanmard and Montanari (2014) that is defined as $\\Sigma_{jj}=1$, $\\Sigma_{jk}=0.1 \\text{ when } j+1\\leq k\\leq j+5 \\text{ or } j+d-5\\leq k \\leq j+d-1$ with $\\Sigma_{jk}=0$ elsewhere for $j\\leq k$, and $\\Sigma_{jk}=\\Sigma_{kj}$.\n", "\n", "The true regression coefficient is $\\beta_0^{sp}=\\left(\\underbrace{\\sqrt{5},...,\\sqrt{5}}_5,0,...,0\\right)^T \\in \\mathbb{R}^d$, and the query point is $x=\\left(1,\\frac{1}{2},\\frac{1}{4},0,0,0,\\frac{1}{2},\\frac{1}{8},0,...,0\\right)^T\\in\\mathbb{R}^d$ to infer the joint effects of a few components of $\\beta_0$.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "247a292b", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T04:52:10.411191Z", "start_time": "2023-09-05T04:52:09.621748Z" } }, "outputs": [], "source": [ "d = 1000\n", "n = 900\n", "\n", "# Circulant symmetric covariance\n", "Sigma = np.zeros((d,d)) + np.eye(d)\n", "rho = 0.1\n", "for i in range(d):\n", " for j in range(i+1, d):\n", " if (j < i+6) or (j > i+d-6):\n", " Sigma[i,j] = rho\n", " Sigma[j,i] = rho\n", "sig = 1\n", "\n", "# True regression coefficient\n", "s_beta = 5\n", "beta_0 = np.zeros((d,))\n", "beta_0[:s_beta] = np.sqrt(5)\n", "\n", "X_sim = np.random.multivariate_normal(mean=np.zeros(d), cov=Sigma, size=n)\n", "eps_err_sim = sig*np.random.randn(n)\n", "Y_sim = np.dot(X_sim, beta_0) + eps_err_sim" ] }, { "cell_type": "code", "execution_count": 3, "id": "3f60924f", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T04:52:10.418099Z", "start_time": "2023-09-05T04:52:10.414153Z" } }, "outputs": [], "source": [ "# Query point\n", "x = np.zeros((d,))\n", "x[0] = 1\n", "x[1] = 1/2\n", "x[2] = 1/4\n", "x[6] = 1/2\n", "x[7] = 1/8\n", "\n", "# True regression function\n", "m_true = np.dot(x, beta_0)" ] }, { "cell_type": "markdown", "id": "e32bb7ec", "metadata": {}, "source": [ "To increase the complexity of estimating the propensity score, we generate the missing indicators $R_i,i=1,...,n$ for the outcome variables $Y_i,i=1,...,n$ as:\n", "$$\\mathrm{P}(R_i=1|X_i) = \\Phi\\left(-4+\\sum_{k=1}^K Z_{ik} \\right),$$\n", "where $\\Phi(\\cdot)$ is the CDF of $\\mathcal{N}(0,1)$ and the vector $(Z_{i1},...,Z_{iK})$ contains all polynomial combinations of the first eight components $X_{i1},...,X_{i8}$ of the covariate vector $X_i$ with degrees less than or equal to two (_i.e._, including the linear, quadratic, and one-way interaction terms)." ] }, { "cell_type": "code", "execution_count": 4, "id": "4234ec77", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T04:52:10.439262Z", "start_time": "2023-09-05T04:52:10.421696Z" } }, "outputs": [], "source": [ "# MAR\n", "inter_mat = PolynomialFeatures(degree=2, interaction_only=False, \n", " include_bias=False).fit_transform(X_sim[:,:8])\n", "obs_prob = scipy.stats.norm.cdf(-4 + np.sum(inter_mat, axis=1))\n", "R = np.ones((n,))\n", "R[np.random.rand(n) >= obs_prob] = 0" ] }, { "cell_type": "markdown", "id": "34aab858", "metadata": {}, "source": [ "## Detailed Procedures of Applying Our Proposed Debiasing Method\n", "### 1. Lasso pilot estimate:\n", "\n", "To select the regularization parameter $\\lambda>0$ in a data adaptive way, we adopt the scaled Lasso (Sun and Zhang, 2012) with its universal regularization parameter $\\lambda_0=\\sqrt{\\frac{2\\log d}{n}}$ as the initialization. Specifically, it provides an iterative algorithm to obtain the Lasso estimate $\\hat{\\beta}$ and a consistent estimator $\\hat{\\sigma}_{\\epsilon}$ of the noise level $\\sigma_{\\epsilon}$ from the following jointly convex optimization problem:\n", "$$\\left(\\hat{\\beta}(\\tilde{\\lambda}), \\hat{\\sigma}_{\\epsilon}(\\tilde{\\lambda})\\right) = \\arg\\min_{\\beta\\in \\mathbb{R}^d, \\sigma_{\\epsilon} >0} \\left[\\frac{1}{2n\\sigma_{\\epsilon}} \\sum_{i=1}^n R_i\\left(Y_i - X_i^T\\beta\\right)^2 + \\frac{\\sigma_{\\epsilon}}{2}+\\tilde{\\lambda}\\|\\beta\\|_1\\right],$$\n", "where the regularization parameter $\\tilde{\\lambda} >0$ is updated iteratively as well." ] }, { "cell_type": "code", "execution_count": 5, "id": "d59fc2e9", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T04:52:10.625477Z", "start_time": "2023-09-05T04:52:10.448482Z" } }, "outputs": [], "source": [ "# Lasso pilot estimator\n", "beta_pilot, sigma_hat = ScaledLasso(X=X_sim[R == 1,:], Y=Y_sim[R == 1], lam0='univ')" ] }, { "cell_type": "markdown", "id": "50eecfff", "metadata": {}, "source": [ "### 2. Propensity Score Estimation:\n", "\n", "We consider both the oracle/true propensity scores and the estimated propensity scores by the Lasso-type logistic regression, two-layer neural network (with and without the Platt's logistic calibration (Platt, 1999))." ] }, { "cell_type": "code", "execution_count": 6, "id": "502d2903", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T04:53:11.445322Z", "start_time": "2023-09-05T04:52:10.635035Z" } }, "outputs": [], "source": [ "## Propensity score estimation (Oracle, LR, MLP neural network)\n", "for non_met in ['Oracle', 'LR', 'NN', 'NNcal']:\n", " if non_met == 'Oracle':\n", " prop_score1 = obs_prob.copy()\n", " MAE_prop = np.mean(abs(prop_score1 - obs_prob))\n", " \n", " if non_met == 'LR':\n", " zeta2 = np.logspace(-1, np.log10(300), 40)*np.sqrt(np.log(d)/n)\n", " lr2 = LogisticRegressionCV(Cs=1/zeta2, cv=5, penalty='l1', scoring='neg_log_loss', \n", " solver='liblinear', tol=1e-6, max_iter=10000).fit(X_sim, R)\n", " prop_score2 = lr2.predict_proba(X_sim)[:,1]\n", " MAE_prop = np.mean(abs(prop_score2 - obs_prob))\n", " \n", " if non_met == 'NN':\n", " lr2_NN = MLPClassifier(hidden_layer_sizes=(80,50,), activation='relu', \n", " random_state=None, learning_rate='adaptive', \n", " learning_rate_init=0.001, max_iter=1000).fit(X_sim, R)\n", " prop_score3 = lr2_NN.predict_proba(X_sim)[:,1]\n", " MAE_prop = np.mean(abs(prop_score3 - obs_prob))\n", " \n", " if non_met == 'NNcal':\n", " NN_base = MLPClassifier(hidden_layer_sizes=(80,50,), activation='relu', random_state=None, \n", " learning_rate='adaptive', learning_rate_init=0.001, max_iter=1000)\n", " lr2_NN = CalibratedClassifierCV(NN_base, method='sigmoid', cv=5).fit(X_sim, R)\n", " prop_score4 = lr2_NN.predict_proba(X_sim)[:,1]\n", " MAE_prop = np.mean(abs(prop_score4 - obs_prob))" ] }, { "cell_type": "markdown", "id": "2d7de6f1", "metadata": {}, "source": [ "### 3. Solving Our Debiasing Program\n", "\n", "We solve for the weights through our debiasing program, where the tuning parameter $\\frac{\\gamma}{n}>0$ is selected via 5-fold cross-validations." ] }, { "cell_type": "code", "execution_count": 7, "id": "1c393741", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T05:01:47.285487Z", "start_time": "2023-09-05T04:53:11.449227Z" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n" ] } ], "source": [ "## Oracle\n", "w_1se1, ll_1se1, gamma_n_1se1 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score1, gamma_lst=None, \n", " cv_fold=5, cv_rule='1se')\n", "# w_minfeas1, ll_minfeas1, gamma_n_minfeas1 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score1, \n", "# gamma_lst=None, cv_fold=5, cv_rule='minfeas')" ] }, { "cell_type": "code", "execution_count": 8, "id": "907bf3f4", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T05:09:39.372875Z", "start_time": "2023-09-05T05:01:47.287520Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n" ] } ], "source": [ "## LR\n", "w_1se2, ll_1se2, gamma_n_1se2 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score2, gamma_lst=None, \n", " cv_fold=5, cv_rule='1se')" ] }, { "cell_type": "code", "execution_count": 9, "id": "c429d227", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T05:21:17.341091Z", "start_time": "2023-09-05T05:09:39.375884Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The coordinate descent algorithm has reached its maximum number of iterations: 5000! Reiterate one more times without small perturbations to the scaled design matrix...\n", "The strong duality between primal and dual programs does not satisfy when \\gamma/n=0.051!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n" ] } ], "source": [ "## NN\n", "w_1se3, ll_1se3, gamma_n_1se3 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score3, gamma_lst=None, \n", " cv_fold=5, cv_rule='1se')" ] }, { "cell_type": "code", "execution_count": 10, "id": "7fef6558", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T05:29:11.466656Z", "start_time": "2023-09-05T05:21:17.344207Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.001!\n", "The primal debiasing program is infeasible!\n", "The primal debiasing program for this fold of the data is not feasible when \\gamma/n=0.026!\n" ] } ], "source": [ "## NNcal\n", "w_1se4, ll_1se4, gamma_n_1se4 = DebiasProgCV(X=X_sim.copy(), x=x, prop_score=prop_score4, gamma_lst=None, \n", " cv_fold=5, cv_rule='1se')" ] }, { "cell_type": "markdown", "id": "e3b2da26", "metadata": {}, "source": [ "If the tuning parameter $\\frac{\\gamma}{n}>0$ is pre-selected, then we can run the primal/dual debiasing program without cross-validations." ] }, { "cell_type": "code", "execution_count": 11, "id": "08db4ad7", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T05:29:15.135929Z", "start_time": "2023-09-05T05:29:11.468898Z" } }, "outputs": [], "source": [ "w_1se1_new = DebiasProg(X=X_sim.copy(), x=x, Pi=np.diag(prop_score1), gamma_n=gamma_n_1se1)\n", "ll_1se1_new = DualCD(X=X_sim.copy(), x=x, Pi=np.diag(prop_score1), gamma_n=gamma_n_1se1, ll_init=None, \n", " eps=1e-9, max_iter=5000)" ] }, { "cell_type": "markdown", "id": "123735a1", "metadata": {}, "source": [ "### 4. Construct the Debiased Estimator and its Asymptotic 95% Confidence Interval" ] }, { "cell_type": "code", "execution_count": 13, "id": "80f3aa97", "metadata": { "ExecuteTime": { "end_time": "2023-09-05T05:41:54.721344Z", "start_time": "2023-09-05T05:41:54.680104Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The true regression function is 3.913118960624632.\n", "\n", "The 95% confidence interval of our debiasing method under the oracle propensity scores is [3.8227718343812698, 3.9852952726087865].\n", "\n", "The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by the Lasso-type logistic regression is \n", "[3.8531350930448305, 4.029273820154903].\n", "\n", "The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by the two-layer neural network is \n", "[3.824591171878295, 3.9872310918920717].\n", "\n", "The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by the two-layer neural network with Platt's logistic calibration is \n", "[3.8490969609870653, 4.02376413689852].\n", "\n" ] } ], "source": [ "print('The true regression function is '+str(m_true)+'.\\n')\n", "\n", "## Oracle\n", "m_deb1 = np.dot(x, beta_pilot) + np.sum(w_1se1 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)\n", "asym_sd1 = np.sqrt(np.sum(prop_score1 * w_1se1**2)/n)\n", "print('The 95% confidence interval of our debiasing method under the oracle propensity scores is '\\\n", " +str([m_deb1 - asym_sd1*sigma_hat*scipy.stats.norm.ppf(1-0.05/2), \n", " m_deb1 + asym_sd1*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\\n')\n", "\n", "## LR\n", "m_deb2 = np.dot(x, beta_pilot) + np.sum(w_1se2 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)\n", "asym_sd2 = np.sqrt(np.sum(prop_score2 * w_1se2**2)/n)\n", "print('The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by '\\\n", " 'the Lasso-type logistic regression is \\n'\\\n", " +str([m_deb2 - asym_sd2*sigma_hat*scipy.stats.norm.ppf(1-0.05/2), \n", " m_deb2 + asym_sd2*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\\n')\n", "\n", "## NN\n", "m_deb3 = np.dot(x, beta_pilot) + np.sum(w_1se3 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)\n", "asym_sd3 = np.sqrt(np.sum(prop_score3 * w_1se3**2)/n)\n", "print('The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by '\\\n", " 'the two-layer neural network is \\n'\\\n", " +str([m_deb3 - asym_sd3*sigma_hat*scipy.stats.norm.ppf(1-0.05/2), \n", " m_deb3 + asym_sd3*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\\n')\n", "\n", "## NNcal\n", "m_deb4 = np.dot(x, beta_pilot) + np.sum(w_1se4 * R * (Y_sim - np.dot(X_sim, beta_pilot)))/np.sqrt(n)\n", "asym_sd4 = np.sqrt(np.sum(prop_score4 * w_1se4**2)/n)\n", "print(\"The 95% confidence interval of our debiasing method when the oracle propensity scores are estimated by \"\\\n", " \"the two-layer neural network with Platt's logistic calibration is \\n\"\\\n", " +str([m_deb4 - asym_sd4*sigma_hat*scipy.stats.norm.ppf(1-0.05/2), \n", " m_deb4 + asym_sd4*sigma_hat*scipy.stats.norm.ppf(1-0.05/2)])+'.\\n')" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }