**178** Publications

- Page
**1**of**6** - Next Page

Biostatistics 2022 Jul 21. Epub 2022 Jul 21.

Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA.

An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double robust and asymptotically linear under loose conditions in the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from nonpredictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1093/biostatistics/kxac029 | DOI Listing |

July 2022

Int J Biostat 2022 Jul 15. Epub 2022 Jul 15.

Division of Biostatistics, University of California, Berkeley, USA.

We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. The highly adaptive lasso estimator of the functional parameter is defined as the minimizer of the empirical risk over a class of cadlag functions with finite sectional variation norm, where the functional parameter is parametrized in terms of such a class of functions. In this article we establish that this HAL estimator yields an asymptotically efficient estimator of any smooth feature of the functional parameter under a global undersmoothing condition. It is formally shown that the -restriction in HAL does not obstruct it from solving the score equations along paths that do not enforce this condition. Therefore, from an asymptotic point of view, the only reason for undersmoothing is that the true target function might not be complex so that the HAL-fit leaves out key basis functions that are needed to span the desired efficient influence curve of the smooth target parameter. Nonetheless, in practice undersmoothing appears to be beneficial and a simple targeted method is proposed and practically verified to perform well. We demonstrate our general result HAL-estimator of a treatment-specific mean and of the integrated square density. We also present simulations for these two examples confirming the theory.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1515/ijb-2019-0092 | DOI Listing |

July 2022

Biometrics 2022 Jul 15. Epub 2022 Jul 15.

Division of Biostatistics, School of Public Health and Department of Statistics, University of California, Berkeley, California, USA.

Inverse-probability-weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudopopulation in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse-probability-weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at nearly -rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse-probability-weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large-scale epidemiologic study.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13719 | DOI Listing |

July 2022

Int J Biostat 2022 Jun 16. Epub 2022 Jun 16.

Divisions of Biostatistics and Epidemiology, University of California Berkeley, Berkeley, USA.

The optimal dynamic treatment rule (ODTR) framework offers an approach for understanding which kinds of patients respond best to specific treatments - in other words, treatment effect heterogeneity. Recently, there has been a proliferation of methods for estimating the ODTR. One such method is an extension of the SuperLearner algorithm - an ensemble method to optimally combine candidate algorithms extensively used in prediction problems - to ODTRs. Following the ``causal roadmap," we causally and statistically define the ODTR and provide an introduction to estimating it using the ODTR SuperLearner. Additionally, we highlight practical choices when implementing the algorithm, including choice of candidate algorithms, metalearners to combine the candidates, and risk functions to select the best combination of algorithms. Using simulations, we illustrate how estimating the ODTR using this SuperLearner approach can uncover treatment effect heterogeneity more effectively than traditional approaches based on fitting a parametric regression of the outcome on the treatment, covariates and treatment-covariate interactions. We investigate the implications of choices in implementing an ODTR SuperLearner at various sample sizes. Our results show the advantages of: (1) including a combination of both flexible machine learning algorithms and simple parametric estimators in the library of candidate algorithms; (2) using an ensemble metalearner to combine candidates rather than selecting only the best-performing candidate; (3) using the mean outcome under the rule as a risk function. Finally, we apply the ODTR SuperLearner to the ``Interventions" study, an ongoing randomized controlled trial, to identify which justice-involved adults with mental illness benefit most from cognitive behavioral therapy to reduce criminal re-offending.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1515/ijb-2020-0127 | DOI Listing |

June 2022

Int J Biostat 2022 Jun 6. Epub 2022 Jun 6.

Divisions of Biostatistics and Epidemiology, University of California Berkeley, Berkeley, USA.

Given an (optimal) dynamic treatment rule, it may be of interest to evaluate that rule - that is, to ask the causal question: what is the expected outcome had every subject received treatment according to that rule? In this paper, we study the performance of estimators that approximate the true value of: (1) an known dynamic treatment rule (2) the true, unknown optimal dynamic treatment rule (ODTR); (3) an estimated ODTR, a so-called "data-adaptive parameter," whose true value depends on the sample. Using simulations of point-treatment data, we specifically investigate: (1) the impact of increasingly data-adaptive estimation of nuisance parameters and/or of the ODTR on performance; (2) the potential for improved efficiency and bias reduction through the use of semiparametric efficient estimators; and, (3) the importance of sample splitting based on the cross-validated targeted maximum likelihood estimator (CV-TMLE) for accurate inference. In the simulations considered, there was very little cost and many benefits to using CV-TMLE to estimate the value of the true and estimated ODTR; importantly, and in contrast to non cross-validated estimators, the performance of CV-TMLE was maintained even when highly data-adaptive algorithms were used to estimate both nuisance parameters and the ODTR. In addition, we apply these estimators for the value of the rule to the "Interventions" study, an ongoing randomized controlled trial, to identify whether assigning cognitive behavioral therapy (CBT) to criminal justice-involved adults with mental illness using an ODTR significantly reduces the probability of recidivism, compared to assigning CBT in a non-individualized way.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1515/ijb-2020-0128 | DOI Listing |

June 2022

Am J Epidemiol 2022 May 4. Epub 2022 May 4.

Department of Biostatistics, UC Berkeley, Berkeley, California, United States.

Inverse probability weighting (IPW) and targeted maximum likelihood estimation (TMLE) are methodologies that can adjust for confounding and selection bias that are often used for causal inference. Both estimators rely on the positivity assumption that within strata of confounders there is a positive probability of receiving treatment at all levels under consideration. Practical applications of IPW require finite IP weights. TMLE requires propensity scores (PS) be bounded away from zero and one. Although truncation can improve variance and finite sample bias, this artificial distortion of the IP weights and PS distribution introduces asymptotic bias. As sample size grows, truncation-induced bias eventually swamps variance, rendering nominal confidence interval coverage and hypothesis tests invalid. We present a simple truncation strategy based on the sample size, $n$, that sets the upper bound on IP weights at $\sqrt{n}\ln n/5$. For TMLE, the lower bound on the PS should be set to $5/\left(\sqrt{n}\ln n\right)$). Our strategy was designed to optimize mean squared error (MSE) of the parameter estimate. It naturally extends to data structures with missing outcomes. Simulation studies and a data analysis demonstrate our strategy's ability to minimize both bias and MSE compared to other common strategies, including the popular, but flawed, quantile-based heuristic.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1093/aje/kwac087 | DOI Listing |

May 2022

Biometrics 2022 Mar 25. Epub 2022 Mar 25.

Division of Biostatistics, University of California at Berkeley, Berkeley, California, USA.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13640 | DOI Listing |

March 2022

Stat Med 2022 05 16;41(12):2132-2165. Epub 2022 Feb 16.

Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA.

Several recently developed methods have the potential to harness machine learning in the pursuit of target quantities inspired by causal inference, including inverse weighting, doubly robust estimating equations and substitution estimators like targeted maximum likelihood estimation. There are even more recent augmentations of these procedures that can increase robustness, by adding a layer of cross-validation (cross-validated targeted maximum likelihood estimation and double machine learning, as applied to substitution and estimating equation approaches, respectively). While these methods have been evaluated individually on simulated and experimental data sets, a comprehensive analysis of their performance across real data based simulations have yet to be conducted. In this work, we benchmark multiple widely used methods for estimation of the average treatment effect using ten different nutrition intervention studies data. A nonparametric regression method, undersmoothed highly adaptive lasso, is used to generate the simulated distribution which preserves important features from the observed data and reproduces a set of true target parameters. For each simulated data, we apply the methods above to estimate the average treatment effects as well as their standard errors and resulting confidence intervals. Based on the analytic results, a general recommendation is put forth for use of the cross-validated variants of both substitution and estimating equation estimators. We conclude that the additional layer of cross-validation helps in avoiding unintentional over-fitting of nuisance parameter functionals and leads to more robust inferences.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1002/sim.9348 | DOI Listing |

May 2022

Biostatistics 2022 Feb 1. Epub 2022 Feb 1.

Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine, 402 E. 67th Street, New York, NY 10065, USA

Causal mediation analysis has historically been limited in two important ways: (i) a focus has traditionally been placed on binary exposures and static interventions and (ii) direct and indirect effect decompositions have been pursued that are only identifiable in the absence of intermediate confounders affected by exposure. We present a theoretical study of an (in)direct effect decomposition of the population intervention effect, defined by stochastic interventions jointly applied to the exposure and mediators. In contrast to existing proposals, our causal effects can be evaluated regardless of whether an exposure is categorical or continuous and remain well-defined even in the presence of intermediate confounders affected by exposure. Our (in)direct effects are identifiable without a restrictive assumption on cross-world counterfactual independencies, allowing for substantive conclusions drawn from them to be validated in randomized controlled trials. Beyond the novel effects introduced, we provide a careful study of nonparametric efficiency theory relevant for the construction of flexible, multiply robust estimators of our (in)direct effects, while avoiding undue restrictions induced by assuming parametric models of nuisance parameter functionals. To complement our nonparametric estimation strategy, we introduce inferential techniques for constructing confidence intervals and hypothesis tests, and discuss open-source software, the $\texttt{medshift}$ $\texttt{R}$ package, implementing the proposed methodology. Application of our (in)direct effects and their nonparametric estimators is illustrated using data from a comparative effectiveness trial examining the direct and indirect effects of pharmacological therapeutics on relapse to opioid use disorder.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1093/biostatistics/kxac002 | DOI Listing |

February 2022

J Am Stat Assoc 2021 23;116(535):1254-1264. Epub 2020 Jan 23.

Division of Biostatistics, University of California, Berkeley.

Mediation analysis is critical to understanding the mechanisms underlying exposure-outcome relationships. In this paper, we identify the instrumental variable-direct effect of the exposure on the outcome not through the mediator, using randomization of the instrument. We call this estimand the complier stochastic direct effect (CSDE). To our knowledge, such an estimand has not previously been considered or estimated. We propose and evaluate several estimators for the CSDE: a ratio of inverse-probability of treatment-weighted estimators (IPTW), a ratio of estimating equation estimators (EE), a ratio of targeted minimum loss-based estimators (TMLE), and a TMLE that targets the CSDE directly. These estimators are applicable for a variety of study designs, including randomized encouragement trials, like the Moving to Opportunity housing voucher experiment we consider as an illustrative example, treatment discontinuities, and Mendelian randomization. We found the IPTW estimator to be the most sensitive to finite sample bias, resulting in bias of over 40% even when all models were correctly specified in a sample size of N=100. In contrast, the EE estimator and TMLE that targets the CSDE directly were far less sensitive. The EE and TML estimators also have advantages in terms of efficiency and reduced reliance on correct parametric model specification.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1080/01621459.2019.1704292 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8439556 | PMC |

January 2020

Addiction 2021 08 22;116(8):2094-2103. Epub 2021 Jan 22.

Department of Psychiatry, School of Medicine, Columbia University and New York State Psychiatric Institute, New York, NY, USA.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/add.15377 | DOI Listing |

August 2021

Biometrics 2021 12 28;77(4):1241-1253. Epub 2020 Sep 28.

Department of Biostatistics & Computational Biology, Rollins School of Public Health, Emory University, Atlanta, Georgia.

The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including human immunodeficiency virus (HIV), have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response marker activity is often costly, which has motivated the usage of two-phase sampling for immune response evaluation in clinical trials of preventive vaccines. In such trials, the measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in theoretical gaps pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13375 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016405 | PMC |

December 2021

Int J Biostat 2020 08 13;17(1):7-21. Epub 2020 Aug 13.

Department of Biostatistics, University of California, Berkeley, Berkeley, USA.

We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize machine learning to detect nonlinear relationships and covariate interactions and propose a measure of association that captures these relationships. A hypothesis test about the proposed associative measure can be used to test the strong null hypothesis of no association between a set of variables and a multivariate outcome. Simulations demonstrate that this hypothesis test has greater power than existing methods against alternatives where covariates have nonlinear relationships with outcomes. We additionally propose measures of variable importance for groups of variables, which summarize each groups' association with the outcome. We demonstrate our methodology using data from a birth cohort study on childhood health and nutrition in the Philippines.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1515/ijb-2019-0061 | DOI Listing |

August 2020

Sci Rep 2020 07 2;10(1):10939. Epub 2020 Jul 2.

Global Health Group, University of California, San Francisco, San Francisco, USA.

The identification of disease hotspots is an increasingly important public health problem. While geospatial modeling offers an opportunity to predict the locations of hotspots using suitable environmental and climatological data, little attention has been paid to optimizing the design of surveys used to inform such models. Here we introduce an adaptive sampling scheme optimized to identify hotspot locations where prevalence exceeds a relevant threshold. Our approach incorporates ideas from Bayesian optimization theory to adaptively select sample batches. We present an experimental simulation study based on survey data of schistosomiasis and lymphatic filariasis across four countries. Results across all scenarios explored show that adaptive sampling produces superior results and suggest that similar performance to random sampling can be achieved with a fraction of the sample size.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1038/s41598-020-67666-3 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7331748 | PMC |

July 2020

Epidemiology 2020 07;31(4):e34

Department of Health Care Policy, Harvard Medical School, Boston, MA.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1097/EDE.0000000000001190 | DOI Listing |

July 2020

Epidemiology 2020 09;31(5):620-627

Division of Biostatistics, School of Public Health, University of California, Berkeley, CA.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1097/EDE.0000000000001215 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8105880 | PMC |

September 2020

Biometrics 2021 03 27;77(1):329-342. Epub 2020 Apr 27.

Division of Research, Kaiser Permanente Northern California, Oakland, California.

In studies based on electronic health records (EHR), the frequency of covariate monitoring can vary by covariate type, across patients, and over time, which can limit the generalizability of inferences about the effects of adaptive treatment strategies. In addition, monitoring is a health intervention in itself with costs and benefits, and stakeholders may be interested in the effect of monitoring when adopting adaptive treatment strategies. This paper demonstrates how to exploit nonsystematic covariate monitoring in EHR-based studies to both improve the generalizability of causal inferences and to evaluate the health impact of monitoring when evaluating adaptive treatment strategies. Using a real world, EHR-based, comparative effectiveness research (CER) study of patients with type II diabetes mellitus, we illustrate how the evaluation of joint dynamic treatment and static monitoring interventions can improve CER evidence and describe two alternate estimation approaches based on inverse probability weighting (IPW). First, we demonstrate the poor performance of the standard estimator of the effects of joint treatment-monitoring interventions, due to a large decrease in data support and concerns over finite-sample bias from near-violations of the positivity assumption (PA) for the monitoring process. Second, we detail an alternate IPW estimator using a no direct effect assumption. We demonstrate that this estimator can improve efficiency but at the potential cost of increase in bias from violations of the PA for the treatment process.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13271 | DOI Listing |

March 2021

Biometrics 2021 03 4;77(1):197-211. Epub 2020 May 4.

Division of Biostatistics, University of California, Berkeley, California.

Transported mediation effects may contribute to understanding how interventions work differently when applied to new populations. However, we are not aware of any estimators for such effects. Thus, we propose two doubly robust, efficient estimators of transported stochastic (also called randomized interventional) direct and indirect effects. We demonstrate their finite sample properties in a simulation study. We then apply the preferred substitution estimator to longitudinal data from the Moving to Opportunity Study, a large-scale housing voucher experiment, to transport stochastic indirect effect estimates of voucher receipt in childhood on subsequent risk of mental health or substance use disorder mediated through parental employment across sites, thereby gaining understanding of drivers of the site differences.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13274 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7664994 | PMC |

March 2021

Proc Natl Acad Sci U S A 2020 03 18;117(9):4571-4577. Epub 2020 Feb 18.

Department of Radiation Oncology, University of California, San Francisco, CA 94143.

Machine learning is proving invaluable across disciplines. However, its success is often limited by the quality and quantity of available data, while its adoption is limited by the level of trust afforded by given models. Human vs. machine performance is commonly compared empirically to decide whether a certain task should be performed by a computer or an expert. In reality, the optimal learning strategy may involve combining the complementary strengths of humans and machines. Here, we present expert-augmented machine learning (EAML), an automated method that guides the extraction of expert knowledge and its integration into machine-learned models. We used a large dataset of intensive-care patient data to derive 126 decision rules that predict hospital mortality. Using an online platform, we asked 15 clinicians to assess the relative risk of the subpopulation defined by each rule compared to the total sample. We compared the clinician-assessed risk to the empirical risk and found that, while clinicians agreed with the data in most cases, there were notable exceptions where they overestimated or underestimated the true risk. Studying the rules with greatest disagreement, we identified problems with the training data, including one miscoded variable and one hidden confounder. Filtering the rules based on the extent of disagreement between clinician-assessed risk and empirical risk, we improved performance on out-of-sample data and were able to train with less data. EAML provides a platform for automated creation of problem-specific priors, which help build robust and dependable machine-learning models in critical applications.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1073/pnas.1906831117 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7060733 | PMC |

March 2020

Electron J Stat 2020 15;14(2):3032-3069. Epub 2020 Aug 15.

Department of Biostatistics, University of Washington, Seattle, Washington, USA.

In many problems, a sensible estimator of a possibly multivariate monotone function may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain. We demonstrate that this corrected estimator has no worse supremal estimation error than the initial estimator, and that analogously corrected confidence bands contain the true function whenever the initial bands do, at no loss to band width. Additionally, we demonstrate that the corrected estimator is asymptotically equivalent to the initial estimator if the initial estimator satisfies a stochastic equicontinuity condition and the true function is Lipschitz and strictly monotone. We provide simple sufficient conditions in the special case that the initial estimator is asymptotically linear, and illustrate the use of these results for estimation of a G-computed distribution function. Our stochastic equicontinuity condition is weaker than standard uniform stochastic equicontinuity, which has been required for alternative correction procedures. This allows us to apply our results to the bivariate correction of the local linear estimator of a conditional distribution function known to be monotone in its conditioning argument. Our experiments suggest that the projection step can yield significant practical improvements.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1214/20-ejs1740 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8112587 | PMC |

August 2020

J Am Stat Assoc 2020 21;115(532):1917-1932. Epub 2019 Oct 21.

Graduate Group in Biostatistics, University of California, Berkeley.

When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. We consider the problem of measuring the performance of a prediction algorithm with the same data that were used to train the algorithm. Typical approaches involve bootstrapping or cross-validation. However, we demonstrate that bootstrap-based approaches often fail and standard cross-validation estimators may perform poorly. We provide a general study of cross-validation-based estimators that highlights the source of this poor performance, and propose an alternative framework for estimation using techniques from the efficiency theory literature. We provide a theorem establishing the weak convergence of our estimators. The general theorem is applied in detail to two specific examples and we discuss possible extensions to other parameters of interest. For the two explicit examples that we consider, our estimators demonstrate remarkable finite-sample improvements over standard approaches.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1080/01621459.2019.1668794 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7954141 | PMC |

October 2019

Epidemiology 2020 05;31(3):e31

Department of Health Care Policy, Harvard Medical School, Boston, MA.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1097/EDE.0000000000001157 | DOI Listing |

May 2020

Biometrics 2020 09 28;76(3):722-733. Epub 2019 Nov 28.

Division of Biostatistics, University of California, Berkeley, CA, USA.

Researchers in observational survival analysis are interested in not only estimating survival curve nonparametrically but also having statistical inference for the parameter. We consider right-censored failure time data where we observe n independent and identically distributed observations of a vector random variable consisting of baseline covariates, a binary treatment at baseline, a survival time subject to right censoring, and the censoring indicator. We assume the baseline covariates are allowed to affect the treatment and censoring so that an estimator that ignores covariate information would be inconsistent. The goal is to use these data to estimate the counterfactual average survival curve of the population if all subjects are assigned the same treatment at baseline. Existing observational survival analysis methods do not result in monotone survival curve estimators, which is undesirable and may lose efficiency by not constraining the shape of the estimator using the prior knowledge of the estimand. In this paper, we present a one-step Targeted Maximum Likelihood Estimator (TMLE) for estimating the counterfactual average survival curve. We show that this new TMLE can be executed via recursion in small local updates. We demonstrate the finite sample performance of this one-step TMLE in simulations and an application to a monoclonal gammopathy data.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13172 | DOI Listing |

September 2020

Biometrics 2020 03 6;76(1):145-157. Epub 2019 Nov 6.

Division of Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, California.

Causal inference methods have been developed for longitudinal observational study designs where confounding is thought to occur over time. In particular, one may estimate and contrast the population mean counterfactual outcome under specific exposure patterns. In such contexts, confounders of the longitudinal treatment-outcome association are generally identified using domain-specific knowledge. However, this may leave an analyst with a large set of potential confounders that may hinder estimation. Previous approaches to data-adaptive model selection for this type of causal parameter were limited to the single time-point setting. We develop a longitudinal extension of a collaborative targeted minimum loss-based estimation (C-TMLE) algorithm that can be applied to perform variable selection in the models for the probability of treatment with the goal of improving the estimation of the population mean counterfactual outcome under a fixed exposure pattern. We investigate the properties of this method through a simulation study, comparing it to G-Computation and inverse probability of treatment weighting. We then apply the method in a real-data example to evaluate the safety of trimester-specific exposure to inhaled corticosteroids during pregnancy in women with mild asthma. The data for this study were obtained from the linkage of electronic health databases in the province of Quebec, Canada. The C-TMLE covariate selection approach allowed for a reduction of the set of potential confounders, which included baseline and longitudinal variables.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13135 | DOI Listing |

March 2020

Biometrics 2020 03 30;76(1):109-118. Epub 2019 Oct 30.

Division of Biostatistics, University of California, Berkeley, California.

Many estimators of the average effect of a treatment on an outcome require estimation of the propensity score, the outcome regression, or both. It is often beneficial to utilize flexible techniques, such as semiparametric regression or machine learning, to estimate these quantities. However, optimal estimation of these regressions does not necessarily lead to optimal estimation of the average treatment effect, particularly in settings with strong instrumental variables. A recent proposal addressed these issues via the outcome-adaptive lasso, a penalized regression technique for estimating the propensity score that seeks to minimize the impact of instrumental variables on treatment effect estimators. However, a notable limitation of this approach is that its application is restricted to parametric models. We propose a more flexible alternative that we call the outcome highly adaptive lasso. We discuss the large sample theory for this estimator and propose closed-form confidence intervals based on the proposed estimator. We show via simulation that our method offers benefits over several popular approaches.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13121 | DOI Listing |

March 2020

Stat Med 2019 07 25;38(16):3073-3090. Epub 2019 Apr 25.

Division of Research, Kaiser Permanente, Northern California, Oakland, California.

Electronic health records (EHR) data provide a cost- and time-effective opportunity to conduct cohort studies of the effects of multiple time-point interventions in the diverse patient population found in real-world clinical settings. Because the computational cost of analyzing EHR data at daily (or more granular) scale can be quite high, a pragmatic approach has been to partition the follow-up into coarser intervals of pre-specified length (eg, quarterly or monthly intervals). The feasibility and practical impact of analyzing EHR data at a granular scale has not been previously evaluated. We start filling these gaps by leveraging large-scale EHR data from a diabetes study to develop a scalable targeted learning approach that allows analyses with small intervals. We then study the practical effects of selecting different coarsening intervals on inferences by reanalyzing data from the same large-scale pool of patients. Specifically, we map daily EHR data into four analytic datasets using 90-, 30-, 15-, and 5-day intervals. We apply a semiparametric and doubly robust estimation approach, the longitudinal Targeted Minimum Loss-Based Estimation (TMLE), to estimate the causal effects of four dynamic treatment rules with each dataset, and compare the resulting inferences. To overcome the computational challenges presented by the size of these data, we propose a novel TMLE implementation, the "long-format TMLE," and rely on the latest advances in scalable data-adaptive machine-learning software, xgboost and h2o, for estimation of the TMLE nuisance parameters.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1002/sim.8164 | DOI Listing |

July 2019

J R Stat Soc Series B Stat Methodol 2019 Feb 2;81(1):75-99. Epub 2018 Nov 2.

Division of Biostatistics, University of California, Berkeley, Berkeley, CA, USA.

We present a novel family of nonparametric omnibus tests of the hypothesis that two unknown but estimable functions are equal in distribution when applied to the observed data structure. We developed these tests, which represent a generalization of the maximum mean discrepancy tests described in Gretton et al. [2006], using recent developments from the higher-order pathwise differentiability literature. Despite their complex derivation, the associated test statistics can be expressed rather simply as U-statistics. We study the asymptotic behavior of the proposed tests under the null hypothesis and under both fixed and local alternatives. We provide examples to which our tests can be applied and show that they perform well in a simulation study. As an important special case, our proposed tests can be used to determine whether an unknown function, such as the conditional average treatment effect, is equal to zero almost surely.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/rssb.12299 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6476331 | PMC |

February 2019

Epidemiology 2019 05;30(3):334-341

Department of Health Care Policy, Harvard Medical School, Boston, MA.

We consider the problem of selecting the optimal subgroup to treat when data on covariates are available from a randomized trial or observational study. We distinguish between four different settings including: (1) treatment selection when resources are constrained; (2) treatment selection when resources are not constrained; (3) treatment selection in the presence of side effects and costs; and (4) treatment selection to maximize effect heterogeneity. We show that, in each of these cases, the optimal treatment selection rule involves treating those for whom the predicted mean difference in outcomes comparing those with versus without treatment, conditional on covariates, exceeds a certain threshold. The threshold varies across these four scenarios, but the form of the optimal treatment selection rule does not. The results suggest a move away from the traditional subgroup analysis for personalized medicine. New randomized trial designs are proposed so as to implement and make use of optimal treatment selection rules in healthcare practice.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1097/EDE.0000000000000991 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456380 | PMC |

May 2019

Biometrics 2019 09 3;75(3):768-777. Epub 2019 Apr 3.

Division of Biostatistics, University of California at Berkeley, Berkeley, California.

The assumption that no subject's exposure affects another subject's outcome, known as the no-interference assumption, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of a known underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject's outcome may be affected by all the other subjects in the group via the proportion exposed (the stratified interference assumption). In this article, we instead consider a complete interference setting, in which each subject affects every other subject's outcome. In particular, we make the stratified interference assumption for a single group consisting of the entire sample. We show that a targeted maximum likelihood estimator for the i.i.d. setting can be used to estimate a class of causal parameters that includes direct effects and overall effects under certain interventions. This estimator remains doubly-robust, semiparametric efficient, and continues to allow for incorporation of machine learning under our model. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1111/biom.13034 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6679813 | PMC |

September 2019

J Appl Stat 2019 22;46(12):2216-2236. Epub 2019 Feb 22.

Division of Biostatistics, University of California, Berkeley.

The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. Super Learner (SL) is a generic ensemble learning algorithm that uses cross-validation to select among a "library" of candidate prediction models. While SL has been widely studied in a number of settings, it has not been thoroughly evaluated in large electronic healthcare databases that are common in pharmacoepidemiology and comparative effectiveness research. In this study, we applied and evaluated the performance of SL in its ability to predict the propensity score (PS), the conditional probability of treatment assignment given baseline covariates, using three electronic healthcare databases. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also proposed a novel strategy for prediction modeling that combines SL with the high-dimensional propensity score (hdPS) variable selection algorithm. Predictive performance was assessed using three metrics: the negative log-likelihood, area under the curve (AUC), and time complexity. Results showed that the best individual algorithm, in terms of predictive performance, varied across datasets. The SL was able to adapt to the given dataset and optimize predictive performance relative to any individual learner. Combining the SL with the hdPS was the most consistent prediction method and may be promising for PS estimation and prediction modeling in electronic healthcare databases.

## Download full-text PDF |
Source |
---|---|

http://dx.doi.org/10.1080/02664763.2019.1582614 | DOI Listing |

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7444746 | PMC |

February 2019

-->