在Tidyverse用estimatr

加密解密 ggplot2 Bootstrap · 發表 2018-10-28 10:20:41

摘要： estimatetr以穩健的標準差進行快速的OLS和IV迴歸。本文揭示estimatetr如何與RStudio的tidyverse軟體包整合。獲得整潔 tidyverse的第一步是將模型輸出轉換為我們可以操作的資料。 tidy函式將lm_robust物件轉換為data.frame。...

estimatetr以穩健的標準差進行快速的OLS和IV迴歸。本文揭示estimatetr如何與RStudio的tidyverse軟體包整合。

獲得整潔

tidyverse的第一步是將模型輸出轉換為我們可以操作的資料。 tidy函式將lm_robust物件轉換為data.frame。

library(estimatr)
fit <- lm_robust(Fertility ~ Agriculture + Catholic, data = swiss)
tidy(fit)

資料處理與dplyr

一旦迴歸擬合為data.frame，您就可以使用dplyr的任何“動作”來進行資料操作，比如mutate、filter、select、summary、group_by和arrange(更多資訊在 ofollow,noindex" target="_blank">這裡 )。

library(tidyverse)

# lm_robust and filter
fit %>% tidy %>% filter(term == "Agriculture")

# lm_robust and select
fit %>% tidy %>% select(term, estimate, std.error)

lm_robust and mutate
fit %>% tidy %>% mutate(t_stat = estimate/std.error, significant = p.value <= 0.05)

ggplot2的資料視覺化

ggplot2提供了許多與estimatr相容的資料視覺化工具

1 繪製係數圖

fit %>% tidy %>% filter(term != "(Intercept)") %>% ggplot(aes(y = term, x = estimate)) + 
geom_vline(xintercept = 0, linetype = 2) + geom_point() + geom_errorbarh(aes(xmin = conf.low, 
xmax = conf.high, height = 0.1)) + theme_bw()

使用geom_smooth函式和stat_smooth函式基於CIS健壯的方差估計(而不是“經典的”方差估計)。

library(ggplot2)
ggplot(swiss, aes(x = Agriculture, y = Fertility)) + geom_point() + geom_smooth(method = "lm_robust") + 
theme_bw()

注意，函式形式可以包括多項式。例如，如果模型是我們可以這樣建模：

library(ggplot2)
ggplot(swiss, aes(x = Agriculture, y = Fertility)) + geom_point() + geom_smooth(method = "lm_robust", 
formula = y ~ poly(x, 3, raw = TRUE)) + theme_bw()

Bootstrap 使用rsample

rsample pacakage提供了Bootstrap 工具:

library(rsample)

boot_out <- bootstraps(data = swiss, 500)$splits %>% map(~lm_robust(Fertility ~ 
Catholic + Agriculture, data = analysis(.))) %>% map(tidy) %>% bind_rows(.id = "bootstrap_replicate")
kable(head(boot_out))

boot_out是一個data.frame，它包含來自每個boostrapped示例的估計。然後，我們可以使用dplyr函式來總結bootstrap，使用tidyr函式來重塑估計，使用GGally::ggpair來視覺化它們。

boot_out %>% group_by(term) %>% summarise(boot_se = sd(estimate))

library(GGally)
boot_out %>% select(bootstrap_replicate, term, estimate) %>% spread(key = term, 
value = estimate) %>% select(-bootstrap_replicate) %>% ggpairs(lower = list(continuous = wrap("points", 
alpha = 0.1))) + theme_bw()

多個模型使用purrr

purrr提供了對向量的每個元素執行相同操作的工具。例如，我們可能需要估計不同資料子集上的模型。我們可以使用map函式來做這件事。

library(purrr)

# Running the same model for highly educated and less educated cantons/districts

two_subsets <- 
swiss %>%
mutate(HighlyEducated = as.numeric(Education > 8)) %>%
split(.$HighlyEducated) %>%
map( ~ lm_robust(Fertility ~ Catholic, data = .)) %>%
map(tidy) %>%
bind_rows(.id = "HighlyEducated")

kable(two_subsets, digits =2)

或者，我們可能想在同一個自變數上回歸不同的因變數。map也可以與estimatr函式一起使用。

three_outcomes <- c("Fertility", "Education", "Agriculture") %>% map(~formula(paste0(., 
" ~ Catholic"))) %>% map(~lm_robust(., data = swiss)) %>% map_df(tidy)

kable(three_outcomes, digits = 2)

使用ggplot2，我們可以做一個係數圖:

three_outcomes %>% filter(term == "Catholic") %>% ggplot(aes(x = estimate, y = outcome)) + 
geom_vline(xintercept = 0, linetype = 2) + geom_point() + geom_errorbarh(aes(xmin = conf.low, 
xmax = conf.high, height = 0.1)) + ggtitle("Slopes with respect to `Catholic`") + 
theme_bw()

最後的想法

一旦輸出的模型變成資料框，在tidyverse中使用estimatr函式很容易。我們用tidy的函式來完成這個任務。在那之後，許多總結和視覺化的可能性出現了。整潔快樂!

原文連結： https://declaredesign.org/r/estimatr/articles/estimatr-in-the-tidyverse.html

版權宣告：作者保留權利，嚴禁修改，轉載請註明原文連結。

資料人網是資料人學習、交流和分享的平臺http://shujuren.org 。專注於從資料中學習到有用知識。平臺的理念：人人投稿，知識共享；人人分析，洞見驅動；智慧聚合，普惠人人。您在資料人網平臺，可以1）學習資料知識；2）建立資料部落格；3）認識資料朋友；4）尋找資料工作；5）找到其它與資料相關的乾貨。我們努力堅持做原創，聚合和分享優質的省時的資料知識！我們都是資料人，資料是有價值的，堅定不移地實現從資料到商業價值的轉換！