如何在R(https連結)中webscrape安全頁面(使用XML包中的readHTMLTable)？

XML Https · 發表 2018-10-14 20:03:40

摘要：有關如何使用XML包中的readHTMLTable的很好的答案,並且我使用常規的http頁面進行了這一操作,但是我無法用https頁面解決我的問題. 我正在這個網站上讀表(url string)： library(RTidyHTML) library(XML) url...

有關如何使用XML包中的readHTMLTable的很好的答案,並且我使用常規的http頁面進行了這一操作,但是我無法用https頁面解決我的問題.

我正在這個網站上讀表(url string)：

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

但是我得到這個錯誤：檔案ofollow,noindex" target="_blank">https://ned.nih.gov/search/Vi…does 不存在.

我試圖通過這個(前兩行)的https問題(從使用谷歌找到解決方案(如這裡：http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/ )).

這個技巧有助於檢視更多的頁面,但是任何嘗試提取表格都無法正常工作.任何建議讚賞.我需要表格,如組織,組織標題,經理.

#attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

新包httr提供了一個圍繞RCurl的包裝器,以便更容易地刮擦各種頁面.

不過,這個頁面給了我很多的麻煩.以下工作,但無疑有更簡單的方法.

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
"https://ned.nih.gov/", 
path="search/ViewDetails.aspx", 
query="NIHID=0010121048",
config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

結果：

$ctl00_ContentPlaceHolder_dvPerson
V1V2
1Legal Name:Dr Francis S Collins
2Preferred Name:Dr Francis Collins
3E-mail:[email protected]
4Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5Mail Stop:Â
6Phone:301-496-2433
7Fax:Â
8IC:OD (Office of the Director)
9Organization:Office of the Director (HNA)
10 Classification:Employee
11TTY:Â

在這裡獲取httr：http://cran.r-project.org/web/packages/httr/index.html

編輯：有關RCurl包的常見問題的有用頁面：http://www.omegahat.org/RCurl/FAQ.html

http://stackoverflow.com/questions/10692066/how-to-webscrape-secured-pages-in-r-https-links-using-readhtmltable-from-xml

如何在R(https連結)中webscrape安全頁面(使用XML包中的readHTMLTable)？

您可能也會喜歡…