python獲取網頁源碼
⑴ 求python抓網頁的代碼
python3.x中使用urllib.request模塊來抓取網頁代碼,通過urllib.request.urlopen函數取網頁內容,獲取的為數據流,通過read()函數把數字讀取出來,再把讀取的二進制數據通過decode函數解碼(編號可以通過查看網頁源代碼中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知,如下例中為gbk編碼。),這樣就得到了網頁的源代碼。
如下例所示,抓取本頁代碼:
importurllib.request
html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取後要按網頁編碼進行解碼
print(html)
以下為urllib.request.urlopen函數說明:
urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)
Open the URL url, which can be either a string or a Request object.
data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.
data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.
urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and ftp connections.
If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.
The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().
The cadefault parameter is ignored.
For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.
For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as
geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)
getcode() – return the HTTP status code of the response.
Raises URLError on errors.
Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).
In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.
The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.
Changed in version 3.2: cafile
and capath were added.
Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).
New in version 3.2: data can be
an iterable object.
Changed in version 3.3: cadefault
was added.
Changed in version 3.4.3: context
was added.
⑵ python 用requests獲取網頁源代碼為什麼中文顯示錯誤
查看一下網頁的編碼,比如是gbk的話,就r.encoding='gbk'。一下內容摘自requests文檔
requests會自動解碼來自伺服器的內容。大多數unicode字元集都能被無縫地解碼。
請求發出後,requests會基於http頭部對響應的編碼作出有根據的推測。當你訪問
r.text
之時,requests會使用其推測的文本編碼。你可以找出requests使用了什麼編碼,並且能夠使用
r.encoding
屬性來改變它:
r.encoding
'utf-8'
r.encoding
=
'iso-8859-1'
如果你改變了編碼,每當你訪問
r.text
,request都將會使用
r.encoding
的新值。你可能希望在使用特殊邏輯計算出文本的編碼的情況下來修改編碼。比如
http
和
xml
自身可以指定編碼。這樣的話,你應該使用
r.content
來找到編碼,然後設置
r.encoding
為相應的編碼。這樣就能使用正確的編碼解析
r.text
了。
⑶ 用Python怎麼得到網頁中iframe的源代碼
簡單的做個例子,框架路徑可以自己修改,調用像網路等網站時無法讀取其中源碼,涉及到一些安全問題,所以路徑要求是合法的允許訪問的路徑 <script> function GetFrameInnerHtml(objIFrame) { var iFrameHTML = ""; if (objIFrame.contentDocument) { //針對netscape iFrameHTML = objIFrame.contentDocument.innerHTML; } else if (objIFrame.contentWindow) { // 針對ie5.5和ie6 iFrameHTML = objIFrame.contentWindow.document.body.innerHTML; } else if (objIFrame.document) { // For IE5 iFrameHTML = objIFrame.document.body.innerHTML; } return iFrameHTML; } </script> <iframe id="ifa" src="1.html" ></iframe> <input type="button" value="click" onclick="alert(GetFrameInnerHtml(document.getElementById('ifa')))"/>
⑷ python 用requests獲取網頁源代碼為什麼中文顯示錯誤
編輯器的編碼與網頁編碼不相符合。
一般編輯器統一用UTF-8,網頁用decode再encode到UTF-8即可
⑸ python,requests中獲取網頁源代碼,與右鍵查看的源代碼不一致,求解!!! 下面是代碼,不知有何錯誤
requests請求網址url = 'https://www..com/s?wd=周傑倫'後,print(res.text) #列印的只是url = 'https://www..com/s?wd=周傑倫 這一個請求返回的響應體內容,
而如下圖,右鍵查看的頁面源代碼是你請求的網頁url加上其他頁面內的js請求,圖片等靜態資源請求,css等最終形成的頁面,所以兩者不一樣的
⑹ python3.9。在網頁源代碼中爬取的漢字代碼如何轉換回漢字
以前總是覺得,爬蟲是個很高大上的東西,就像盜取別人的數據一樣。現在才知道,爬蟲能爬到的,都是網頁上能看到的,說白了就是別人給你看的。
所謂爬蟲,就是先獲取網頁的源代碼,然後從源代碼中篩選出自己想要的資源,比如網頁上的圖片、視頻等文件,甚至網頁上的文字。接下來,我們就用Python來爬取網頁上的圖片。
首先我們先獲取網站的源碼。
然後就是從萬千的源碼中解析出自己想要的資源了,我這里想要的是網站上的圖片。
個人覺得,這個爬蟲考驗的,還是正則表達式的功底,怎麼寫好正則表達式,才能將所有想要的資源都解析出來,其他的都比較簡單。
以下是我從網頁上爬下來的部分圖片。
⑺ Python怎樣抓取當前頁面HTML內容
Python用做數據處理還是相當不錯的,如果你想要做爬蟲,Python是很好的選擇,它有很多已經寫好的類包,只要調用,即可完成很多復雜的功能,此文中所有的功能都是基於BeautifulSoup這個包。
1 Pyhton獲取網頁的內容(也就是源代碼)
page = urllib2.urlopen(url)
contents = page.read()
#獲得了整個網頁的內容也就是源代碼 print(contents)
url代表網址,contents代表網址所對應的源代碼,urllib2是需要用到的包,以上三句代碼就能獲得網頁的整個源代碼
2 獲取網頁中想要的內容(先要獲得網頁源代碼,再分析網頁源代碼,找所對應的標簽,然後提取出標簽中的內容)
⑻ python裡面request怎麼讀取html代碼
使用Python 3的requests模塊抓取網頁源碼並保存到文件示例:
import requests
ff = open('testt.txt','w',encoding='utf-8')
with open('test.txt',encoding="utf-8") as f:
for line in f:
ff.write(line)
ff.close()
這是演示讀取一個txt文件,每次讀取一行,並保存到另一個txt文件中的示例。
因為在命令行中列印每次讀取一行的數據,中文會出現編碼錯誤,所以每次讀取一行並保存到另一個文件,這樣來測試讀取是否正常。(注意open的時候制定encoding編碼方式)
⑼ python如何獲取網頁源碼中整個<body>的內容
一般是這樣,用request庫獲取html內容,然後用正則表達式獲取內容。比如:
import requests
from bs4 import BeautifulSoup
txt=requests.get("https://www.gov.cn/").text //抓取網頁
a=BeautifulSoup(txt,'html.parser') //構建解析器
print(a.body) //獲取內容,也可以是a.title或者其他的標記內容
⑽ python採集源代碼網頁沒有
python採集源代碼網頁沒有。可以使用driver.current_url獲取當前窗口網址,再次get延時恰當時間後可以正確獲取源代碼。