當前位置:首頁 » 編程語言 » python網頁源代碼

python網頁源代碼

發布時間: 2023-06-03 16:37:39

python下用selenium的webdriver包如何取得打開頁面的html源代碼呢

  1. 這敏告個可以通過瀏覽器自帶的f12 。

  2. 或者通過滑鼠右鍵,審計元素獲得當前html源代碼。

❷ python,requests中獲取網頁源代碼,與右鍵查看的源代碼不一致,求解!!! 下面是代碼,不知有何錯誤

requests請求網址url = 'https://www..com/s?wd=周傑倫'後,print(res.text) #列印的只是url = 'https://www..com/s?wd=周傑倫 這一個請求返回的響應體內容,

而如下圖,右鍵查看的頁面源代碼是你請求的網頁url加上其他頁面內的js請求,圖片等靜態資源請求,css等最終形成的頁面,所以兩者不一樣的


❸ python爬蟲怎麼獲取動態的網頁源碼

一個月前實習導師布置任務說通過網路爬蟲獲取深圳市氣象局發布的降雨數據,網頁如下:

心想,爬蟲不太難的,當年跟zjb爬煎蛋網無(mei)聊(zi)圖的時候,多麼清高。由於接受任務後的一個月考試加作業一大堆,導師也不催,自己也不急。

但是,導師等我一個月都得讓我來寫意味著這東西得有多難吧。。。今天打開一看的確是這樣。網站是基於Ajax寫的,數據動態獲取,所以無法通過下載源代碼然後解析獲得。

從某不良少年寫的抓取淘寶mm的例子中收到啟發,對於這樣的情況,一般可以同構自己搭建瀏覽器實現。phantomJs,CasperJS都是不錯的選擇。

導師的要求是獲取過去一年內深圳每個區每個站點每小時的降雨量,執行該操作需要通過如上圖中的歷史查詢實現,即通過一個時間來查詢,而這個時間存放在一個hidden類型的input標簽里,當然可以通過js語句將其改為text類型,然後執行send_keys之類的操作。然而,我失敗了。時間可以修改設置,可是結果如下圖。

為此,僅抓取實時數據。選取python的selenium,模擬搭建瀏覽器,模擬人為的點擊等操作實現數據生成和獲取。selenium的一大優點就是能獲取網頁渲染後的源代碼,即執行操作後的源代碼。普通的通過 url解析網頁的方式只能獲取給定的數據,不能實現與用戶之間的交互。selenium通過獲取渲染後的網頁源碼,並通過豐富的查找工具,個人認為最好用的就是find_element_by_xpath("xxx"),通過該方式查找到元素後可執行點擊、輸入等事件,進而向伺服器發出請求,獲取所需的數據。

[python]view plain

  • #coding=utf-8

  • fromtestStringimport*

  • fromseleniumimportwebdriver

  • importstring

  • importos

  • fromselenium.webdriver.common.keysimportKeys

  • importtime

  • importsys

  • default_encoding='utf-8'

  • ifsys.getdefaultencoding()!=default_encoding:

  • reload(sys)

  • sys.setdefaultencoding(default_encoding)

  • district_navs=['nav2','nav1','nav3','nav4','nav5','nav6','nav7','nav8','nav9','nav10']

  • district_names=['福田區','羅湖區','南山區','鹽田區','寶安區','龍崗區','光明新區','坪山新區','龍華新區','大鵬新區']

  • flag=1

  • while(flag>0):

  • driver=webdriver.Chrome()

  • driver.get("hianCe/")

  • #選擇降雨量

  • driver.find_element_by_xpath("//span[@id='fenqu_H24R']").click()

  • filename=time.strftime("%Y%m%d%H%M",time.localtime(time.time()))+'.txt'

  • #創建文件

  • output_file=open(filename,'w')

  • #選擇行政區

  • foriinrange(len(district_navs)):

  • driver.find_element_by_xpath("//div[@id='"+district_navs[i]+"']").click()

  • #printdriver.page_source

  • timeElem=driver.find_element_by_id("time_shikuang")

  • #輸出時間和站點名

  • output_file.write(timeElem.text+',')

  • output_file.write(district_names[i]+',')

  • elems=driver.find_elements_by_xpath("//span[@onmouseover='javscript:changeTextOver(this)']")

  • #輸出每個站點的數據,格式為:站點名,一小時降雨量,當日累積降雨量

  • foreleminelems:

  • output_file.write(AMonitorRecord(elem.get_attribute("title"))+',')

  • output_file.write(' ')

  • output_file.close()

  • driver.close()

  • time.sleep(3600)

  • 文件中引用的文件testString只是修改輸出格式,提取有效數據。
  • [python]view plain

  • #Encoding=utf-8

  • defOnlyCharNum(s,oth=''):

  • s2=s.lower()

  • fomart=',.'

  • forcins2:

  • ifnotcinfomart:

  • s=s.replace(c,'')

  • returns

  • defAMonitorRecord(str):

  • str=str.split(":")

  • returnstr[0]+","+OnlyCharNum(str[1])


  • 一小時抓取一次數據,結果如下:

❹ python怎麼爬取網頁源代碼

#!/usr/bin/env python3
#-*- coding=utf-8 -*-

import urllib3

if __name__ == '__main__':
http=urllib3.PoolManager()
r=http.request('GET','IP')
print(r.data.decode("gbk"))

可以正常抓取。需要安裝urllib3,py版本3.43

❺ 如何使用python或R抓取網頁被隱藏的源代碼

隱藏的源代碼?不知道你指的是什麼?我的理解有兩種,一是不在前段顯示,但是查看源代碼時有,二是,非同步載入的內容在前端和源代碼中均看不到,第一種很容易解決,想必你指的時第二種,解決方法有三種:

  1. 模擬瀏覽器,動態獲取,可以使用大殺器selenium工具

    使用這種方法可以實現只要能看到就能抓取到,如滑鼠滑過,非同步載入等,因為他的行為可以與瀏覽器一模一樣,但是這種方式的效率卻是最低的,一般不到實在沒有辦法的時候不推薦使用。

  2. 執行js代碼

    在python中執行非同步載入的js代碼,獲得一些諸如滑鼠滑過,下拉載入更多等,但是現在的網站中都有非常多的js代碼,要找到需要執行的目標js代碼時非常困難和耗時的,此外python對js的兼容性也不是很好,也不推薦使用。

  3. 找到非同步載入的json文件,最常用,最方便,最好用的方法,這是我平常抓取動態非同步載入網站時最常用的方法,可以解決我99%的問題。具體的使用方法是打開瀏覽器的開發者工具,轉到network選項,之後重新載入網頁,在network中的列表中找到載入過程中載入的需要動態非同步載入的json文件,以京東為例,如圖,第一張找到的是非同步載入的庫存信息的json文件,第二招找到的是非同步載入的評論信息的json文件:

具體更詳細的方法可以google或網路

❻ 求python抓網頁的代碼

python3.x中使用urllib.request模塊來抓取網頁代碼,通過urllib.request.urlopen函數取網頁內容,獲取的為數據流,通過read()函數把數字讀取出來,再把讀取的二進制數據通過decode函數解碼(編號可以通過查看網頁源代碼中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知,如下例中為gbk編碼。),這樣就得到了網頁的源代碼。

如下例所示,抓取本頁代碼:

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取後要按網頁編碼進行解碼
print(html)

以下為urllib.request.urlopen函數說明:

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)


Open the URL url, which can be either a string or a Request object.


data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.


data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.


urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.


The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and ftp connections.


If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.


The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().


The cadefault parameter is ignored.


For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.


For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as


geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.


Raises URLError on errors.


Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).


In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.


The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.



Changed in version 3.2: cafile
and capath were added.



Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).



New in version 3.2: data can be
an iterable object.



Changed in version 3.3: cadefault
was added.



Changed in version 3.4.3: context
was added.

❼ python 中關於用beautifulsoup4庫解析網頁源代碼標簽的問題,急求解答

以網路為例

#-*-coding:utf-8-*-
importrequests
importurlparse
importos
frombs4importBeautifulSoup
defprocess(url):
headers={'content-type':'application/json',
'User-Agent':'Mozilla/5.0(X11;Ubuntu;Linuxx86_64;rv:22.0)Gecko/20100101Firefox/22.0'}
pageSourse=requests.get(url,headers=headers).text
page_soup=BeautifulSoup(pageSourse)
a_all=page_soup.findAll("a")
link_urls=[i.get('href')foriina_all]#有些是javascript觸發事件,過濾方法自己寫下。
img_all=page_soup.findAll("img")
img_urls=[i.get("src")foriinimg_all]
printlink_urls,img_urls
return(link_urls,img_urls)
process("https://www..com")

結果如下:

[u'/',u'javascript:;',u'javascript:;',u'javascript:;',u'/',u'javascript:;',u'https://passport..com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww..com%2F',u'http://www.nuomi.com/?cid=002540',u'http://news..com',u'http://www.hao123.com',u'http://map..com',u'http://v..com',u'http://tieba..com',u'https://passport..com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww..com%2F',u'http://www..com/gaoji/preferences.html',u'http://www..com/more/',u'http://news..com/ns?cl=2&rn=20&tn=news&word=',u'http://tieba..com/f?kw=&fr=wwwt',u'http://..com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt',u'http://music..com/search?fr=ps&ie=utf-8&key=',u'http://image..com/search/index?tn=image&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=',u'http://v..com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=',u'http://map..com/m?word=&fr=ps01000',u'http://wenku..com/search?word=&lm=0&od=0&ie=utf-8',u'//www..com/more/',u'/',u'//www..com/cache/sethelp/help.html',u'http://home..com',u'http://ir..com',u'http://www..com/ty/',u'http://jianyi..com/'][u'//www..com/img/bd_logo1.png',u'//www..com/img/_jgylogo3.gif']

有問題可指出,滿意請採納

熱點內容
安卓怎麼開一鍵啟動 發布:2024-04-24 18:12:05 瀏覽:455
phpsmarty使用 發布:2024-04-24 17:59:32 瀏覽:460
rt809f編程器軟體下載 發布:2024-04-24 17:58:01 瀏覽:65
a級車買哪個配置劃算 發布:2024-04-24 17:37:23 瀏覽:404
安卓的微信復制不了怎麼回事 發布:2024-04-24 17:32:25 瀏覽:211
我的世界伺服器控制台喊話 發布:2024-04-24 17:29:54 瀏覽:34
python保存為excel 發布:2024-04-24 17:20:31 瀏覽:368
戰艦世界什麼伺服器號 發布:2024-04-24 17:19:51 瀏覽:155
接碼平台源碼 發布:2024-04-24 17:14:29 瀏覽:148
榮耀智慧屏x1存儲文件 發布:2024-04-24 17:13:42 瀏覽:189