当前位置:首页 » 编程语言 » python网页源代码

python网页源代码

发布时间: 2023-06-03 16:37:39

python下用selenium的webdriver包如何取得打开页面的html源代码呢

  1. 这敏告个可以通过浏览器自带的f12 。

  2. 或者通过鼠标右键,审计元素获得当前html源代码。

❷ python,requests中获取网页源代码,与右键查看的源代码不一致,求解!!! 下面是代码,不知有何错误

requests请求网址url = 'https://www..com/s?wd=周杰伦'后,print(res.text) #打印的只是url = 'https://www..com/s?wd=周杰伦 这一个请求返回的响应体内容,

而如下图,右键查看的页面源代码是你请求的网页url加上其他页面内的js请求,图片等静态资源请求,css等最终形成的页面,所以两者不一样的


❸ python爬虫怎么获取动态的网页源码

一个月前实习导师布置任务说通过网络爬虫获取深圳市气象局发布的降雨数据,网页如下:

心想,爬虫不太难的,当年跟zjb爬煎蛋网无(mei)聊(zi)图的时候,多么清高。由于接受任务后的一个月考试加作业一大堆,导师也不催,自己也不急。

但是,导师等我一个月都得让我来写意味着这东西得有多难吧。。。今天打开一看的确是这样。网站是基于Ajax写的,数据动态获取,所以无法通过下载源代码然后解析获得。

从某不良少年写的抓取淘宝mm的例子中收到启发,对于这样的情况,一般可以同构自己搭建浏览器实现。phantomJs,CasperJS都是不错的选择。

导师的要求是获取过去一年内深圳每个区每个站点每小时的降雨量,执行该操作需要通过如上图中的历史查询实现,即通过一个时间来查询,而这个时间存放在一个hidden类型的input标签里,当然可以通过js语句将其改为text类型,然后执行send_keys之类的操作。然而,我失败了。时间可以修改设置,可是结果如下图。

为此,仅抓取实时数据。选取python的selenium,模拟搭建浏览器,模拟人为的点击等操作实现数据生成和获取。selenium的一大优点就是能获取网页渲染后的源代码,即执行操作后的源代码。普通的通过 url解析网页的方式只能获取给定的数据,不能实现与用户之间的交互。selenium通过获取渲染后的网页源码,并通过丰富的查找工具,个人认为最好用的就是find_element_by_xpath("xxx"),通过该方式查找到元素后可执行点击、输入等事件,进而向服务器发出请求,获取所需的数据。

[python]view plain

  • #coding=utf-8

  • fromtestStringimport*

  • fromseleniumimportwebdriver

  • importstring

  • importos

  • fromselenium.webdriver.common.keysimportKeys

  • importtime

  • importsys

  • default_encoding='utf-8'

  • ifsys.getdefaultencoding()!=default_encoding:

  • reload(sys)

  • sys.setdefaultencoding(default_encoding)

  • district_navs=['nav2','nav1','nav3','nav4','nav5','nav6','nav7','nav8','nav9','nav10']

  • district_names=['福田区','罗湖区','南山区','盐田区','宝安区','龙岗区','光明新区','坪山新区','龙华新区','大鹏新区']

  • flag=1

  • while(flag>0):

  • driver=webdriver.Chrome()

  • driver.get("hianCe/")

  • #选择降雨量

  • driver.find_element_by_xpath("//span[@id='fenqu_H24R']").click()

  • filename=time.strftime("%Y%m%d%H%M",time.localtime(time.time()))+'.txt'

  • #创建文件

  • output_file=open(filename,'w')

  • #选择行政区

  • foriinrange(len(district_navs)):

  • driver.find_element_by_xpath("//div[@id='"+district_navs[i]+"']").click()

  • #printdriver.page_source

  • timeElem=driver.find_element_by_id("time_shikuang")

  • #输出时间和站点名

  • output_file.write(timeElem.text+',')

  • output_file.write(district_names[i]+',')

  • elems=driver.find_elements_by_xpath("//span[@onmouseover='javscript:changeTextOver(this)']")

  • #输出每个站点的数据,格式为:站点名,一小时降雨量,当日累积降雨量

  • foreleminelems:

  • output_file.write(AMonitorRecord(elem.get_attribute("title"))+',')

  • output_file.write(' ')

  • output_file.close()

  • driver.close()

  • time.sleep(3600)

  • 文件中引用的文件testString只是修改输出格式,提取有效数据。
  • [python]view plain

  • #Encoding=utf-8

  • defOnlyCharNum(s,oth=''):

  • s2=s.lower()

  • fomart=',.'

  • forcins2:

  • ifnotcinfomart:

  • s=s.replace(c,'')

  • returns

  • defAMonitorRecord(str):

  • str=str.split(":")

  • returnstr[0]+","+OnlyCharNum(str[1])


  • 一小时抓取一次数据,结果如下:

❹ python怎么爬取网页源代码

#!/usr/bin/env python3
#-*- coding=utf-8 -*-

import urllib3

if __name__ == '__main__':
http=urllib3.PoolManager()
r=http.request('GET','IP')
print(r.data.decode("gbk"))

可以正常抓取。需要安装urllib3,py版本3.43

❺ 如何使用python或R抓取网页被隐藏的源代码

隐藏的源代码?不知道你指的是什么?我的理解有两种,一是不在前段显示,但是查看源代码时有,二是,异步加载的内容在前端和源代码中均看不到,第一种很容易解决,想必你指的时第二种,解决方法有三种:

  1. 模拟浏览器,动态获取,可以使用大杀器selenium工具

    使用这种方法可以实现只要能看到就能抓取到,如鼠标滑过,异步加载等,因为他的行为可以与浏览器一模一样,但是这种方式的效率却是最低的,一般不到实在没有办法的时候不推荐使用。

  2. 执行js代码

    在python中执行异步加载的js代码,获得一些诸如鼠标滑过,下拉加载更多等,但是现在的网站中都有非常多的js代码,要找到需要执行的目标js代码时非常困难和耗时的,此外python对js的兼容性也不是很好,也不推荐使用。

  3. 找到异步加载的json文件,最常用,最方便,最好用的方法,这是我平常抓取动态异步加载网站时最常用的方法,可以解决我99%的问题。具体的使用方法是打开浏览器的开发者工具,转到network选项,之后重新加载网页,在network中的列表中找到加载过程中加载的需要动态异步加载的json文件,以京东为例,如图,第一张找到的是异步加载的库存信息的json文件,第二招找到的是异步加载的评论信息的json文件:

具体更详细的方法可以google或网络

❻ 求python抓网页的代码

python3.x中使用urllib.request模块来抓取网页代码,通过urllib.request.urlopen函数取网页内容,获取的为数据流,通过read()函数把数字读取出来,再把读取的二进制数据通过decode函数解码(编号可以通过查看网页源代码中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知,如下例中为gbk编码。),这样就得到了网页的源代码。

如下例所示,抓取本页代码:

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取后要按网页编码进行解码
print(html)

以下为urllib.request.urlopen函数说明:

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)


Open the URL url, which can be either a string or a Request object.


data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.


data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.


urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.


The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and ftp connections.


If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.


The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().


The cadefault parameter is ignored.


For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.


For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as


geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.


Raises URLError on errors.


Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).


In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.


The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.



Changed in version 3.2: cafile
and capath were added.



Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).



New in version 3.2: data can be
an iterable object.



Changed in version 3.3: cadefault
was added.



Changed in version 3.4.3: context
was added.

❼ python 中关于用beautifulsoup4库解析网页源代码标签的问题,急求解答

以网络为例

#-*-coding:utf-8-*-
importrequests
importurlparse
importos
frombs4importBeautifulSoup
defprocess(url):
headers={'content-type':'application/json',
'User-Agent':'Mozilla/5.0(X11;Ubuntu;Linuxx86_64;rv:22.0)Gecko/20100101Firefox/22.0'}
pageSourse=requests.get(url,headers=headers).text
page_soup=BeautifulSoup(pageSourse)
a_all=page_soup.findAll("a")
link_urls=[i.get('href')foriina_all]#有些是javascript触发事件,过滤方法自己写下。
img_all=page_soup.findAll("img")
img_urls=[i.get("src")foriinimg_all]
printlink_urls,img_urls
return(link_urls,img_urls)
process("https://www..com")

结果如下:

[u'/',u'javascript:;',u'javascript:;',u'javascript:;',u'/',u'javascript:;',u'https://passport..com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww..com%2F',u'http://www.nuomi.com/?cid=002540',u'http://news..com',u'http://www.hao123.com',u'http://map..com',u'http://v..com',u'http://tieba..com',u'https://passport..com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww..com%2F',u'http://www..com/gaoji/preferences.html',u'http://www..com/more/',u'http://news..com/ns?cl=2&rn=20&tn=news&word=',u'http://tieba..com/f?kw=&fr=wwwt',u'http://..com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt',u'http://music..com/search?fr=ps&ie=utf-8&key=',u'http://image..com/search/index?tn=image&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=',u'http://v..com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=',u'http://map..com/m?word=&fr=ps01000',u'http://wenku..com/search?word=&lm=0&od=0&ie=utf-8',u'//www..com/more/',u'/',u'//www..com/cache/sethelp/help.html',u'http://home..com',u'http://ir..com',u'http://www..com/ty/',u'http://jianyi..com/'][u'//www..com/img/bd_logo1.png',u'//www..com/img/_jgylogo3.gif']

有问题可指出,满意请采纳

热点内容
七七网源码 发布:2024-05-06 10:27:36 浏览:294
shell输入脚本 发布:2024-05-06 10:19:49 浏览:984
通达信自定义板块在哪个文件夹 发布:2024-05-06 09:56:37 浏览:103
在linux搭建mqtt服务器搭建 发布:2024-05-06 09:52:00 浏览:558
windowspython23 发布:2024-05-06 09:27:50 浏览:746
编程ug开初 发布:2024-05-06 09:27:48 浏览:560
小白源码论坛 发布:2024-05-06 09:24:56 浏览:139
android进程重启 发布:2024-05-06 09:15:09 浏览:97
ie浏览器设置默认ftp 发布:2024-05-06 09:14:03 浏览:885
迈腾尊贵中控配置怎么使用 发布:2024-05-06 09:13:28 浏览:656