xpath过滤元素怎么写

xpath过滤元素怎么写

python新手,问题比较初级,各位大佬轻喷。
需要爬点数据,有个xpath问题请教下各位大佬,如下面html代码所示,
<span class="media-caption__text"></span>标签就获取它内部文本,没有就获取<figcaption></figcaption>内部的文本,但是必须要过滤掉<span class="off-screen"></span>

html代码如下所示:

<figcaption class="media-caption">

<span class="off-screen">Image caption</span>

<span class="media-caption__text"> &#32445;&#32422;&#24066;&#26159;&#32654;&#22269;&#30123;&#24773;&#30340;&#8220;&#38663;&#20013;&#8221;&#12290; </span>

</figcaption>

或者

<figcaption class="media-with-caption__caption">

<span class="off-screen"></span>

&#22833;&#19994;&#20013;&#30340;&#32654;&#22269;&#38738;&#24180;&#65306;&#27882;&#27700;&#12289;&#24656;&#24807;&#19982;&#19981;&#23433;

</figcaption>


回答:

为什么不用代码逻辑呢。。。
用xpath的话感觉很丑

//figcaption/span[@class="media-caption__text"][count(//figcaption/span[@class="media-caption__text"]) > 0]/text()[normalize-space()]|//figcaption[count(//figcaption/span[@class="media-caption__text"]) = 0]/text()[normalize-space()]


回答:

from lxml import etree

text = '''

<figcaption class="media-caption">

<span class="off-screen">Image caption</span>

<span class="media-caption__text"> &#32445;&#32422;&#24066;&#26159;&#32654;&#22269;&#30123;&#24773;&#30340;&#8220;&#38663;&#20013;&#8221;&#12290; </span>

</figcaption>

<figcaption class="media-with-caption__caption">

<span class="off-screen"></span>

&#22833;&#19994;&#20013;&#30340;&#32654;&#22269;&#38738;&#24180;&#65306;&#27882;&#27700;&#12289;&#24656;&#24807;&#19982;&#19981;&#23433;

</figcaption>

'''

html = etree.HTML(text)

result = html.xpath('//figcaption//text()[normalize-space()]')

print(result)

以上是 xpath过滤元素怎么写 的全部内容, 来源链接: utcz.com/p/937808.html

回到顶部