Scrapy是否有可能从原始HTML数据中获取纯文本?

例如:

scrapy shell http://scrapy.org/

content = hxs.select('//*[@id="content"]').extract()[0]

print content

然后,我得到以下原始HTML代码:

<div id="content">

<h2>Welcome to Scrapy</h2>

<h3>What is Scrapy?</h3>

<p>Scrapy is a fast high-level screen scraping and web crawling

framework, used to crawl websites and extract structured data from their

pages. It can be used for a wide range of purposes, from data mining to

monitoring and automated testing.</p>

<h3>Features</h3>

<dl>

<dt>Simple</dt>

<dt>

</dt>

<dd>Scrapy was designed with simplicity in mind, by providing the features

you need without getting in your way

</dd>

<dt>Productive</dt>

<dd>Just write the rules to extract the data from web pages and let Scrapy

crawl the entire web site for you

</dd>

<dt>Fast</dt>

<dd>Scrapy is used in production crawlers to completely scrape more than

500 retailer sites daily, all in one server

</dd>

<dt>Extensible</dt>

<dd>Scrapy was designed with extensibility in mind and so it provides

several mechanisms to plug new code without having to touch the framework

core

</dd>

<dt>Portable, open-source, 100% Python</dt>

<dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>

<dt>Batteries included</dt>

<dd>Scrapy comes with lots of functionality built in. Check <a

href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this

section</a> of the documentation for a list of them.

</dd>

<dt>Well-documented &amp; well-tested</dt>

<dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite

with <a href="http://static.scrapy.org/coverage-report/">very good code

coverage</a></dd>

<dt><a href="/community">Healthy community</a></dt>

<dd>

1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>

700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>

850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>

200 messages per month on mailing list (<a

href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>

40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)

</dd>

<dt><a href="/support">Commercial support</a></dt>

<dd>A few companies provide Scrapy consulting and support</dd>

<p>Still not sure if Scrapy is what you're looking for?. Check out <a

href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a

glance</a>.

</p>

<h3>Companies using Scrapy</h3>

<p>Scrapy is being used in large production environments, to crawl

thousands of sites daily. Here is a list of <a href="/companies/">Companies

using Scrapy</a>.</p>

<h3>Where to start?</h3>

<p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,

then <a href="/download/">download Scrapy</a> and follow the <a

href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.

</p></dl>

</div>

但是我想直接从scrapy 获取纯文本。

我不希望使用任何XPath选择提取ph2h3…标签,因为我爬一个网站,其主要内容嵌入到table,tbody; 递归地 找到xPath可能是一项繁琐的任务。

可以通过Scrapy中的内置函数来实现吗?还是我需要外部工具进行转换?我已经阅读了Scrapy的所有文档,但一无所获。

回答:

Scrapy没有内置此类功能。html2text是你要寻找的。

这是一个示例spider,它抓取Wikipedia的python页面,使用xpath获取第一段,然后使用以下命令将html转换为纯文本html2text

from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider

import html2text

class WikiSpider(BaseSpider):

name = "wiki_spider"

allowed_domains = ["www.wikipedia.org"]

start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

def parse(self, response):

hxs = HtmlXPathSelector(response)

sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

converter = html2text.HTML2Text()

converter.ignore_links = True

print(converter.handle(sample)) #Python 3 print syntax

印刷品:

** Python **是一种广泛使用的通用高级编程语言。[11] [12] [13] 它的设计理念强调代码的可读性,并且其语法允许程序员用更少的代码行来表达概念,而这种语言的表达量比诸如C. [14] [15] 该语言提供了旨在实现小规模和大规模清晰程序的结构。[16]

以上是 Scrapy是否有可能从原始HTML数据中获取纯文本? 的全部内容, 来源链接: utcz.com/qa/426104.html

回到顶部