使用htmlq从HTML文件中提取内容/数据

Z时代
2024-01-10
分类：综合

linux

在JSON文档中检索数据时我们使用jq命令。jq命令可以帮助我们快速提取json的数据。那么你可能会想到存在可以快速搜索，切片，过滤，提取HTML页面数据的命令。你可能会想到sed/awk/grep等这些常用的命令，现在我们可以使用htmlq命令来做到这一点。

htmlq类似于jq，但用于HTML。它使用CSS选择器从HTML文件中提取我们需要的部分内容/数据/片段。CSS选择器是用于定位想要获取数据的HTML元素。例如，我们可以使用htmlq命令提取图像或a元素的URL属性，只需简单的给htmlq命令传递css选择即可。

在本教程中，我们将说明如何使用htmlq从HTML文件中搜索，切片，过滤，提取内容或者数据。包括在Linux/Mac OS/Windows安装htmlq，将htmlq添加环境变量PATH，使用CSS ID选择器切片HTML片段，CSS属性选择器提取属性值，使用类名选择器提取文本，格式化输出，语法高亮等。

安装htmlq

安装htmlq最简单方法是从Github下载预构建的二进制可执行文件，我们将使用这种方式安装htmlq命令。在撰写本教程时，最新的版本的htmlq命令是v0.4.0，如果你需要更改htmlq的版本，改变v0.4.0为你需要的数字即可。

运行以下wget命令下载htmlq命令并且通过管道传递tar命令解压缩tar.gz存档文件：

wget -q -O -  https://github.com/mgdm/htmlq/releases/download/v0.4.0/htmlq-x86_64-linux.tar.gz | tar xz

下载完成后并解压缩，htmlq命令将你的当前工作目录下，你可使用以下命令运行htmlq来验证是否可用：

./htmlq --help

这将会输出htmlq的帮助文档，输出类似于以下内容：

htmlq 0.4.0 Michael Maclean <[email protected]> Runs CSS selectors on HTML USAGE: htmlq [FLAGS] [OPTIONS] [--] [selector]... FLAGS: -B, --detect-base Try to detect the base URL from the <base> tag in the document. If not found, default to the value of --base, if supplied -h, --help Prints help information -w, --ignore-whitespace When printing text nodes, ignore those that consist entirely of whitespace -p, --pretty Pretty-print the serialised output -t, --text Output only the contents of text nodes inside selected elements -V, --version Prints version information OPTIONS: -a, --attribute <attribute> Only return this attribute (if present) from selected elements -b, --base <base> Use this URL as the base for links -f, --filename <FILE> The input file. Defaults to stdin -o, --output <FILE> The output file. Defaults to stdout -r, --remove-nodes <SELECTOR>... Remove nodes matching this expression before output. May be specified multiple times ARGS: <selector>... The CSS expression to select [default: html]

当前htmlq命令仅限于在当前目录执行，如果你需要全局且不需要指定htmlq全路径运行，可将htmlq命令的路径添加到环境变量PATH中或者将htmlq命令移动到环境变量PATH所包含的目录中。

我们将使用最后一种方式，将htmlq命令移动/usr/bin/目录中。运行以下mv命令将htmlq移动：

sudo mv htmlq /usr/bin

至此，你已经完成htmlq的安装。以下是一些htmlq命令的示例。

使用CSS ID选择器提取片段

以下命令将通过curl命令下载rust官方网站的首页HTML页面，并通过管道|传递给htmlq命令并且指定css选择器#get-help。你将运行以下命令：

curl --silent https://www.rust-lang.org/ | htmlq '#get-help'

将输出类是以下的内容：

<div class="flex flex-column mw8 w-100 measure-wide-l pv2 pv5-m pv2-ns ph4-m ph4-l" id="get-help">
        <h4>Get help!</h4>
        <ul>
          <li><a href="/learn">Documentation</a></li>
          <li><a href="http://forge.rust-lang.org">Rust Forge (Contributor Documentation)</a></li>
          <li><a href="https://users.rust-lang.org">Ask a Question on the Users Forum</a></li>
        </ul>
        <div class="languages">
            <label class="hidden" for="language-footer">Language</label>
            <select id="language-footer">
                <option title="English (en-US)" value="en-US">English (en-US)</option>
<option title="Español (es)" value="es">Español (es)</option>
<option title="Français (fr)" value="fr">Français (fr)</option>
<option title="Italiano (it)" value="it">Italiano (it)</option>
<option title="日本語 (ja)" value="ja">日本語 (ja)</option>
<option title="Português (pt-BR)" value="pt-BR">Português (pt-BR)</option>
<option title="Русский (ru)" value="ru">Русский (ru)</option>
<option title="Türkçe (tr)" value="tr">Türkçe (tr)</option>
<option title="简体中文 (zh-CN)" value="zh-CN">简体中文 (zh-CN)</option>
<option title="正體中文 (zh-TW)" value="zh-TW">正體中文 (zh-TW)</option>
            </select>
        </div>
      </div>

在使用时curl命令--silent选项是可选的，如果不传递此选项，curl将打印进度输出，你将htmlq命令的结果中也能找到该标准输出。内容类似以下：

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 19392 100 19392 0 0 4698 0 0:00:04 0:00:04 --:--:-- 4698

....

使用CSS属性选择器提取href值

如果你提取/查找HTML页面的元素的属性值，可以为htmlq指定--attribute选项，然后指定HTML元素的名称和该元素属性名称。例如搜索/提取HTML页面所有a元素href属性。你将运行以下命令：

curl --silent https://www.rust-lang.org/ | htmlq --attribute href a

这将会输出所有html页面中所有a元素的href属性，即页面中所有可跳转的链接。内容类似于下面的输出：

/ /tools/install /learn https://play.rust-lang.org/ /tools /governance /community https://blog.rust-lang.org/ /learn/get-started

....

使用CSS class类名选择器提取文本

如果需要提取指定元素和该元素下子元素的文本。可以为htmlq指定--text选项，然后指定一个css选择器定位到该元素。例如要获取CSS class类名选择器.main元素以及子元素下的所有文本，你将运行以下命令：

curl --silent https://nixos.org/nixos/about.html | htmlq  --text .main

输出内容将不包含HTML的元素标签，仅剩下文本。如下所示：

          About NixOS
NixOS is a GNU/Linux distribution that aims to
improve the state of the art in system configuration management.  In
existing distributions, actions such as upgrades are dangerous:
upgrading a package can cause other packages to break, upgrading an
entire system is much less reliable than reinstalling from scratch,
you can’t safely test what the results of a configuration change will
be, you cannot easily undo changes to the system, and so on.  We want
to change that.  NixOS has many innovative features:[...]

在输出之前移除指定节点

如果需要在你提取的数据或者截取的HTML片段移除不需要的元素，可以为htmlq命令指定--remove-nodes选项。该选项在一个CSS选择器之后，并在该选项之后添加要删除节点的CSS选择器。例如要从.whynix类名选择器中移除所有svg元素/节点，你将运行以下命令：

$ curl --silent https://nixos.org/ | ./target/debug/htmlq '.whynix' --remove-nodes svg

输出将不包含svg元素：

<ul class="whynix">
      <li>
        <h2>Reproducible</h2>
        <p>
          Nix builds packages in isolation from each other. This ensures that they
          are reproducible and don't have undeclared dependencies, so <span>if a
            package works on one machine, it will also work on another</span>.
        </p>
      </li>
      <li>
        <h2>Declarative</h2>
        <p>
          Nix makes it <span>trivial to share development and build
            environments</span> for your projects, regardless of what programming
          languages and tools you’re using.
        </p>
      </li>
      <li>
        <h2>Reliable</h2>
        <p>
          Nix ensures that installing or upgrading one package <span>cannot
            break other packages</span>. It allows you to <span>roll back to
            previous versions</span>, and ensures that no package is in an
          inconsistent state during an upgrade.
        </p>
      </li>    </ul>

格式化HTML输出

htmq除了提取HTML页面的数据，元素，属性，片段之外还可以对HTML片段进行格式化输出，例如以下命令将格式化输出选择器#posts元素片段的输出，你将运行以下命令：

curl --silent https://mgdm.net | htmlq --pretty '#posts'

<section id="posts">
  <h2>I write about...
  </h2>
  <ul class="post-list">
    <li>
      <time datetime="2019-04-29 00:%i:1556496000" pubdate="">
        29/04/2019</time><a href="/weblog/nettop/">
        <h3>Debugging network connections on macOS with nettop
        </h3></a>
      <p>Using nettop to find out what network connections a program is trying to make.
      </p>
    </li>[...]

HTML语法高亮

bat命令是一个语法高亮的命令。可在终端中高亮几乎所有语言的语法和关键词。让你更容易查看数据与语法。htmlq命令的结果是写到标准输出，这允许我们将htmlq命令的结果通过管道传递给bat命令高亮HTML片段关键词/语法/属性等。

例如我们使用htmlq提取body元素所有内容，包括HTML标签，属性。即不只是文本，并通过bat命令高亮HTML语法你将运行以下命令：

curl --silent example.com | htmlq 'body' | bat --language html

结论

至此，你已经熟悉使用htmlq提取，检索，截取，切片，移除HTML页面的数据和内容，如果你需要更多的帮助，请运行htmlq --help以查看帮助。如你有任何疑问，请随时在评论中反馈。

以上是使用htmlq从HTML文件中提取内容/数据的全部内容，来源链接： utcz.com/z/507739.html