正则表达式中的那些模式修饰符（三）

Z时代
2024-01-10
分类：综合

U (PCRE_UNGREEDY)

这个修饰符逆转了量词的“贪婪”模式。使量词默认为“非贪婪”的，通过量词后紧跟? 的方式可以使其成为贪婪的。这和 perl 是不兼容的。它同样可以使用模式内修饰符设置 (?U)进行设置，或者在量词后以问号标记其非贪婪(比如.*?)。也就是说U的作用是，如果正则是"贪婪"模式设置U之后就变成了"非贪婪"；如果正则是"非贪婪"，U则使其变成"贪婪"。


<?php
$str = "<p>This is not an example</p></p>";
// 贪婪模式
$pattern_greedy = '/^<p>.*<\/p>/';
// 设置 `U` 变成了 非贪婪模式
$pattern = '/^<p>.*<\/p>/U';
$res = preg_match($pattern_greedy,$str,$matches_greedy);
$res = preg_match($pattern,$str,$matches);
print_r($matches_greedy);
print_r($matches);

上面正则如果不加U，就是“贪婪”模式，所以会匹配到字符串最后的</p>；加上U，则匹配到字符串中的第一个<\p> 就停止了。


// 未设置`U`
Array
(
    [0] => <p>This is not an example</p></p>
)
// 设置 `U`
Array
(
    [0] => <p>This is not an example</p>
)

下面我们看另一个例子，正则在不设置U情况下是“非贪婪”的，加上U变成“贪婪”。


<?php
$str = "<p>This is not an example</p></p>";
// 非贪婪模式
$pattern = '/^<p>.*?<\/p>/';
// 设置 `U` 变成了 贪婪模式
$pattern_greedy = '/^<p>.*?<\/p>/U';
$res = preg_match($pattern,$str,$matches);
$res = preg_match($pattern_greedy,$str,$matches_greedy);
print_r($matches);
print_r($matches_greedy);

结果如下


// 未设置 `U`
Array
(
    [0] => <p>This is not an example</p>
)
// 设置 `U`
Array
(
    [0] => <p>This is not an example</p></p>
)

X (PCRE_EXTRA)

这个修饰符打开了 PCRE 与 perl 不兼容的附加功能。如果设置了该修饰符，那么在正则中如果出现了反斜线后面紧跟着一个没有特殊含义的字符，比如说\T、\q 等，那么程序就会报错。默认情况下，在 perl 中，反斜线紧跟一个没有特殊含义的字符被认为是该字符的原文。举个例子


<?php
$str = "<p>This is not an example</p></p>";
$pattern = '/^<p>\T.*<\/p>/';
$res = preg_match($pattern,$str,$matches);
print_r($matches);

在没有设置X的情况下，上面的正则是能匹配到内容的。


Array
(
    [0] => <p>This is not an example</p></p>
)

但是，如果设置了X，那就会产生错误了。


<?php
$str = "<p>This is not an example</p></p>";
$pattern = '/^<p>\T.*<\/p>/X';
$res = preg_match($pattern,$str,$matches);
print_r($matches);

执行结果

PHP Warning: preg_match(): Compilation failed: unrecognized character follows \ at offset 5 in

...

该修饰符目前就仅此一个功能，没有其他的用途。

J (PCRE_INFO_JCHANGED)

内部选项设置(?J)修改本地的PCRE_DUPNAMES选项。允许子组重名。在中文官网中有下面一段话

(译注：只能通过内部选项设置，外部的 /J 设置会产生错误。)

我用程序验证过，外部/J设置并不会报错，也就是说J也是一个模式修饰符。


<?php
$str = "<p>This is not an example</p></p>";
$pattern = '/^<p>(?<k>This)(?<k>.*)<\/p>/';
$res = preg_match($pattern,$str,$matches);
print_r($matches);

在不加J修饰的情况下，由于子组都使用k命名，所以会报错

PHP Warning: preg_match(): Compilation failed: two named subpatterns have the same name at offset 18 in ......

加上J 修饰符，允许子组重名


<?php
$str = "<p>This is not an example</p></p>";
$pattern = '/^<p>(?J)(?<k>This)(?<k>.*)<\/p>/';
// 或者  $pattern = '/^<p>(?<k>This)(?<k>.*)<\/p>/J'; 两者都可以
$res = preg_match($pattern,$str,$matches);
print_r($matches);

正常执行，执行结果如下


Array
(
    [0] => <p>This is not an example</p></p>
    [k] =>  is not an example</p>
    [1] => This
    [2] =>  is not an example</p>
)

所以说 J的作用就是对子组命名的控制。

u (PCRE_UTF8)

此修正符打开一个与 perl 不兼容的附加功能。在默认情况下正则表达式和目标字符串都被认为是 utf-8 编码的的。如果设置了该修饰符，那么它产生的效果：无效的目标字符串会导致什么都匹配不到；无效的模式字符串会导致 E_WARNING 级别的错误。 PHP 5.3.4 后，5字节和6字节的 UTF-8 字符序列被考虑为无效（resp. PCRE 7.3 2007-08-28）。以前就被认为是无效的 UTF-8。

下面我们先看无效的目标字符串的情况


<?php
$str = "\xf8\xa1\xa1\xa1\xa1";
$pattern = "/.*/";
$pattern_u = '/.*/u';
$res = preg_match($pattern,$str,$matches);
$res_u = preg_match($pattern_u,$str,$matches_u);
print_r($matches);
print_r($matches_u);

$pattern 由于点号.的作用是可以匹配出内容来；但是$pattern_u由于设置了u修饰符，按照其功能，如果目标字符串是无效的，那不会匹配到任何内容。


// 未设置 `u`
Array
(
    [0] => �����  // 因为是无效的编码，所以显示的是乱码。
)
// 设置了 `u`  匹配不到内容
Array
(
)

如果正则表达式是无效的，设置u之后，就不是匹配不到内容了，而是会产生Warning警告。


<?php
$str = "Hello example!";
$pattern = "/\xf8\xa1\xa1\xa1\xa1/";
$pattern_u = "/\xf8\xa1\xa1\xa1\xa1/u";
$res_u = preg_match($pattern_u,$str,$matches_u);
$res = preg_match($pattern,$str,$matches);
print_r($matches_u);
print_r($matches);

执行结果


// 指定了 u
Warning: preg_match(): Compilation failed: invalid UTF-8 string at offset 0 in ...
// 未指定 u 则匹配不到任何内容
Array
(
)

本文转载自：迹忆客（https://www.jiyik.com）

以上是正则表达式中的那些模式修饰符（三）的全部内容，来源链接： utcz.com/z/290165.html

正则表达式中的那些模式修饰符（三）

其他人也看了：