如何过滤（或替换）在UTF-8中占用3个以上字节的Unicode字符？

Z时代
2024-01-10
分类：问答

我正在使用Python和Django，但由于MySQL的限制而遇到问题。根据MySQL

5.1文档，其utf8实现不支持4字节字符。MySQL

5.5将使用utf8mb4；支持4字节字符；并且在将来的某一天utf8也可能会支持它。

但是我的服务器尚未准备好升级到MySQL 5.5，因此我限于使用3个字节或更少字节的UTF-8字符。

我的问题是：

我想用官方\ufffd（ U + FFFD REPLACEMENT CHARACTER ）或替换所有4字节字符?。

换句话说，我想要一个与Python自己的str.encode()方法（传递'replace'参数时）非常相似的行为。

编辑：我想要一个类似于的行为encode()，但是我不想实际编码字符串。我想过滤后仍然有一个unicode字符" title="unicode字符">unicode字符串。

我不想在存储到MySQL之前转义该字符，因为那意味着我将需要转义从数据库中获取的所有字符串，这非常烦人且不可行。

也可以看看：

将某些Unicode字符保存到MySQL（在Django票务系统中）时出现“字符串值不正确”警告

‘????’不是有效的Unicode字符，但是在Unicode字符集中？（在堆栈溢出时）

[编辑]添加了有关建议解决方案的测试

到目前为止，我得到了很好的答案。谢谢大家！现在，为了选择其中一个，我进行了快速测试以找到最简单，最快的一个。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
    return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
        for uc in unicode_string
    )
def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

结果：

filter_using_re()在了515个函数调用（sub()内置时为0.138 CPU秒）

filter_using_python()在中了2097923函数调用（调用时为1.511 CPU秒，join()并评估了生成器表达式为1.900 CPU秒）

我没有使用任何测试，itertools因为…好吧…虽然很有趣，但是这个解决方案却相当庞大和复杂。

结论

迄今为止，RegEx解决方案是最快的解决方案。

回答：

\ u0000- \ uD7FF和\ uE000- \ uFFFF范围内的Unicode字符在UTF8中将具有3字节（或更少）的编码。\ uD800- \

uDFFF范围适用于多字节UTF16。我不了解python，但您应该可以设置一个正则表达式以匹配这些范围之外的内容。

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

编辑在问题正文中从DenilsonSá的脚本中添加Python：

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)

以上是如何过滤（或替换）在UTF-8中占用3个以上字节的Unicode字符？的全部内容，来源链接： utcz.com/qa/428367.html

如何过滤（或替换）在UTF-8中占用3个以上字节的Unicode字符？

[编辑]添加了有关建议解决方案的测试

结论

回答：

其他人也看了：