一旦UTF-8编码,如何截断java字符串以适合给定的字节数?

我如何截断一个Java,String以便我知道一旦它以UTF-8编码,它将适合给定数量的字节存储?

回答:

这是一个简单的循环,用于计算UTF-8表示形式的大小,并在超出时截断:

public static String truncateWhenUTF8(String s, int maxBytes) {

int b = 0;

for (int i = 0; i < s.length(); i++) {

char c = s.charAt(i);

// ranges from http://en.wikipedia.org/wiki/UTF-8

int skip = 0;

int more;

if (c <= 0x007f) {

more = 1;

}

else if (c <= 0x07FF) {

more = 2;

} else if (c <= 0xd7ff) {

more = 3;

} else if (c <= 0xDFFF) {

// surrogate area, consume next char as well

more = 4;

skip = 1;

} else {

more = 3;

}

if (b + more > maxBytes) {

return s.substring(0, i);

}

b += more;

i += skip;

}

return s;

}

确实可以

处理出现在输入字符串中的代理对。Java的UTF-8编码器(正确)将代理对输出为单个4字节序列而不是两个3字节序列,因此truncateWhenUTF8()将返回最长的截断字符串。如果您在实现中忽略代理对,则截短的字符串可能会短于所需的长度。

我没有对该代码做很多测试,但是这里有一些初步测试:

private static void test(String s, int maxBytes, int expectedBytes) {

String result = truncateWhenUTF8(s, maxBytes);

byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));

if (utf8.length > maxBytes) {

System.out.println("BAD: our truncation of " + s + " was too big");

}

if (utf8.length != expectedBytes) {

System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);

}

System.out.println(s + " truncated to " + result);

}

public static void main(String[] args) {

test("abcd", 0, 0);

test("abcd", 1, 1);

test("abcd", 2, 2);

test("abcd", 3, 3);

test("abcd", 4, 4);

test("abcd", 5, 4);

test("a\u0080b", 0, 0);

test("a\u0080b", 1, 1);

test("a\u0080b", 2, 1);

test("a\u0080b", 3, 3);

test("a\u0080b", 4, 4);

test("a\u0080b", 5, 4);

test("a\u0800b", 0, 0);

test("a\u0800b", 1, 1);

test("a\u0800b", 2, 1);

test("a\u0800b", 3, 1);

test("a\u0800b", 4, 4);

test("a\u0800b", 5, 5);

test("a\u0800b", 6, 5);

// surrogate pairs

test("\uD834\uDD1E", 0, 0);

test("\uD834\uDD1E", 1, 0);

test("\uD834\uDD1E", 2, 0);

test("\uD834\uDD1E", 3, 0);

test("\uD834\uDD1E", 4, 4);

test("\uD834\uDD1E", 5, 4);

}

修改后的代码示例,现在可以处理代理对。

以上是 一旦UTF-8编码,如何截断java字符串以适合给定的字节数? 的全部内容, 来源链接: utcz.com/qa/407770.html

回到顶部