python 解码字节流 decode bytes

之前使用python解码二进制字节流，遇到各种各样的bug，现在对各种问题进行总结记录

一. 解码报错：'utf-8' codec can't decode byte

1.bug再现

已知字节流生成时采用utf8编码，但是解码时莫名出现乱码，如下：

text = b'\x00\x00\t\x00\x00\x002\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00[\x00\x00\x00n\x00\x00\x00p\x0b\xe2\x01\x00\x00\x00\x00 [9\x0b`\x7f\x00\x00'
text = text.decode()
print("text:", text)

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-115-e6b2c457ec24> in <module>()
      1 text = b'\x00\x00\t\x00\x00\x002\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00[\x00\x00\x00n\x00\x00\x00p\x0b\xe2\x01\x00\x00\x00\x00 [9\x0b`\x7f\x00\x00'
----> 2 text = text.decode()
      3 print("text:", text)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 36: invalid continuation byte

2. str.decode()语法

decode()方法语法：

str.decode(encoding='UTF-8',errors='strict')

参数：
encoding -- 要使用的编码，默认"UTF-8"，其余还有"gbk"，"unicode_escape"，"ascii"，"base64"等

errors -- 设置不同错误的处理方案。默认为 'strict',意为编码错误引起一个UnicodeError。 
其余还有 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' 以及通过 codecs.register_error() 注册的任何值。

下面代码采用不同的errors来encode带有特殊字符的字符串，有兴趣的同学可以尝试下，感受区别

txt = "My name is Ståle"

# print(txt.encode(encoding="ascii",errors="backslashreplace"))
# print(txt.encode(encoding="ascii",errors="ignore"))
# print(txt.encode(encoding="ascii",errors="namereplace"))
# print(txt.encode(encoding="ascii",errors="replace"))
# print(txt.encode(encoding="ascii",errors="xmlcharrefreplace"))
print(txt.encode(encoding="ascii",errors="strict"))

3. 解决解码报错的bug

了解了这些，上面的bug表面上就很好解决了，直接设置errors的方式就可以了

text = b'\x00\x00\t\x00\x00\x002\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00[\x00\x00\x00n\x00\x00\x00p\x0b\xe2\x01\x00\x00\x00\x00 [9\x0b`\x7f\x00\x00'
text = text.decode("utf8", "ignore")
print("text:", text)

二、去除解码后的空白占位符

python中去除空白字符的方式有很多种，各种方式似乎没有太大差别，今天做了各种尝试，做一个记录
先看一下去除前的效果

text = b'2020-06-02 13:49:13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x86\x00\x00\x00\x00\x00\x00\x00\xf0\xa9'
text = text.decode("utf8", "ignore")
print(len(text), text, sep=">>>", end=">>>")

---------------------------------------------------------------------------
47>>>2020-06-02 13:49:13>>>

1.strip

text = b'2020-06-02 13:49:13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x86\x00\x00\x00\x00\x00\x00\x00\xf0\xa9'
text = text.decode("utf8", "ignore").strip("\x00")
print(len(text), text, sep=">>>", end=">>>")

---------------------------------------------------------------------------
19>>>2020-06-02 13:49:13>>>

2. replace

text = b'2020-06-02 13:49:13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x86\x00\x00\x00\x00\x00\x00\x00\xf0\xa9'
text = text.decode("utf8", "ignore").replace("\x00", "")
print(len(text), text, sep=">>>", end=">>>")

---------------------------------------------------------------------------
19>>>2020-06-02 13:49:13>>>

3. split

text = b'2020-06-02 13:49:13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x86\x00\x00\x00\x00\x00\x00\x00\xf0\xa9'
text = ''.join(text.decode("utf8", "ignore").split("\x00"))
print(len(text), text, sep=">>>", end=">>>")

---------------------------------------------------------------------------
19>>>2020-06-02 13:49:13>>>

编码正常的情况下，上述三个方法看起来都没什么问题，都能达到想要的结果，权且先记录下来

参考资料：
菜鸟教程：python decode()方法
 w3school：python字符串encode()方法
 Python encode()和decode()方法：字符串编码转换
 python 字符串去空格