2-Base64编码

编码原理

Base64编码就是把3个8位的二进制数据用4个ASCII可见字符展示出来。编码时，将3个8位二进制码重新分组成4个6位的二进制码，不足6位的，右侧补零，然后这4个6位的二进制码高位补两个0，形成4个8位的字节数据，然后取每个字节的十进制值在编码表中对应的字符作为最终的编码数据。Base64编码后的数据长度是源数据长度的4/3。标准的Base64编码要求最终的数据长度是4字节的整数倍，不足4字节的倍数时要用填充字符补齐，填充字符为等号“=”。编码表如下

0 A     1 B     2 C     3 D     4 E     5 F     6 G     7 H    
 8 I     9 J    10 K    11 L    12 M    13 N    14 O    15 P    
16 Q    17 R    18 S    19 T    20 U    21 V    22 W    23 X    
24 Y    25 Z    26 a    27 b    28 c    29 d    30 e    31 f    
32 g    33 h    34 i    35 j    36 k    37 l    38 m    39 n    
40 o    41 p    42 q    43 r    44 s    45 t    46 u    47 v    
48 w    49 x    50 y    51 z    52 0    53 1    54 2    55 3    
56 4    57 5    58 6    59 7    60 8    61 9    62 +    63 /

例如ASCII码A的Base64编码过程为

字符：A
ASCII码：65
二进制：  0100 0001
重新分组：010000 01
低位补零：010000 010000
高位补零：00010000 00010000 
转十进制：16       16
对应字符：Q        Q
填充字符：Q        Q        =        =
最终结果：QQ==

代码实现

使用Bouncy Castle实现

下面的代码使用开源软件Bouncy Castle实现Base64编解码，使用的版本是1.56。

import java.io.UnsupportedEncodingException;
import org.bouncycastle.util.encoders.Base64;
public class Base64TestBC {
    public static void main(String[] args)
            throws UnsupportedEncodingException {
        // 编码
        byte data[] = "A".getBytes();
        byte[] encodeData = Base64.encode(data);
        String encodeStr = Base64.toBase64String(data);
        System.out.println(new String(encodeData, "UTF-8"));
        System.out.println(encodeStr);
        // 解码
        byte[] decodeData = Base64.decode(encodeData);
        byte[] decodeData2 = Base64.decode(encodeStr);
        System.out.println(new String(decodeData, "UTF-8"));
        System.out.println(new String(decodeData2, "UTF-8"));
    }
}

程序输出为

QQ==
QQ==
A
A

使用Apache Commons Codec实现

下面的代码使用开源软件Apache Commons Codec实现Base64编解码，使用的版本是1.10。

import java.io.UnsupportedEncodingException;
import org.apache.commons.codec.binary.Base64;
public class Base64TestCC {
    public static void main(String[] args)
            throws UnsupportedEncodingException {
        // 编码
        byte data[] = "A".getBytes();
        byte[] encodeData = Base64.encodeBase64(data);
        String encodeStr = Base64.encodeBase64String(data);
        System.out.println(new String(encodeData, "UTF-8"));
        System.out.println(encodeStr);
        // 解码
        byte[] decodeData = Base64.decodeBase64(encodeData);
        byte[] decodeData2 = Base64.decodeBase64(encodeStr);
        System.out.println(new String(decodeData, "UTF-8"));
        System.out.println(new String(decodeData2, "UTF-8"));
    }
}

源码分析

Bouncy Castle实现源码分析

Bouncy Castle实现Base64编解码的方法和其实现Hex编解码的方法类似，源码是org.bouncycastle.util.encoders.Base64Encoder类，实现编码时首先定义了一个编码表和填充字符

protected final byte[] encodingTable =
{
    (byte)'A', (byte)'B', (byte)'C', (byte)'D', 
    (byte)'E', (byte)'F', (byte)'G', (byte)'H',
    (byte)'I', (byte)'J', (byte)'K', (byte)'L',
    (byte)'M', (byte)'N', (byte)'O', (byte)'P',
    (byte)'Q', (byte)'R', (byte)'S', (byte)'T', 
    (byte)'U', (byte)'V', (byte)'W', (byte)'X',
    (byte)'Y', (byte)'Z', (byte)'a', (byte)'b', 
    (byte)'c', (byte)'d', (byte)'e', (byte)'f', 
    (byte)'g', (byte)'h', (byte)'i', (byte)'j', 
    (byte)'k', (byte)'l', (byte)'m', (byte)'n',
    (byte)'o', (byte)'p', (byte)'q', (byte)'r', 
    (byte)'s', (byte)'t', (byte)'u', (byte)'v',
    (byte)'w', (byte)'x', (byte)'y', (byte)'z',
    (byte)'0', (byte)'1', (byte)'2', (byte)'3', 
    (byte)'4', (byte)'5', (byte)'6', (byte)'7',
    (byte)'8', (byte)'9', (byte)'+', (byte)'/'
}; 
protected byte    padding = (byte)'=';

然后编码的代码如下，首先依次处理连续的3字节的数据，因为连续的3个字节可以完整的转换为4个字节的数据。最后处理末尾的字节，末尾的字节分为3种情况，如果是字节数正好是3的倍数，即末尾没有多余的字节，不作处理。如果末尾剩余1个字节，那么需要补两个填充字符，如果末尾有2个字节，那么需要补1个填充字符

public int encode(
    byte[]          data,
    int             off,
    int             length,
    OutputStream    out) 
    throws IOException
{
    int modulus = length % 3;
    int dataLength = (length - modulus);
    int a1, a2, a3;
    
    for (int i = off; i < off + dataLength; i += 3)
    {
        a1 = data[i] & 0xff;
        a2 = data[i + 1] & 0xff;
        a3 = data[i + 2] & 0xff;
        out.write(encodingTable[(a1 >>> 2) & 0x3f]);
        out.write(encodingTable[((a1 << 4) | (a2 >>> 4)) & 0x3f]);
        out.write(encodingTable[((a2 << 2) | (a3 >>> 6)) & 0x3f]);
        out.write(encodingTable[a3 & 0x3f]);
    }
    /*
     * process the tail end.
     */
    int    b1, b2, b3;
    int    d1, d2;
    switch (modulus)
    {
    case 0:        /* nothing left to do */
        break;
    case 1:
        d1 = data[off + dataLength] & 0xff;
        b1 = (d1 >>> 2) & 0x3f;
        b2 = (d1 << 4) & 0x3f;
        out.write(encodingTable[b1]);
        out.write(encodingTable[b2]);
        out.write(padding);
        out.write(padding);
        break;
    case 2:
        d1 = data[off + dataLength] & 0xff;
        d2 = data[off + dataLength + 1] & 0xff;
        b1 = (d1 >>> 2) & 0x3f;
        b2 = ((d1 << 4) | (d2 >>> 4)) & 0x3f;
        b3 = (d2 << 2) & 0x3f;
        out.write(encodingTable[b1]);
        out.write(encodingTable[b2]);
        out.write(encodingTable[b3]);
        out.write(padding);
        break;
    }
    return (dataLength / 3) * 4 + ((modulus == 0) ? 0 : 4);
}

解码的方法同样是首先构建解码表，解码表是一个128位数组，每个位置代表对应的ASCII码，该位置上的值表示该ASCII码在编码表中的序号。具体到Base64的解码表，每个编码表上的可见字符，在解码表中其ASCII码对应的十进制位置上的值就是其编码的序号，比如编码表中数字0对应的字符是A，而A的ASCII码是65，那么解码表的第65个位置上的值就是0，其他的值都是-1。生成解码表的源码如下

protected final byte[] decodingTable = new byte[128];
protected void initialiseDecodingTable()
{
    for (int i = 0; i < decodingTable.length; i++)
    {
        decodingTable[i] = (byte)0xff;
    }
    
    for (int i = 0; i < encodingTable.length; i++)
    {
        decodingTable[encodingTable[i]] = (byte)i;
    }
}

解码表实际上是这样的（不可见字符统一用空白表示）

  -1      -1      -1      -1      -1      -1      -1      -1    
  -1      -1      -1      -1      -1      -1      -1      -1    
  -1      -1      -1      -1      -1      -1      -1      -1    
  -1      -1      -1      -1      -1      -1      -1      -1    
  -1    ! -1    " -1    # -1    $ -1    % -1    & -1    ' -1    
( -1    ) -1    * -1    + 62    , -1    - -1    . -1    / 63    
0 52    1 53    2 54    3 55    4 56    5 57    6 58    7 59    
8 60    9 61    : -1    ; -1    < -1    = -1    > -1    ? -1    
@ -1    A  0    B  1    C  2    D  3    E  4    F  5    G  6    
H  7    I  8    J  9    K 10    L 11    M 12    N 13    O 14    
P 15    Q 16    R 17    S 18    T 19    U 20    V 21    W 22    
X 23    Y 24    Z 25    [ -1    \ -1    ] -1    ^ -1    _ -1    
` -1    a 26    b 27    c 28    d 29    e 30    f 31    g 32    
h 33    i 34    j 35    k 36    l 37    m 38    n 39    o 40    
p 41    q 42    r 43    s 44    t 45    u 46    v 47    w 48    
x 49    y 50    z 51    { -1    | -1    } -1    ~ -1      -1

解码的过程实际上就是获取连续4个字符，取解码表中对应的值，都去掉高两位，则剩余24个二进制位，然后将这个24个二进制码重组成3个字节作为解码的输出。对于最后的4个字符，要判断是否有填充字符，如果有填充字符，则作相应的处理。源码如下：

public int decode(
    byte[]          data,
    int             off,
    int             length,
    OutputStream    out)
    throws IOException
{
    byte    b1, b2, b3, b4;
    int     outLen = 0;
    
    int     end = off + length;
    
    while (end > off)
    {
        if (!ignore((char)data[end - 1]))
        {
            break;
        }
        
        end--;
    }
    
    int  i = off;
    int  finish = end - 4;
    
    i = nextI(data, i, finish);
    while (i < finish)
    {
        b1 = decodingTable[data[i++]];
        
        i = nextI(data, i, finish);
        
        b2 = decodingTable[data[i++]];
        
        i = nextI(data, i, finish);
        
        b3 = decodingTable[data[i++]];
        
        i = nextI(data, i, finish);
        
        b4 = decodingTable[data[i++]];
        if ((b1 | b2 | b3 | b4) < 0)
        {
            throw new IOException("invalid "
                    + "characters encountered in base64 data");
        }
        
        out.write((b1 << 2) | (b2 >> 4));
        out.write((b2 << 4) | (b3 >> 2));
        out.write((b3 << 6) | b4);
        
        outLen += 3;
        
        i = nextI(data, i, finish);
    }
    outLen += decodeLastBlock(out, (char)data[end - 4], 
            (char)data[end - 3], (char)data[end - 2], 
            (char)data[end - 1]);
    
    return outLen;
}
private boolean ignore(char c)
{
    return (c == '\n' || c =='\r' || c == '\t' || c == ' ');
}
private int nextI(byte[] data, int i, int finish)
{
    while ((i < finish) && ignore((char)data[i]))
    {
        i++;
    }
    return i;
}
private int decodeLastBlock(OutputStream out, char c1, 
        char c2, char c3, char c4) throws IOException
{
    byte    b1, b2, b3, b4;
    
    if (c3 == padding)
    {
        b1 = decodingTable[c1];
        b2 = decodingTable[c2];
        if ((b1 | b2) < 0)
        {
            throw new IOException("invalid characters "
                    + "encountered at end of base64 data");
        }
        out.write((b1 << 2) | (b2 >> 4));
        
        return 1;
    }
    else if (c4 == padding)
    {
        b1 = decodingTable[c1];
        b2 = decodingTable[c2];
        b3 = decodingTable[c3];
        if ((b1 | b2 | b3) < 0)
        {
            throw new IOException("invalid characters"
                    + " encountered at end of base64 data");
        }
        
        out.write((b1 << 2) | (b2 >> 4));
        out.write((b2 << 4) | (b3 >> 2));
        
        return 2;
    }
    else
    {
        b1 = decodingTable[c1];
        b2 = decodingTable[c2];
        b3 = decodingTable[c3];
        b4 = decodingTable[c4];
        if ((b1 | b2 | b3 | b4) < 0)
        {
            throw new IOException("invalid characters"
                    + " encountered at end of base64 data");
        }
        
        out.write((b1 << 2) | (b2 >> 4));
        out.write((b2 << 4) | (b3 >> 2));
        out.write((b3 << 6) | b4);
        
        return 3;
    } 
}

从代码中可以看到，在解码时会忽略首、尾、中间的空白。

Apache Commons Codec的实现

Apache Commons Codec的实现较复杂，该实现抽象出一个BaseNCodec抽象类用以同时支持Base32和Base64编解码，Base64编解码的实现类是org.apache.commons.codec.binary.Base64，编码的实现也是定义了编码表，由于Apache Commons Codec的Base64类同时支持UrlBase64编码，所以定义了两个编码表，本文暂不分析这部分代码。

Base64编码的分块

标准的Base64编码要求每76个字符后面加回车换行符（\r\n），一行无论是否够76个字符，末尾都要加回车换行。Bouncy Castle没有实现该功能，而Apache Commons Codec实现了该功能。