Locality Sensitive Hashing 的实现

三种计算最邻近方法的比较：

最简单最粗暴的方式就是计算一个点与其他所有点的距离，取最近的
kd-trees, 将训练数据用二叉树的数据结构存储，减少搜寻时间，缺点是在高维空间表现不好，计算量同样不小。
LSH， Locality Sensitive Hashing, 通过在数据所在的空间中，随机放置超平面，来hash整个数据集, 最终的目的也是减少搜寻时间

LSH的实现：

根据华盛顿大学机器学习专项课程整理

# necessary package
import numpy as np
import graphlab  # 课程使用的机器学习包， 与sklearn类似
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import pairwise_distances
import time
from copy import copy
import matplotlib.pyplot as plt

# load dataset
# dataset 包含列：URL, name, text
wiki = graphlab.SFrame('people_wiki.gl/')
wiki = wiki.add_row_number()

# compute tf-idf
#新列以字典的形式保存当前行document的td-idf
wiki['tf_idf'] = graphlab.text_analytics.tf_idf(wiki['text'])

将document中提取的td-idf 转换成系数矩阵

def sframe_to_scipy(column):
    """
    将sframe转换程稀疏矩阵
    Convert a dict-typed SArray into a SciPy sparse matrix.
    
    Returns
    -------
        mat : a SciPy sparse matrix where mat[i, j] is the value of word j for document i.
        mapping : a dictionary where mapping[j] is the word whose values are in column j.
    """
        # create triples of (row_id, feature_id, count).
    x = graphlab.SFrame({'X1':column})
    
    # add a row number.
    x = x.add_row_number()
    # stack will transform x to have a row for each unique (row, key) pair.
    x = x.stack('X1', ['feature', 'value'])

    # map words into integers using a OneHotEncoder feature transformation.
    f = graphlab.feature_engineering.OneHotEncoder(features=['feature'])

    # We first fit the transformer using the above data.
    f.fit(x)

    # The transform method will add a new column that is the transformed version
    # of the 'word' column.
    x = f.transform(x)

    # Get the feature mapping.
    mapping = f['feature_encoding']

    # Get the actual word id.
    x['feature_id'] = x['encoded_features'].dict_keys().apply(lambda x: x[0])

    # Create numpy arrays that contain the data for the sparse matrix.
    i = np.array(x['id'])  # document的id
    j = np.array(x['feature_id'])  # onehot编码后单词的id
    v = np.array(x['value'])  # 单词的tfidf值
    width = x['id'].max() + 1 
    height = x['feature_id'].max() + 1

    # Create a sparse matrix.
    # 创建稀疏矩阵
    mat = csr_matrix((v, (i, j)), shape=(width, height))
    # maping, 单词与其在mat矩阵所在列位置的对应字典
    return mat, mapping

# 创建稀疏矩阵
# corpus 为稀疏矩阵，行号为documnet的ID，列号为每个单词的id
# mapping记录了单词的id 与单词之间的映射关系
corpus, mapping = sframe_to_scipy(wiki['tf_idf'])

corpus.shape == (59071, 547979)

训练LSH模型

第一步，随机生成超平面，满足高斯分布，超平面的维度与稀疏矩阵的列的数量一致.
第二步，将表示document的向量与这一组超平面进行点乘运算，得到的一维向量，向量中大于0的元素令其为1，小于0的元素令其为0，得到的这组向量就可以对数据进行hash。

def generate_random_vectors(num_vector, dim):
    # 生成一个维度为dim * num_vector的高斯分部矩阵
    # num_vector: 随机向量的个数
    # dim： 随机向量的维度
    return np.random.randn(dim, num_vector)  # 注意这里是由区别的
    # 列向量

测试效果：

# Generate 3 random vectors of dimension 5, arranged into a single 5 x 3 matrix.
np.random.seed(0) # set seed=0 for consistent results
generate_random_vectors(num_vector=3, dim=5)
# 生成3个维度是5的列向量
>>> array([[ 1.76405235,  0.40015721,  0.97873798],
           [ 2.2408932 ,  1.86755799, -0.97727788],
           [ 0.95008842, -0.15135721, -0.10321885],
           [ 0.4105985 ,  0.14404357,  1.45427351],
           [ 0.76103773,  0.12167502,  0.44386323]])

测试hash向量的效果：

corpus[0:2].dot(random_vectors) >= 0 # compute bit indices of first two documents
>>> array([[ True,  True, False, False, False,  True,  True, False,  True,
         True,  True, False, False,  True, False,  True],
           [True, False, False, False,  True,  True, False,  True,  True,
           False,  True, False,  True, False, False,  True]], dtype=bool)

更进一步，这样的一组向量，是可以用整数唯一代替的：

Bin index                      integer
[0,0,0,0,0,0,0,0,0,0,0,0]   => 0
[0,0,0,0,0,0,0,0,0,0,0,1]   => 1
[0,0,0,0,0,0,0,0,0,0,1,0]   => 2
[0,0,0,0,0,0,0,0,0,0,1,1]   => 3
...
[1,1,1,1,1,1,1,1,1,1,0,0]   => 65532
[1,1,1,1,1,1,1,1,1,1,0,1]   => 65533
[1,1,1,1,1,1,1,1,1,1,1,0]   => 65534
[1,1,1,1,1,1,1,1,1,1,1,1]   => 65535 (= 2^16-1)

这样的话，只用一个数，就可以表示这个document 向量。

doc = corpus[0, :]  # first document
index_bits = (doc.dot(random_vectors) >= 0)
powers_of_two = (1 << np.arange(15, -1, -1))
print index_bits
print powers_of_two
print index_bits.dot(powers_of_two)

>>>[[ True  True False False False  True  True False  True  True  True False
  False  True False  True]]
[32768 16384  8192  4096  2048  1024   512   256   128    64    32    16
     8     4     2     1]
[50917]

# 整个document
index_bits = corpus.dot(random_vectors) >= 0
index_bits.dot(powers_of_two)

>>> array([50917, 36265, 19365, ..., 52983, 27589, 41449])

可以把超平面切割而成的空间看作是桶，数据集就散落在这些桶中，而上面的数组恰恰就是每个桶的ID，这样当我们需要寻找数据点最近的点的时候，我们只要在这个点所在的桶里进行寻找就可以了，从而减小计算的代价。

需要注意的地方是： ID相邻其所代表的桶在空间上并不相邻。(与id相差2的n次方的的其他id，他们才是相邻的)

模型训练

def train_lsh(data, num_vector=16, seed=None):
    
    dim = data.shape[1]
    if seed is not None:
        np.random.seed(seed)
    random_vectors = generate_random_vectors(num_vector, dim)
  
    powers_of_two = 1 << np.arange(num_vector-1, -1, -1)
    # 利用移位运算，计算2的n次方
  
    table = {}
    
    # Partition data points into bins
    bin_index_bits = (data.dot(random_vectors) >= 0)
  
    # Encode bin index bits into integers
    bin_indices = bin_index_bits.dot(powers_of_two)
    
    # Update `table` so that `table[i]` is the list of document ids with bin index equal to i.
    for data_index, bin_index in enumerate(bin_indices):
        if bin_index not in table:
            # If no list yet exists for this bin, assign the bin an empty list.
            table[bin_index] = [] 
        # Fetch the list of document ids associated with the bin and add the document id to the end.
        table[bin_index].append(data_index)
        

    model = {'data': data,
             'bin_index_bits': bin_index_bits,
             'bin_indices': bin_indices,
             'table': table,
             'random_vectors': random_vectors,
             'num_vector': num_vector}
    
    return model

定义距离：

def norm(x):
    sum_sq = x.dot(x.T)
    norm = np.sqrt(sum_sq)
    return norm

def consine_distance(x, y):
    xy = x.dot(y.T)
    dist = xy / (norm(x) * norm(y))
    return 1 - dist

利用LSH model 搜寻最邻近

搜寻思路：

令向量L为代表超空间的bin
首先搜寻这个bin中的所有向量
搜寻与这个bin相差1 bit 的向量
搜寻与这个bin相差2 bit 的向量

使用itertools.combination来确定候选的bin

确定搜寻半径r
根据搜寻半径反转bin array的bit

def search_nearby_bins(query_bin_bits, table, search_radius=2, initial_candidates=set()):
    """
    For a given query vector and trained LSH model, return all candidate neighbors for
    the query among all bins within the given search radius.
    
    Example usage
    -------------
    >>> model = train_lsh(corpus, num_vector=16, seed=143)
    >>> q = model['bin_index_bits'][0]  # vector for the first document
  
    >>> candidates = search_nearby_bins(q, model['table'])
    """
    num_vector = len(query_bin_bits)
    powers_of_two = 1 << np.arange(num_vector-1, -1, -1)
    
    # Allow the user to provide an initial set of candidates.
    candidate_set = copy(initial_candidates)
    
    for different_bits in combinations(range(num_vector), search_radius):       
        # Flip the bits (n_1,n_2,...,n_r) of the query bin to produce a new bit vector.
        ## Hint: you can iterate over a tuple like a list
        alternate_bits = copy(query_bin_bits)  # 在这么多维度的向量中，可能需要反转bit的位置
        for i in different_bits:
            # 对i位置的， 进行按位翻转
            alternate_bits[i] = 1 - alternate_bits[i] # YOUR CODE HERE 
        
        # Convert the new bit vector to an integer index
        nearby_bin = alternate_bits.dot(powers_of_two)
        
        # Fetch the list of documents belonging to the bin indexed by the new bit vector.
        # Then add those documents to candidate_set
        # Make sure that the bin exists in the table!
        # Hint: update() method for sets lets you add an entire list to the set
        if nearby_bin in table:
            # 在集合中更新元素
            candidate_set.update(table[nearby_bin])
             # YOUR CODE HERE: Update candidate_set with the documents in this bin.
    return candidate_set

def query(vec, model, k, max_search_radius):
    # vec: tfidf向量
    # model: 训练的模型
    # k: 距离最近的k个
    # max_search_radius: 最大搜索半径
  
    data = model['data']
    table = model['table']
    random_vectors = model['random_vectors']
    num_vector = random_vectors.shape[1]
    
    
    # Compute bin index for the query vector, in bit representation.
    bin_index_bits = (vec.dot(random_vectors) >= 0).flatten()
    
    # Search nearby bins and collect candidates
    candidate_set = set()
    for search_radius in xrange(max_search_radius+1):
        candidate_set = search_nearby_bins(bin_index_bits, table, search_radius, initial_candidates=candidate_set)
    
    # Sort candidates by their true distances from the query
    nearest_neighbors = graphlab.SFrame({'id':candidate_set})
    candidates = data[np.array(list(candidate_set)),:]
    nearest_neighbors['distance'] = pairwise_distances(candidates, vec, metric='cosine').flatten()
    
    return nearest_neighbors.topk('distance', k, reverse=True), len(candidate_set)

以上便是LSH中全部的主函数，详细请参见个人的git

最后编辑于：2017.12.06 15:11:14

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 211,348评论 6赞 491
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 90,122评论 2赞 385
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 156,936评论 0赞 347
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 56,427评论 1赞 283
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 65,467评论 6赞 385
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,785评论 1赞 290
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,931评论 3赞 406
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,696评论 0赞 266
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,141评论 1赞 303
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,483评论 2赞 327
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,625评论 1赞 340
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,291评论 4赞 329
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,892评论 3赞 312
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,741评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,977评论 1赞 265
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,324评论 2赞 360
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,492评论 2赞 348

Locality Sensitive Hashing 的实现

LSH的实现：

训练LSH模型

利用LSH model 搜寻最邻近

推荐阅读更多精彩内容