Embedding

Embedding模块使用rocksdb来磁盘来存储Embedding的值, 采用KV的方式。其中Key是特征hash的值(uint64类型), Value是Embedding对应的浮点数列表以及一些其他的值。

Key

所有的特征都是经过离散化的, 用唯一的uint64的值进行表示。同一类特征用相同的特征组(group)来管理。在rocksdb中使用uint64这个值来进行查询。

Group

这里会将特征分成不同的group, 不同的group可以设置不同的宽度, 不同的初始化算子和优化算子。

Value

struct MetaData {
    int group; 
    int64_t key;
    int64_t update_time;
    int64_t update_num;
    float data[];
};

TTL

对于一些长时间没有更新的特征, 可以通过设置TTL的方式将其删除, 这个是rocksdb自身支持的功能。这样做也是为了降低模型的大小。

Usage

How to Create an Embedding

The arguments are listed below:

storage: damo.PyStorage type
optimizer: damo.PyOptimizer type
initializer: damo.PyInitializer type
dimension: int type, dim of embedding
group: int type, [0, 256), defaul: 0

import damo
storage = damo.PyStorage(...)
optimizer = damo.PyOptimizer(...)
initializer = damo.PyInitializer(...)
dimension = 16
group = 1
embedding = damo.PyEmbedding(storage, optimizer, initializer, dimension, group)

Member Functions of Embedding

There are two member functions of embedding, both have no return values, which are listed below:

lookup: pull weight from embedding

The arguments are listed below:

keys: numpy.ndarray type, one dimension, dtype MUST BE np.uint64
weights: numpy.ndarray type, one dimension
1. dtype MUST BE np.float32
2. $size == embedding\_dimension * keys.shape[0]$
We will store the result in this space.

import numpy as np

# example
n = 8
keys = np.zeros(n).astype(np.uint64)
for i in range(n):
    keys[i] = i+1

# array([1, 2, 3, 4, 5, 6, 7, 8], dtype=uint64)

weight = np.zeros(n*dimension).astype(np.float32)

embedding.lookup(keys, weight)

# IT IS Easy To Extract Each Key's Weight
tmp = weight.reshape((n, dimension))
weight_dict = {k: v for k,v in zip(keys, tmp)}

apply_gradients: push gradients to embedding

The arguments are listed below:

keys: same as lookup, numpy.ndarray type, one dimension, dtype MUST BE np.uint64
gradients: numpy.ndarray type, one dimension, dtype MUST BE np.float32 type,
1. dtype MUST BE np.float32
2. $size == embedding\_dimension * keys.shape[0]$

import numpy as np

gradients = np.random.random(n*dimension).astype(np.float32)

embedding.apply_gradients(keys, gradients)

Embedding

Key​

Group​

Value​

TTL​

Usage​

How to Create an Embedding​

Member Functions of Embedding​

lookup: pull weight from embedding​

apply_gradients: push gradients to embedding​

Key

Group

Value

TTL

Usage

How to Create an Embedding

Member Functions of Embedding

lookup: pull weight from embedding

apply_gradients: push gradients to embedding