https://github.com/megagonlabs/ditto
数据集配置
在本次实验中
在我们的实验中,我们使用两个基准评估了 Ditto:
-
DeepMatcher 论文中使用的ER_Magellan 基准。该基准包含 3 个类别的 13 个数据集:
Structured
,Dirty
和Textual
代表不同的数据集特征。 -
WDC产品匹配基准。该基准包含来自 4 个域的电子商务产品提供对:
cameras
、computers
、shoes
和watches
. 每个领域的训练数据也被分采样成不同的大小,small
,medium
,large
, 和xlarge
测试模型的标签效率。
我们在data/
中提供了他们数据集的序列化版本
数据集配置configs.json
如下:
[
{
"//":"theWDC product matching benchmark",
},
{
"name": "wdc_all_small",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/wdc/all/train.txt.small",
"validset": "data/wdc/all/valid.txt.small",
"testset": "data/wdc/all/test.txt"
},
{
"name": "wdc_cameras_small",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/wdc/cameras/train.txt.small",
"validset": "data/wdc/cameras/valid.txt.small",
"testset": "data/wdc/cameras/test.txt"
},
{
"name": "wdc_computers_small",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/wdc/computers/train.txt.small",
"validset": "data/wdc/computers/valid.txt.small",
"testset": "data/wdc/computers/test.txt"
},
{
"name": "wdc_shoes_small",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/wdc/shoes/train.txt.small",
"validset": "data/wdc/shoes/valid.txt.small",
"testset": "data/wdc/shoes/test.txt"
},
{
"name": "wdc_watches_small",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/wdc/watches/train.txt.small",
"validset": "data/wdc/watches/valid.txt.small",
"testset": "data/wdc/watches/test.txt"
},
{
"name": "wdc_all_medium",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/wdc/all/train.txt.medium",
"validset": "data/wdc/all/valid.txt.medium",
"testset": "data/wdc/all/test.txt"
},
{
"//": 然后依次medium/large/xlarge
}
,
{
"name": "Dirty/DBLP-ACM",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/er_magellan/Dirty/DBLP-ACM/train.txt",
"validset": "data/er_magellan/Dirty/DBLP-ACM/valid.txt",
"testset": "data/er_magellan/Dirty/DBLP-ACM/test.txt"
},
{
"name": "Dirty/DBLP-GoogleScholar",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/er_magellan/Dirty/DBLP-GoogleScholar/train.txt",
"validset": "data/er_magellan/Dirty/DBLP-GoogleScholar/valid.txt",
"testset": "data/er_magellan/Dirty/DBLP-GoogleScholar/test.txt"
},
{
"name": "Dirty/iTunes-Amazon",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/er_magellan/Dirty/iTunes-Amazon/train.txt",
"validset": "data/er_magellan/Dirty/iTunes-Amazon/valid.txt",
"testset": "data/er_magellan/Dirty/iTunes-Amazon/test.txt"
},
{
"name": "Dirty/Walmart-Amazon",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/er_magellan/Dirty/Walmart-Amazon/train.txt",
"validset": "data/er_magellan/Dirty/Walmart-Amazon/valid.txt",
"testset": "data/er_magellan/Dirty/Walmart-Amazon/test.txt"
},
{
"//": 然后依次Structured/Textual
}
]
对于WDC数据集而言:
包括
all
、cameras
、computers
、shoes
和watches
模块,其中每一个模块又分为训练集(small
, medium
, large
, 和xlarge
)、验证集(small
, medium
, large
, 和xlarge
)和测试集。
运行
训练模型
python train_ditto.py --task Structured/DBLP-ACM --batch_size 64 --max_len 64 --lr 3e-5 --n_epochs 40 --lm distilbert --fp16 --da del --dk product --summarize --save_model
测试模型
python matcher.py --task Structured/DBLP-ACM --input_path input/input_small.jsonl --output_path output/output_small.jsonl --lm distilbert --max_len 64 --use_gpu --fp16 --checkpoint_path checkpoints/