AIDD流程中常常遇到的数据集格式为csv为主,CADD流程中遇到的则以sdf文件为主,本文分享常用的格式转换代码。
一、SDF文件转CSV文件
from rdkit import Chem
import pandas as pd
# 读取SDF为可迭代对象
suppl = Chem.SDMolSupplier('my_project/demo_cpds.sdf')
#查看化合物信息样例
suppl[0].GetPropsAsDict()
output:
{'ID': '001',
'Mw': 469.4,
'logP': 4.6,
'HBD': 2,
'HBA': 7,
'TPSA': 99.8,
'RotB': 4,
'QED': 0.55,
'chiral_center': 1,
'aromatic_rings': 3
}
然后遍历整个SDF文件,转为DataFrame格式
cpd_list=[]
for i in range(len(suppl_patent)):
try:
smi=Chem.MolToSmiles(suppl_patent[i])
temp_dict=suppl_patent[i].GetPropsAsDict()
temp_dict['SMILES']=smi
cpd_list.append(temp_dict)
except Exception as e:
print (e)
continue
df=pd.DataFrame(cpd_list)
print (df.shape)
df.head(1)
output:
(5,11)
ID Mw logP HBD HBA TPSA RotB QED chiral_center aromatic_rings SMILES
0 001 469.4 4.6 2 7 99.8 4 0.55 1 3 N#CC1NC(=O)c2cc(-c3cnn(C4CC4)c3)cc(NC(=O)c3cc(F)cc(C(F)(F)F)c3)c21
然后存储为csv文件即可,注意若有conformer相关信息则难以处理,建议还是用sdf格式保存
df.to_csv('my_project/demo.csv', index=False)
二、CSV文件转换为SDF文件
反过来,存储于csv的化合物信息也能轻松地转换为sdf格式
df=pd.read_csv('my_project/demo.csv')
print (df.shape)
df.head(1)
output:
(5,11)
ID Mw logP HBD HBA TPSA RotB QED chiral_center aromatic_rings SMILES
0 001 469.4 4.6 2 7 99.8 4 0.55 1 3 N#CC1NC(=O)c2cc(-c3cnn(C4CC4)c3)cc(NC(=O)c3cc(F)cc(C(F)(F)F)c3)c21
然后准备一个简单的小函数out_sdf,传入化合物list,将化合物数据存到指定地址
def out_sdf(lig_list, filename):
writer=Chem.SDWriter(filename)
for i in lig_list:
writer.write(i)
writer.close()
return
准备化合物list,包含表格里的所有信息
cpd_list=[]
for idx, row in df.iterrows():
if idx%5000==0:
print (idx ,' have been processed')
try:
smi=row['SMILES']
mol=Chem.MolFromSmiles(smi)
for prop in df.columns:
prop_value = str(row[prop])
mol.SetProp(prop,prop_value)
except Exception as e:
print (idx, e)
cpd_list.append(mol)
len(cpd_list)
output:
5
最后写入sdf文件即可
out_sdf(cpd_list, 'my_project/demo.sdf')