读开源项目系列1:python开发的一些简单语法和方法

在读一些python生信项目的开源代码，记录和回忆一下其中关键的语法和用到的包,语法是不需要记的，但是还是需要记录，所以一些很基础的东西还是要记一下

Python类的概念

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
class Employee:
   '所有员工的基类'
   empCount = 0
 
   def __init__(self, name, salary):
      self.name = name
      self.salary = salary
      Employee.empCount += 1
   
   def displayCount(self):
     print "Total Employee %d" % Employee.empCount
 
   def displayEmployee(self):
      print "Name : ", self.name,  ", Salary: ", self.salary
 
"创建 Employee 类的第一个对象"
emp1 = Employee("Zara", 2000)
"创建 Employee 类的第二个对象"
emp2 = Employee("Manni", 5000)
emp1.displayEmployee()
emp2.displayEmployee()
print ("Total Employee %d" % Employee.empCount)

#Name :  Zara ,Salary:  2000
#Name :  Manni ,Salary:  5000
#Total Employee 2

enumerate() 函数

enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中。

seasons = ['Spring', 'Summer', 'Fall', 'Winter']
list(enumerate(seasons))
#[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
 list(enumerate(seasons, start=1))       # 下标从 1 开始
#[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]

i = 0
seq = ['one', 'two', 'three']
for element in seq:
   print i, seq[i]
   i +=1

#0 one
#1 two
#2 three

zip() 函数

a = [1,2,3]
b = [4,5,6]
c = [4,5,6,7,8]
zipped = zip(a,b)     # 打包为元组的列表
## [(1, 4), (2, 5), (3, 6)]

itertools.combinations

itertools --- 为高效循环而创建迭代器的函数 — Python 3.10.1 文档
Python使用combinations可以实现排列组合

from itertools import combinations
test_data = ['a1', 'a2', 'a3', 'b']
for i in combinations(test_data, 2):
    print (i)

#('a1', 'a2')
#('a1', 'a3')
#('a'1, 'b')
#('a2', 'a3')
#('a2', 'b')
#('a3', 'b')

## 如果只想看a1和其他的比较的话
z=0
for i in combinations(test_data, 2):
    if z < (len(test_data) - 1):
        print (i)
    z+=1
#('a1', 'a2')
#('a1', 'a3')
#('a'1, 'b')

mappy包

这是minimap2的python版本

lh3/minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences (github.com)

This class describes an alignment. An object of this class has the following properties:

ctg: name of the reference sequence the query is mapped to
ctg_len: total length of the reference sequence
r_st and r_en: start and end positions on the reference
q_st and q_en: start and end positions on the query
strand: +1 if on the forward strand; -1 if on the reverse strand
mapq: mapping quality
blen: length of the alignment, including both alignment matches and gaps but excluding ambiguous bases.
mlen: length of the matching bases in the alignment, excluding ambiguous base matches.
NM: number of mismatches, gaps and ambiguous positions in the alignment
trans_strand: transcript strand. +1 if on the forward strand; -1 if on the reverse strand; 0 if unknown
is_primary: if the alignment is primary (typically the best and the first to generate)
read_num: read number that the alignment corresponds to; 1 for the first read and 2 for the second read
cigar_str: CIGAR string
cigar: CIGAR returned as an array of shape (n_cigar,2). The two numbers give the length and the operator of each CIGAR operation.
MD: the MD tag as in the SAM format. It is an empty string unless the MD argument is applied when calling mappy.Aligner.map().
cs: the cs tag.

mappy · PyPI

pip install --user mappy

import mappy as mp
a = mp.Aligner("test/MT-human.fa")  # load or build index
if not a: raise Exception("ERROR: failed to load/build index")
s = a.seq("MT_human", 100, 200)     # retrieve a subsequence from the index
print(mp.revcomp(s))                # reverse complement
for name, seq, qual in mp.fastx_read("test/MT-orang.fa"): # read a fasta/q sequence
        for hit in a.map(seq): # traverse alignments
                print("{}\t{}\t{}\t{}".format(hit.ctg, hit.r_st, hit.r_en, hit.cigar_str))

google.protobuf

Google protobuf是非常出色的开源工具，在项目中可以用它来作为服务间数据交互的接口，例如rpc服务、数据文件传输等。protobuf为proto文件中定义的对象提供了标准的序列化和反序列化方法，可以很方便的对pb对象进行各种解析和转换。
Protobuf的介绍和使用 - 简书 (jianshu.com)
Google protobuf使用技巧和经验 - 张巩武 - 博客园 (cnblogs.com)

hashlib

Python的hashlib提供了常见的摘要算法，如MD5，SHA1等等。
什么是摘要算法呢？摘要算法又称哈希算法、散列算法。它通过一个函数，把任意长度的数据转换为一个长度固定的数据串（通常用16进制的字符串表示）。
摘要算法之所以能指出数据是否被篡改过，就是因为摘要函数是一个单向函数，计算f(data)很容易，但通过digest反推data却非常困难。而且，对原始数据做一个bit的修改，都会导致计算出的摘要完全不同
hashlib - 廖雪峰的官方网站 (liaoxuefeng.com)

import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

md5生成一个128bit的结果，通常用32位的16进制字符串表示
sha1生成一个160bit的结果，通常用40位的16进制字符串表示
SHA256和SHA512，不过越安全的算法越慢，而且摘要长度更长

import hashlib

sha1 = hashlib.sha1()
sha1.update('how to use sha1 in ')
sha1.update('python hashlib?')
print sha1.hexdigest()

这个模块针对许多不同的安全哈希和消息摘要算法实现了一个通用接口。包括 FIPS 安全哈希算法 SHA1, SHA224, SHA256, SHA384 和 SHA512 (定义于 FIPS 180-2) 以及 RSA 的 MD5 算法 (定义于互联网 RFC 1321)。术语 "安全哈希" 和 "消息摘要" 是同义的。较旧的算法被称为消息摘要。现代的术语是安全哈希。
hashlib --- 安全哈希与消息摘要 — Python 3.10.1 文档

import hashlib
m = hashlib.sha256()
m.update(b"Nobody inspects")
m.update(b" the spammish repetition")
m.digest()

m.digest_size

m.block_size

os.path.expanduser

注:就是把相对路径改为绝对路径，对于用户的迁移比较友好

os.path --- 常用路径操作 — Python 3.10.1 文档
在 Unix 和 Windows 上，将参数中开头部分的 ~ 或 ~user 替换为当前用户的家目录并返回。

path = os.path.expanduser('~/Project')
path
##/home/username/Project

math.inf 常量

Python math.inf 常量-CJavaPy

# Import math Library
import math

# 打印正无穷大
print (math.inf)

# 打印负无穷
print (-math.inf)

python 中 array 和 list 的区别

python 中 array 和 list 的区别 - 知乎 (zhihu.com)

其实python的array和R之中的数组是很类似的，就是多维的数据

`collections` --- 容器数据类型

OrderedDict
字典的子类，保存了他们被添加的顺序
defaultdict
字典的子类，提供了一个工厂函数，为字典查询提供一个默认值

image.png

Counter

有点像R之中的table

一个 Counter 是一个 dict 的子类，用于计数可哈希对象。它是一个集合，元素像字典键(key)一样存储，它们的计数存储为值。计数可以是任何整数值，包括0和负数。 Counter 类有点像其他语言中的 bags或multisets。

# Tally occurrences of words in a list
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    cnt[word] += 1
print(cnt)
## Counter({'blue': 3, 'red': 2, 'green': 1})


# Find the ten most common words in Hamlet
import re
words = re.findall(r'\w+', open('hamlet.txt').read().lower())
Counter(words).most_common(10)

`logging` --- Python 的日志记录工具

记录器有以下的属性和方法。注意永远不要直接实例化记录器，应当通过模块级别的函数 logging.getLogger(name) 。多次使用相同的名字调用 getLogger() 会一直返回相同的 Logger 对象的引用。

import logging
logger = logging.getLogger(__name__)

`functools` --- 高阶函数和可调用对象上的操作

@functools.cache(user_function)

@cache
def factorial(n):
    return n * factorial(n-1) if n else 1

>>> factorial(10)      # no previously cached result, makes 11 recursive calls
3628800
>>> factorial(5)       # just looks up cached value result
120
>>> factorial(12)      # makes two new recursive calls, the other 10 are cached
479001600

`pickle` --- Python 对象序列化

其实和R的保存差不多

要序列化某个包含层次结构的对象，只需调用 dumps() 函数即可。同样，要反序列化数据流，可以调用 loads() 函数。但是，如果要对序列化和反序列化加以更多的控制，可以分别创建 Pickler 或 Unpickler 对象。
Python数据存储：pickle模块的使用讲解_coffee_cream的博客-CSDN博客_import pickle

tqdm

增加进度条
tqdm/tqdm: A Fast, Extensible Progress Bar for Python and CLI (github.com)

seq 9999999 | tqdm --bytes | wc -l

glob()函数

glob是python自己带的一个文件操作相关模块，用它可以查找符合自己目的的文件，类似于Windows下的文件搜索，支持通配符操作，,?,[]这三个通配符，代表0个或多个字符，?代表一个字符，[]匹配指定范围内的字符，如[0-9]匹配数字。两个主要方法如下。
Python glob()函数的作用和用法_xjp_xujiping的博客-CSDN博客_glob()函数

glob方法

其实就是一种简单正则

glob模块的主要方法就是glob,该方法返回所有匹配的文件路径列表（list）；该方法需要一个参数用来指定匹配的路径字符串（字符串可以为绝对路径也可以为相对路径），其返回的文件名只包括当前目录里的文件名，不包括子文件夹里的文件。

glob.glob(r’c:*.txt’)

 filenames = glob.glob(os.path.join(root, "**", basename), recursive=True)

iglob方法：

获取一个迭代器（ iterator ）对象，使用它可以逐个获取匹配的文件路径名。与glob.glob()的区别是：glob.glob同时获取所有的匹配路径，而 glob.iglob一次只获取一个匹配路径。

f = glob.iglob(r'../*.py')
print f
<generator object iglob at 0x00B9FF80>
 
for py in f:
    print py

pytest

快速对python脚本进行测试的python工具，测试成功失败都有相应提示

快速入门 — learning-pytest 1.0 文档

Python Pytest 教程|极客教程 (geek-docs.com)
Pytest 使用手册 — learning-pytest 1.0 文档

pyro:基于pytorch的概率编程语言（PPL）

pyro-ppl/pyro: Deep universal probabilistic programming with Python and PyTorch (github.com)

参考文档

Pyro Documentation — Pyro documentation
Pyro 推断简介 — Pyro Tutorials 编译 Pyro官方教程汉化

除了用于合并观察数据的 pyro.condition 之外，Pyro还包含 pyro.do，这是 Pearl 的 do-operator 的实现，用于因果推断，其接口与 pyro.condition 相同。condition and do 可以自由混合和组合，使Pyro成为基于模型的因果推断的强大工具。

Pyro 从入门到出门 - 知乎 (zhihu.com)

isinstance

Python isinstance() 函数 | 菜鸟教程 (runoob.com)

assert isinstance(counts, dict)

`typing` --- 类型标注支持

typing模块的作用：
类型检查，防止运行时出现参数和返回值类型不符合。
作为开发文档附加说明，方便使用者调用时传入和返回参数类型。
该模块加入后并不会影响程序的运行，不会报正式的错误，只有提醒。

python模块：typing - 1024搜-程序员专属的搜索引擎 (1024sou.com)

typing-python用于类型注解的库 - lynskylate - 博客园 (cnblogs.com)

assert（断言）

>>> assert True     # 条件为 true 正常执行
>>> assert False    # 条件为 false 触发异常
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>> assert 1==1    # 条件为 true 正常执行
>>> assert 1==2    # 条件为 false 触发异常
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

>>> assert 1==2, '1 不等于 2'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: 1 不等于 2
>>>

import sys
assert ('linux' in sys.platform), "该代码只能在 Linux 下执行"

异常处理

Python3 错误和异常 | 菜鸟教程 (runoob.com)

import sys

try:
    f = open('myfile.txt')
    s = f.readline()
    i = int(s.strip())
except OSError as err:
    print("OS error: {0}".format(err))
except ValueError:
    print("Could not convert data to an integer.")
except:
    print("Unexpected error:", sys.exc_info()[0])
    raise

for arg in sys.argv[1:]:
    try:
        f = open(arg, 'r')
    except IOError:
        print('cannot open', arg)
    else:
        print(arg, 'has', len(f.readlines()), 'lines')
        f.close()

@property

其实就是将某个类的属性进行私有化，使得其他人无法乱改类之中的原始参数

[Python教学]@property是什么？使用场景和用法介绍 | Max营销志 (maxlist.xyz)

特性一：将 class （类）的方法转换为只能读取的属性

class Bank_acount:
    @property
    def password(self):
        return ‘密碼:123'

首先我们先将 class 实例化 andy = Bank_acount（），当我们 print（andy.password）时，可以获得密码：123，当我想对 andy.password 修改时会发现程序出现了 AttributeError： can't set attribute 的错误，这就是 property 只能读取的属性特性

andy = Bank_acount()

print(andy.password)
>>> 密碼:123

andy.password = '密碼:456'
>>> AttributeError: can't set attribute

只能读取，那要怎么修改呢？

接下来我们会在特性二看到 property 的 setter、getter 和 deleter 方法

Property 特性二：

class Bank_acount:
    def __init__(self):
        self._password = ‘預設密碼 0000’

    @property
    def password(self):
        return self._password

    @password.setter
    def password(self, value):
        self._password = value

    @password.deleter
    def password(self):
        del self._password
        print('del complite')

getter

andy = Bank_acount()
print(andy.password)
>>> 預設密碼 0000

setter

andy.password = '1234'
print(andy.password)
>>> 1234

deleter

del andy.password
print(andy.password)
>>> del

为什么会需要 @property？

@property 是要实现对象导向中设计中封装的实现方式

使用@property - 廖雪峰的官方网站 (liaoxuefeng.com)

class Student(object):

    def get_score(self):
         return self._score

    def set_score(self, value):
        if not isinstance(value, int):
            raise ValueError('score must be an integer!')
        if value < 0 or value > 100:
            raise ValueError('score must between 0 ~ 100!')
        self._score = value

有没有既能检查参数，又可以用类似属性这样简单的方式来访问类的变量呢？对于追求完美的Python程序员来说，这是必须要做到的！
还记得装饰器（decorator）可以给函数动态加上功能吗？对于类的方法，装饰器一样起作用。Python内置的@property装饰器就是负责把一个方法变成属性调用的

class Student(object):

    @property
    def score(self):
        return self._score

    @score.setter
    def score(self, value):
        if not isinstance(value, int):
            raise ValueError('score must be an integer!')
        if value < 0 or value > 100:
            raise ValueError('score must between 0 ~ 100!')
        self._score = value

>>> s = Student()
>>> s.score = 60 # OK，实际转化为s.set_score(60)
>>> s.score # OK，实际转化为s.get_score()
60
>>> s.score = 9999
Traceback (most recent call last):
  ...
ValueError: score must between 0 ~ 100!

这篇讲的最清楚
python @property的用法及含义_昨天丶今天丶明天的的博客-CSDN博客

base64

Base64是一种用64个字符来表示任意二进制数据的方法。
Base64编码的长度永远是4的倍数

`shutil` --- 高阶文件操作

#!/usr/bin/env python
# _*_ coding:utf-8 _*_
__author__ = 'junxi'

import shutil

# 将文件内容拷贝到另一个文件中
shutil.copyfileobj(open('old.txt', 'r'), open('new.txt', 'w'))

# 拷贝文件
shutil.copyfile('old.txt', 'old1.txt')

# 仅拷贝权限。内容、组、用户均不变
shutil.copymode('old.txt', 'old1.txt')

# 复制权限、最后访问时间、最后修改时间
shutil.copystat('old.txt', 'old1.txt')

# 复制一个文件到一个文件或一个目录
shutil.copy('old.txt', 'old2.txt')

# 在copy上的基础上再复制文件最后访问时间与修改时间也复制过来了
shutil.copy2('old.txt', 'old2.txt')

# 把olddir拷贝一份newdir，如果第3个参数是True，则复制目录时将保持文件夹下的符号连接，如果第3个参数是False，则将在复制的目录下生成物理副本来替代符号连接
shutil.copytree('C:/Users/xiaoxinsoso/Desktop/aaa', 'C:/Users/xiaoxinsoso/Desktop/bbb')

# 移动目录或文件
shutil.move('C:/Users/xiaoxinsoso/Desktop/aaa', 'C:/Users/xiaoxinsoso/Desktop/bbb') # 把aaa目录移动到bbb目录下

# 删除一个目录
shutil.rmtree('C:/Users/xiaoxinsoso/Desktop/bbb') # 删除bbb目录

`subprocess` --- 子进程管理

Python模块之subprocess用法实例详解 - 云+社区 - 腾讯云 (tencent.com)

>>> import subprocess
# python 解析则传入命令的每个参数的列表
>>> subprocess.run(["df","-h"])
Filesystem      Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-LogVol00
           289G  70G 204G 26% /
tmpfs         64G   0  64G  0% /dev/shm
/dev/sda1       283M  27M 241M 11% /boot
CompletedProcess(args=['df', '-h'], returncode=0)
# 需要交给Linux shell自己解析，则:传入命令字符串，shell=True
>>> subprocess.run("df -h|grep /dev/sda1",shell=True)
/dev/sda1       283M  27M 241M 11% /boot
CompletedProcess(args='df -h|grep /dev/sda1', returncode=0)

为什么python 函数名之后有一个箭头？

7.3 给函数参数增加元信息 — python3-cookbook 3.0.0 文档

只是提示该函数输入参数和返回值的数据类型
方便程序员阅读代码的。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 211,265评论 6赞 490
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 90,078评论 2赞 385
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 156,852评论 0赞 347
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 56,408评论 1赞 283
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 65,445评论 5赞 384
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,772评论 1赞 290
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,921评论 3赞 406
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,688评论 0赞 266
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,130评论 1赞 303
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,467评论 2赞 325
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,617评论 1赞 340
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,276评论 4赞 329
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,882评论 3赞 312
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,740评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,967评论 1赞 265
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,315评论 2赞 360
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,486评论 2赞 348