187. Repeated DNA Sequences (Medium)

Description:

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]


Solution:

No masking and re-encoding:

NOTE:

  • 没想到竟然不是用DP解
  • 注意输出结果要去重
  • range(l-9) 不是 range(l-10)
from collections import defaultdict
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        l = len(s)
        if l <= 10:
            return []
        
        dic = defaultdict(int)
        ret = []
        for i in range(l-9):
            cache = s[i:i+10]
            if dic[cache] == 1:
                ret.append(cache)
            dic[cache] += 1
        
        return ret

Runtime: 88 ms, faster than 14.75% of Python3 online submissions for Repeated DNA Sequences.
Memory Usage: 27.5 MB, less than 8.11% of Python3 online submissions for Repeated DNA Sequences.

Remove l<=10 check

from collections import defaultdict
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        l = len(s)
        
        dic = defaultdict(int)
        ret = []
        for i in range(l-9):
            cache = s[i:i+10]
            if dic[cache] == 1:
                ret.append(cache)
            dic[cache] += 1
        
        return ret

Runtime: 72 ms, faster than 53.01% of Python3 online submissions for Repeated DNA Sequences.
Memory Usage: 27.5 MB, less than 8.11% of Python3 online submissions for Repeated DNA Sequences.

sample 52 ms submission

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        container = set()
        res = set()
        for i in range(len(s) - 9):
            sub = s[i:i+10]
            if sub not in container:
                container.add(sub)
            else:
                res.add(sub)
        return res

CPP solutions using the mask: https://www.cnblogs.com/grandyang/p/4284205.html

Python solution with mask:

NOTE: 每一个碱基用2bit表示,10个一共是20bit,所以mask是0xfffff,5个f

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        if len(s) <= 10:
            return []
        
        to_bit = {"A":0b0,"C":0b1,"G":0b10,"T":0b11}
        cache = 0
        for i in range(9):
            cache = cache << 2 | to_bit[s[i]]
        
        mask = 0xfffff
        dic = set()
        ret = set()
        for i in range(len(s)-9):
            cache = (cache << 2 | to_bit[s[i+9]]) & mask
            if cache in dic:
                ret.add(s[i:i+10])
            else:
                dic.add(cache)
                
        return ret

Runtime: 64 ms, faster than 83.06% of Python3 online submissions for Repeated DNA Sequences.
Memory Usage: 23.1 MB, less than 100.00% of Python3 online submissions for Repeated DNA Sequences.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容