Description:
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Solution:
No masking and re-encoding:
NOTE:
- 没想到竟然不是用DP解
- 注意输出结果要去重
- range(l-9) 不是 range(l-10)
from collections import defaultdict
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
l = len(s)
if l <= 10:
return []
dic = defaultdict(int)
ret = []
for i in range(l-9):
cache = s[i:i+10]
if dic[cache] == 1:
ret.append(cache)
dic[cache] += 1
return ret
Runtime: 88 ms, faster than 14.75% of Python3 online submissions for Repeated DNA Sequences.
Memory Usage: 27.5 MB, less than 8.11% of Python3 online submissions for Repeated DNA Sequences.
Remove l<=10
check
from collections import defaultdict
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
l = len(s)
dic = defaultdict(int)
ret = []
for i in range(l-9):
cache = s[i:i+10]
if dic[cache] == 1:
ret.append(cache)
dic[cache] += 1
return ret
Runtime: 72 ms, faster than 53.01% of Python3 online submissions for Repeated DNA Sequences.
Memory Usage: 27.5 MB, less than 8.11% of Python3 online submissions for Repeated DNA Sequences.
sample 52 ms submission
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
container = set()
res = set()
for i in range(len(s) - 9):
sub = s[i:i+10]
if sub not in container:
container.add(sub)
else:
res.add(sub)
return res
CPP solutions using the mask: https://www.cnblogs.com/grandyang/p/4284205.html
Python solution with mask:
NOTE: 每一个碱基用2bit表示,10个一共是20bit,所以mask是0xfffff,5个f
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
if len(s) <= 10:
return []
to_bit = {"A":0b0,"C":0b1,"G":0b10,"T":0b11}
cache = 0
for i in range(9):
cache = cache << 2 | to_bit[s[i]]
mask = 0xfffff
dic = set()
ret = set()
for i in range(len(s)-9):
cache = (cache << 2 | to_bit[s[i+9]]) & mask
if cache in dic:
ret.add(s[i:i+10])
else:
dic.add(cache)
return ret
Runtime: 64 ms, faster than 83.06% of Python3 online submissions for Repeated DNA Sequences.
Memory Usage: 23.1 MB, less than 100.00% of Python3 online submissions for Repeated DNA Sequences.