问题
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
例子
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
分析
首先要明确,一个长度为n的字符串有且仅有n-9个长度为10的子字符串。
定义两个unordered_set: allSet, findSet. allSet用来保存所有遍历过的长度为10的子字符串,
findSet用来保存已经找到的出现次数不小于2次的长度为10的子字符串。findSet是用来防止结果列表里出现重复的子字符串。
要点
- 防止检测到重复的子字符串;
- vector的构造函数可以传入unordered_set的首尾迭代器(迭代器实在是太强大了!)。
时间复杂度
O(n)
空间复杂度
O(n)
代码
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
if (s.length() < 11) return vector<string>();
unordered_set<string> allSet, findSet;
for (int i = 0; i <= s.length() - 10; i++) {
string substring = s.substr(i, 10);
if (allSet.find(substring) != allSet.end() &&
findSet.find(substring) == findSet.end())
findSet.insert(substring);
else
allSet.insert(substring);
}
return vector<string>(findSet.begin(), findSet.end());
}
};