Monica Rogati 是领英的数据科学家,她给了我们在挖掘数据时十个应该避免的常见错误。
- 假设数据是干净的。数据清洗通常占了工作中大部分时间,而且简单的清洗动作也常常揭示出重要的模式。比如问道“是这个方法导致数据中的30%都为NULL吗?90210这个邮编对应的客户真的有那么多吗?”在拿到数据时就进行核对,以确保其有效和有用。
- 数据处理不规范。假设你正在制作一个热门结婚圣地的列表。你可以计算飞去某地参加婚礼的人数,但如果不考虑所有去那个地方旅客的总人数的话,你的列表仅仅代表了一个航空业发达的城市列表。
- 剔除异常值。假设有21个人每天使用你的产品一千次,这些人可能是你的超级粉丝,当然也可能仅仅是爬你网站的爬虫程序。但不管他们是谁,不应该随便的剔除他们。
- 包含异常值。从某个角度来说这21个人每天用1000次你的产品很有趣,因为他们能带给你意想不到的东西。但处理这些人没有合适的通用模型,所以在某些功能上需要剔除他们,否则“推荐功能”可能给你所有的忠实粉丝都推了千篇一律的东西。
- 忽视时间周期性。看了数据后惊叹实习生是今年增长最快的职位,定睛一看才发现是7月。在寻找规律时,如果忽视了时刻、工作日、月份会导致错误的决策。
- 汇报增长情况时忽视规模。情境非常重要,否则刚刚开始时,你爸爸注册了一次,增长率就翻了一倍。
- 数据输出,如果你不知道该看什么,那dashboard基本没什么用。
- 狼来了。你设置了很多报警好在出问题时第一时间修复,但当你的阈值设的很敏感时,这些警报就像“狼来了”一样,你慢慢就开始无视它们。
- 不要采集这里的数据综合症。将你的数据和其他来源的数据混合,可能会产生有价值的东西。“你最好的客户来的地方都非常喜欢日料吗?”。这些会给你很多很好的下一步行动的想法,甚至会影响你的增长策略。
- 聚焦噪声数据。即使什么都没有,我们人类也能给他找出模式来。摆脱虚荣指标的数据,退后一步关注更远大的目标。
How to Think Like a Data Scientist
Monica Rogati, a data scientist at LinkedIn, gave us the following 10 common pitfalls that entrepreneurs should avoid as they dig into the data their startups capture.
- Assuming the data is clean. Cleaning the data you capture is often most of the work, and the simple act of cleaning it up can often reveal important patterns. “Is an instrumentation bug causing 30% of your numbers to be null?” asks Monica. “Do you really have that many users in the 90210 zip code?” Check your data at
the door to be sure it’s valid and useful. - Not normalizing. Let’s say you’re making a list of popular wedding destinations. You could count the number of people flying in for a wedding, but unless you consider the total number of air travellers coming to that city as well, you’ll just get a list of cities with busy airports.
- Excluding outliers. Those 21 people using your product more than a thousand times a day are either your biggest fans, or bots crawling your site for content. Whichever they are, ignoring them would be a mistake.
- Including outliers. While those 21 people using your product a thousand times a day are interesting from a qualitative perspective, because they can show you things you didn’t expect, they’re not good for building a general model. “You probably want to exclude them when building data products,” cautions Monica. “Otherwise, the ‘you may also like’ feature on your site will have the same items everywhere—the ones your hardcore fans wanted.”
- Ignoring seasonality. “Whoa, is ‘intern’ the fastest-growing job of the year? Oh, wait, it’s June.” Failure to consider time of day, day of week, and monthly changes when looking at patterns leads to bad decision making.
- Ignoring size when reporting growth. Context is critical. Or, as Monica puts it, “When you’ve just started, technically, your dad signing up does count as doubling your user base.”
- Data vomit. A dashboard isn’t much use if you don’t know where to look.
- Metrics that cry wolf. You want to be responsive, so you set up alerts to let you know when something is awry in order to fix it quickly. But if your thresholds are too sensitive, they get “whiny”— and you’ll start to ignore them.
- The “Not Collected Here” syndrome. “Mashing up your data with data from other sources can lead to valuable insights,” says Monica. “Do your best customers come from zip codes with a high concentration of sushi restaurants?” This might give you a few great ideas about what experiments to run next—or even influence
your growth strategy. - Focusing on noise. “We’re hardwired (and then programmed) to see patterns where there are none,” Monica warns. “It helps to set aside the vanity metrics, step back, and look at the bigger picture.“
节选自Alistair Croll,Benjamin Yoskovitz,《Lean Analytics》