Higer Dimensionality2

Andrew's Curves
An Andrews plot, also known as Andrews curve, helps you visualize higher dimensionality, multivariate data by plotting each of your dataset's observations as a curve. The feature values of the observation act as the coefficients of the curve, so observations with similar characteristics tend to group closer to each other. Due to this, Andrews curves have some use in outlier detection.
Just as with Parallel Coordinates, every plotted feature must be numeric since the curve equation is essentially the product of the observation's features vector (transposed) and the vector: (1/sqrt(2), sin(t), cos(t), sin(2t), cos(2t), sin(3t), cos(3t), ...) to create a Fourier series.


Andrews Curves
Andrews Curves

The Pandas implementation requires you once again specify a GroupBy feature, which is then used to color code the curves as well as produce as chart legend:

from sklearn.datasets import load_iris
from pandas.tools.plotting import andrews_curves

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use plt.style.use('ggplot') instead

# Load up SKLearn's Iris Dataset into a Pandas Dataframe
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target_names'] = [data.target_names[i] for i in data.target]

# Andrews Curves Start Here:
plt.figure()
andrews_curves(df, 'target_names')
plt.show()

One of the current weaknesses with the Pandas implementation (and this goes for Parallel Coordinates as well) is that every single observation is charted. In the MATLAB version, you can specify a quantile or probability distribution cutoff. This way, only the mean feature values for a specific group are plotted, with a transparent boundary around the cutoffs. If you feel up to the challenge, a straightforward bonus assignment for you is to take the existing Pandas Andrews curve implementation and extend it with said functionality.

Matlab Andrews
Matlab Andrews
(Desired MatLab Implementation)

Imshow

One last higher dimensionality, visualization-technique you should know how to use is MatPlotLib's .imshow() method. This command generates an image based off of the normalized values stored in a matrix, or rectangular array of float64s. The properties of the generated image will depend on the dimensions and contents of the array passed in:
An [X, Y] shaped array will result in a grayscale image being generated
A [X, Y, 3] shaped array results in a full-color image: 1 channel for red, 1 for green, and 1 for blue
A [X, Y, 4] shaped array results in a full-color image as before with an extra channel for alpha
Besides being a straightforward way to display .PNG and other images, the .imshow() method has quite a few other use cases. When you use the .corr() method on your dataset, Pandas calculates a correlation matrix for you that measures how close to being linear the relationship between any two features in your dataset are. Correlation values may range from -1 to 1, where 1 would mean the two features are perfectly positively correlated and have identical slopes for all values. -1 would mean they are perfectly negatively correlated, and have a negative slope for one another, again being linear. Values closer to 0 mean there is little to no linear relationship between the two variables at all (e.g., pizza sales and plant growth), and so the the further away from 0 the value is, the stronger the relationship between the features:

>>> df = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> df.corr()

          a         b         c         d         e
a  1.000000  0.007568  0.014746  0.027275 -0.029043
b  0.007568  1.000000 -0.039130 -0.011612  0.082062
c  0.014746 -0.039130  1.000000  0.025330 -0.028471
d  0.027275 -0.011612  0.025330  1.000000 -0.002215
e -0.029043  0.082062 -0.028471 -0.002215  1.000000

The matrix is symmetric because the correlation between any two features X and Y is, of course, identical to that of features Y and X. It is invariant to scale, so even if one feature is measured in inches and the other is in centimeters, it makes no difference. This matrix and others like the covariance matrix, are useful for inspecting how the variance of a feature is explained by the variance in other feature, and verifying how much new information each feature provides. But even looking at this little, 5x5 matrix makes me dizzy, so you can imagine how easy it is to get lost in a higher dimensionality dataset. You can circumvent this by visualizing your correlation matrix by plotting it with .imshow():

import matplotlib.pyplot as plt

plt.imshow(df.corr(), cmap=plt.cm.Blues, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(df.columns))]
plt.xticks(tick_marks, df.columns, rotation='vertical')
plt.yticks(tick_marks, df.columns)

plt.show()
imshow
imshow

.imshow() can help you any time you have a square matrix you want to visualize. Other matrices you might want to visualize include the covariance matrix, the confusion matrix, and in the future once you learn how to use certain machine learning algorithms that generate clusters which live in your feature-space, you'll also be able to use .imshow() to peek into the brain of your algorithms as they run, so long as your features represent a rectangular image!

Dive Deeper
Being the cornerstone of data science, not much can be done without sound data. Having learned how to look for and manipulate your data, in this module you experimented with numerous visualization techniques to ensure the data you've collected is sound, such as scatter plots, histograms and other higher dimensionality methods. You probably also learned more about wheat kernels than you probably wanted to. We hope you've taken scrupulous notes about the best use cases for each of these plotting mechanisms and will be able to apply them on demand as needed!
The time has come for you to start applying real machine learning to your data. If you have some extra time, take a look at the following list of additional resources so that your visualization toolbox has all the tools you need to continue marching forward!

Basic Visualizations
Pandas Visualization with MatPlotLib
Radar Charts
Scatter-Histogram 2-Variable Distribution

Higher Dimensionality
Andrews Plot
Parallel Coordinates on Wikipedia
More on Parallel Coordinates Usage
Parallel Coords with Different Axes in MatPlotLib

Extras
MatPlotLib Markers
MatPlotLib ColorMaps

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,163评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,301评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,089评论 0 352
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,093评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,110评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,079评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,005评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,840评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,278评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,497评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,667评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,394评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,980评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,628评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,649评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,548评论 2 352

推荐阅读更多精彩内容