The Lab(2)

Discover the Lab

In the previous Basics course, and thus far in this course, all of your work creating a data pipeline has been done in the Flow. Now that you have been introduced to the Lab, let’s test out how it works.

  • In this tutorial, you are going to see how you can perform preliminary work on data outside the Flow in a dedicated environment called the Lab.
  • More specifically, you will create a Visual analysis in the Lab.
  • Of the three main tabs in a Visual analysis (Script, Charts, and Models), we’ll cover the first two and leave the Models tab for a future course.

Returning to your Basics 103 project, we want to create a visual analysis for the customers_orders_joined dataset.

  • Select or open the customers_orders_joined dataset.
  • From the Actions sidebar, click the blue Lab button.
  • You can also click the last icon on the right sidebar to go directly there.
  • Create a New analysis.

You will be prompted to specify a name for your analysis. Leave the default name Analyze customers_orders_joined.

Interactively Prepare Your Data

We find a screen similar to a Prepare recipe, but there are important distinctions discussed in the concept video.

Hint
Screencasts at the end of sections mirror the actions described.

First, let’s parse the birthdate column.

  • Open the column dropdown and select Parse date.
  • Accept the “yyyy/HH/mm” format.
  • Clear the output column in the Script step so that the parsed date replaces the original birthdate column.

With a customer’s birth date, and the date on which they made their first order, we can compute their age on the date of their first order.

  • From the birthdate column dropdown, choose Compute time since.
  • Choose “until” to be Another date column.
  • Choose first_order_date to be the “Other column”.
  • Change “Output time unit” to Years.
  • Then edit the Output column name to age_first_order.

From the new column age_first_order header dropdown, select Analyze in order to see if the distribution of ages looks okay. As it turns out, there are a number of outliers with ages well over 120. These are indicative of bad data.

  • Within the Analyze dialog, click the Actions button.
  • Choose to clear values outside 1.5 IQR (interquartile range). This will set those values to missing.

Now the distribution looks more reasonable, but there are still a few suspicious values over 100. We can remove these by setting an upper bound limit. Close the Analyze dialog.

  • In the Script, click the new step, Clear values outside [x,y] in age_first_order, to expand it.
  • Set the upper bound to 100.

Lastly, now that we’ve computed age_first_order, we won’t need birthdate or first_order_date anymore, so let’s remove them from the script.

  • For both columns, open the column dropdown and select Delete.
  • This adds a Remove step to the script.

The following video goes through what we just covered.

//

Leveraging the User Agent

Let’s now enrich the data by processing the user_agent and ip_address columns.

The user_agent column contains information about the browser and operating system, and we want to pull this information out into separate columns so that it’s possible to use it in further analyses.

  • Note that Dataiku DSS has inferred the meaning of the the user_agent column to be User-Agent.
  • Accordingly, its column dropdown is able to suggest specific actions. Select Classify User-Agent.
  • This adds a new step to the script and seven new columns to the dataset.

For this tutorial, we are only interested in the browser and the operating system, so we will remove the columns we don’t need.

  • To do this quickly, change from the Table view to the Columns view using the icon near the top right of the screen.
  • Select all of the columns beginning with user_agent_, except user_agent_brand and user_agent_os.
  • Click the Actions button and select Delete.

The following video goes through what we just covered.

//

Leveraging the IP Address

Dataiku DSS has inferred the meaning of the ip_address column to be an IP address. Just like with user_agent, we’ll have meaning-specific actions in the column dropdown.

  • Open the column header dropdown for the ip_address column.
  • Select Resolve GeoIP.

This adds a new step to the script and seven new columns to the dataset that tell us about the general geographic location of each IP address.

For this tutorial, we are only interested in the country and GeoPoint (approximate longitude and latitude of the IP address).

  • In the Script step, deselect Extract country code, Extract region, and Extract city.
  • Finally, delete the original ip_address column.

The following video goes through what we just covered.

//

Using Formulas

In Basics 102, we saw how to use Formulas in a Prepare recipe. We can use the same Formulas here.

Let’s create a new column to act as a label on the customers generating a lot of revenue. We’ll consider customers with a value of total orders in excess of 300 as “high revenue” customers.

  • Click +Add a New Step in the script and select Formula.
  • Type high_revenue as the output column name.
  • Click the Edit button to open the expression editor.
  • Type if(total_sum > 300, "True", "False") as the expression.
  • Hit Save.

Note
The syntax for the Formula Processor can be found in the reference documentation.

Visualize Your Data with Charts

Now let’s move from the Script tab to the Charts tab. A screencast below will recap all of the actions taken in this section.

Popular User Agents

Since we extracted the browsers used by customers from user_agent, it’s natural to want to know which browsers are most popular. A common way to visualize parts of a whole is with a pie or donut chart.

  • On the Charts tab, click the chart type tool.
  • In the Basics category, select Donut.
  • Click and drag user_agent_brand to the By box, and Count of records to the Show box.

This shows that nearly 3/4 of customers who have placed orders use Chrome. While the donut chart does a nice job of showing the relative share of each browser to the total, we’d also like to include the OS in the visualization.

  • Click the chart type tool again.
  • In the Basics category, select Stacked bar chart instead of the Donut.
  • Click and drag user_agent_os to the And box.
  • Click on user_agent_brand to adjust the sorting from “Natural ordering” to “Count of records, descending”.

Adding OS gives us further insight to the data. As expected, IE and Edge are only available on Windows, and Safari is only on MacOS. What is enlightening is that there are approximately double the number of customers using Chrome on MacOS as Safari and Firefox combined. There is a similar relationship between use of Chrome versus Firefox on Linux.

Sales by Age and Campaign

There are a number of insights we can glean from the combined Haiku T-shirts data that we couldn’t from the individual datasets. For a start, let’s see if there is a relationship between a customer’s age, whether that customer is part of a Haiku T-Shirt campaign, and how much they spend.

Click +Chart at the bottom center of the screen.

  • From the chart type tool, go to the Scatters category and select the Scatter Plot chart.
  • Select age_first_order as the X axis, and total_sum as the Y axis.
  • Select campaign as the column to color bubbles in the plot.
  • Select count as the column to set the size of bubbles.
  • From the size dropdown to the left of the count field, change the Base radius from 5 to 1 to reduce overlapping bubbles,.

The scatterplot shows that older customers, and those who are part of the campaign, tend to spend the most. The bubble sizes show that some of the moderately valued customers are those who have made a lot of small purchases, while others have made a few larger purchases.

Sales by Geography

Since we extracted locations from ip_address, it’s also natural to want to know where Haiku T-Shirt’s customers come from. We can visualize this with a map.

  • Click +Chart to make a third chart.
  • Select the Maps category and the Scatter Map plot.
  • Select ip_address_geopoint as the Geo field.
  • Select campaign as the column to color bubbles by.
  • Select total_sum as the column to set the size of bubbles.
  • From the size dropdown, change the base radius from 5 to 2 to reduce overlapping bubbles.

This looks much better, and you can quickly get a feel for which customers are located where. If we then want to focus on the largest sales:

  • Drag total_sum to the Filters box.
  • Click the number for the lower bound to edit it, and type 300 as the lower bound. This filters all customers who have spent less than 300 from the map.

The following video goes through what we just covered.

//

Deploy work in the Lab to the Flow

While you work on charts in a visual analysis, recall that these are built on a sample of your data. You can change the sample in the Sampling and Engine tab in the left panel, but since Dataiku DSS has to re-apply the latest preparation each time, it will not be very efficient for very large datasets.

Moreover, if you want to share these charts with your team on a dashboard, you will first need to deploy your script.

  • From any tab in the visual analysis, go to the top right corner of the screen and click on Deploy Script. A dialog appears to deploy the script as a Prepare recipe.
  • Note that, by default, charts created in the Lab will be carried over to the new dataset so that you can view them on the whole output data, rather than a sample.
  • Rename the output dataset customers_labelled.
  • Click Deploy to create the recipe.
  • Save the recipe and go to the Flow.

As suggested by the white square instead of the normal blue one, we haven’t yet actually built the dataset. Only the instructions for doing so have been added to the Flow.

  • Open the dataset and see that it is empty. This is because we have not yet run the recipe to build the full output dataset.
  • Click Build.

This opens a dialog that asks whether you want to build just this dataset (non-recursive) or reconstruct datasets leading to this dataset (recursive). Since the input dataset is up-to-date, a non-recursive build is sufficient.

  • Click Build Dataset (leaving non-recursive selected).

While the job executes, you are taken to the detailed activity log.

  • When the job completes, click Output dataset to view a sample of the output dataset.

Let’s configure the stacked bar chart to use the entire dataset.

  • Go to the Charts tab of the customers_labelled dataset.
  • Click Sampling & Engine from the left panel.
  • Uncheck Use same sample as explore.
  • Select No sampling (whole data) as the sampling method.
  • Click Save and Refresh Sample.

The following video goes through what we just covered.

//

What’s next?

Congratulations! You successfully deployed a visual analysis script from the Lab to the Flow. Be sure to review the concept materials for greater discussion on the differences between these two.

Now that the orders and customers datasets are joined, cleaned, and prepared, you would be ready to build a model to predict customer value. This is a task for a future course!

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,542评论 6 504
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,822评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 163,912评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,449评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,500评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,370评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,193评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,074评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,505评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,722评论 3 335
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,841评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,569评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,168评论 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,783评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,918评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,962评论 2 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,781评论 2 354

推荐阅读更多精彩内容