Prepare Your Data(2)

Hands-On: Preparing Data

Note
To complete this lesson, you can continue with the project you created in the Basics 101 course.
Alternatively, you can create a new starter project picking up where Basics 101 left off. From the Dataiku DSS homepage select +New Project > DSS Tutorials > Core Designer / Basics > Basics 102.

At the end of the last hands-on lesson, we realized our categories of t-shirts were not consistently named. We can fix this kind of problem with a Prepare recipe.

First Steps

Hint
In addition to the written instructions and screenshots, you’ll also find several short screencasts recording the actions described in each section.

If it’s not already open, click to open the orders dataset. No matter what tab you are looking at, you’ll find an Actions button and a plus sign near the top right of the screen. Click on either of these to expand the right sidebar.

  • In the Actions sidebar, choose Prepare from the section of Visual recipes.

When creating a recipe, you must provide an input dataset and the name of the output dataset, which the recipe will produce.

  • Accept the default output dataset name of orders_prepared. Click Create Recipe.

Note
You can also set the value of “Store into” to decide where the output data will live. In this example, the output is written to the local filesystem, but the output could be written to a relational database or a distributed filesystem, if the infrastructure exists.

The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset. The types of steps you can add to a Prepare recipe are wide-ranging and powerful. One example is reordering columns.

  • Drag the order_id column in front of the pages_visited column. Note how a step describing this action is added to the recipe’s Script.

In order to standardize the categories of tshirt_category, let’s recode the values.

  • Click on the column name tshirt_category, which opens a dropdown, and select Analyze.

The Analyze window provides a quick summary of the data in the column (for the sample of data that is being displayed by Dataiku DSS). You can also perform various data cleansing actions.

  • Select White T-Shirt M and Wh Tshirt M. From the “Mass Actions” dropdown choose Merge selected.
  • Choose to replace the values with White T-Shirt M and click Merge.
  • Repeat this process for other categories until six remain.

When all necessary replacements have been made, close the Analyze window and see that a “Replace” step has been added to the Prepare script.

Replacing the four values in this step affects 517 rows in the sample. We could have created this step explicitly in the script, but the Analyze dialog provides a quick and intuitive shortcut to build the step.

At this point, you will see some values of the tshirt_category in blue. This is because you are in Step preview mode. In this mode, you can see what the step changes. The values in blue are those that were modified by the “Replace” step.

If you want to see your data as it will appear after processing, click on the Disable preview (blue eye) button in the top bar.

The following video goes through what we just covered.

//

More Preparation

Now, let’s deal with the order_date. At this point, the storage type of the order_date column is a “string”, but its meaning inferred by Dataiku DSS is an unparsed date. Let’s parse it so that we can treat it as a proper date.

  • Open the order_date column dropdown and select Parse date.
  • The Smart Date dialog shows the most likely formats for our dates and how the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in yyyy/MM/dd format. Select this format, and see that a “Parse date” step has been added to the script.

By default, this step creates a new column order_date_parsed. Note how both its storage type and meaning are a date. We could leave the name of the output column empty in order to parse the column in place. Instead though, let’s delete the original date column and rename the new column.

  • Click on the order_date column header dropdown. Choose Delete.
  • Click on the order_date_parsed column header dropdown. Choose Rename. Give the name order_date.

The following video goes through what we just covered.

//

Finally, let’s compute the value of each order. The orders dataset includes the number of t-shirts in each order and the price per t-shirt. We are going to use a Formula step to compute the dollar value of each order. Dataiku DSS formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.

This time, we will not add the step by clicking on a column header, but instead use the processors library which references all 90+ data preparation processors.

  • Click the yellow +Add a New Step button near the bottom left of the page.
  • Select Formula (you can search for it).
  • Type total as the name of the new column.
  • In the expression, type tshirt_price * tshirt_quantity (you can also click Edit to bring up the advanced formula editor, which will autocomplete column names)
  • Click anywhere and see the new total column appear.
  • Remove the columns tshirt_price and tshirt_quantity by clicking on the column header and choosing Delete.

The following video goes through what we just covered.

//

Recall that the data visible in the recipe is merely a sample, meant to give you immediate visual feedback on the design of your Prepare script. With our data preparation finished, we must now run the recipe on the whole input dataset.

  • Click Run in the lower left corner of the page. Dataiku DSS uses its own engine for this recipe runtime, but depending upon your infrastructure and the type of recipe, you can choose where the computation takes place.

You will also be asked to update the schema. A dataset’s schema is a list of the columns, plus their storage type and meaning. After creating the column total, removed columns such as tshirt_price and tshirt_quantity, and changed the type of order_date, we need to allow Dataiku DSS to update the schema.

  • Update the schema.

When the job completes, click Explore dataset orders_prepared to view the output dataset. You can also return to the Flow and see your progress.

What’s next?

Congratulations on completing your first Prepare recipe! However, there’s much more data exploration and cleaning to be done.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,734评论 6 505
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,931评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,133评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,532评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,585评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,462评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,262评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,153评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,587评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,792评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,919评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,635评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,237评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,855评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,983评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,048评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,864评论 2 354

推荐阅读更多精彩内容