Hands-On: Preparing Data
Note
To complete this lesson, you can continue with the project you created in the Basics 101 course.
Alternatively, you can create a new starter project picking up where Basics 101 left off. From the Dataiku DSS homepage select +New Project > DSS Tutorials > Core Designer / Basics > Basics 102.
At the end of the last hands-on lesson, we realized our categories of t-shirts were not consistently named. We can fix this kind of problem with a Prepare recipe.
First Steps
Hint
In addition to the written instructions and screenshots, you’ll also find several short screencasts recording the actions described in each section.
If it’s not already open, click to open the orders dataset. No matter what tab you are looking at, you’ll find an Actions button and a plus sign near the top right of the screen. Click on either of these to expand the right sidebar.
- In the Actions sidebar, choose Prepare from the section of Visual recipes.
When creating a recipe, you must provide an input dataset and the name of the output dataset, which the recipe will produce.
- Accept the default output dataset name of orders_prepared. Click Create Recipe.
Note
You can also set the value of “Store into” to decide where the output data will live. In this example, the output is written to the local filesystem, but the output could be written to a relational database or a distributed filesystem, if the infrastructure exists.
The Prepare recipe allows you to define a series of steps, or actions, to take on the dataset. The types of steps you can add to a Prepare recipe are wide-ranging and powerful. One example is reordering columns.
- Drag the order_id column in front of the pages_visited column. Note how a step describing this action is added to the recipe’s Script.
In order to standardize the categories of tshirt_category, let’s recode the values.
- Click on the column name tshirt_category, which opens a dropdown, and select Analyze.
The Analyze window provides a quick summary of the data in the column (for the sample of data that is being displayed by Dataiku DSS). You can also perform various data cleansing actions.
- Select White T-Shirt M and Wh Tshirt M. From the “Mass Actions” dropdown choose Merge selected.
- Choose to replace the values with White T-Shirt M and click Merge.
- Repeat this process for other categories until six remain.
When all necessary replacements have been made, close the Analyze window and see that a “Replace” step has been added to the Prepare script.
Replacing the four values in this step affects 517 rows in the sample. We could have created this step explicitly in the script, but the Analyze dialog provides a quick and intuitive shortcut to build the step.
At this point, you will see some values of the tshirt_category in blue. This is because you are in Step preview mode. In this mode, you can see what the step changes. The values in blue are those that were modified by the “Replace” step.
If you want to see your data as it will appear after processing, click on the Disable preview (blue eye) button in the top bar.
The following video goes through what we just covered.
//
More Preparation
Now, let’s deal with the order_date. At this point, the storage type of the order_date column is a “string”, but its meaning inferred by Dataiku DSS is an unparsed date. Let’s parse it so that we can treat it as a proper date.
- Open the order_date column dropdown and select Parse date.
- The Smart Date dialog shows the most likely formats for our dates and how the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in
yyyy/MM/dd
format. Select this format, and see that a “Parse date” step has been added to the script.
By default, this step creates a new column order_date_parsed. Note how both its storage type and meaning are a date. We could leave the name of the output column empty in order to parse the column in place. Instead though, let’s delete the original date column and rename the new column.
- Click on the order_date column header dropdown. Choose Delete.
- Click on the order_date_parsed column header dropdown. Choose Rename. Give the name
order_date
.
The following video goes through what we just covered.
//
Finally, let’s compute the value of each order. The orders dataset includes the number of t-shirts in each order and the price per t-shirt. We are going to use a Formula step to compute the dollar value of each order. Dataiku DSS formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.
This time, we will not add the step by clicking on a column header, but instead use the processors library which references all 90+ data preparation processors.
- Click the yellow +Add a New Step button near the bottom left of the page.
- Select Formula (you can search for it).
- Type
total
as the name of the new column. - In the expression, type
tshirt_price * tshirt_quantity
(you can also click Edit to bring up the advanced formula editor, which will autocomplete column names) - Click anywhere and see the new total column appear.
- Remove the columns tshirt_price and tshirt_quantity by clicking on the column header and choosing Delete.
The following video goes through what we just covered.
//
Recall that the data visible in the recipe is merely a sample, meant to give you immediate visual feedback on the design of your Prepare script. With our data preparation finished, we must now run the recipe on the whole input dataset.
- Click Run in the lower left corner of the page. Dataiku DSS uses its own engine for this recipe runtime, but depending upon your infrastructure and the type of recipe, you can choose where the computation takes place.
You will also be asked to update the schema. A dataset’s schema is a list of the columns, plus their storage type and meaning. After creating the column total, removed columns such as tshirt_price and tshirt_quantity, and changed the type of order_date, we need to allow Dataiku DSS to update the schema.
- Update the schema.
When the job completes, click Explore dataset orders_prepared to view the output dataset. You can also return to the Flow and see your progress.
What’s next?
Congratulations on completing your first Prepare recipe! However, there’s much more data exploration and cleaning to be done.