Discover the Lab

In the previous Basics course, and thus far in this course, all of your work creating a data pipeline has been done in the Flow. Now that you have been introduced to the Lab, let’s test out how it works.

In this tutorial, you are going to see how you can perform preliminary work on data outside the Flow in a dedicated environment called the Lab.
More specifically, you will create a Visual analysis in the Lab.
Of the three main tabs in a Visual analysis (Script, Charts, and Models), we’ll cover the first two and leave the Models tab for a future course.

Returning to your Basics 103 project, we want to create a visual analysis for the customers_orders_joined dataset.

Select or open the customers_orders_joined dataset.
From the Actions sidebar, click the blue Lab button.
You can also click the last icon on the right sidebar to go directly there.
Create a New analysis.

You will be prompted to specify a name for your analysis. Leave the default name Analyze customers_orders_joined.

Interactively Prepare Your Data

We find a screen similar to a Prepare recipe, but there are important distinctions discussed in the concept video.

Hint
Screencasts at the end of sections mirror the actions described.

First, let’s parse the birthdate column.

Open the column dropdown and select Parse date.
Accept the “yyyy/HH/mm” format.
Clear the output column in the Script step so that the parsed date replaces the original birthdate column.

With a customer’s birth date, and the date on which they made their first order, we can compute their age on the date of their first order.

From the birthdate column dropdown, choose Compute time since.
Choose “until” to be Another date column.
Choose first_order_date to be the “Other column”.
Change “Output time unit” to Years.
Then edit the Output column name to age_first_order.

From the new column age_first_order header dropdown, select Analyze in order to see if the distribution of ages looks okay. As it turns out, there are a number of outliers with ages well over 120. These are indicative of bad data.

Within the Analyze dialog, click the Actions button.
Choose to clear values outside 1.5 IQR (interquartile range). This will set those values to missing.

Now the distribution looks more reasonable, but there are still a few suspicious values over 100. We can remove these by setting an upper bound limit. Close the Analyze dialog.

In the Script, click the new step, Clear values outside [x,y] in age_first_order, to expand it.
Set the upper bound to 100.

Lastly, now that we’ve computed age_first_order, we won’t need birthdate or first_order_date anymore, so let’s remove them from the script.

For both columns, open the column dropdown and select Delete.
This adds a Remove step to the script.

The following video goes through what we just covered.

Leveraging the User Agent

Let’s now enrich the data by processing the user_agent and ip_address columns.

The user_agent column contains information about the browser and operating system, and we want to pull this information out into separate columns so that it’s possible to use it in further analyses.

Note that Dataiku DSS has inferred the meaning of the the user_agent column to be User-Agent.
Accordingly, its column dropdown is able to suggest specific actions. Select Classify User-Agent.
This adds a new step to the script and seven new columns to the dataset.

For this tutorial, we are only interested in the browser and the operating system, so we will remove the columns we don’t need.

To do this quickly, change from the Table view to the Columns view using the icon near the top right of the screen.
Select all of the columns beginning with user_agent_, except user_agent_brand and user_agent_os.
Click the Actions button and select Delete.

The following video goes through what we just covered.

Leveraging the IP Address

Dataiku DSS has inferred the meaning of the ip_address column to be an IP address. Just like with user_agent, we’ll have meaning-specific actions in the column dropdown.

Open the column header dropdown for the ip_address column.
Select Resolve GeoIP.

This adds a new step to the script and seven new columns to the dataset that tell us about the general geographic location of each IP address.

For this tutorial, we are only interested in the country and GeoPoint (approximate longitude and latitude of the IP address).

In the Script step, deselect Extract country code, Extract region, and Extract city.
Finally, delete the original ip_address column.

The following video goes through what we just covered.

Using Formulas

In Basics 102, we saw how to use Formulas in a Prepare recipe. We can use the same Formulas here.

Let’s create a new column to act as a label on the customers generating a lot of revenue. We’ll consider customers with a value of total orders in excess of 300 as “high revenue” customers.

Click +Add a New Step in the script and select Formula.
Type high_revenue as the output column name.
Click the Edit button to open the expression editor.
Type if(total_sum > 300, "True", "False") as the expression.
Hit Save.

Note
The syntax for the Formula Processor can be found in the reference documentation.

Visualize Your Data with Charts

Now let’s move from the Script tab to the Charts tab. A screencast below will recap all of the actions taken in this section.

Popular User Agents

Since we extracted the browsers used by customers from user_agent, it’s natural to want to know which browsers are most popular. A common way to visualize parts of a whole is with a pie or donut chart.

On the Charts tab, click the chart type tool.
In the Basics category, select Donut.
Click and drag user_agent_brand to the By box, and Count of records to the Show box.

This shows that nearly 3/4 of customers who have placed orders use Chrome. While the donut chart does a nice job of showing the relative share of each browser to the total, we’d also like to include the OS in the visualization.

Click the chart type tool again.
In the Basics category, select Stacked bar chart instead of the Donut.
Click and drag user_agent_os to the And box.
Click on user_agent_brand to adjust the sorting from “Natural ordering” to “Count of records, descending”.

Adding OS gives us further insight to the data. As expected, IE and Edge are only available on Windows, and Safari is only on MacOS. What is enlightening is that there are approximately double the number of customers using Chrome on MacOS as Safari and Firefox combined. There is a similar relationship between use of Chrome versus Firefox on Linux.

Sales by Age and Campaign

There are a number of insights we can glean from the combined Haiku T-shirts data that we couldn’t from the individual datasets. For a start, let’s see if there is a relationship between a customer’s age, whether that customer is part of a Haiku T-Shirt campaign, and how much they spend.

Click +Chart at the bottom center of the screen.

From the chart type tool, go to the Scatters category and select the Scatter Plot chart.
Select age_first_order as the X axis, and total_sum as the Y axis.
Select campaign as the column to color bubbles in the plot.
Select count as the column to set the size of bubbles.
From the size dropdown to the left of the count field, change the Base radius from 5 to 1 to reduce overlapping bubbles,.

The scatterplot shows that older customers, and those who are part of the campaign, tend to spend the most. The bubble sizes show that some of the moderately valued customers are those who have made a lot of small purchases, while others have made a few larger purchases.

Sales by Geography

Since we extracted locations from ip_address, it’s also natural to want to know where Haiku T-Shirt’s customers come from. We can visualize this with a map.

Click +Chart to make a third chart.
Select the Maps category and the Scatter Map plot.
Select ip_address_geopoint as the Geo field.
Select campaign as the column to color bubbles by.
Select total_sum as the column to set the size of bubbles.
From the size dropdown, change the base radius from 5 to 2 to reduce overlapping bubbles.

This looks much better, and you can quickly get a feel for which customers are located where. If we then want to focus on the largest sales:

Drag total_sum to the Filters box.
Click the number for the lower bound to edit it, and type 300 as the lower bound. This filters all customers who have spent less than 300 from the map.

The following video goes through what we just covered.

Deploy work in the Lab to the Flow

While you work on charts in a visual analysis, recall that these are built on a sample of your data. You can change the sample in the Sampling and Engine tab in the left panel, but since Dataiku DSS has to re-apply the latest preparation each time, it will not be very efficient for very large datasets.

Moreover, if you want to share these charts with your team on a dashboard, you will first need to deploy your script.

From any tab in the visual analysis, go to the top right corner of the screen and click on Deploy Script. A dialog appears to deploy the script as a Prepare recipe.
Note that, by default, charts created in the Lab will be carried over to the new dataset so that you can view them on the whole output data, rather than a sample.
Rename the output dataset customers_labelled.
Click Deploy to create the recipe.
Save the recipe and go to the Flow.

As suggested by the white square instead of the normal blue one, we haven’t yet actually built the dataset. Only the instructions for doing so have been added to the Flow.

Open the dataset and see that it is empty. This is because we have not yet run the recipe to build the full output dataset.
Click Build.

This opens a dialog that asks whether you want to build just this dataset (non-recursive) or reconstruct datasets leading to this dataset (recursive). Since the input dataset is up-to-date, a non-recursive build is sufficient.

Click Build Dataset (leaving non-recursive selected).

While the job executes, you are taken to the detailed activity log.

When the job completes, click Output dataset to view a sample of the output dataset.

Let’s configure the stacked bar chart to use the entire dataset.

Go to the Charts tab of the customers_labelled dataset.
Click Sampling & Engine from the left panel.
Uncheck Use same sample as explore.
Select No sampling (whole data) as the sampling method.
Click Save and Refresh Sample.

The following video goes through what we just covered.

What’s next?

Congratulations! You successfully deployed a visual analysis script from the Lab to the Flow. Be sure to review the concept materials for greater discussion on the differences between these two.

Now that the orders and customers datasets are joined, cleaned, and prepared, you would be ready to build a model to predict customer value. This is a task for a future course!

The Lab(2)

The Lab(2)

Discover the Lab

Interactively Prepare Your Data

Leveraging the User Agent

Leveraging the IP Address

Using Formulas

Visualize Your Data with Charts

Popular User Agents

Sales by Age and Campaign

Sales by Geography

Deploy work in the Lab to the Flow

What’s next?

推荐阅读更多精彩内容