Tableau Prep on stdin

Tableau Prep: The Power of Composability

Isaac — Wed, 09 May 2018 00:00:00 +0000

When we built Tableau Prep, we put a premium on ensuring composability of operations: you can take the operations Prep supports and string them together in any combination you need. There are no restrictions based on where the data came from, or what operations came before.

This means that you never need to think about whether a particular operation is supported in your particular situation: if Prep supports it ever, Prep supports it always. Moreover, this gives you a lot of power to do what you need to with your data.

In the rest of this post, we’ll walk through a Superstore example that highlights this power.

The Problem

Let’s start with the sample Superstore data from Tableau Desktop. This data set is a list of order details: each row represents one item from an order, with multiple line items accruing to each order.

Given these data, let’s try to fulfill what seems like a simple request:

Get the order details for customers with fewer than the median number of orders.

This seems relatively straightforward… or is it? In cases like this, I often find it helpful to think backwards to come up with a solution:

Step 4 If we had a list of customers with fewer than the median number of orders, we could cull the order details down to just those from customers on the list. But we don’t have a list of these sub-median customers.

Step 3 If we knew the median number of orders, we could prune the list of customers down to those with fewer than the median. But we don’t have the median number of orders.

Step 2 If we knew the count of orders for each customer, we could aggregate it to find the median number of orders over all customers. But we don’t have the number of orders for each customer.

Step 1 If we had the list of orders for each customer, we could aggregate to get the count for each customer. _And we do have the order list!
_

Now we have a plan: we’ll start with the order details we have, and climb the ladder outlined above to get to the solution.

The Solution

We start by loading the Superstore data:

As we’ve already observed, these are order details. Each order has a distinct Order ID, but may have more than one line.

Step 1

Following our plan, the first thing we need to is get the count of orders for each customer. To do this we introduce an aggregate: we group by customer and count the distinct number of Order IDs:

The distinct makes it so repeated Order IDs — which come from having more than one order detail line per order — are only counted once.

So we don’t confuse ourselves later, we’ll rename Order ID to Number of Orders:

Step 2

Now that we have the list orders for each customer, we can aggregate again to find the median number of orders per customer:

This aggregate is a little funny: There’s no grouping field, so we don’t partition the table at all. The result is an odd little table with one row and one column, but this record represents the median over all customers we were looking for.

We’ll rename this once again:

Step 3

With the median number of orders in hand, we can join it with our list of customers and order counts to filter down that list. I.e., we’ll join it with the result of our first aggregate:

Note the join clause here: we’re doing an inner join, but matching when the median is greater than the customer’s order count. We also have an error: the types don’t match because the result of the median is a floating-point number, not an integer.

If we correct the type, we get our list of customers with fewer than the median number of orders:

Step 4

Now that we have our customer list, we’re ready to cull the line items. We’ll again use a join as a filter, but this time we’re joining our latest table with the original input:

You can see that there are a bunch of records dropping out from the right: those were the customers with more than the median number of orders. What remain are the line items we care about:

Wrapping Up

At this point, you might want to clean up a few of the columns we created along the way, but our data are ready to output to Tableau or anywhere else you want to take them.

This may seem a little complex — and it’s clearly stretching our flow layout algorithm — but it makes a perfectly fine flow. There was no operator that solved our problem out of the box, but composability made it possible to mix-and-match the operations present to build a computational machines for our task.

We certainly aren’t done adding operations to Prep, but there’s a rich set already present. And with a little composition, you can make them do some pretty cool tricks.

Tableau Prep: The Flow

Isaac — Mon, 07 May 2018 00:00:00 +0000

I’ve been a bit quiet lately, but Tableau Prep out the door and it’s time to make a little noise.

Clark recently wrote an excellent post on the basic UX architecture of Prep. Here I’d like to cover a key concept underlying Prep that may be a bit foreign to people coming from Tableau: the flow.

This isn’t the most glamorous part of Prep, but it is one of the most fundamental concepts in the tool, so it seems worth spending some quality time on.

Strap on your life jacket and read on for more.

Data In; Data Out

To understand flows, we start with steps , which are the conceptual unit of work in Tableau Prep. Every time you take an action on your data in Prep, you’re adding a step. For example, if we take the world consumer price index data included with the product and add a filter, we find that a new step is added to the flow:

Each item in the flow pane represents a step, and each step works in the same way: data come in from the left, are modified by the step, and leave to the right:

Some steps — cleaning steps — may have multiple sub-steps, or changes. These are just like steps in the flow, but are smaller increments of work. They flow top to bottom:

We group these together to help conceptually simplify the flow, but each change acts just like any other step: rows come in, they’re modified, and they go out.

Some steps — such as joins — have multiple inputs, but they work the same way: two sets of data come in from the left, they’re put together, and the result leaves to the right:

And where do they go? On to the next step! Some steps may even have multiple outputs, with the data going to multiple targets:

Step-by-step we build up a flow: an ordered sequence of steps that does what we want.

Clarity and Control

That ordering is a key aspect of flows. If you’re coming from Tableau, you may be aware that it performs operations in a particular order, but the system doesn’t advertise this, and generally you don’t need to think about it.

But order sometimes matters, and we designed Prep with those times in mind. The CPI data contain both a food index and a general index. Let’s say that we’ve pivoted the data, and now want to compare each country’s CPI to the global average for each year — except we only care about the food index.

To do this, we’ll first filter to keep only the food index:

And then we’ll aggregate by year:

Order matters: if we did the aggregate first, we would have folded in the general CPI as well.

This kind of ordering is explicit in Prep. You don’t have to guess, and you don’t need to coax the system into doing what you want: you just build your flow in the order fits your problem.

And with Prep, you can always go back and see your data at any point along the flow. Just click back and look. This way you can see and control what the flow is doing to your data every step along the way.

Prep is a Competent Cook

We can add another metaphor: think of a flow as a recipe, and let’s take a moment to bake some cookies.

We’ve already mixed the wet ingredients — the eggs, the vanilla, the butter — when we get to this part of the recipe:

…
Measure 1.5 cups flour
Add 1/4 teaspoon salt
Add 1/2 teaspoon baking powder
Mix thoroughly
Add dry ingredients to wet ingredients
…

A competent cook would mix these dry ingredients before adding them to the wet, but they would take the liberty of combining them in any convenient order: they know it’s irrelevant.

Tableau Prep is a competent cook. It can figure out many cases where the order won’t matter, and can rearrange them to make your flow run more efficiently. But it will only do this when the reordering won’t affect the results that you intended.

So while the flow give a conceptual order to the operations and their execution order, they may not be run that way at all. The result is that you can ignore order when it doesn’t matter, but rely on it when it does.

More than Just Flows

The notion of a flow is not unique to Tableau Prep, and it isn’t Prep’s most distinguishing feature. The way that Prep uses samples to give you immediate feedback, the way we use analytics to help you see what needs to be done, and the direct manipulation all more directly contribute to what makes Prep special.

But understanding flows is central to understanding how to make Prep do exactly what you want, and it can be a bit of a leap for folks coming from Tableau Desktop. I hope this helps make that leap a little easier.

Happy hacking!