Posts on stdin

Lessons From the Signal Leak

Isaac — Mon, 31 Mar 2025 00:00:00 +0000

I’ve found a lot of the coverage about the Trump administration’s accidental leak of Signal messages to The Atlantic frustrating. To read most of the coverage, the major mistake was Mike Waltz’s inclusion of Jeffery Goldberg in the conversation, and the big questions are about the impact of this particular leak. I have two alternative takeaways, one about the general security attitude, and one about Signal itself.

First the broad one.

Two of the things to consider when evaluating the security of a system are the capability and motivation of the assumed attacker. Most of us are worried about relatively unsophisticated adversaries that don’t actually care that much about us in particular. We want to guard against the cyber-criminal out to get a credit card number, or a telco that wants to sell our info to advertisers. If we’re harder to hack than then next guy, they’ll just move on.

It’s clearly a different case when, for example, law enforcement gets interested in you in specific: the adversary (“The Law”) is now motivated to expend significant, directed effort, and can bring in reasonably sophisticated resources, such as the FBI, to break into your device and messages. If this is your adversary, your job is much harder, and their rate of success goes up substantially.

But we’re talking about people like the Vice President, Secretary of Defense, and Director of National Security: people who would be at the top of any US adversaries’ “to bug” list. The motivation is extreme. And, particularly with Russia and China, we’re talking about highly sophisticated attackers.

And so it is a reasonable assumption that any commodity device, like those running Signal, owned by these individuals have been compromised, and that every conversation they have on them is being scooped up by Beijing and Moscow. For the same reason, it’s a reasonable assumption that their personal laptops, cars, and homes have all been bugged.

This is why the government has separate systems and physical locations to hold this kind of conversation. An isolated, stripped-down, purpose-built system would be much easier to secure than even a minimal Android or iOS device.

We know about this particular case because Mr. Waltz accidentally included Mr. Goldberg on the conversation, leaking the whole thing to The Atlantic. But this was a minor snafu in the grand scheme of things. The big mistake, made by all of the people in the group, was having the conversation on a commodity platform to begin with. And while we know about this particular conversation, we don’t know how many others these individual have broadcast to America’s adversaries.

This was dumb – potentially criminally dumb – behavior by officials who should have known better, and should disqualify all of these individuals from handling classified information in the future.

Beyond this, I think there is a lesson to be learned from the accidental inclusion of Mr. Goldberg on this conversation, and it’s not that Mr. Waltz is an idiot (even if he may be): it’s about user interfaces and Signal’s security model. I don’t know whether Mr. Waltz included the wrong Jeffery Goldberg in the conversation, or just fat-fingered his contacts list, but either way, Signal couldn’t have warned him that what he was doing was dumb, because Signal doesn’t have any notion of an organization or its security boundaries: people are just people.

Indeed, if this group had used Slack for the conversation, or if they were jointly editing a Google Doc, the system would have almost certainly been locked down to avoid the accidental inclusion of any individual outside of the organization. Adding a person from The Atlantic would have at least provided a warning that this was a bad idea, and Mr. Waltz would have almost certainly not made the error.

I’m a fan of (and a donor to) Signal, but the lack of these organizational boundaries is a good argument against its organizational use. And I suspect that for Signal, this is just fine: that’s not the use case they’re targeting.

Five Eights

Isaac — Tue, 14 May 2024 00:00:00 +0000

With the recent news that Adam Selipsky is stepping down from AWS, I thought I’d share my funny Adam story.

When Adam was at the helm of Tableau, I was a PM on the Prep team. I never knew Adam closely, but I did see him in meetings a fair bit. My impression was of very capable senior exec who had a clear idea of what he wanted and how things should be run. He had high expectations for those around him, and exuded a geeky awkwardness. I liked him, but he was definitely no teddy bear.

I can’t remember the details, but we were meeting with Adam to discuss some product direction. It was a small group – fewer than ten of us – and Adam was late. So the rest of us were cooling our heels.

Eventually Adam showed up and apologized. He explained that he was at a meeting with the Tableau Online team about service stability, and that it had run long. I knew that this was a major problem, and didn’t begrudge him taking the time. Then I open my trap.

“Oh, are we up to five eights yet?” I joked.

And panicked. What had my big mouth done. I could hear the silence.

But it last doesn’t more then a beat when Adam picked it up.

“Ooh, I like that”, he said, turning to an imagined customer and going into salesman mode. “Would you rather have three nines or five eights?”

He laughed and said, “I’m going to use that.”

I exhaled.

I recall that we got torn to shreds in that meeting, but it was for good reason, not my joke. I had a lot of respect for Adam, but seeing his sense of humor made me like him more. I wish him the best.

Two Unequal Products

Isaac — Mon, 03 Oct 2022 00:00:00 +0000

I’ve been watching some of Timothy Gowers’ videos in which he documents his attempts to solve various mathematics problems. Gowers’ goal is to provide some examples of the mathematical thought process for other to study. I don’t have any deep insights on this to share, but watching the mental process of a serious mathematician as he tackles a problem is certainly interesting. And the problems are interesting themselves.

The second problem Gowers tackles is the topic of this post. He solves it, but the solution doesn’t feel particularly satisfying. It doesn’t feel satisfying to him, either, so he tries another path towards a simpler solution that doesn’t pan out. Here, I take a pass.

The Problem

Here’s Gowers’ statement of the problem:

Prove that for every positive integer $n$, there do not exist positive integers $a$, $b$, $c$, $d$ with $ad=bc$ and $n^2 < a < b < c < d < (n+1)^2$.

I suggest that you take some time to think this through and go watch Gowers’ videos before reading on. Below is my solution. I took a lot longer to get to this than Gowers, but the result seems reasonably elegant.

Some Intuition

Before jumping into it, I want to say a few words about my intuition for the problem. Clearly, if the numbers $a$, $b$, $c$, and $d$ were arbitrary reals or rationals, then it would be easy to come up with values that make this work. So for this to fail, we’re going to have to make use of properties that are special to the integers.

In particular, I want to use the inequality to generate some extra space that I can use to show that the gap between $n^2$ and $(n+1)^2$ isn’t large enough to hold our numbers. My initial attempts were to observe that over the integers, $a>n^2$ means that $a\geq n^2+1$, that $b\geq n^2+2$, etc. But I wasn’t able to use this by itself to generate a large enough gap for the proposition to fail.

The other property of integers is that they factor. And putting this together with the observation above does generate enough space. Let’s see how this works.

My Solution

Assume that the statement were true; we’ll derive a contradiction. Given that $ad=bc$, we can write

$$\tag{1} {ad \over b} = c $$

Since these are all positive integers, we can expand out $a$ and $d$ as products of (non-distinct) primes: $a = p_1 p_2 \ldots p_m$ and $d = q_1 q_2 \ldots q_n$. And since the result of the division is an integer, we can see that $b$ must be the product of a subset of these $p$ and $q$ values, with $c$ being the product of the remaining factors. Explicitly, we can rewrite equation (1) as:

$$ { { p_1 p_2 \ldots p_m q_1 q_2 \ldots q_n } \over { p_{\alpha_1}\ldots p_{\alpha_k} q_{\beta_1}\ldots q_{\beta_l} } } = { p_{\gamma_1}\ldots p_{\gamma_i} q_{\delta_1}\ldots q_{\delta_j} } $$

Where the $p_\alpha$s and $p_\gamma$s account for all of the $p_1,\ldots,p_m$ and $q_\beta$s and $q_\delta$s account for all of the $q_1,\ldots,q_n$. If we collect up all of the $p$ terms used to create $b$ as $a_1$, and the leftover ones as $a_2$, and do likewise for the $q$ terms to create $d_1$ and $d_2$, we can rewrite the whole thing as:

$$\tag{2} { {a_1 a_2 d_1 d_2} \over {a_1 d_1} } = a_2 d_2 \quad\text{where}\quad \begin{cases} a = a_1 a_2\\ b = a_1 d_1\\ c = a_2 d_2\\ d = d_1 d_2 \end{cases} $$

All of these terms are still positive integers (possibly 1), but we now have:

$$ n^2 < \overbrace{a_1 a_2 < \underbrace{a_1 d_1} } < a_2 d_2 < \underbrace{d_1 d_2} < (n+1)^2 $$

Comparing the indicated terms, we can extract:

$$ \begin{align}\tag{3} d_1 > a_2 &\implies d_1 \geq a_2 +1\\ d_2 > a_1 &\implies d_2 \geq a_1 +1 \end{align} $$

These implications make use of the fact that the terms are all integers. Now we can see that:

$$ \begin{aligned} \boxed{n^2 + 2n + 1} = (n+1)^2 &> d \\ &= d_1 d_2 \\ &\geq (a_2 + 1)(a_1 + 1) \\ &= a_1 a_2 + a_1 + a_2 + 1 \\ &> \boxed{n^2 + a_1 + a_2 + 1} \end{aligned} $$

Has this forced enough space to generate a contradiction? Together, the boxed terms tell us that:

$$ \begin{aligned} 2n &> a_1 + a_2 \\ 4n^2 &> a_1^2 + 2a_1 a_2 + a_2^2 \\ 4a_1 a_2 &> a_1^2 + 2a_1 a_2 + a_2^2 \\ 0 &> a_1^2 - 2a_1 a_2 + a_2^2 \\ 0 &> (a_1 - a_2)^2 \end{aligned} $$

And this last statement cannot hold for positive integers $a_1$ and $a_2$, so our assumption that $ad = bc$ must fail.

$\blacksquare$

Discussion

Making use of a few properties of the integers – factorization and discreteness – pays off. By cleanly factoring them in step (2), and developing an inequality on the factors in step (3), we’re able to then amplify the difference of the product enough to generate a contradiction.

Au Revoir, Snowflake!

Isaac — Tue, 06 Sep 2022 00:00:00 +0000

Just reading this blog, you might guess that all I do is leave jobs. First leaving Tableau, and now, four years later, departing Snowflake.

I’m incredibly proud of what we accomplished at Snowflake, particularly with Snowpark. Snowpark not only expands what customers and partners can do with the platform, but also provides a lot of flexibility for Snowflake itself. I expect this to pay dividends for a long time.

Moreover, the Snowpark team – and Snowflake engineering in gereral – was absolutely top notch and a joy to work with.

So why leave?

Certainly not because of the people or for lack of interesting work. Nor for doubts in the company: Snowflake is absolutley crushing it. (And as a stockholder, I look forward to them continuing to crush it.)

This was a much more personal decison. I’ve had a longstanding ambivalence towards the software industry. Software has provided me with a lot of interesting, worthwhile problems to solve, and smart, engaging people to solve them with. And it has paid the bills quite handsomly.

On the other hand, I’ve always found myself drawn to the less practical side of computing, mathematics, and the sciences – maybe it runs in the family. I was in academia once: a graduate student for all the wrong reasons, and a poor one as a result. Now I’m in a position to explore again, this time with a bit more perspective.

Exactly how will this exploration play out? I have some ideas, but the truth is that I’m not yet entirely sure.

In the short term, my plans are to take a little time off, get a little more involved in my kids’ schools, and start thinking about the future. I’ll also try to write a bit more about non-employment topics here, as well as get some pictures posted on our new family blog.

Stay tuned!

Iterating Over Metadata With Snowpark

Isaac — Tue, 17 Aug 2021 00:00:00 +0000

(This was ported from my original Medium post.)

Hi Folks,

Last time we saw how to create simple Java functions to detect and mask personally identifying information (PII). For example, we could take a table containing some email messages and mask out the PII in the bodies with a simple query:

But let’s say we wanted to mask out all of the PII. And let’s say that we had many more fields like you might find in something like survey results.

In this case, masking out the PII would be easy, but tedious: we’d have to apply the function manually to each column. And if the schema of our table were to change – or if we wanted to run this masking routine on a different table – we’d have to rewrite the query.

What we’ve run into is a pretty fundamental limitation in SQL: the query is very tied to the underlying schema. There’s no way to pass a type parameter to the query or iterate over metadata. Snowpark doesn’t have this limitation: we can write code to inspect metadata and dynamically generate queries based on what we find.

To get started with Snowpark, you can follow the instructions on how to get it set up in your existing Scala development environment. Or you can follow the nice directions Zohar Nissare-Houssen has outlined here to get going using Docker.

Now using Snowpark for Scala, we can write a fully generic PII masking function:

val maskAllPii = (df: DataFrame) => {
   val toMask = df.schema
      .filter(_.dataType.typeName == "String")
      .map(_.name)
   df.withColumns(toMask, 
      toMask.map(c => callUDF("maskpii", df.col(c))))
}

This function takes in a DataFrame, inspects the schema, and applies the PII masking function we already have registered in Snowflake to each string column it finds, leaving non-string columns untouched. The result is just another DataFrame.

Now we can very easily run this on our email data…

val df = maskAllPii(sess.table("emails"))

…and fetch the results:

df.show(3,100)  // get the first three lines, format wide

As you can see, the maskAllPii() call has touched all of the String columns. Under the covers, Snowpark has dynamically generated a plan that corresponds a SQL query:

SELECT "ID", 
       maskpii("SENDER") AS "SENDER", 
       maskpii("SUBJECT") AS "SUBJECT", 
       maskpii("BODY") AS "BODY" 
FROM ( SELECT  *  FROM (emails))

When show() runs, it generates and issues the SQL, wrapping this in an outer LIMIT clause and pretty-printing the result – that’s what show() does.

Of course, this query isn’t a hard one to write, though doing so does start to get a bit tedious as the column count goes up. And you have to do it again for each table or query you want to mask. Moreover, writing this yourself means more chances to make a mistake and miss a column.

In contrast, the Snowpark alternative is simple, robust, and reusable. And as a simple exercise, you can retool the example above to take a different function — or better yet, take an arbitrary function as a parameter.

Happy hacking!

Basic PII Detection and Masking in Snowflake Using Java

Isaac — Wed, 28 Jul 2021 00:00:00 +0000

(This was ported from my original Medium post.)

Hi Folks,

For my first foray into Medium, I wanted to share some code that I’ve used previously in demos. The examples here do basic detection and masking of personally-identifying information (PII) using Java’s built-in regular expression support.

Now, I make no assertion that these routines are good: if you really want to do robust PII detection, you probably want something more sophisticated than a few regexes. Snowflake is even working on data classification as a built-in feature.

But I like these examples because they do a good job of illustrating the basic pattern of Snowflake’s Java functions. And they’re pretty malleable: you should be able to modify these examples to work for any situation where you need to detect or mask based on a set of regexes.

Let’s start with the code and then tear it apart. If you’re running on Snowflake and have Java functions enabled – any AWS account, for now – then you can define them right inline using this create function command:

create function haspii(s string)
returns boolean
language java
returns null on null input
handler = 'PIIDetector.hasPII'
as
$$
import java.util.regex.*;
import java.util.*;public class PIIDetector {
    
    static final String[] TARGETS = {
        "\\d{3}-\\d{2}-\\d{4}",                 // SSN
        "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
        "[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
    };    
    
    ArrayList<Pattern> patterns;    
    
    public PIIDetector() {
        patterns = new ArrayList<Pattern>();
        for(String s : TARGETS) {
            patterns.add(Pattern.compile(s));
        }
    }    
    
    public boolean hasPII(String s) {
        for(Pattern p : patterns) {
            if (p.matcher(s).find()) {
                return true;
            }
        }
        return false;
    }
}
$$

With this in hand, anyone with permissions on the function can issue queries that use it without any knowledge of Java:

select id, haspii(body)
from emails

So let’s take the definition apart. The first section defines how the function will show up in SQL:

create function haspii(s string)
returns boolean
language java
returns null on null input
handler = 'PIIDetector.hasPII'

Most of this is pretty self explanatory: it’s a function that takes a string and returns a Boolean, and the language is Java. The null on null input bit lets me skip any null handling in my routine: nulls inputs will be handled without calling into Java at all.

The handler directive is new, and specifies where in the Java code to actually make a call. You may have many potential entry points, but in this case, Snowflake is going to call the hasPII method defined on the PIIDetector class.

The actual Java code is contained between the pairs of dollar signs. After a little boilerplate, we see a few regular expressions:

static final String[] TARGETS = {
    "\\d{3}-\\d{2}-\\d{4}",                 // SSN
    "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
    "[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
};

These (highly USA-centric) expressions match the basic forms of Social Security numbers, email addresses, and phone numbers. You can very easily augment this list with more patterns to match your definition of PII.

Next, we see some initialization code:

ArrayList<Pattern> patterns;public PIIDetector() {
    patterns = new ArrayList<Pattern>();
    for(String s : TARGETS) {
        patterns.add(Pattern.compile(s));
    }
}

Our handler points to an instance method in the PIIDetector class. When Snowflake runs a query that requires an instance of this class, Snowflake will will look for a default constructor to use to generate this instance. This provides a really easy way to do one-time initialization: in this case we compile up the regular expressions so they’re ready to go once per query, rather than doing so on each invocation – it should be much faster.

Finally, we have the actual method we’re binding to:

public boolean hasPII(String s) {
    for(Pattern p : patterns) {
        if (p.matcher(s).find()) {
            return true;
        }
    }
    return false;
}

This just loops over the patterns and fires if any match. Easy peasy!

And there we have it: a simple PII detection routine that you can customize to your requirements (and local phone-number formats). But really, this is good for any situation where you have a number of regular expressions to match.

And with a little tweaking, you can mask out these matches instead. Here’s the code; I’ll let you dig into the details.

create function maskpii(s string)
returns string
language java
returns null on null input
handler = 'PIIDetector.maskPII'
as
$$
import java.util.regex.*;
import java.util.*;

public class PIIDetector {
    
    static final String[] TARGETS = {
        "\\d{3}-\\d{2}-\\d{4}",                 // SSN
        "[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}",  // email
        "[2-9]\\d{2}-\\d{3}-\\d{4}"             // phone
    };    
    
    static final String MASK = "###";    

    ArrayList<Pattern> patterns;    
    
    public PIIDetector() {
        patterns = new ArrayList<Pattern>();
        for(String s : TARGETS) {
            patterns.add(Pattern.compile(s));
        }
    }    
    
    public String maskPII(String s) {
        for(Pattern p : patterns) {
            s = p.matcher(s).replaceAll(MASK);
        }
        return s;
    }
}
$$

Happy hacking!

A leopard can't change his spots. (But he may change jobs.)

Isaac — Sun, 15 Jul 2018 00:00:00 +0000

I won’t bury the lede: My last day at Tableau was July 6th, and tomorrow I start a new gig at Snowflake.

I joined Tableau in June of 2015, and spent most of my three years there starting, building, and ultimately shipping Tableau Prep. I’m incredibly proud of the Prep team, the product we put together, and the awesome functionality yet to come.

As I move on, I’ve been thinking a bit about the past projects that really excited me. In addition to Prep, my favorites were probably StreamInsight, which was a system for dealing with time-aware queries and streaming data, and the spatial types in SQL Server. (Those types are still going strong and motivating new integrations ten years later.)

A common theme through of all of these projects has been making it easy to do complex things with data. And Snowflake is most certainly out to do that with data warehousing. It feels like a wonderful match.

I’m going to miss Tableau — it’s a wonderful company — and I’m going to miss Prep. But I’m incredibly excited to be starting at Snowflake. (And a special thanks to those Preppies that slipped Snowflake support into the latest Prep release. That should save me some awkward moments.)

I’ll try to keep writing here — maybe with a broader set of topics, and hopefully with a bit more regularity. So do please check in and drop me a note.

Tableau Prep: The Power of Composability

Isaac — Wed, 09 May 2018 00:00:00 +0000

When we built Tableau Prep, we put a premium on ensuring composability of operations: you can take the operations Prep supports and string them together in any combination you need. There are no restrictions based on where the data came from, or what operations came before.

This means that you never need to think about whether a particular operation is supported in your particular situation: if Prep supports it ever, Prep supports it always. Moreover, this gives you a lot of power to do what you need to with your data.

In the rest of this post, we’ll walk through a Superstore example that highlights this power.

The Problem

Let’s start with the sample Superstore data from Tableau Desktop. This data set is a list of order details: each row represents one item from an order, with multiple line items accruing to each order.

Given these data, let’s try to fulfill what seems like a simple request:

Get the order details for customers with fewer than the median number of orders.

This seems relatively straightforward… or is it? In cases like this, I often find it helpful to think backwards to come up with a solution:

Step 4 If we had a list of customers with fewer than the median number of orders, we could cull the order details down to just those from customers on the list. But we don’t have a list of these sub-median customers.

Step 3 If we knew the median number of orders, we could prune the list of customers down to those with fewer than the median. But we don’t have the median number of orders.

Step 2 If we knew the count of orders for each customer, we could aggregate it to find the median number of orders over all customers. But we don’t have the number of orders for each customer.

Step 1 If we had the list of orders for each customer, we could aggregate to get the count for each customer. _And we do have the order list!
_

Now we have a plan: we’ll start with the order details we have, and climb the ladder outlined above to get to the solution.

The Solution

We start by loading the Superstore data:

As we’ve already observed, these are order details. Each order has a distinct Order ID, but may have more than one line.

Step 1

Following our plan, the first thing we need to is get the count of orders for each customer. To do this we introduce an aggregate: we group by customer and count the distinct number of Order IDs:

The distinct makes it so repeated Order IDs — which come from having more than one order detail line per order — are only counted once.

So we don’t confuse ourselves later, we’ll rename Order ID to Number of Orders:

Step 2

Now that we have the list orders for each customer, we can aggregate again to find the median number of orders per customer:

This aggregate is a little funny: There’s no grouping field, so we don’t partition the table at all. The result is an odd little table with one row and one column, but this record represents the median over all customers we were looking for.

We’ll rename this once again:

Step 3

With the median number of orders in hand, we can join it with our list of customers and order counts to filter down that list. I.e., we’ll join it with the result of our first aggregate:

Note the join clause here: we’re doing an inner join, but matching when the median is greater than the customer’s order count. We also have an error: the types don’t match because the result of the median is a floating-point number, not an integer.

If we correct the type, we get our list of customers with fewer than the median number of orders:

Step 4

Now that we have our customer list, we’re ready to cull the line items. We’ll again use a join as a filter, but this time we’re joining our latest table with the original input:

You can see that there are a bunch of records dropping out from the right: those were the customers with more than the median number of orders. What remain are the line items we care about:

Wrapping Up

At this point, you might want to clean up a few of the columns we created along the way, but our data are ready to output to Tableau or anywhere else you want to take them.

This may seem a little complex — and it’s clearly stretching our flow layout algorithm — but it makes a perfectly fine flow. There was no operator that solved our problem out of the box, but composability made it possible to mix-and-match the operations present to build a computational machines for our task.

We certainly aren’t done adding operations to Prep, but there’s a rich set already present. And with a little composition, you can make them do some pretty cool tricks.

Tableau Prep: The Flow

Isaac — Mon, 07 May 2018 00:00:00 +0000

I’ve been a bit quiet lately, but Tableau Prep out the door and it’s time to make a little noise.

Clark recently wrote an excellent post on the basic UX architecture of Prep. Here I’d like to cover a key concept underlying Prep that may be a bit foreign to people coming from Tableau: the flow.

This isn’t the most glamorous part of Prep, but it is one of the most fundamental concepts in the tool, so it seems worth spending some quality time on.

Strap on your life jacket and read on for more.

Data In; Data Out

To understand flows, we start with steps , which are the conceptual unit of work in Tableau Prep. Every time you take an action on your data in Prep, you’re adding a step. For example, if we take the world consumer price index data included with the product and add a filter, we find that a new step is added to the flow:

Each item in the flow pane represents a step, and each step works in the same way: data come in from the left, are modified by the step, and leave to the right:

Some steps — cleaning steps — may have multiple sub-steps, or changes. These are just like steps in the flow, but are smaller increments of work. They flow top to bottom:

We group these together to help conceptually simplify the flow, but each change acts just like any other step: rows come in, they’re modified, and they go out.

Some steps — such as joins — have multiple inputs, but they work the same way: two sets of data come in from the left, they’re put together, and the result leaves to the right:

And where do they go? On to the next step! Some steps may even have multiple outputs, with the data going to multiple targets:

Step-by-step we build up a flow: an ordered sequence of steps that does what we want.

Clarity and Control

That ordering is a key aspect of flows. If you’re coming from Tableau, you may be aware that it performs operations in a particular order, but the system doesn’t advertise this, and generally you don’t need to think about it.

But order sometimes matters, and we designed Prep with those times in mind. The CPI data contain both a food index and a general index. Let’s say that we’ve pivoted the data, and now want to compare each country’s CPI to the global average for each year — except we only care about the food index.

To do this, we’ll first filter to keep only the food index:

And then we’ll aggregate by year:

Order matters: if we did the aggregate first, we would have folded in the general CPI as well.

This kind of ordering is explicit in Prep. You don’t have to guess, and you don’t need to coax the system into doing what you want: you just build your flow in the order fits your problem.

And with Prep, you can always go back and see your data at any point along the flow. Just click back and look. This way you can see and control what the flow is doing to your data every step along the way.

Prep is a Competent Cook

We can add another metaphor: think of a flow as a recipe, and let’s take a moment to bake some cookies.

We’ve already mixed the wet ingredients — the eggs, the vanilla, the butter — when we get to this part of the recipe:

…
Measure 1.5 cups flour
Add 1/4 teaspoon salt
Add 1/2 teaspoon baking powder
Mix thoroughly
Add dry ingredients to wet ingredients
…

A competent cook would mix these dry ingredients before adding them to the wet, but they would take the liberty of combining them in any convenient order: they know it’s irrelevant.

Tableau Prep is a competent cook. It can figure out many cases where the order won’t matter, and can rearrange them to make your flow run more efficiently. But it will only do this when the reordering won’t affect the results that you intended.

So while the flow give a conceptual order to the operations and their execution order, they may not be run that way at all. The result is that you can ignore order when it doesn’t matter, but rely on it when it does.

More than Just Flows

The notion of a flow is not unique to Tableau Prep, and it isn’t Prep’s most distinguishing feature. The way that Prep uses samples to give you immediate feedback, the way we use analytics to help you see what needs to be done, and the direct manipulation all more directly contribute to what makes Prep special.

But understanding flows is central to understanding how to make Prep do exactly what you want, and it can be a bit of a leap for folks coming from Tableau Desktop. I hope this helps make that leap a little easier.

Happy hacking!

When Live Beats an Extract

Isaac — Wed, 14 Mar 2018 00:00:00 +0000

When using Tableau, taking an extract is always better than using a live query, right?

Well, no.

Of course. Obviously, when your data are changing and you want to get all of the latest updates in your viz, you’ll want to use a live query. But if that’s not the case, then an extract is clearly better, especially with Hyper in 10.5, right?

Well, no!

Shoot! This is complicated? When will live beat an extract? Let’s take a look at a few cases.

A Few Basics

To understand what’s going on, you should have a basic understanding of how live and extracted data sources are used by the system. If you feel a bit shaky here, I’d recommend my previous post on live vs extracts. But in a nutshell:

When you’re using an extract, the query defined by the data source is run and the whole resulting table is persisted in either a TDE (in Tableau 10.4 or before) or a Hyper database (in 10.5 and later). The queries produced by your workbook are then run against this table.
When you’re running live, the queries from your workbook are composed with the data source query. In simple cases, at least, this will result in a single query that is pushed down to the target database system, and only the results needed for the viz are returned.

We’re going to look at a few cases where live can do better than an extract. As we look at them, pay particular attention to:

The time to run the remote query,
The time to transfer the data, and
The time to run the local query.

These aren’t rigorous perf numbers, but to give you a sense of scale, here’s my setup:

Tableau 10.5 (with Hyper) running on a i5-2500 with 8GB of RAM.
SQL Server 2017 Express Edition running on an i7-3770 with 16GB of RAM.
All wired together over gigabit Ethernet.

So nothing too grand. In any case, the lessons here should carry over to other hardware.

The data set is a stock history set from Kaggle that records daily stats for large number of stocks and ETFs. The schema looks like:

history(ticker, type, date, open, high, low, close, volume, openInt)

Loaded into SQL Server and indexed on (ticker, date), this results in 17.4M rows and about 1.5GB of storage. (I have no idea what the provenance or accuracy of these data are, but for this work only the size is relevant.)

Let’s try to beat an extract!

Nail The Index

Let’s start with an easy case: let’s find the yearly average close for Tableau’s stock. I’ll drag the ticker into filters, years into columns, and Avg(Close) into rows. It’s an award-worthy viz:

This is also an almost ideal query for our SQL Server database: it makes excellent use of the index, so the query is exceptionally fast to run; and because the aggregation happens remotely, there are almost no results to send over the wire. By looking in the log, I find that it takes a whole 0.006 seconds to run this query and fetch the results. How can we possibly beat that?

Indeed, if we recreate the same viz with an extract, Hyper takes more like 0.2 seconds to compute the viz.

So SQL Server is faster than Hyper? Well, in this case it is, but we’ve almost cheated by practically tuning it to answer this query quickly. Hyper, on the other hand, doesn’t require (and doesn’t allow) us to tune its setup. So we’re comparing the best case for SQL Server to a case for Hyper.

But the lesson is still sound: if your query (a) lines up well with the setup of your remote database, and (b) transfers very little data, then we can actually beat a Hyper extract.

Be Truly Ad Hoc

Let’s try to avoid pandering to SQL Server quite so much and just ask for the number of records my data set has each year:

Now SQL Server takes a bit longer: 5.53 seconds. Trying this against the extract shows what Hyper can do: 0.193 seconds. In this case, both engines have to do roughly the same amount of work, but with it’s column-based, in-memory execution, Hyper is the clear winner!

Except that we haven’t taken into account the cost of generating the extract. When we refresh it, we find that it takes us 67.8 seconds to generate a 435MB extract. If we add that in, SQL Server starts looking pretty good:

Applying a little algebra, that means that to recoup the cost of our extract, we’d need to run our viz query a hair over 15 times. Often times this will be worth it, but if the query is truly one off, I’d rather spend 5.53 seconds than 68.

Blow Up the Extract

Let’s try something more horrible. Let’s say that in addition to the historical stock prices, we have a table of customer holdings. We’ll keep it simple; our customers have static holdings that look like:

customerholdings(customer, ticker, amount)

(I don’t actually have any customers, so I randomly generated 20 holdings for each of 20,000 imaginary customers.)

We want to do things like look at the total value of all customers’ holdings over time, so we join the holdings to the price history.

We then create a calc to compute the value each customer’s holdings and make a viz:

In case you’re interested, that giant spike is caused by a few odd stocks like DryShips Inc. (DRYS), which somehow peaked at $1,442,048,636.45 in 2007. I don’t comprehend. The graph looks funny, but again, this doesn’t matter for our analysis.

What we care about is that this query takes 133 seconds to run—it’s a fair bit of work for SQL Server to do. How about the extract?

Well, let’s do a little back of the envelope computation. If we execute the full join in SQL Server and don’t aggregate anything down, instead of the 17 million records in our history table, the result set will have about 441 million records. And these records are larger than the history rows because they have customer information as well.

Optimistically, this will end up being something like 10 gigabytes of data that I have to move over the wire, and store in a local extract. And that’s all before I even get to ask my query. So unless I’m doing this a lot, I’m simply not going to bother.

Wrapping Up

So we’ve seen a few cases where live queries may be preferable to extracts, leaving aside the obvious cases where you simply want the most current data.

One thing we didn’t talk about is federated queries: queries that span multiple data sources. As a general rule, federation makes extracts look better relative to live, because live starts to look worse. Live works best when the engine can push operations that reduce data volumes off to the remote system—operations like aggregations and filters—and federation tends to interfere with that pushdown.

But that’s another ball of wax. I’ll write more on federation soon.