<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Posts on stdin</title><link>https://stdin.org/posts/</link><description>Recent content in Posts on stdin</description><generator>Hugo -- 0.161.1</generator><language>en</language><copyright>Isaac Kunen</copyright><lastBuildDate>Mon, 31 Mar 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://stdin.org/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>Lessons From the Signal Leak</title><link>https://stdin.org/lessons-from-the-signal-leak/</link><pubDate>Mon, 31 Mar 2025 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/lessons-from-the-signal-leak/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>I&rsquo;ve found a lot of the coverage about the Trump administration&rsquo;s <a href="https://www.theatlantic.com/politics/archive/2025/03/signal-group-chat-attack-plans-hegseth-goldberg/682176/">accidental leak</a> of Signal messages to The Atlantic frustrating. To read most of the coverage, the major mistake was Mike Waltz&rsquo;s inclusion of Jeffery Goldberg in the conversation, and the big questions are about the impact of this particular leak.
I have two alternative takeaways, one about the general security attitude, and one about Signal itself.</p>
<p>First the broad one.</p>
<p>Two of the things to consider when evaluating the security of a system are the capability and motivation of the assumed attacker. Most of us are worried about relatively unsophisticated adversaries that don&rsquo;t actually care that much about <em>us</em> in particular. We want to guard against the cyber-criminal out to get a credit card number, or a telco that wants to sell our info to advertisers. If we&rsquo;re harder to hack than then next guy, they&rsquo;ll just move on.</p>
<p>It&rsquo;s clearly a different case when, for example, law enforcement gets interested in <em>you</em> in specific: the adversary (&ldquo;The Law&rdquo;) is now motivated to expend significant, directed effort, and can bring in reasonably sophisticated resources, such as the FBI, to break into your device and messages.
If this is your adversary, your job is much harder, and their rate of success goes up substantially.</p>
<p>But we&rsquo;re talking about people like the Vice President, Secretary of Defense, and Director of National Security:
people who would be at the top of any US adversaries&rsquo; &ldquo;to bug&rdquo; list. The motivation is extreme. And, particularly with Russia and China, we&rsquo;re talking about highly sophisticated attackers.</p>
<p>And so it is a reasonable assumption that <em>any</em> commodity device, like those running Signal, owned by these individuals have been compromised, and that <em>every</em> conversation they have on them is being scooped up by Beijing and Moscow.
For the same reason, it&rsquo;s a reasonable assumption that their personal laptops, cars,
and homes have all been bugged.</p>
<p>This is why the government has separate systems and physical locations to hold this kind of conversation. An isolated, stripped-down, purpose-built system would be much easier to secure than even a minimal Android or iOS device.</p>
<p>We know about this particular case because Mr. Waltz accidentally included Mr. Goldberg on the conversation, leaking the whole thing to The Atlantic. But this was a minor snafu in the grand scheme of things. <em>The big mistake, made by all of the people in the group, was having the conversation on a commodity platform to begin with.</em> And while we know about this particular conversation, we don&rsquo;t know how many others these individual have broadcast to America&rsquo;s adversaries.</p>
<p>This was dumb &ndash; potentially criminally dumb &ndash; behavior by officials who should have known better, and should disqualify all of these individuals from handling classified information in the future.</p>
<p>Beyond this, I think there is a lesson to be learned from the accidental inclusion of Mr. Goldberg on this conversation, and it&rsquo;s not that Mr. Waltz is an idiot (even if he may be): it&rsquo;s about user interfaces and
Signal&rsquo;s security model.
I don&rsquo;t know whether Mr. Waltz included the <em>wrong</em> Jeffery Goldberg in the conversation, or just fat-fingered his contacts list, but either way, Signal couldn&rsquo;t have warned him that what he was doing was dumb, because Signal doesn&rsquo;t have any notion of an organization or its security boundaries: people are just people.</p>
<p>Indeed, if this group had used Slack for the conversation, or if they were jointly editing a Google Doc, the system would have almost certainly been locked down to avoid the accidental inclusion of any individual outside of the organization. Adding a person from The Atlantic would have <em>at least</em> provided a warning that this was a bad idea, and Mr. Waltz would have almost certainly not made the error.</p>
<p>I’m a fan of (and a donor to) Signal, but the lack of these organizational boundaries is a good argument against its organizational use. And I suspect that for Signal, this is just fine: that&rsquo;s not the use case they&rsquo;re targeting.</p>]]></content></item><item><title>Five Eights</title><link>https://stdin.org/five-eights/</link><pubDate>Tue, 14 May 2024 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/five-eights/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>With the recent <a href="https://www.aboutamazon.com/news/company-news/leadership-update-aws-adam-selipsky-matt-garman">news</a>
that Adam Selipsky is stepping down from AWS, I thought I&rsquo;d share my funny Adam story.</p>
<p>When Adam was at the helm of <a href="https://www.tableau.com/">Tableau</a>,
I was a PM on the <a href="https://www.tableau.com/products/prep">Prep</a> team.
I never knew Adam closely, but I did see him in meetings a fair bit.
My impression was of very capable senior exec who had a
clear idea of what he wanted and how things should be run.
He had high expectations for those around him, and exuded a geeky awkwardness.
I liked him, but he was definitely no teddy bear.</p>
<p>I can&rsquo;t remember the details, but we were meeting with Adam
to discuss some product direction. It was a small group &ndash; fewer than
ten of us &ndash; and Adam was late. So the rest of us were cooling our heels.</p>
<p>Eventually Adam showed up and apologized. He
explained that he was at a meeting with the Tableau Online team about
service stability, and that it had run long.
I knew that this was a major problem, and didn&rsquo;t begrudge him taking the time.
Then I open my trap.</p>
<p>&ldquo;Oh, are we up to <a href="https://en.wikipedia.org/wiki/High_availability#Percentage_calculation">five eights</a> yet?&rdquo; I joked.</p>
<p>And panicked. What had my big mouth done. I could hear the silence.</p>
<p>But it last doesn&rsquo;t more then a beat when Adam picked it up.</p>
<p>&ldquo;Ooh, I like that&rdquo;, he said, turning to an imagined customer and going into salesman mode.
&ldquo;Would you rather have three nines or <em>five eights</em>?&rdquo;</p>
<p>He laughed and said, &ldquo;I&rsquo;m going to use that.&rdquo;</p>
<p>I exhaled.</p>
<p>I recall that we got torn to shreds in that meeting,
but it was for good reason, not my joke.
I had a lot of respect for Adam, but seeing his sense of humor made me <em>like</em> him more.
I wish him the best.</p>]]></content></item><item><title>Two Unequal Products</title><link>https://stdin.org/two-unequal-products/</link><pubDate>Mon, 03 Oct 2022 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/two-unequal-products/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>I&rsquo;ve been watching some of <a href="https://www.youtube.com/c/TimothyGowers0/videos">Timothy Gowers&rsquo; videos</a>
in which he documents his attempts to solve various mathematics problems.
Gowers&rsquo; goal is to provide
some examples of the mathematical thought process for other to study. I don&rsquo;t
have any deep insights on this to share, but watching the mental process of a
<a href="https://en.wikipedia.org/wiki/Timothy_Gowers">serious mathematician</a>
as he tackles a problem is certainly interesting.
And the problems are interesting themselves.</p>
<p>The <a href="https://youtu.be/NmEVwJ_lJ1A">second problem</a>
Gowers tackles is the topic of this post. He solves
it, but the solution doesn&rsquo;t feel particularly satisfying. It doesn&rsquo;t feel
satisfying to him, either, so he tries
<a href="https://youtu.be/vsRw6oLIUT4">another path</a>
towards a simpler solution that doesn&rsquo;t pan out. Here, I take a pass.</p>
<h2 id="the-problem">The Problem<a href="#the-problem" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Here&rsquo;s Gowers&rsquo; statement of the problem:</p>
<blockquote>
<p>Prove that for every positive integer \(n\), there do not exist positive
integers \(a\), \(b\), \(c\), \(d\) with \(ad=bc\) and \(n^2 < a < b < c < d < (n+1)^2\).</p>
</blockquote>
<p>I suggest that you take some time to think this through and go watch
Gowers&rsquo; videos before reading on. Below is my solution. I took a lot
longer to get to this than Gowers, but the result seems reasonably elegant.</p>
<h2 id="some-intuition">Some Intuition<a href="#some-intuition" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Before jumping into it, I want to say a few words about my intuition
for the problem. Clearly, if the numbers \(a\), \(b\), \(c\), and \(d\) were
arbitrary reals or rationals, then it would be easy to come
up with values that make this work. So for this to <em>fail,</em> we&rsquo;re going
to have to make use of properties that are special to the integers.</p>
<p>In particular, I want to use the inequality to generate some extra
space that I can use to show that the gap between \(n^2\) and \((n+1)^2\)
isn&rsquo;t large enough to hold our numbers. My initial attempts were to
observe that over the integers, \(a>n^2\) means that \(a\geq n^2+1\), that
\(b\geq n^2+2\), etc. But I wasn&rsquo;t able to use this by itself to generate
a large enough gap for the proposition to fail.</p>
<p>The other property of integers is that they factor. And putting this
together with the observation above does generate enough space. Let&rsquo;s see
how this works.</p>
<h2 id="my-solution">My Solution<a href="#my-solution" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Assume that the statement were true; we&rsquo;ll derive a contradiction.
Given that \(ad=bc\), we can write</p>
$$\tag{1} 
{ad \over b} = c
$$<p>Since these are all
positive integers, we can expand out \(a\) and \(d\) as products of
(non-distinct) primes:
\(a = p_1 p_2 \ldots p_m\) and \(d = q_1 q_2 \ldots q_n\).
And since the result of the division is an integer, we can see that
\(b\) must be the product of a
subset of these \(p\) and \(q\) values, with \(c\) being the
product of the remaining factors.
Explicitly, we can rewrite equation (1) as:</p>
$$
{
{ p_1 p_2 \ldots p_m q_1 q_2 \ldots q_n }
\over
{ p_{\alpha_1}\ldots p_{\alpha_k} q_{\beta_1}\ldots q_{\beta_l} }
}
= { p_{\gamma_1}\ldots p_{\gamma_i} q_{\delta_1}\ldots q_{\delta_j} }
$$<p>Where the \(p_\alpha\)s and \(p_\gamma\)s account for all of the
\(p_1,\ldots,p_m\) and \(q_\beta\)s and \(q_\delta\)s account
for all of the \(q_1,\ldots,q_n\). If we collect up
all of the \(p\) terms used to create \(b\) as \(a_1\), and the leftover ones
as \(a_2\), and do likewise for the \(q\) terms to create \(d_1\) and \(d_2\),
we can rewrite the whole thing as:</p>
$$\tag{2}
{ {a_1 a_2 d_1 d_2}
 \over
 {a_1 d_1} }
= a_2 d_2
\quad\text{where}\quad
\begin{cases}
    a = a_1 a_2\\
    b = a_1 d_1\\
    c = a_2 d_2\\
    d = d_1 d_2
\end{cases}
$$<p>All of these terms are still positive integers (possibly 1), but we now have:</p>
$$
n^2 < \overbrace{a_1 a_2 < \underbrace{a_1 d_1} } < a_2 d_2 < \underbrace{d_1 d_2} < (n+1)^2
$$<p>Comparing the indicated terms, we can extract:</p>
$$
\begin{align}\tag{3}
  d_1 > a_2 &\implies d_1 \geq a_2 +1\\
  d_2 > a_1 &\implies d_2 \geq a_1 +1
\end{align}
$$<p>These implications make use of the fact that the terms are all integers.
Now we can see that:</p>
$$ 
\begin{aligned}
\boxed{n^2 + 2n + 1} = (n+1)^2 &> d \\
                               &= d_1 d_2 \\
                               &\geq (a_2 + 1)(a_1 + 1) \\
                               &= a_1 a_2 + a_1 + a_2 + 1 \\
                               &> \boxed{n^2 + a_1 + a_2 + 1}
\end{aligned}
$$<p>Has this forced enough space to generate a contradiction? Together, the
boxed terms tell us that:</p>
$$ 
\begin{aligned}
2n &> a_1 + a_2 \\
4n^2 &> a_1^2 + 2a_1 a_2 + a_2^2 \\
4a_1 a_2 &> a_1^2 + 2a_1 a_2 + a_2^2 \\
0 &> a_1^2 - 2a_1 a_2 + a_2^2 \\
0 &> (a_1 - a_2)^2 
\end{aligned}
$$<p>And this last statement cannot hold for positive integers \(a_1\) and \(a_2\), so
our assumption that \(ad = bc\) must fail.</p>
<div style="text-align: right">\(\blacksquare\)</div>
<h2 id="discussion">Discussion<a href="#discussion" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Making use of a few properties of the integers &ndash; factorization and
discreteness &ndash; pays off. By cleanly factoring them in step (2), and developing
an inequality on the factors in step (3), we&rsquo;re able to then amplify
the difference of the product enough to generate a contradiction.</p>
]]></content></item><item><title>Au Revoir, Snowflake!</title><link>https://stdin.org/au-revoir-snowflake/</link><pubDate>Tue, 06 Sep 2022 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/au-revoir-snowflake/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>Just reading this blog, you might guess
that all I do is leave jobs. First leaving
<a href="https://stdin.org/a-leopard-cant-change-his-spots-but-he-may-change-jobs/">Tableau</a>,
and now, four years later, departing
<a href="https://www.snowflake.com/">Snowflake</a>.</p>
<p>I&rsquo;m incredibly proud of what we accomplished at Snowflake, particularly with
<a href="https://www.snowflake.com/snowpark/">Snowpark</a>. Snowpark not only expands what
customers and partners can do with the platform, but also
provides a lot of flexibility
for Snowflake itself. I expect this to pay dividends for a long time.</p>
<p>Moreover,
the Snowpark team &ndash; and Snowflake engineering in gereral &ndash; was absolutely
top notch and a joy to work with.</p>
<p>So why leave?</p>
<p>Certainly not because of the people or for lack of interesting work.
Nor for doubts in the company: Snowflake
is absolutley
<a href="https://www.cnbc.com/2022/08/24/snowflake-shares-soar-following-revenue-beat.html">crushing it</a>.
(And as a stockholder, I look forward to them continuing to crush it.)</p>
<p>This was a much more personal decison. I&rsquo;ve had a longstanding
ambivalence towards the software industry. Software
has provided me with a lot of interesting,
worthwhile problems to solve, and smart, engaging people to solve them with.
And it has paid the bills quite handsomly.</p>
<p>On the other hand, I&rsquo;ve always found myself drawn to the less practical side
of computing, mathematics, and the sciences &ndash; maybe it runs in
<a href="https://en.wikipedia.org/wiki/Kenneth_Kunen">the family</a>.
I was in academia once: a graduate student for all the wrong reasons,
and a poor one as a result. Now I&rsquo;m in a position to explore again, this time
with a bit more perspective.</p>
<p>Exactly how will this exploration play out? I have some ideas, but the
truth is that I&rsquo;m not yet entirely sure.</p>
<p>In the short term, my plans are to take a little time off, get a little
more involved in my kids&rsquo; schools, and start thinking about the future.
I&rsquo;ll also try to write a bit more about non-employment topics here,
as well as get some pictures posted on our new
<a href="https://kunen.net">family blog</a>.</p>
<p>Stay tuned!</p>
]]></content></item><item><title>Iterating Over Metadata With Snowpark</title><link>https://stdin.org/iterating-over-metadata-with-snowpark/</link><pubDate>Tue, 17 Aug 2021 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/iterating-over-metadata-with-snowpark/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p><em>(This was ported from my original <a href="https://medium.com/snowflake/iterating-over-metadata-with-snowpark-aa59598169bf">Medium post</a>.)</em></p>
<p>Hi Folks,</p>
<p><a href="/basic-pii-detection-using-java/">Last time</a>
we saw how to create simple Java functions to detect and mask personally identifying information (PII). For example, we could take a table containing some email messages and mask out the PII in the bodies with a simple query:</p>
<p><img src="/assets/2021/08/iterating_1.png" alt="one masked column"></p>
<p>But let’s say we wanted to mask out all of the PII. And let’s say that we had many more fields like you might find in something like survey results.</p>
<p>In this case, masking out the PII would be easy, but tedious: we’d have to apply the function manually to each column. And if the schema of our table were to change &ndash; or if we wanted to run this masking routine on a different table &ndash; we’d have to rewrite the query.</p>
<p>What we’ve run into is a pretty fundamental limitation in SQL: the query is very tied to the underlying schema. There’s no way to pass a type parameter to the query or iterate over metadata.
<a href="https://docs.snowflake.com/en/developer-guide/snowpark/index.html">Snowpark</a>
doesn’t have this limitation: we can write code to inspect metadata and dynamically generate queries based on what we find.</p>
<p>To get started with Snowpark, you can follow the instructions on how to get it set up in your existing Scala development environment. Or you can follow the nice directions
<a href="https://medium.com/snowflake/from-zero-to-snowpark-in-5-minutes-72c5f8ec0b55">Zohar Nissare-Houssen has outlined here</a>
to get going using Docker.</p>
<p>Now using Snowpark for Scala, we can write a fully generic PII masking function:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scala" data-lang="scala"><span class="line"><span class="cl"><span class="k">val</span> <span class="n">maskAllPii</span> <span class="k">=</span> <span class="o">(</span><span class="n">df</span><span class="k">:</span> <span class="kt">DataFrame</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="o">{</span>
</span></span><span class="line"><span class="cl">   <span class="k">val</span> <span class="n">toMask</span> <span class="k">=</span> <span class="n">df</span><span class="o">.</span><span class="n">schema</span>
</span></span><span class="line"><span class="cl">      <span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">dataType</span><span class="o">.</span><span class="n">typeName</span> <span class="o">==</span> <span class="s">&#34;String&#34;</span><span class="o">)</span>
</span></span><span class="line"><span class="cl">      <span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">name</span><span class="o">)</span>
</span></span><span class="line"><span class="cl">   <span class="n">df</span><span class="o">.</span><span class="n">withColumns</span><span class="o">(</span><span class="n">toMask</span><span class="o">,</span> 
</span></span><span class="line"><span class="cl">      <span class="n">toMask</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">c</span> <span class="k">=&gt;</span> <span class="n">callUDF</span><span class="o">(</span><span class="s">&#34;maskpii&#34;</span><span class="o">,</span> <span class="n">df</span><span class="o">.</span><span class="n">col</span><span class="o">(</span><span class="n">c</span><span class="o">))))</span>
</span></span><span class="line"><span class="cl"><span class="o">}</span>
</span></span></code></pre></div><p>This function takes in a DataFrame, inspects the schema, and applies the PII masking function we already have registered in Snowflake to each string column it finds, leaving non-string columns untouched. The result is just another DataFrame.</p>
<p>Now we can very easily run this on our email data…</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scala" data-lang="scala"><span class="line"><span class="cl"><span class="k">val</span> <span class="n">df</span> <span class="k">=</span> <span class="n">maskAllPii</span><span class="o">(</span><span class="n">sess</span><span class="o">.</span><span class="n">table</span><span class="o">(</span><span class="s">&#34;emails&#34;</span><span class="o">))</span>
</span></span></code></pre></div><p>…and fetch the results:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-scala" data-lang="scala"><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span><span class="mi">100</span><span class="o">)</span>  <span class="c1">// get the first three lines, format wide
</span></span></span></code></pre></div><p><img src="/assets/2021/08/iterating_2.png" alt="all masked columns"></p>
<p>As you can see, the <code>maskAllPii()</code> call has touched all of the String columns. Under the covers, Snowpark has dynamically generated a plan that corresponds a SQL query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="s2">&#34;ID&#34;</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="n">maskpii</span><span class="p">(</span><span class="s2">&#34;SENDER&#34;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="s2">&#34;SENDER&#34;</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="n">maskpii</span><span class="p">(</span><span class="s2">&#34;SUBJECT&#34;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="s2">&#34;SUBJECT&#34;</span><span class="p">,</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="n">maskpii</span><span class="p">(</span><span class="s2">&#34;BODY&#34;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="s2">&#34;BODY&#34;</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w"> </span><span class="k">SELECT</span><span class="w">  </span><span class="o">*</span><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="n">emails</span><span class="p">))</span><span class="w">
</span></span></span></code></pre></div><p>When <code>show()</code> runs, it generates and issues the SQL,
wrapping this in an outer <code>LIMIT</code> clause and pretty-printing the result &ndash; that’s what <code>show()</code> does.</p>
<p>Of course, this query isn’t a hard one to write, though doing so does start to get a bit tedious as the column count goes up. And you have to do it again for each table or query you want to mask. Moreover, writing this yourself means more chances to make a mistake and miss a column.</p>
<p>In contrast, the Snowpark alternative is simple, robust, and reusable. And as a simple exercise, you can retool the example above to take a different function — or better yet, take an arbitrary function as a parameter.</p>
<p>Happy hacking!</p>
]]></content></item><item><title>Basic PII Detection and Masking in Snowflake Using Java</title><link>https://stdin.org/basic-pii-detection-using-java/</link><pubDate>Wed, 28 Jul 2021 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/basic-pii-detection-using-java/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p><em>(This was ported from my original <a href="https://medium.com/snowflake/basic-pii-detection-and-masking-in-snowflake-using-java-1689ae63aa69">Medium post</a>.)</em></p>
<p>Hi Folks,</p>
<p>For my first foray into Medium, I wanted to share some code that I’ve used previously in demos. The examples here do basic detection and masking of personally-identifying information (PII) using Java’s built-in regular expression support.</p>
<p>Now, I make no assertion that these routines are good: if you really want to do robust PII detection, you probably want something more sophisticated than a few regexes. Snowflake is even working on
<a href="https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit/">data classification</a>
as a built-in feature.</p>
<p>But I like these examples because they do a good job of illustrating the basic pattern of Snowflake’s
<a href="https://docs.snowflake.com/en/developer-guide/udf/java/udf-java.html">Java functions</a>.
And they’re pretty malleable: you should be able to modify these examples to work for any situation where you need to detect or mask based on a set of regexes.</p>
<p>Let’s start with the code and then tear it apart. If you’re running on Snowflake and have Java functions enabled &ndash; any AWS account, for now &ndash; then you can define them right inline using this <code>create function</code>
command:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">create</span><span class="w"> </span><span class="k">function</span><span class="w"> </span><span class="n">haspii</span><span class="p">(</span><span class="n">s</span><span class="w"> </span><span class="n">string</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">returns</span><span class="w"> </span><span class="nb">boolean</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">language</span><span class="w"> </span><span class="n">java</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">returns</span><span class="w"> </span><span class="k">null</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="k">null</span><span class="w"> </span><span class="k">input</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">handler</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;PIIDetector.hasPII&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">as</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">import</span><span class="w"> </span><span class="n">java</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">regex</span><span class="p">.</span><span class="o">*</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">import</span><span class="w"> </span><span class="n">java</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="o">*</span><span class="p">;</span><span class="k">public</span><span class="w"> </span><span class="k">class</span><span class="w"> </span><span class="n">PIIDetector</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">static</span><span class="w"> </span><span class="k">final</span><span class="w"> </span><span class="n">String</span><span class="p">[]</span><span class="w"> </span><span class="n">TARGETS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="s2">&#34;\\d{3}-\\d{2}-\\d{4}&#34;</span><span class="p">,</span><span class="w">                 </span><span class="o">//</span><span class="w"> </span><span class="n">SSN</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="s2">&#34;[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}&#34;</span><span class="p">,</span><span class="w">  </span><span class="o">//</span><span class="w"> </span><span class="n">email</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="s2">&#34;[2-9]\\d{2}-\\d{3}-\\d{4}&#34;</span><span class="w">             </span><span class="o">//</span><span class="w"> </span><span class="n">phone</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">}</span><span class="p">;</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">Pattern</span><span class="o">&gt;</span><span class="w"> </span><span class="n">patterns</span><span class="p">;</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">public</span><span class="w"> </span><span class="n">PIIDetector</span><span class="p">()</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">patterns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">Pattern</span><span class="o">&gt;</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">for</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">TARGETS</span><span class="p">)</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="n">patterns</span><span class="p">.</span><span class="k">add</span><span class="p">(</span><span class="n">Pattern</span><span class="p">.</span><span class="n">compile</span><span class="p">(</span><span class="n">s</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">}</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">public</span><span class="w"> </span><span class="nb">boolean</span><span class="w"> </span><span class="n">hasPII</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">s</span><span class="p">)</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">for</span><span class="p">(</span><span class="n">Pattern</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">patterns</span><span class="p">)</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">matcher</span><span class="p">(</span><span class="n">s</span><span class="p">).</span><span class="n">find</span><span class="p">())</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="k">return</span><span class="w"> </span><span class="k">true</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">return</span><span class="w"> </span><span class="k">false</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">$$</span><span class="w">
</span></span></span></code></pre></div><p>With this in hand, anyone with permissions on the function can issue queries that use it without any knowledge of Java:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">haspii</span><span class="p">(</span><span class="n">body</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">from</span><span class="w"> </span><span class="n">emails</span><span class="w">
</span></span></span></code></pre></div><p>So let’s take the definition apart. The first section defines how the function will show up in SQL:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">create</span><span class="w"> </span><span class="k">function</span><span class="w"> </span><span class="n">haspii</span><span class="p">(</span><span class="n">s</span><span class="w"> </span><span class="n">string</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">returns</span><span class="w"> </span><span class="nb">boolean</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">language</span><span class="w"> </span><span class="n">java</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">returns</span><span class="w"> </span><span class="k">null</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="k">null</span><span class="w"> </span><span class="k">input</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">handler</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;PIIDetector.hasPII&#39;</span><span class="w">
</span></span></span></code></pre></div><p>Most of this is pretty self explanatory: it’s a function that takes a string and returns a Boolean, and the language is Java. The <code>null on null input</code> bit lets me skip any null handling in my routine: nulls inputs will be handled without calling into Java at all.</p>
<p>The <code>handler</code> directive is new, and specifies where in the Java code to actually make a call. You may have many potential entry points, but in this case, Snowflake is going to call the <code>hasPII</code> method defined on the <code>PIIDetector</code> class.</p>
<p>The actual Java code is contained between the pairs of dollar signs. After a little boilerplate, we see a few regular expressions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-java" data-lang="java"><span class="line"><span class="cl"><span class="kd">static</span><span class="w"> </span><span class="kd">final</span><span class="w"> </span><span class="n">String</span><span class="o">[]</span><span class="w"> </span><span class="n">TARGETS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s">&#34;\\d{3}-\\d{2}-\\d{4}&#34;</span><span class="p">,</span><span class="w">                 </span><span class="c1">// SSN</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s">&#34;[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}&#34;</span><span class="p">,</span><span class="w">  </span><span class="c1">// email</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="s">&#34;[2-9]\\d{2}-\\d{3}-\\d{4}&#34;</span><span class="w">             </span><span class="c1">// phone</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">};</span><span class="w">
</span></span></span></code></pre></div><p>These (highly USA-centric) expressions match the basic forms of Social Security numbers, email addresses, and phone numbers. You can very easily augment this list with more patterns to match your definition of PII.</p>
<p>Next, we see some initialization code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-java" data-lang="java"><span class="line"><span class="cl"><span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">Pattern</span><span class="o">&gt;</span><span class="w"> </span><span class="n">patterns</span><span class="p">;</span><span class="kd">public</span><span class="w"> </span><span class="nf">PIIDetector</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">patterns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">Pattern</span><span class="o">&gt;</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">for</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">TARGETS</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">patterns</span><span class="p">.</span><span class="na">add</span><span class="p">(</span><span class="n">Pattern</span><span class="p">.</span><span class="na">compile</span><span class="p">(</span><span class="n">s</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>Our handler points to an instance method in the PIIDetector class. When Snowflake runs a query that requires an instance of this class, Snowflake will will look for a default constructor to use to generate this instance. This provides a really easy way to do one-time initialization: in this case we compile up the regular expressions so they’re ready to go once per query, rather than doing so on each invocation &ndash; it should be much faster.</p>
<p>Finally, we have the actual method we’re binding to:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-java" data-lang="java"><span class="line"><span class="cl"><span class="kd">public</span><span class="w"> </span><span class="kt">boolean</span><span class="w"> </span><span class="nf">hasPII</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">s</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">for</span><span class="p">(</span><span class="n">Pattern</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">patterns</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="na">matcher</span><span class="p">(</span><span class="n">s</span><span class="p">).</span><span class="na">find</span><span class="p">())</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="k">return</span><span class="w"> </span><span class="kc">true</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">return</span><span class="w"> </span><span class="kc">false</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>This just loops over the patterns and fires if any match. Easy peasy!</p>
<p>And there we have it: a simple PII detection routine that you can customize to your requirements (and local phone-number formats). But really, this is good for any situation where you have a number of regular expressions to match.</p>
<p>And with a little tweaking, you can mask out these matches instead. Here’s the code; I’ll let you dig into the details.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">create</span><span class="w"> </span><span class="k">function</span><span class="w"> </span><span class="n">maskpii</span><span class="p">(</span><span class="n">s</span><span class="w"> </span><span class="n">string</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">returns</span><span class="w"> </span><span class="n">string</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">language</span><span class="w"> </span><span class="n">java</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">returns</span><span class="w"> </span><span class="k">null</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="k">null</span><span class="w"> </span><span class="k">input</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">handler</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;PIIDetector.maskPII&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">as</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">import</span><span class="w"> </span><span class="n">java</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">regex</span><span class="p">.</span><span class="o">*</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">import</span><span class="w"> </span><span class="n">java</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="o">*</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">public</span><span class="w"> </span><span class="k">class</span><span class="w"> </span><span class="n">PIIDetector</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">static</span><span class="w"> </span><span class="k">final</span><span class="w"> </span><span class="n">String</span><span class="p">[]</span><span class="w"> </span><span class="n">TARGETS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="s2">&#34;\\d{3}-\\d{2}-\\d{4}&#34;</span><span class="p">,</span><span class="w">                 </span><span class="o">//</span><span class="w"> </span><span class="n">SSN</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="s2">&#34;[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}&#34;</span><span class="p">,</span><span class="w">  </span><span class="o">//</span><span class="w"> </span><span class="n">email</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="s2">&#34;[2-9]\\d{2}-\\d{3}-\\d{4}&#34;</span><span class="w">             </span><span class="o">//</span><span class="w"> </span><span class="n">phone</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">}</span><span class="p">;</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">static</span><span class="w"> </span><span class="k">final</span><span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">MASK</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&#34;###&#34;</span><span class="p">;</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">Pattern</span><span class="o">&gt;</span><span class="w"> </span><span class="n">patterns</span><span class="p">;</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">public</span><span class="w"> </span><span class="n">PIIDetector</span><span class="p">()</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="n">patterns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">Pattern</span><span class="o">&gt;</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">for</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">TARGETS</span><span class="p">)</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="n">patterns</span><span class="p">.</span><span class="k">add</span><span class="p">(</span><span class="n">Pattern</span><span class="p">.</span><span class="n">compile</span><span class="p">(</span><span class="n">s</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">}</span><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">public</span><span class="w"> </span><span class="n">String</span><span class="w"> </span><span class="n">maskPII</span><span class="p">(</span><span class="n">String</span><span class="w"> </span><span class="n">s</span><span class="p">)</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">for</span><span class="p">(</span><span class="n">Pattern</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">patterns</span><span class="p">)</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p</span><span class="p">.</span><span class="n">matcher</span><span class="p">(</span><span class="n">s</span><span class="p">).</span><span class="n">replaceAll</span><span class="p">(</span><span class="n">MASK</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">return</span><span class="w"> </span><span class="n">s</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">}</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="err">$$</span><span class="w">
</span></span></span></code></pre></div><p>Happy hacking!</p>
]]></content></item><item><title>A leopard can't change his spots. (But he may change jobs.)</title><link>https://stdin.org/a-leopard-cant-change-his-spots-but-he-may-change-jobs/</link><pubDate>Sun, 15 Jul 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/a-leopard-cant-change-his-spots-but-he-may-change-jobs/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>I won&rsquo;t bury the lede: My last day at <a href="https://www.tableau.com/">Tableau</a> was July 6th, and tomorrow I start a new gig at <a href="https://www.snowflake.net/">Snowflake</a>.</p>
<p><img src="happysad.png" alt="HappySad"></p>
<p>I joined Tableau in June of 2015, and spent most of my three years there starting, building, and ultimately shipping <a href="https://www.tableau.com/products/prep">Tableau Prep</a>. I&rsquo;m incredibly proud of the Prep team, the product we put together, and the awesome functionality yet to come.</p>
<p>As I move on, I&rsquo;ve been thinking a bit about the past projects that really excited me. In addition to Prep, my favorites were probably StreamInsight, which was a system for dealing with time-aware queries and streaming data, and the spatial types in SQL Server. (Those types are still going strong and <a href="https://www.tableau.com/about/blog/visualize-spatial-data-directly-sql-server-tableau-20181-87377">motivating new integrations</a> ten years later.)</p>
<p>A common theme through of all of these projects has been making it easy to do complex things with data. And Snowflake is most certainly out to do that with data warehousing. It feels like a wonderful match.</p>
<p>I&rsquo;m going to miss Tableau — it&rsquo;s a wonderful company — and I&rsquo;m going to miss Prep. But I&rsquo;m incredibly excited to be starting at Snowflake. (And a special thanks to those Preppies that slipped Snowflake support into <a href="https://www.tableau.com/about/blog/2018/7/announcing-tableau-prep-easy-enterprise-deployments-and-more-data">the latest Prep release</a>. That should save me some awkward moments.)</p>
<p>I&rsquo;ll try to keep writing here — maybe with a broader set of topics, and hopefully with a bit more regularity. So do please check in and drop me a note.</p>
]]></content></item><item><title>Tableau Prep: The Power of Composability</title><link>https://stdin.org/tableau-prep-the-power-of-composability/</link><pubDate>Wed, 09 May 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/tableau-prep-the-power-of-composability/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>When we built <a href="https://www.tableau.com/products/prep">Tableau Prep</a>, we put a premium on ensuring  <em>composability</em> of operations: you can take the operations Prep supports and string them together in any combination you need. There are  <em>no</em> restrictions based on where the data came from, or what operations came before.</p>
<p>This means that you never need to think about whether a particular operation is supported in your particular situation: if Prep supports it ever, Prep supports it always. Moreover, this gives you a lot of  <em>power</em> to do what you need to with your data.</p>
<p><img src="youhavethepower.jpg" alt="youhavethepower"></p>
<p>In the rest of this post, we&rsquo;ll walk through a Superstore example that highlights this power.</p>
<h1 id="the-problem">The Problem<a href="#the-problem" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h1>
<p>Let&rsquo;s start with the sample Superstore data from Tableau Desktop. This data set is a list of order details: each row represents one item from an order, with multiple line items accruing to each order.</p>
<p>Given these data, let&rsquo;s try to fulfill what seems like a simple request:</p>
<p>Get the order details for customers with fewer than the median number of orders.</p>
<p>This seems relatively straightforward&hellip; or is it? In cases like this, I often find it helpful to think backwards to come up with a solution:</p>
<p><strong>Step 4</strong>
If we had a list of customers with fewer than the median number of orders, we could cull the order details down to just those from customers on the list. But we don&rsquo;t have a list of these sub-median customers.</p>
<p><strong>Step 3</strong>
If we knew the median number of orders, we could prune the list of customers down to those with fewer than the median. But we don&rsquo;t have the median number of orders.</p>
<p><strong>Step 2</strong>
If we knew the count of orders for each customer, we could aggregate it to find the median number of orders over all customers. But we don&rsquo;t have the number of orders for each customer.</p>
<p><strong>Step 1</strong>
If we had the list of orders for each customer, we could aggregate to get the count for each customer.  _And we do have the order list!<br>
_</p>
<p>Now we have a plan: we&rsquo;ll start with the order details we have, and climb the ladder outlined above to get to the solution.</p>
<h1 id="the-solution">The Solution<a href="#the-solution" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h1>
<p>We start by loading the Superstore data:<img src="step0.png" alt="Step0"></p>
<p>As we&rsquo;ve already observed, these are order details. Each order has a distinct Order ID, but may have more than one line.</p>
<h2 id="step-1"><strong>Step 1</strong><a href="#step-1" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Following our plan, the first thing we need to is get the count of orders for each customer. To do this we introduce an aggregate: we group by customer and count the <em>distinct</em> number of Order IDs:<img src="step1-1.png" alt="Step1-1"></p>
<p>The distinct makes it so repeated Order IDs — which come from having more than one order detail line per order — are only counted once.</p>
<p>So we don&rsquo;t confuse ourselves later, we&rsquo;ll rename Order ID to Number of Orders:<img src="step1-2-anno.png" alt="Step1-2-anno"></p>
<h2 id="step-2"><strong>Step 2</strong><a href="#step-2" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Now that we have the list orders for each customer, we can aggregate  <em>again</em> to find the median number of orders per customer:<br>
<img src="step2.png" alt="Step2"></p>
<p>This aggregate is a little funny: There&rsquo;s no grouping field, so we don&rsquo;t partition the table at all. The result is an odd little table with one row and one column, but this record represents the median over all customers we were looking for.</p>
<p>We&rsquo;ll rename this once again:<br>
<img src="step2-2-anno.png" alt="Step2-2-anno"></p>
<h2 id="step-3"><strong>Step 3</strong><a href="#step-3" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>With the median number of orders in hand, we can join it with our list of customers and order counts to filter down that list. I.e., we&rsquo;ll join it with the result of our first aggregate:</p>
<p><img src="step31.png" alt="Step3"></p>
<p>Note the join clause here: we&rsquo;re doing an inner join, but matching when the median is greater than the customer&rsquo;s order count. We also have an error: the types don&rsquo;t match because the result of the median is a floating-point number, not an integer.</p>
<p>If we correct the type, we get our list of customers with fewer than the median number of orders:<img src="step3-21.png" alt="Step3-2"></p>
<h2 id="step-4"><strong>Step 4</strong><a href="#step-4" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Now that we have our customer list, we&rsquo;re ready to cull the line items. We&rsquo;ll again use a join as a filter, but this time we&rsquo;re joining our latest table with the original input:<br>
<img src="step41.png" alt="Step4"></p>
<p>You can see that there are a bunch of records dropping out from the right: those were the customers with more than the median number of orders. What remain are the line items we care about:<img src="step51.png" alt="Step5.PNG"></p>
<h1 id="wrapping-up">Wrapping Up<a href="#wrapping-up" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h1>
<p>At this point, you might want to clean up a few of the columns we created along the way, but our data are ready to output to Tableau or anywhere else you want to take them.</p>
<p>This may seem a little complex — and it&rsquo;s clearly stretching our flow layout algorithm — but it makes a perfectly fine flow. There was no operator that solved our problem out of the box, but composability made it possible to mix-and-match the operations present to build a computational machines for our task.</p>
<p>We certainly aren&rsquo;t done adding operations to Prep, but there&rsquo;s a rich set already present. And with a little composition, you can make them do some pretty cool tricks.</p>
]]></content></item><item><title>Tableau Prep: The Flow</title><link>https://stdin.org/tableau-prep-the-flow/</link><pubDate>Mon, 07 May 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/tableau-prep-the-flow/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>I&rsquo;ve been a bit quiet lately, but Tableau Prep out the door and it&rsquo;s time to make a little noise.</p>
<p>Clark recently wrote an <a href="https://www.tableau.com/about/blog/2018/4/ux-notebook-designing-tableau-preps-coordinated-workspace-85846">excellent post</a> on the basic UX architecture of Prep. Here I&rsquo;d like to cover a key concept underlying Prep that may be a bit foreign to people coming from Tableau: the <em>flow</em>.</p>
<p><img src="1flow.png" alt="1flow"></p>
<p>This isn&rsquo;t the most glamorous part of Prep, but it is one of the most fundamental concepts in the tool, so it seems worth spending some quality time on.</p>
<p>Strap on your life jacket and read on for more.</p>
<p><strong>Data In; Data Out</strong></p>
<p>To understand flows, we start with <em>steps</em> , which are the conceptual unit of work in Tableau Prep. Every time you take an action on your data in Prep, you&rsquo;re adding a step. For example, if we take the world consumer price index data included with the product and add a filter, we find that a new step is added to the flow:</p>
<p><img src="2step.png" alt="2step"></p>
<p>Each item in the flow pane represents a step, and each step works in the same way: data come in from the left, are modified by the step, and leave to the right:</p>
<p><img src="3inandout-annotated.png" alt="3inandout-annotated"></p>
<p>Some steps — <em>cleaning steps</em> — may have multiple sub-steps, or <em>changes.</em> These are just like steps in the flow, but are smaller increments of work. They flow top to bottom:</p>
<p><img src="5cleaning-annotated.png" alt="5cleaning-annotated"></p>
<p>We group these together to help conceptually simplify the flow, but each change acts just like any other step: rows come in, they&rsquo;re modified, and they go out.</p>
<p>Some steps — such as joins — have multiple inputs, but they work the same way: two sets of data come in from the left, they&rsquo;re put together, and the result leaves to the right:</p>
<p><img src="4join-annotated.png" alt="4join-annotated"></p>
<p>And where do they go? On to the next step! Some steps may even have multiple outputs, with the data going to multiple targets:</p>
<p><img src="6twooutannotated.png" alt="6twooutannotated"></p>
<p>Step-by-step we build up a flow: an  <em>ordered</em> sequence of steps that does what we want.</p>
<p><img src="1flow.png" alt="1flow"></p>
<p><strong>Clarity and Control</strong></p>
<p>That ordering is a key aspect of flows. If you&rsquo;re coming from Tableau, you may be aware that it performs operations in a particular order, but the system doesn&rsquo;t advertise this, and generally you don&rsquo;t need to think about it.</p>
<p>But order sometimes matters, and we designed Prep with those times in mind. The CPI data contain both a food index and a general index. Let&rsquo;s say that we&rsquo;ve pivoted the data, and now want to compare each country&rsquo;s CPI to the global average for each year — except we only care about the food index.</p>
<p>To do this, we&rsquo;ll first filter to keep only the food index:</p>
<p><img src="filter-annotated.png" alt="filter-annotated"></p>
<p>And <em>then</em> we&rsquo;ll aggregate by year:</p>
<p><img src="agg-annotated.png" alt="agg-annotated"></p>
<p>Order matters: if we did the aggregate first, we would have folded in the general CPI as well.</p>
<p>This kind of ordering is explicit in Prep. You don&rsquo;t have to guess, and you don&rsquo;t need to coax the system into doing what you want: you just build your flow in the order fits your problem.</p>
<p>And with Prep, you can always go back and see your data at any point along the flow. Just click back and look. This way you can see and control what the flow is doing to your data every step along the way.</p>
<p><strong>Prep is a Competent Cook</strong></p>
<p>We can add another metaphor: think of a flow as a recipe, and let&rsquo;s take a moment to bake some cookies.</p>
<p><img src="julia-spoon.jpg" alt="julia-spoon"></p>
<p>We&rsquo;ve already mixed the wet ingredients — the eggs, the vanilla, the butter — when we get to this part of the recipe:</p>
<ol start="6">
<li>&hellip;</li>
<li>Measure 1.5 cups flour</li>
<li>Add 1/4 teaspoon salt</li>
<li>Add 1/2 teaspoon baking powder</li>
<li>Mix thoroughly</li>
<li>Add dry ingredients to wet ingredients</li>
<li>…</li>
</ol>
<p>A competent cook would mix these dry ingredients before adding them to the wet, but they would take the liberty of combining them in any convenient order: they know it&rsquo;s irrelevant.</p>
<p>Tableau Prep is a competent cook. It can figure out many cases where the order won&rsquo;t matter, and can rearrange them to make your flow run more efficiently. But it will only do this when the reordering won&rsquo;t affect the results that <em>you</em> intended.</p>
<p>So while the flow give a <em>conceptual</em> order to the operations and their execution order, they may not be run that way at all. The result is that you can ignore order when it doesn’t matter, but rely on it when it does.</p>
<p><strong>More than Just Flows</strong></p>
<p>The notion of a flow is not unique to Tableau Prep, and it isn&rsquo;t Prep&rsquo;s most distinguishing feature. The way that Prep uses samples to give you immediate feedback, the way we use analytics to help you see what needs to be done, and the direct manipulation all more directly contribute to what makes Prep special.</p>
<p>But understanding flows is central to understanding how to make Prep do exactly what you want, and it can be a bit of a leap for folks coming from Tableau Desktop. I hope this helps make that leap a little easier.</p>
<p>Happy hacking!</p>
]]></content></item><item><title>When Live Beats an Extract</title><link>https://stdin.org/when-live-beats-an-extract/</link><pubDate>Wed, 14 Mar 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/when-live-beats-an-extract/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>When using Tableau, taking an extract is always better than using a live query, right?</p>
<p>Well, no.</p>
<p>Of course. Obviously, when your data are changing and you want to get all of the latest updates in your viz, you&rsquo;ll want to use a live query. But if that&rsquo;s not the case, then an extract is clearly better, especially with Hyper in 10.5, right?</p>
<p>Well, no!</p>
<p>Shoot! This is complicated? When <em>will</em> live beat an extract? Let&rsquo;s take a look at a few cases.</p>
<h2 id="a-few-basics">A Few Basics<a href="#a-few-basics" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>To understand what&rsquo;s going on, you should have a basic understanding of how live and extracted data sources are used by the system. If you feel a bit shaky here, I&rsquo;d recommend my previous post on <a href="http://blog.stdin.org/2018/01/12/tableau-data-sources-live-vs-extract/">live vs extracts</a>. But in a nutshell:</p>
<ul>
<li>When you&rsquo;re using an extract, the query defined by the data source is run and the whole resulting table is persisted in either a TDE (in Tableau 10.4 or before) or a Hyper database (in 10.5 and later). The queries produced by your workbook are then run against this table.</li>
<li>When you&rsquo;re running live, the queries from your workbook are <em>composed</em> with the data source query. In simple cases, at least, this will result in a single query that is pushed down to the target database system, and only the results needed for the viz are returned.</li>
</ul>
<p>We&rsquo;re going to look at a few cases where live can do better than an extract. As we look at them, pay particular attention to:</p>
<ul>
<li>The time to run the remote query,</li>
<li>The time to transfer the data, and</li>
<li>The time to run the local query.</li>
</ul>
<p>These aren&rsquo;t rigorous perf numbers, but to give you a sense of scale, here&rsquo;s my setup:</p>
<ul>
<li>Tableau 10.5 (with Hyper) running on a i5-2500 with 8GB of RAM.</li>
<li>SQL Server 2017 Express Edition running on an i7-3770 with 16GB of RAM.</li>
<li>All wired together over gigabit Ethernet.</li>
</ul>
<p>So nothing too grand. In any case, the lessons here should carry over to other hardware.</p>
<p>The data set is a <a href="https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs">stock history set from Kaggle</a> that records daily stats for large number of stocks and <a href="https://en.wikipedia.org/wiki/Exchange-traded_fund">ETFs</a>. The schema looks like:</p>
<pre tabindex="0"><code>history(ticker, type, date, open, high, low, close, volume, openInt)
</code></pre><p>Loaded into SQL Server and indexed on (ticker, date), this results in 17.4M rows and about 1.5GB of storage. (I have no idea what the provenance or accuracy of these data are, but for this work only the size is relevant.)</p>
<p>Let&rsquo;s try to beat an extract!</p>
<h2 id="nail-the-index">Nail The Index<a href="#nail-the-index" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s start with an easy case: let&rsquo;s find the yearly average close for Tableau&rsquo;s stock. I&rsquo;ll drag the ticker into filters, years into columns, and Avg(Close) into rows. It&rsquo;s an award-worthy viz:</p>
<p><img src="data.png" alt="data"></p>
<p>This is also an almost ideal query for our SQL Server database: it makes excellent use of the index, so the query is exceptionally fast to run; and because the aggregation happens remotely, there are almost no results to send over the wire. By <a href="http://blog.stdin.org/2018/02/13/the-query-behind-the-viz/">looking in the log</a>, I find that it takes a whole 0.006 seconds to run this query and fetch the results. How can we possibly beat that?</p>
<p>Indeed, if we recreate the same viz with an extract, Hyper takes more like 0.2 seconds to compute the viz.</p>
<p><img src="bestcase.png" alt="bestcase"></p>
<p>So SQL Server is faster than Hyper? Well, in this case it is, but we&rsquo;ve almost cheated by practically tuning it to answer <em>this query</em> quickly. Hyper, on the other hand, doesn&rsquo;t require (and doesn&rsquo;t allow) us to tune its setup. So we&rsquo;re comparing the <em>best case</em> for SQL Server to <em>a</em> case for Hyper.</p>
<p>But the lesson is still sound: if your query (a) lines up well with the setup of your remote database, and (b) transfers very little data, then we can actually beat a Hyper extract.</p>
<h2 id="be-truly-ad-hoc">Be Truly Ad Hoc<a href="#be-truly-ad-hoc" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s try to avoid pandering to SQL Server quite so much and just ask for the number of records my data set has each year:</p>
<p><img src="yearly.png" alt="yearly.PNG"></p>
<p>Now SQL Server takes a bit longer: 5.53 seconds. Trying this against the extract shows what Hyper can do: 0.193 seconds. In this case, both engines have to do roughly the same amount of work, but with it&rsquo;s column-based, in-memory execution, Hyper is the clear winner!</p>
<p>Except that we haven&rsquo;t taken into account the cost of generating the extract. When we refresh it, we find that it takes us 67.8 seconds to generate a 435MB extract. If we add that in, SQL Server starts looking pretty good:</p>
<p><img src="adhoc.png" alt="adhoc"></p>
<p>Applying a little algebra, that means that to recoup the cost of our extract, we&rsquo;d need to run our viz query a hair over 15 times. Often times this will be worth it, but if the query is truly one off, I&rsquo;d rather spend 5.53 seconds than 68.</p>
<h2 id="blow-up-the-extract">Blow Up the Extract<a href="#blow-up-the-extract" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s try something more horrible. Let&rsquo;s say that in addition to the historical stock prices, we have a table of customer holdings. We&rsquo;ll keep it simple; our customers have static holdings that look like:</p>
<p><code>customerholdings(customer, ticker, amount)</code></p>
<p>(I don&rsquo;t actually have any customers, so I randomly generated 20 holdings for each of 20,000 imaginary customers.)</p>
<p>We want to do things like look at the total value of all customers&rsquo; holdings over time, so we join the holdings to the price history.</p>
<p><img src="holdingsjoin.png" alt="holdingsJoin"></p>
<p>We then create a calc to compute the value each customer&rsquo;s holdings and make a viz:</p>
<p><img src="holdingsviz.png" alt="holdingsViz"></p>
<p>In case you&rsquo;re interested, that giant spike is caused by a few odd stocks like DryShips Inc. (<a href="https://finance.yahoo.com/quote/drys?p=drys">DRYS</a>), which somehow peaked at $1,442,048,636.45 in 2007. I don&rsquo;t comprehend. The graph looks funny, but again, this doesn&rsquo;t matter for our analysis.</p>
<p>What we care about is that this query takes 133 seconds to run—it&rsquo;s a fair bit of work for SQL Server to do. How about the extract?</p>
<p>Well, let&rsquo;s do a little back of the envelope computation. If we execute the full join in SQL Server and don&rsquo;t aggregate anything down, instead of the 17 million records in our history table, the result set will have about 441 million records. And these records are larger than the history rows because they have customer information as well.</p>
<p>Optimistically, this will end up being something like 10 gigabytes of data that I have to move over the wire, and store in a local extract. And that&rsquo;s all before I even get to ask my query. So unless I&rsquo;m doing this a lot, I&rsquo;m simply not going to bother.</p>
<h2 id="wrapping-up">Wrapping Up<a href="#wrapping-up" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>So we&rsquo;ve seen a few cases where live queries may be preferable to extracts, leaving aside the obvious cases where you simply want the most current data.</p>
<p>One thing we didn&rsquo;t talk about is federated queries: queries that span multiple data sources. As a <em>general rule,</em> federation makes extracts look better relative to live, because live starts to look worse. Live works best when the engine can push operations that reduce data volumes off to the remote system—operations like aggregations and filters—and federation tends to interfere with that pushdown.</p>
<p>But that&rsquo;s another ball of wax. I&rsquo;ll write more on federation soon.</p>
]]></content></item></channel></rss>