<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Databases on stdin</title><link>https://stdin.org/tags/databases/</link><description>Recent content in Databases on stdin</description><generator>Hugo -- 0.161.1</generator><language>en</language><copyright>Isaac Kunen</copyright><lastBuildDate>Wed, 14 Mar 2018 00:00:00 +0000</lastBuildDate><atom:link href="https://stdin.org/tags/databases/index.xml" rel="self" type="application/rss+xml"/><item><title>When Live Beats an Extract</title><link>https://stdin.org/when-live-beats-an-extract/</link><pubDate>Wed, 14 Mar 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/when-live-beats-an-extract/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>When using Tableau, taking an extract is always better than using a live query, right?</p>
<p>Well, no.</p>
<p>Of course. Obviously, when your data are changing and you want to get all of the latest updates in your viz, you&rsquo;ll want to use a live query. But if that&rsquo;s not the case, then an extract is clearly better, especially with Hyper in 10.5, right?</p>
<p>Well, no!</p>
<p>Shoot! This is complicated? When <em>will</em> live beat an extract? Let&rsquo;s take a look at a few cases.</p>
<h2 id="a-few-basics">A Few Basics<a href="#a-few-basics" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>To understand what&rsquo;s going on, you should have a basic understanding of how live and extracted data sources are used by the system. If you feel a bit shaky here, I&rsquo;d recommend my previous post on <a href="http://blog.stdin.org/2018/01/12/tableau-data-sources-live-vs-extract/">live vs extracts</a>. But in a nutshell:</p>
<ul>
<li>When you&rsquo;re using an extract, the query defined by the data source is run and the whole resulting table is persisted in either a TDE (in Tableau 10.4 or before) or a Hyper database (in 10.5 and later). The queries produced by your workbook are then run against this table.</li>
<li>When you&rsquo;re running live, the queries from your workbook are <em>composed</em> with the data source query. In simple cases, at least, this will result in a single query that is pushed down to the target database system, and only the results needed for the viz are returned.</li>
</ul>
<p>We&rsquo;re going to look at a few cases where live can do better than an extract. As we look at them, pay particular attention to:</p>
<ul>
<li>The time to run the remote query,</li>
<li>The time to transfer the data, and</li>
<li>The time to run the local query.</li>
</ul>
<p>These aren&rsquo;t rigorous perf numbers, but to give you a sense of scale, here&rsquo;s my setup:</p>
<ul>
<li>Tableau 10.5 (with Hyper) running on a i5-2500 with 8GB of RAM.</li>
<li>SQL Server 2017 Express Edition running on an i7-3770 with 16GB of RAM.</li>
<li>All wired together over gigabit Ethernet.</li>
</ul>
<p>So nothing too grand. In any case, the lessons here should carry over to other hardware.</p>
<p>The data set is a <a href="https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs">stock history set from Kaggle</a> that records daily stats for large number of stocks and <a href="https://en.wikipedia.org/wiki/Exchange-traded_fund">ETFs</a>. The schema looks like:</p>
<pre tabindex="0"><code>history(ticker, type, date, open, high, low, close, volume, openInt)
</code></pre><p>Loaded into SQL Server and indexed on (ticker, date), this results in 17.4M rows and about 1.5GB of storage. (I have no idea what the provenance or accuracy of these data are, but for this work only the size is relevant.)</p>
<p>Let&rsquo;s try to beat an extract!</p>
<h2 id="nail-the-index">Nail The Index<a href="#nail-the-index" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s start with an easy case: let&rsquo;s find the yearly average close for Tableau&rsquo;s stock. I&rsquo;ll drag the ticker into filters, years into columns, and Avg(Close) into rows. It&rsquo;s an award-worthy viz:</p>
<p><img src="data.png" alt="data"></p>
<p>This is also an almost ideal query for our SQL Server database: it makes excellent use of the index, so the query is exceptionally fast to run; and because the aggregation happens remotely, there are almost no results to send over the wire. By <a href="http://blog.stdin.org/2018/02/13/the-query-behind-the-viz/">looking in the log</a>, I find that it takes a whole 0.006 seconds to run this query and fetch the results. How can we possibly beat that?</p>
<p>Indeed, if we recreate the same viz with an extract, Hyper takes more like 0.2 seconds to compute the viz.</p>
<p><img src="bestcase.png" alt="bestcase"></p>
<p>So SQL Server is faster than Hyper? Well, in this case it is, but we&rsquo;ve almost cheated by practically tuning it to answer <em>this query</em> quickly. Hyper, on the other hand, doesn&rsquo;t require (and doesn&rsquo;t allow) us to tune its setup. So we&rsquo;re comparing the <em>best case</em> for SQL Server to <em>a</em> case for Hyper.</p>
<p>But the lesson is still sound: if your query (a) lines up well with the setup of your remote database, and (b) transfers very little data, then we can actually beat a Hyper extract.</p>
<h2 id="be-truly-ad-hoc">Be Truly Ad Hoc<a href="#be-truly-ad-hoc" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s try to avoid pandering to SQL Server quite so much and just ask for the number of records my data set has each year:</p>
<p><img src="yearly.png" alt="yearly.PNG"></p>
<p>Now SQL Server takes a bit longer: 5.53 seconds. Trying this against the extract shows what Hyper can do: 0.193 seconds. In this case, both engines have to do roughly the same amount of work, but with it&rsquo;s column-based, in-memory execution, Hyper is the clear winner!</p>
<p>Except that we haven&rsquo;t taken into account the cost of generating the extract. When we refresh it, we find that it takes us 67.8 seconds to generate a 435MB extract. If we add that in, SQL Server starts looking pretty good:</p>
<p><img src="adhoc.png" alt="adhoc"></p>
<p>Applying a little algebra, that means that to recoup the cost of our extract, we&rsquo;d need to run our viz query a hair over 15 times. Often times this will be worth it, but if the query is truly one off, I&rsquo;d rather spend 5.53 seconds than 68.</p>
<h2 id="blow-up-the-extract">Blow Up the Extract<a href="#blow-up-the-extract" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s try something more horrible. Let&rsquo;s say that in addition to the historical stock prices, we have a table of customer holdings. We&rsquo;ll keep it simple; our customers have static holdings that look like:</p>
<p><code>customerholdings(customer, ticker, amount)</code></p>
<p>(I don&rsquo;t actually have any customers, so I randomly generated 20 holdings for each of 20,000 imaginary customers.)</p>
<p>We want to do things like look at the total value of all customers&rsquo; holdings over time, so we join the holdings to the price history.</p>
<p><img src="holdingsjoin.png" alt="holdingsJoin"></p>
<p>We then create a calc to compute the value each customer&rsquo;s holdings and make a viz:</p>
<p><img src="holdingsviz.png" alt="holdingsViz"></p>
<p>In case you&rsquo;re interested, that giant spike is caused by a few odd stocks like DryShips Inc. (<a href="https://finance.yahoo.com/quote/drys?p=drys">DRYS</a>), which somehow peaked at $1,442,048,636.45 in 2007. I don&rsquo;t comprehend. The graph looks funny, but again, this doesn&rsquo;t matter for our analysis.</p>
<p>What we care about is that this query takes 133 seconds to run—it&rsquo;s a fair bit of work for SQL Server to do. How about the extract?</p>
<p>Well, let&rsquo;s do a little back of the envelope computation. If we execute the full join in SQL Server and don&rsquo;t aggregate anything down, instead of the 17 million records in our history table, the result set will have about 441 million records. And these records are larger than the history rows because they have customer information as well.</p>
<p>Optimistically, this will end up being something like 10 gigabytes of data that I have to move over the wire, and store in a local extract. And that&rsquo;s all before I even get to ask my query. So unless I&rsquo;m doing this a lot, I&rsquo;m simply not going to bother.</p>
<h2 id="wrapping-up">Wrapping Up<a href="#wrapping-up" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>So we&rsquo;ve seen a few cases where live queries may be preferable to extracts, leaving aside the obvious cases where you simply want the most current data.</p>
<p>One thing we didn&rsquo;t talk about is federated queries: queries that span multiple data sources. As a <em>general rule,</em> federation makes extracts look better relative to live, because live starts to look worse. Live works best when the engine can push operations that reduce data volumes off to the remote system—operations like aggregations and filters—and federation tends to interfere with that pushdown.</p>
<p>But that&rsquo;s another ball of wax. I&rsquo;ll write more on federation soon.</p>
]]></content></item><item><title>The Query Behind the Viz</title><link>https://stdin.org/the-query-behind-the-viz/</link><pubDate>Tue, 13 Feb 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/the-query-behind-the-viz/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>Several posts here have explored the queries Tableau generates as it builds your viz, including <a href="http://blog.stdin.org/2018/02/05/custom-sql-in-tableau-subqueries-and-sql-injection/">last week&rsquo;s write-up on custom SQL</a>. This is a trend that will continue: it&rsquo;s much easier to understand a machine when you can see its inner workings.</p>
<p>But how do I get at those queries? I was talking with <a href="https://twitter.com/YvanFornes">Yvan Fornes</a>, and he suggested that I write about how I do it.</p>
<p>Challenge accepted! Except I may have gone overboard: in this post I&rsquo;ll explore <em>three</em> ways to find the queries underlying your viz.</p>
<h2 id="setup">Setup<a href="#setup" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>To illustrate things, I&rsquo;ve built a very simple viz against a very simple table I have in SQL Server. Here&rsquo;s the data source—twelve rows in all their glory.</p>
<p><img src="findingqueries1.png" alt="findingQueries1"></p>
<p>And here&rsquo;s the viz—I&rsquo;m counting students by class:</p>
<p><img src="findingqueries2.png" alt="findingQueries2"></p>
<p>With these, I&rsquo;m going to show how to find out what query is issued when I refresh the view:</p>
<p><img src="findingqueries3.png" alt="findingQueries3"></p>
<p>I&rsquo;m going to walk through these with my setup, but you can do this with your own favorite viz just as well.</p>
<h2 id="method-1-look-to-the-logs">Method 1: Look to the Logs<a href="#method-1-look-to-the-logs" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Tableau logs all of the queries it issues, plus a lot of diagnostic information along the way. In theory, you could read look through these logs with a text editor, but you don&rsquo;t want to do that.</p>
<p>Instead, head on over to <a href="https://github.com/tableau/tableau-log-viewer">GitHub</a>, where you can find the Tableau Log Viewer. You can clone the project if you want to build it yourself, or grab a prebuilt copy from the <a href="https://github.com/tableau/tableau-log-viewer/releases">releases folder</a> like a normal human being.</p>
<p>Once you have this installed, go ahead and start it up. You should see a screen like this:</p>
<p><img src="findingqueries4.png" alt="findingQueries4"></p>
<p>You can use this to read through historical logs, but there are a lot of events, and narrowing it down to what we care about can be hard. Instead, we&rsquo;ll use Live mode and capture the events that result from our refresh as they happen.</p>
<p>So lets capture some logs! If you&rsquo;ve pre-loaded your workbook in Tableau, the first thing to do with TLV is to open your logs, which should be in <code>My Tableau Repository\Logs\log.txt</code>. This will load a lot of noise, but once you have that open, you can:</p>
<ol>
<li>Start live mode.<img src="findingqueries5.png" alt="findingQueries5"></li>
<li>Clear out any history that you have, leaving you with a nice, blank starting state. Don&rsquo;t worry: this will only clear what&rsquo;s loaded in TLV, not the log file on disk.<img src="findingqueries6.png" alt="findingQueries6"></li>
<li>Switch over to Tableau and refresh the data source.</li>
<li>Come back to the Log Viewer and turn off live mode.</li>
</ol>
<p>Now we have a mess of log entries. It can be a bit overwhelming. But we&rsquo;re interested in the queries we&rsquo;re issuing, so let&rsquo;s scan for a <code>begin-query</code> event&hellip;</p>
<p><img src="findingqueries7.png" alt="findingQueries7"></p>
<p>Now we can right-click on it and ask to &ldquo;Highlight all events of this type&rdquo;.</p>
<p><img src="findingqueries8.png" alt="findingQueries8"></p>
<p>This will make it easier to see the relevant events. Even so, there will likely be a few queries to wade through, but it shouldn&rsquo;t be too overwhelming. You can double-click on an entry to see more details. As you hunt through the list, you may see a number of entries like this</p>
<p><img src="findingqueries9.png" alt="findingQueries9"></p>
<p>that query the system for metadata needed to understand the server. But you should also have the query for your data source. In my case, this is the second <code>begin-query</code> element:</p>
<p><img src="findingqueries10.png" alt="findingQueries10"></p>
<p>That&rsquo;s it: if you&rsquo;ve read my blog entry on <a href="http://blog.stdin.org/2018/01/07/dimensions-and-measures-a-sql-perspecitive/">dimensions and measures</a>, you should recognize this as the query populating my simple viz.</p>
<h2 id="method-2-tableau-performance-recorder">Method 2: Tableau Performance Recorder<a href="#method-2-tableau-performance-recorder" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Another Tableau-provided tool is the Performance Recorder built right into Tableau. This has the advantage of being built-in and potentially simpler to use than the Log Viewer.</p>
<p>To use the Performance Recorder for our example, we open up our viz, and then:</p>
<ol>
<li>Start the recording by clicking Help→Settings and Performance→Start Performance Recording.<img src="findingqueries11.png" alt="findingQueries11"></li>
<li>Refresh the data source as before.</li>
<li>Stop the recording by clicking Help→Settings and Performance→Stop Performance Recording.</li>
</ol>
<p>Once you stop the recording, Tableau will take a moment to put the report together, and then open up a Performance Recording viz:</p>
<p><img src="findingqueries12.png" alt="findingQueries12"></p>
<p>In our case, there&rsquo;s not much to see, because the query was so fast (a whole 12 rows!) and by default, the viz filters out events that take less than 0.1 seconds. To see what&rsquo;s going on, we have to edit the filter so that we can see our query—let&rsquo;s see everything:</p>
<p><img src="findingqueries13.png" alt="findingQueries13"></p>
<p>Now we can see a set of query events. We can click on one to see its text. Even though it&rsquo;s quick, the longest duration query is the one we want:</p>
<p><img src="findingqueries14.png" alt="findingQueries14"></p>
<p>This was a simpler task to complete, but I find it easier to understand everything that&rsquo;s going on using the Log Viewer. Ultimately, though, this is up to personal preference and the task at hand.</p>
<h2 id="method-3-database-specific-options">Method 3: Database-Specific Options<a href="#method-3-database-specific-options" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>If you&rsquo;ve been reading along, you may notice that in my posts I&rsquo;ve used <em>neither</em> of the approaches above, and have instead used the <a href="https://docs.microsoft.com/en-us/sql/tools/sql-server-profiler/sql-server-profiler">SQL Server Profiler.</a></p>
<p>The primary advantage to the Profiler is familiarity—for me, anyway: I worked on SQL Server for a number of years, and I know its tools reasonably well. Profiler is also useful if you want to understand what&rsquo;s happening on the database side of the conversation, but for most cases this is likely to be irrelevant.</p>
<p>The downside to Profiler is that it only works with SQL Server. If you have another database, you&rsquo;ll have to learn whatever tools it happens to have. So I&rsquo;d recommend using one of the alternatives above, but I&rsquo;ll give a brief show-and-tell for completeness.</p>
<p>I&rsquo;ll assume that Tableau is already open to your workbook. The first thing to do is to start Profiler and connect it to your SQL Server instance. Next, you want to start a trace—the defaults are good:</p>
<p><img src="findingqueries15.png" alt="findingQueries15"></p>
<p>When you want to record your activity:</p>
<ol>
<li>Clear the current trace in Profiler.<img src="findingqueries16.png" alt="findingQueries16"></li>
<li>Switch over to Tableau and refresh the data source.</li>
<li>Switch back to Profiler and pause the trace.<img src="findingqueries17.png" alt="findingQueries17"></li>
</ol>
<p>Now you have the view of the world from SQL Server&rsquo;s point of view. And as with the Log Viewer, there&rsquo;s a lot here:</p>
<p><img src="findingqueries18.png" alt="findingQueries18"></p>
<p>It can be a little hard to separate the wheat from the chaff. But after a little digging, we find our query buried in a <code>sp_prepexec</code>, and the whole statement starts with a <code>declare</code>:</p>
<p><img src="findingqueries19.png" alt="findingQueries19"></p>
<h2 id="final-thoughts">Final Thoughts<a href="#final-thoughts" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>There you are: more than you ever wanted to know about finding the queries underlying Tableau. Now maybe you can figure out what LOD calcs are really doing. If not, stay tuned and I&rsquo;ll share—once <em>I</em> wrap my head around them properly.</p>
]]></content></item><item><title>Custom SQL in Tableau: Subqueries and SQL Injection</title><link>https://stdin.org/custom-sql-in-tableau-subqueries-and-sql-injection/</link><pubDate>Mon, 05 Feb 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/custom-sql-in-tableau-subqueries-and-sql-injection/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>I recently answered <a href="https://community.tableau.com/thread/258004">a question</a> on the Tableau Community forums that arose from confusion over why some (perfectly correct) SQL wasn&rsquo;t working as custom SQL in Tableau. The poster wanted a list of Tableau&rsquo;s supported syntax.</p>
<p>But as it turns out, that&rsquo;s the wrong question: Tableau doesn&rsquo;t  <em>have</em> a list of all the custom SQL syntax it supports because it really is just passing along the SQL code as you&rsquo;ve typed it.</p>
<p>So why would a perfectly reasonable custom query fail? And what&rsquo;s the link to SQL injection? Read on!</p>
<h2 id="the-issue">The Issue<a href="#the-issue" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>To explore this, I&rsquo;m going to use a really simple table and my favorite database system, SQL Server. Here&rsquo;s the schema for the table:</p>
<pre tabindex="0"><code>students(name, class)
</code></pre><p>Now, if I put a very simple query into Tableau, I get exactly what I&rsquo;d expect:</p>
<p><img src="customsql1.png" alt="customsql1"></p>
<p><img src="customsql2.png" alt="customsql2"></p>
<p>But if I try something wee bit more interesting, Tableau gives me an error:</p>
<p><img src="customsql3.png" alt="customsql3"></p>
<p><img src="customsql4.png" alt="customsql4"></p>
<p>What gives? How is Tableau screwing up my query? This works fine if I run it directly against SQL Server:</p>
<p><img src="customsql5.png" alt="customsql5"></p>
<p>As it happens, the error doesn&rsquo;t come from Tableau: it comes from SQL Server—because the query is wrong.</p>
<h2 id="what-gives">What Gives?<a href="#what-gives" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>When you put custom SQL into Tableau, Tableau passes it along <em>nearly unadulterated</em> to the target system. But by &ldquo;nearly unadaulterated&rdquo; I mean &ldquo;wrapped in a subquery&rdquo;.</p>
<p>So, for example, when I enter:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="k">class</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="k">from</span><span class="w"> </span><span class="n">students</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="k">group</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="k">class</span><span class="w">
</span></span></span></code></pre></div><p>What actually gets passed along to SQL Server is:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">TOP</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">select</span><span class="w"> </span><span class="k">class</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">from</span><span class="w"> </span><span class="n">students</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">group</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="k">class</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><p>You can see our query in there, but it isn&rsquo;t issued by itself: it&rsquo;s wrapped in an outer <code>SELECT</code> and lives on as a subquery creatively named &ldquo;Custom SQL Query&rdquo;.</p>
<p>I&rsquo;ll come back to why this outer query is there in a second. But first, let&rsquo;s take this whole query and try running it in SQL Server directly:</p>
<p><img src="customsql8.png" alt="customsql8"></p>
<p>This error looks familiar: It&rsquo;s telling us that the <code>count(*)</code> statement in the <code>select</code> needs to have a column name if it&rsquo;s going to live in a subquery. If we fix this by giving the count a name&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">select</span><span class="w"> </span><span class="k">class</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="n">a_name</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">from</span><span class="w"> </span><span class="n">students</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="k">group</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="k">class</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><p>&hellip;then the query will run correctly SQL Server—and the inner bit will work in Tableau:</p>
<p><img src="customsql10.png" alt="customsql10"></p>
<p>One lesson here is that if you&rsquo;re trying to debug why your query isn&rsquo;t working in Tableau, you can wrap it as a subquery and try debugging it in the underlying database directly. Once you have it working in that context, it will probably work as custom SQL in Tableau.</p>
<p>Oh, and why the <code>SELECT TOP 1 *</code>? The first thing Tableau wants to do when it&rsquo;s faced with custom SQL is to get the resulting schema, and fetching a single row is a good way to do this. If this succeeds, Tableau will use your query in other combinations, but always as a subquery.</p>
<h2 id="a-sql-injection">A SQL Injection?<a href="#a-sql-injection" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>If all of this makes you think of SQL injection, you&rsquo;re not crazy. A SQL injection attack works by letting someone provide code that gets naively splatted into a SQL statement that&rsquo;s sent along to the database—and that&rsquo;s a lot like what&rsquo;s going on here.</p>
<p><img src="exploits_of_a_mom.png" alt=""> <a href="https://xkcd.com/327/">Obligatory XKCD Reference</a></p>
<p>I&rsquo;m sure that someone out there can do something a whole lot slicker (and more nefarious) than this simple example, but what if we put this garbage SQL into our custom SQL?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">foo</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">create</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="n">SUPER_NEFARIOUS_TABLE</span><span class="w"> </span><span class="p">(</span><span class="n">EVIL_COLUMN</span><span class="w"> </span><span class="nb">int</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="p">(</span><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">foo</span><span class="w">
</span></span></span></code></pre></div><p>This is clearly malformed SQL all by itself, but remember that it&rsquo;s going to be inserted into another query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">TOP</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">foo</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">create</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="n">SUPER_NEFARIOUS_TABLE</span><span class="w"> </span><span class="p">(</span><span class="n">EVIL_COLUMN</span><span class="w"> </span><span class="nb">int</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="p">(</span><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">foo</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><p>Now this looks more like SQL. In fact, it&rsquo;s three separate SQL commands:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">TOP</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">foo</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">create</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="n">SUPER_NEFARIOUS_TABLE</span><span class="w"> </span><span class="p">(</span><span class="n">EVIL_COLUMN</span><span class="w"> </span><span class="nb">int</span><span class="p">)</span><span class="w">
</span></span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="p">(</span><span class="k">select</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">foo</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="n">Custom</span><span class="w"> </span><span class="k">SQL</span><span class="w"> </span><span class="n">Query</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><p>What happens when Tableau executes this? It gets its data from the first query, but the second and third queries run as well, and number two is a little suspicious. That red bit <em>will</em> actually create a table on your target database—assuming you have the proper permissions—right from custom SQL.</p>
<p>Of course, here I&rsquo;m just creating a table, but I could drop one just as easily.</p>
<h2 id="stay-safe">Stay Safe<a href="#stay-safe" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Here&rsquo;s the good news: I&rsquo;m just not that clever.</p>
<p>While I&rsquo;ve injected some potentially nefarious SQL into my custom SQL, so far all I&rsquo;ve really done is stage a SQL injection attack on myself.</p>
<p>Naturally, this would be harmful if I could get  <em>you</em> to run the query. What if I sent you a present: a nice, simple, harmless, cuddly, nonthreatening .twb&hellip;</p>
<p><img src="drevil_million_dollars.jpg" alt="Drevil_million_dollars"></p>
<p>&hellip;with laser beams! Or better yet, nasty custom SQL.</p>
<p>But (alas!) I&rsquo;m not the first one to figure out this potential attack vector, and some smart engineers at Tableau decided to warn you when custom SQL is present in your .twb. So when you get a suspicious workbook—any workbook with custom SQL—you&rsquo;ll be confronted with a warning like this:</p>
<p><img src="customsql7.png" alt="customsql7"></p>
<p>If you see this warning, you should take note and make sure you understand what that custom SQL is doing before you proceed.</p>
<p>And to be doubly safe, don&rsquo;t accept any .twb from me.</p>
]]></content></item><item><title>Row-Level Security: A Cautionary Tale</title><link>https://stdin.org/row-level-security-a-cautionary-tale/</link><pubDate>Mon, 29 Jan 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/row-level-security-a-cautionary-tale/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>Row-level security is a common requirement for people trying to control access to data. Some systems provide this natively, but when it&rsquo;s not provided, people often roll their own using the tools they have—with mixed results</p>
<p>In this post we&rsquo;ll explore a common way to implement row-level security on top of a relational database and see why it may not be as secure as it looks.</p>
<h2 id="a-pop-quiz">A Pop Quiz<a href="#a-pop-quiz" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>But before we get to the crux of the issue, here&rsquo;s a quick quiz. I promise it&rsquo;s relevant.</p>
<p>What will each of the following languages do when <code>a</code> is equal to <code>0</code>​?</p>
<ol>
<li>C, C++, C#, Java, and most other C-family languages:<br>
<code>if (a != 0 &amp;&amp; 1/a &gt; 0) { /* Do something */ }</code></li>
<li>Pascal:<br>
<code>IF a &lt;&gt; 0 AND 1/a &gt; 0 THEN (* Do something *)</code></li>
<li>SQL:<br>
<code>SELECT *</code><br>
<code>FROM T</code><br>
<code>WHERE a &lt;&gt; 0 AND 1/a &gt; 0</code></li>
</ol>
<p>Obviously, I&rsquo;m asking about short circuiting behavior. I&rsquo;ll let you ponder and reveal the answers in a moment. But first, back to row-level security.</p>
<h2 id="the-setup">The Setup<a href="#the-setup" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Imagine that we have a table of sensitive customer information:</p>
<pre tabindex="0"><code>customers:
id  ssn        balance
--- ---------- --------
1   123456789  150.00
2   234567890  250.00
3   345678901  350.00
4   456789012  450.00
5   567890123  550.00
6   678901234  650.00
7   789012345  750.00
8   890123456  850.00
9   901234567  950.00
</code></pre><p>(Apologies if I&rsquo;ve exposed your SSN&hellip;)</p>
<p>We want to provide our employees access, but only to  <em>their</em> customers, not the entire set.</p>
<p>A common way to do this on a system that doesn&rsquo;t have built-in row-level security is to (a) add a security table that expresses which rows each user is allowed to see, (b) build a view that uses this security table to restrict the rows that each user sees, and (c) force everyone to access the data through the view.</p>
<p>So, I first create a security table that maps users to the customers they can see. E.g.,:</p>
<pre tabindex="0"><code>access:
uid    cid
------ -----
alice  1
alice  2
alice  3
isaac  4
isaac  5
isaac  6
bob    7
bob    8
bob    9
</code></pre><p>This means, e.g., that <code>isaac</code> should only be able to see customers numbered 4, 5, and 6. To enforce this we create a view:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sec_customers</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">CUSTOMERS</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="k">IN</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="p">(</span><span class="k">SELECT</span><span class="w"> </span><span class="n">cid</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">access</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">uid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">USER_NAME</span><span class="p">())</span><span class="w">
</span></span></span></code></pre></div><p>I&rsquo;m showing this with SQL Server, so I&rsquo;m using the built-in function <code>USER_NAME()</code> to dynamically modify the query based on the user who accesses the view. The specifics here will vary system-to-system, but you should be able to accomplish something similar.</p>
<p>We&rsquo;ll restrict access to the base table, and let users only come in through the view. Now when <code>isaac</code> selects everything from <code>sec_customers</code>, all he sees is:</p>
<pre tabindex="0"><code>id  ssn        balance
--- ---------- --------
4   456789012  450.00
5   567890123  550.00
6   678901234  650.00
</code></pre><h2 id="the-punchline">The Punchline<a href="#the-punchline" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Pretty good! But before we celebrate, let&rsquo;s look at the answers to our quiz:</p>
<ol>
<li>In most C-family languages, compound predicates like this short circuit: the system will check whether <code>a != 0</code> and only execute the <code>1/a &gt; 0</code> bit if <code>a</code> isn&rsquo;t zero. The body of our conditional won&rsquo;t be executed, but life will go on as usual.</li>
<li>In Pascal there is no short circuiting: the system will  <em>always</em> execute all of the parts of the compound predicate. So if <code>a</code> is zero, the system will throw a divide-by-zero error.</li>
<li>In a particularly awesome twist of semantics, SQL short circuits <em>but doesn&rsquo;t guarantee order of operations</em>. So this code  <em>may</em> execute fine if it tests <code>a &lt;&gt; 0</code> first, or it may throw an exception if it tries the division first—you&rsquo;re at the whim of the optimizer.</li>
</ol>
<p>What does this have to do with row-level security? As I mentioned <a href="https://stdin.org/tableau-data-sources-live-vs-extract/">when discussing extract types in Tableau</a>, when you write a query against a (virtual) view in SQL, your query is composed with the view query, and this whole thing is then optimized. But SQL doesn&rsquo;t generally respect the order of the operations you&rsquo;ve written down, and this disregard runs deep. There is no &ldquo;query boundary&rdquo; when you compose queries: your operations can get shuffled around anywhere in the plan.</p>
<p>And that&rsquo;s a problem when it comes to security.</p>
<p>To illustrate, let&rsquo;s try another query against my &ldquo;secured&rdquo; customer table. I&rsquo;m going to guess the Social Security number of a customer that I shouldn&rsquo;t have access to, and see what I can find with a little SQL.</p>
<p>If I guess incorrectly, everything works as we&rsquo;d expect:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">sec_customers</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="n">ssn</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">789012346</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="w">
</span></span></span></code></pre></div><pre tabindex="0"><code>id  ssn        balance
--- ---------- --------
4   456789012   450.00
5   567890123   550.00
6   678901234   650.00
</code></pre><p>But if I guess correctly:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">sec_customers</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">7</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="n">ssn</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">789012345</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="w">
</span></span></span></code></pre></div><pre tabindex="0"><code>id  ssn        balance
--- ---------- --------
4   456789012  450.00
5   567890123  550.00
6   678901234  650.00
Msg 8134, Level 16, State 1, Line 16
Divide by zero error encountered.
</code></pre><p>And now I know that <em>a</em> customer has an SSN of 789012345: I&rsquo;ve leaked information that I shouldn&rsquo;t have leaked. And with a little work, I may be able to narrow this down to a particular customer.</p>
<p>What happened? Looking at the query plan for this query it becomes more clear:</p>
<p><img src="pushdown1.png" alt="pushdown"></p>
<p>The syntax implies that customers I don&rsquo;t have access to will be filtered out before they hit my query. But the optimizer has reordered the operations: the security filter is enforced by the join, and my predicate has been &ldquo;pushed down&rdquo; and folded into the table scan. The result is that the predicate sees the entire customer table, which results in an information-leaking exception.</p>
<h2 id="coda">Coda<a href="#coda" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>I&rsquo;ve shown this with SQL Server, but this isn&rsquo;t a criticism of that system. Aside from the specifics around identifying the user, the mechanism I&rsquo;ve shown is likely to apply to any database with an optimizer that can rearrange operations—which is to say any database worth it&rsquo;s salt.</p>
<p>That said, while SQL Server did add first-class row-level security in SQL Server 2016, it&rsquo;s likely that it&rsquo;s using a similar mechanism under the hood. As <a href="https://docs.microsoft.com/en-us/sql/relational-databases/security/row-level-security">Microsoft&rsquo;s notes</a>:</p>
<blockquote>
<p><strong>Carefully crafted queries:</strong> It is possible to cause information leakage through the use of carefully crafted queries. For example, <code>SELECT 1/(SALARY-100000) FROM PAYROLL WHERE NAME='John Doe'</code> would let a malicious user know that John Doe&rsquo;s salary is $100,000. Even though there is a security predicate in place to prevent a malicious user from directly querying other people&rsquo;s salary, the user can determine when the query returns a divide-by-zero exception.</p>
</blockquote>
<p>Is this a worry? Well, I suppose it depends. But I&rsquo;d be reluctant to rely on this mechanism if I had a real concern about exposing data.</p>
]]></content></item><item><title>Tableau Data Sources: Live vs Extract</title><link>https://stdin.org/tableau-data-sources-live-vs-extract/</link><pubDate>Fri, 12 Jan 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/tableau-data-sources-live-vs-extract/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>Continuing <a href="http://blog.stdin.org/2018/01/07/dimensions-and-measures-a-sql-perspecitive/">last-week&rsquo;s trend</a>, we&rsquo;ll again take a look at an aspect of Tableau that people often find confusing: the difference between live and extracted data sources. And again, we&rsquo;re going to take a bit of a database perspective to clarify the situation.</p>
<p>The impetus for this post is a number of statements I&rsquo;ve seen along the lines of:</p>
<blockquote>
<p>A live data source is just a real-time extract of your data.</p>
</blockquote>
<p>This is my favorite kind of wrong: subtly wrong.</p>
<p>We&rsquo;ll come back to extracts shortly, but first I want to take a digression through  <em>views</em>. That will be our database perspective for the day.</p>
<h2 id="the-database-perspective">The Database Perspective<a href="#the-database-perspective" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>For the purpose of this example, we&rsquo;re going to imagine that we have two super-creative tables named <em>products</em> and <em>sales</em>. For this post, we&rsquo;re not going to care much about the data, but let&rsquo;s assume a schema for each of these tables:</p>
<pre tabindex="0"><code>products(pid, description, price)
sales(pid, customer, count)
</code></pre><p>Here, pid is the product identifier—and the join key between the two tables.</p>
<p>We might want to ask questions of these data like, &ldquo;how much did each customer spend?&rdquo; If we were going to jump right in and use these in Tableau, we&rsquo;d start by creating a new data source that joins these tables together. After joining the tables, we&rsquo;d also create a new column with the total spend for each sale.</p>
<p>But we&rsquo;re going to stick to database-land, so we&rsquo;ll write this data source as a SQL query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">products</span><span class="p">.</span><span class="n">pid</span><span class="p">,</span><span class="w"> </span><span class="n">description</span><span class="p">,</span><span class="w"> </span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">customer</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">count</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">price</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">total_sale</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">sales</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="p">.</span><span class="n">pid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sales</span><span class="p">.</span><span class="n">pid</span><span class="w">
</span></span></span></code></pre></div><p>To use this in a viz, we need to be able to run other queries against the results of this query. To do that, we need to create a  <em>view</em> , which <em>caches</em> the query and gives it a name that we can write other queries against:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">full_sales</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">products</span><span class="p">.</span><span class="n">pid</span><span class="p">,</span><span class="w"> </span><span class="n">description</span><span class="p">,</span><span class="w"> </span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">customer</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">count</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">price</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">total_sale</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">sales</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="p">.</span><span class="n">pid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sales</span><span class="p">.</span><span class="n">pid</span><span class="w">
</span></span></span></code></pre></div><p>Going forward, we&rsquo;ll call this our &ldquo;data-source query&rdquo;.</p>
<p>Now we could use this in Tableau to create a viz that looked at total sales per customer. Using <a href="http://blog.stdin.org/2018/01/07/dimensions-and-measures-a-sql-perspecitive/">what we learned last time</a>, we can also write the query that our viz will generate as:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">customer</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">total_sale</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">full_sales</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">customer</span><span class="w">
</span></span></span></code></pre></div><p>We&rsquo;ll call this our &ldquo;viz query&rdquo;.</p>
<p>So far this has all been a long preface for the real meat of this post. We want to understand what exactly the database engine does with this last query. And the answer depends a lot on what we actually mean mean by &ldquo;caching a query&rdquo;.</p>
<h2 id="the-meat-of-the-post">The Meat of the Post<a href="#the-meat-of-the-post" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>When we said we&rsquo;d cache a query, we could have meant two very different things. The simplest is that we could cache the  <em>result</em> of the query. I.e., when we create the view, we could effectively create a new table called full_sales, run the data-source query, and populate full_sales with the result.</p>
<p>This is sometimes called a  <em>materialized view</em> , and makes it pretty simple to understand what our viz query does: it just runs against the data in the table we&rsquo;ve cached.</p>
<p>But there&rsquo;s another possibility: we could cache the query itself. I.e., when we create the view, we won&rsquo;t actually run the query at all; we&rsquo;ll just stash the SQL and use it to produce results on demand later.</p>
<p>This is generally called a  <em>virtual view,</em> and it would be pretty uninteresting if we just reran the data-source query every time we used it. But that&rsquo;s not what databases do.</p>
<p>Instead, when we run our viz query against a virtual view, the database engine will <em>compose</em> the two queries together: it will effectively do an internal rewrite of the query so that it looks something like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">customer</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">total_sale</span><span class="p">)</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">SELECT</span><span class="w"> </span><span class="n">products</span><span class="p">.</span><span class="n">pid</span><span class="p">,</span><span class="w"> </span><span class="n">description</span><span class="p">,</span><span class="w"> </span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">customer</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span><span class="k">count</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">price</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">total_sale</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">sales</span><span class="w"> </span><span class="n">s</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="p">.</span><span class="n">pid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sales</span><span class="p">.</span><span class="n">pid</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">temp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">customer</span><span class="w">
</span></span></span></code></pre></div><p>The full_sales reference has just been rewritten with the subquery that defined it.</p>
<p>There are a few relatively obvious trade offs between these approaches:</p>
<ul>
<li>A materialized view can take a lot of space, depending on how much data the query produces.</li>
<li>A virtual view takes almost no space at all, since all it stores is the query.</li>
<li>A materialized view may take a while to build, but can ultimately save time if it precomputes data that speed up the queries that are run against it.</li>
<li>A virtual view takes almost no time to create, but also doesn&rsquo;t precompute anything.</li>
</ul>
<p>But this misses the main event: in many cases, the fully-composed query will be <em>much</em> faster to execute than each half individually.</p>
<p>There are two big reasons for this. First, because the database system gets the whole, composed query at once, it can optimize the whole thing globally. To give a simple example, assume that the viz query filtered out a lot of the underlying data—maybe the viz only shows information on a single customer. With separate view and viz queries, the data-source query has to produce data for  <em>all</em> the customers, just to have them filtered out by the view query. By considering them together, that filter can be  <em>pushed down</em> and executed early, short-cutting a lot of the work.</p>
<p>Second, assume that the data-source query was executed on one system—say a SQL Server database—and the viz query was executed some where else—like Tableau. If the data-source query produces a lot of data, all of it needs to be sent over the network. If the queries can be composed and the whole thing can be executed by the source database, then only the final viz query results need to be sent. And through aggregation, Tableau viz queries usually reduce the data  <em>substantially</em>.</p>
<p>For these reasons, running the viz query on top of the virtual view may perform significantly better than taking an extract  <em>and then</em> running the viz query separately.</p>
<h2 id="bringing-this-back-to-tableau">Bringing This Back to Tableau<a href="#bringing-this-back-to-tableau" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Let&rsquo;s connect this back to Tableau. But with the heavy lifting complete, the connection is straightforward:</p>
<ul>
<li>An  <em>extracted data source</em> is a materialized view: the data-source query is run to pull data from your database, and the data are materialized into a TDE or (with Tableau 10.5 or higher) a Hyper database. View queries then run against this materialized extract.</li>
<li>A  <em>live data source</em> is a virtual view: the data source just stores the query, and Tableau composes it on the fly with whatever query your viz generates. This composed query is issued in one shot to your database.</li>
</ul>
<p>Done. But now you can see how the explanation at the top is subtly wrong.</p>
<blockquote>
<p>A live data source is just a real-time extract of your data.</p>
</blockquote>
<p>If this were true, then every time you modified or refreshed your workbook, the system would  <em>first</em> run the data-source query and suck down the results, and  <em>then</em> run the viz query to aggregate it all down. But this striation doesn&rsquo;t happen: with a live data source, the whole view+viz query is composed, and this whole query is then executed. Every time.</p>
<p>In the right circumstances, this can mean the query that&rsquo;s run is more optimized and flows much less data down from the database. And that can perform very well.</p>
<p>Other times, the cost to materialize the extract is worth it because the construction cost is amortized over a whole lot of viz queries, all of which benefit from the extract. But that would be for naught if the extract were re-created for each viz query.</p>
<h2 id="coda">Coda<a href="#coda" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round" class="feather">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>After all of this, it&rsquo;s worth popping up to recognize the obvious benefit to live data sources: liveliness. Since live data sources go directly against your database, they see the changes to your data as those changes occur. This can be valuable unto itself, regardless of the performance trade-offs.</p>
]]></content></item><item><title>Dimensions and Measures: A SQL Perspecitive</title><link>https://stdin.org/dimensions-and-measures-a-sql-perspecitive/</link><pubDate>Sun, 07 Jan 2018 00:00:00 +0000</pubDate><author>Isaac</author><guid>https://stdin.org/dimensions-and-measures-a-sql-perspecitive/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>I thought I&rsquo;d kick this off gently. I remember going through Boot Camp after joining <a href="https://www.tableau.com/">Tableau</a> and learning about dimensions and measures. And I remember finding the descriptions rather confusing.</p>
<p>I don&rsquo;t recall the precise phrasing, but it went something like <a href="https://www.safaribooksonline.com/library/view/tableau-data-visualization/9781849689786/ch01s09.html">this</a>:</p>
<blockquote>
<p>Dimensions are usually those fields that cannot be aggregated; measures, as its [sic] name suggests, are those fields that can be measured, aggregated, or used for mathematical operations.</p>
</blockquote>
<p>Or <a href="https://www.interworks.com/blog/mtreadwell/2013/11/20/tableau-pills-measures-and-dimensions">this</a>:</p>
<blockquote>
<p>Measures are the result of a business process event&hellip; Dimensions are reference variables that give context to measures.</p>
</blockquote>
<p>I don&rsquo;t really mean to criticize these definitions, but to a database guy, they seem rather <em>imprecise</em>. For someone with a little SQL know-how, the actual definition is both crisp and helpful in understanding what Tableau really does under the covers—this helps  <em>predict</em> what actions in the UI will do, so you don&rsquo;t just blindly drag-and-drop until things look right.</p>
<p>The rest of this post is a crisp explanation of dimensions and measures  <em>for someone who knows a little SQL.</em></p>
<p>Tableau does a lot of things, but at its core—or what I like to think of as its core—it&rsquo;s an aggregation engine: you tell it how to slice-and-dice your data and what aggregates to apply, and it does so. Then there is some visualization on top. That&rsquo;s nice, too.</p>
<p>What this means is that so long as we confine ourselves to &ldquo;simple&rdquo; things, and our data are in a table T, then Tableau is going to produce a query that looks like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">some_stuff</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">T</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">some_other_stuff</span><span class="w">
</span></span></span></code></pre></div><p>When I say &ldquo;simple&rdquo;, I mean that we&rsquo;re going to confine ourselves to dragging fields into the Columns, Rows, and Marks shelves:</p>
<p><img src="students1.png" alt="students1"></p>
<p>To illustrate, I swiped some highly confidential student records from the nearby college and dumped them into SQL Server. Here they are:</p>
<p><img src="students01.png" alt="students0"></p>
<p>And now the meat of the post: Tableau&rsquo;s rule for generating a SQL query from the collection of items on each of the shelves. It&rsquo;s very simple—don&rsquo;t blink:</p>
<ul>
<li>If a field is used as a dimension, then it&rsquo;s added to the GROUP BY and SELECT clauses.</li>
<li>If a field is used as a measure, then it&rsquo;s only added to the SELECT clause, with the appropriate aggregation applied.</li>
</ul>
<p>That&rsquo;s it. The order doesn&rsquo;t matter. The exact shelf doesn&rsquo;t matter. All we care about is whether each field is a dimension or a measure.</p>
<p>Let&rsquo;s try this out. We&rsquo;ll just use the rows shelf, and drag out both Class and Name <em>as dimensions</em>. According to our rules, we should then generate:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">class</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">students</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">class</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w">
</span></span></span></code></pre></div><p>Let&rsquo;s see what happens. First, here&rsquo;s what things look like in Tableau:</p>
<p><img src="students21.png" alt="students2"></p>
<p>This looks promising. Now let&rsquo;s look at the query that was generated. There are various ways to do this, including digging through Tableau&rsquo;s logs with the <a href="https://github.com/tableau/tableau-log-viewer">Tableau Log Viewer</a>. I&rsquo;m going to use the <a href="https://docs.microsoft.com/en-us/sql/tools/sql-server-profiler/sql-server-profiler">SQL Server Profiler</a> to get the query.</p>
<p>I won&rsquo;t give all the Profiler details. Tableau issues a bunch of little metadata queries, but if we dig a tiny bit, we find the main course:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="p">[</span><span class="n">students</span><span class="p">].[</span><span class="k">class</span><span class="p">]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">[</span><span class="k">class</span><span class="p">],</span><span class="w"> 
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="p">[</span><span class="n">students</span><span class="p">].[</span><span class="n">name</span><span class="p">]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">[</span><span class="n">name</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">[</span><span class="n">dbo</span><span class="p">].[</span><span class="n">students</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">students</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="p">[</span><span class="n">students</span><span class="p">].[</span><span class="k">class</span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="n">students</span><span class="p">].[</span><span class="n">name</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><p>And there we have it. A few more characters, but exactly the query we predicted. I&rsquo;ve color coded these examples to make it easier to line up the fields.</p>
<p>But what are those &ldquo;Abc&quot;s? Well, Tableau wants to put a mark for each value that it&rsquo;s calculating, and we haven&rsquo;t told it what data to show, so it uses &ldquo;Abc&rdquo; as a placeholder. We can improve our viz a little and get rid of these by moving the Name to &ldquo;Text&rdquo; shelf:</p>
<p><img src="students31.png" alt="students3"></p>
<p>A visualization guru might scoff at this. But I&rsquo;m not a visualization guru. What we care about is the query this generates. And although the visualization has changed, the underlying query doesn&rsquo;t: we&rsquo;re still using both Name and Class as dimensions, and each unique pairing of these values shows up once in the viz.</p>
<p>Let&rsquo;s try to use this knowledge to direct the viz we get. Let&rsquo;s say that we want to know how many classes each student is taking. In SQL, we might write:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="k">class</span><span class="p">),</span><span class="w"> </span><span class="n">name</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">students</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">name</span><span class="w">
</span></span></span></code></pre></div><p>How do we get this in Tableau? Reviewing our rules, we see that we should leave Name a dimension, and make Class a measure with COUNT as its aggregation.</p>
<p>Let&rsquo;s try it. Here I&rsquo;ve put Name on Rows, and COUNT(Class) on Text:</p>
<p><img src="students41.png" alt="students4"></p>
<p>That looks right: we have one mark for each distinct name, that mark is the count of classes. Just to confirm, let&rsquo;s go to the SQL:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-SQL" data-lang="SQL"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">COUNT_BIG</span><span class="p">([</span><span class="n">students</span><span class="p">].[</span><span class="k">class</span><span class="p">])</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">[</span><span class="n">cnt</span><span class="p">:</span><span class="k">class</span><span class="p">:</span><span class="n">ok</span><span class="p">],</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">       </span><span class="p">[</span><span class="n">students</span><span class="p">].[</span><span class="n">name</span><span class="p">]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">[</span><span class="n">name</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">[</span><span class="n">dbo</span><span class="p">].[</span><span class="n">students</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">students</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="p">[</span><span class="n">students</span><span class="p">].[</span><span class="n">name</span><span class="p">]</span><span class="w">
</span></span></span></code></pre></div><p>Spot on!</p>
<p>Again, we can make a better viz—say, a bar chart—but rest assured, it&rsquo;s just another way of presenting the same data from the same query:</p>
<p><img src="students51.png" alt="students5">(I always knew Elsa was an overachiever.)</p>
<p>To recap, dimensions and measures are really very simple. Dimensions are the things that you group by; they show up in both the GROUP BY and SELECT clauses of the underlying query. And measures are the things you&rsquo;re aggregating; they never show up in the GROUP BY clause, only aggregated in the SELECT clause.</p>
<p>Cheers,<br>
-Isaac</p>
]]></content></item></channel></rss>