Power Query M Primer (Part 12): Tables—Table Think I

, , ,

Why should you concern yourself with how Power Query “thinks” about tables? After all, you write an expression that outputs the table you want, the mashup engine executes it and everyone is happy without you having to think about how the engine does its thing…right? Yes—at least until you encounter performance problems, values change during processing or a firewall error bites—then what do you do?

Understanding how M processes tables is an important asset in developing efficient mashups, avoiding unexpected data variability and keeping the data privacy layer happy. Streaming, query folding, buffering, table keys, native query caching and the firewall—all of these relate to how the interpreter thinks about/processes/handles tables.

There’s so much to cover, we’ll split the list in two. Let’s tackle the first half (streaming, query folding and buffering) in this post and save the remainder (table keys, native query caching and the firewall) for next time.

Let’s get going!

Series Index

How Data Flows

To start, let’s make sure we have a firm grasp of streaming and query folding.

Imagine you are the mashup engine. How would you execute the below?

let
  Source = SomeDataSourceReturningATable,
  Filtered = Table.SelectRows(Source, each [Office] = "Chicago"),
  Result = Table.FirstN(Filtered, 3)
in
  Result

A simple way would be to retrieve all the rows returned by SomeDataSourceReturningATable and save them in variable Source. Then, take the contents of that variable, figure out which rows pass the [Office] = "Chicago" test, and save those in variable Filtered. Lastly, grab the first three rows from Filtered, save them in Result and then return Result.

Logical? Yes. Efficient? No. Why not? For one, there’s resource usage: While at most three rows will be output, the system hosting the mashup engine must have enough capacity to store everything returned from the source (which could be billions of rows). Attempting to do this could lead to that system running out of resources (e.g. memory or disk space).

Thankfully, Power Query doesn’t handle table expressions in this simplistic way. Instead, M uses streaming, query folding or a combination of these two techniques. We learned about both back in Paradigm (part 5); in this post, we’ll try to hone our understanding by delving deeper into their details and walking through several examples. If the general ideas of what streaming and query folding are isn’t sharp in your mind, it’s probably worth jumping back to part 5 for a refresher before continuing on.

Streaming

Let’s say M executes the above expression using streaming….

When Result’s contents are requested, Table.FirstN in the Result step starts by asking the proceeding step (Filtered) for one row of data. When Filtered’s Table.SelectRows receives this request, it turns around and asks step Source for a row of data, which Source provides. When Filtered’s SelectRows receives this row, it checks whether it passes Filtered’s [Office] = "Chicago" test. If so, SelectRows returns the row to Result; if not, it discards the row then requests another, repeating this process until it finds one that passes the test, which is returned to step Result. Once Result’s FirstN has received a row, it outputs that row then turns around and asks Filtered for a second row (because it’s looking for a total of three rows). Filtered’s SelectRows then picks back up where it left off with Source, asking for one row at a time until it finds another that passes the [Office] = "Chicago" test, which it then passes to Result’s FirstN, which then outputs it. Lastly, this process is repeated one more time to retrieve the third row FirstN needs to satisfy how it was programmed.

Each step produces rows one at a time, only requesting as many rows as it needs from the preceding step to produce the requested row. By working with just enough data (vs. the hypothetical simplistic approach we started with which stored the entire output of each step in memory), Power Query is able to handle data sets that are too large to be stored locally and doesn’t waste resources storing rows that ultimately are unnecessary to produce the requested output.

Internal In-Memory Row Storage

Excepting the data provider (which, for performance reasons, might fetch rows in chunks from the external data source), none of the operations in the proceeding example held rows in memory. When a functioned processed a row, it either passed it on or discarded it.

However, this isn’t true for every operations. Let’s say we add a sort step to our example:

let
  Source = SomeDataSourceReturningATable,
  Filtered = Table.SelectRows(Source, each [Office] = "Chicago"),
  Sorted = Table.Sort(Filtered,{{"TotalSales", Order.Descending}}),
  Result = Table.FirstN(Sorted , 3)
in
  Result

Sorting (generally) requires retrieving all rows from the previous step so that they can be put in proper order. When the above expression is executed, Result’s Table.FirstN asks Sorted’s Table.Sort for the first row. To figure out which row to return, Sort gets all rows from Filtered, sorts them, saves the sorted rows in memory, then returns the first row from the sorted set. Each time Sort is asked for a subsequent row, it returns the appropriate row from what it has in memory. After Sort returns a row, it will never need it again, so it can remove that row from its memory (whether or not it purges memory like this is an internal implementation detail—but at least at the theoretical level it’s allowed to do this).

This internal storing of rows in memory is not a persistent cache; rather, it is limited in scope to a single method invocation during a single execution of the query. There is no sharing of these held in-memory rowsets when a function is invoked multiple times, like Table.Sort is below (both List.Sum(Top3[TotalSales]) and List.Average(Top3[TotalSales]) end up calling it). 

let
  Source = SomeDataSourceReturningATable,
  Filtered = Table.SelectRows(Source, each [Office] = "Chicago"),
  Sorted = Table.Sort(Filtered, {{"TotalSales", Order.Descending}}),
  Top3 = Table.FirstN(Sorted , 3)
in
  { List.Sum(Top3[TotalSales]), List.Average(Top3[TotalSales]) }

“Which operations hold rows internally?,” you might ask. Table at a time operations, like joins (though not always nested joins), sorts, grouping, pivot/unpivot, are all suspects—and obviously buffering. (Unfortunately, other than for buffering, I’m not aware of documentation officially detailing this so anecdotal evidence gathered from testing and answers to forum posts are what we have to go by.)

Above, the holding of rows was described as “in memory.” Keep in mind that memory can be paged to disk. Working with memory paged to disk is much, much, much slower than working with memory stored in RAM. The memory usage point that triggers paging to disk is environment specific. In some environments, paging starts when a query’s total memory use exceeds 256 MB.

Performance

The order of operations can have a significant impact on how much data must be kept in memory. To see this significance, let’s contrast two variations of an expression. Both that produce the same output but can differ significantly in local resources used.

let
  Source = SomeDataSourceReturningATable,
  Sorted = Table.Sort(Source, {{"TotalSales", Order.Descending}}),
  Filtered = Table.SelectRows(Sorted, each [Office] = "Chicago"),
  Result = Table.FirstN(Filtered, 3)
in
  Result
let
  Source = SomeDataSourceReturningATable,
  Filtered = Table.SelectRows(Source, each [Office] = "Chicago"),
  Sorted = Table.Sort(Filtered, {{"TotalSales", Order.Descending}}),
  Result = Table.FirstN(Sorted, 3)
in
  Result

With the first expression, sorting occurs directly after Source, so all rows from Source are held in memory by the sort function. The second expression sorts after the Table.SelectRows filter so only rows that pass that filter are held by the sort. Say Source contains two billion rows, out of which only 500 that pass the [Office] = "Chicago" test. With the first version of the expression, all two billion rows are held by the sort; while with the second, only 500 rows are held. What a difference!

Performance

Tip: When streaming is in play, if your query contains steps that hold rows in memory, try placing any applicable filter steps before the row-holding steps. This way, the filtering steps will reduce the quantity of what needs to be in memory.

Query Folding

Streaming can involve pulling lots of rows which are later discarded. In the example we’ve been using, it’s potentially necessary to stream billions of rows from the source to produce the three requested output rows. If, instead, you directly interacted with the source, you could probably tell it exactly what you wanted and it would produce just that.

If the source were a SQL database, you’d write something like:

SELECT TOP 3 *
FROM Customers
WHERE Office = 'Chicago'
ORDER BY TotalSales DESC;

Or if the source is an OData REST API, your request might look like:

GET /Customers?$filter=Office eq 'Chicago'&$top=3&$orderby=TotalSales

In either case, you get back at most 3 results. Potentially billions of rows aren’t sent to you for you to sort through to find the three you want; instead, that processing occurs on the external system (utilizing any indexing or caching it may have) and just the final results are sent back (much less data crossing the wire). It should be intuitively obvious which approach is more efficient. 🙂

Thankfully, M’s query folding offers the ability to leverage the performance of native queries without needing to write them yourself.

Quick recap: Query folding takes one or more Power Query steps and translates them into a native request which is then executed on the source system (again, for a refresher, jump back to part 5, if needed).

With query folding, if our source system is a SQL database, it’s almost as though the example:

let
  Source = SomeDataSourceReturningATable,
  Filtered = Table.SelectRows(Source, each [Office] = "Chicago"),
  Sorted = Table.Sort(Filtered, {{"TotalSales", Order.Descending}}),
  Result = Table.FirstN(Sorted, 3)
in
  Result

Is internally replaced by M’s interpreter with something like:

let 
  Result = Value.NativeQuery(SomeDataSourceReturningATable, "SELECT TOP 3 * FROM Customers WHERE Office = 'Chicago' ORDER BY TotalSales DESC;")
in
  Result

(Technically, the internal mechanism used may work a bit differently, but as far as producing rows go, the net effect is approximately the same.)

In the proceeding, the native query to use can be statically deduced simply by looking at the expression steps. Power Query’s query folding can also dynamically factor data in when it produces native requests.

For example, take the below expression, which filters data (MainData) pulled from source A using a list of values (FilterData) retrieved from source B.

let
  MainData = GetFromSourceA(),
  FilterData = GetFromSourceB(),
  Result = Table.SelectRows(MainData, each List.Contains(FilterData[ID], [ID]))
in
  Result

At first glance, the expression in step Result may not look like a candidate for query folding because it combines data from two sources. Instead, it may seem necessary for the mashup engine to retrieve all data from both sources then apply the Table.SelectRows filter locally.

However, Power Query can pull data from one source and write that data into the native request it sends to another source. Say FilterData (from source B) contains only a few rows. Power Query might first pull those few rows locally, then decide to push data from those rows into the native request it sends to source A. For example, pretend FilterData contains three rows and those rows’ ID column values are 1, 2 and 3. Power Query’s query folding might execute step Result by first pulling those values from source B then sending a query like the following query to source A:

SELECT *
FROM SomeTableInSourceA
WHERE ID IN (1, 2, 3); -- these values were pulled from FilterData (source B) then written into this query

The above query tells source A just which rows are needed based on filtering data retrieved from source B. By doing this, it avoids fetching rows that ultimately would have been discarded if the Table.SelectRows filter were applied locally.

This pulling data from one source then pushing it to another can provide performance benefits and pose security concerns. Power Query’s environment has a mechanism for managing the latter—something which we’ll explore shortly.

Folding + Streaming

As we discussed in part 5, not all operations can be query folded. Once a non-foldable operation is encountered in an expression chain, any potentially foldable operations that come after it won’t be folded back into the data source that started the chain because the non-foldable operation blocks that folding.

Query folding does not eliminate streaming; rather it folds certain steps into a native request whose results are then streamed to any subsequent steps that weren’t query folded. To put it another way: With M, tables are always either streamed or query folded then streamed, never just query folded without then being streamed.

Performance

This leads to what may be an obvious performance tip: Try to put all foldable operations before any non-foldable operations so that the maximum amount of processing can be offloaded to the data sources.

Which steps are query folded, as well as how they are folded, can change as the mashup engine improves, as the quantity of data involved changes, as library/data source functions are revised and as security settings are changed (more on the latter shortly). So, there may be times where you may find it advantageous to re-try performance tuning even though you haven’t made any code changes.

Tables Are Not Immutable

A variable that appears to hold a table (or list) actually just holds a handle to the expression that produces the table (or list). When accessed, that handle executes logic which produces the requested data. While handle is immutable throughout the lifetime of the query’s execution, the data returned when it is invoked is not. This is because that data is produced on demand each time the handle is involved. The fact that the data returned is not immutable can result in values seeming to change during the execution of an M query.

The below expression returns a pair of tables. One holds all customers associated with the Chicago office; the other contains the three customers with the largest total sales amounts. Both tables are ultimately pulled from step Source. 

let
  Source = SomeDataSourceReturningATable,
  ChicagoOffice = Table.SelectRows(Source, each [Office] = "Chicago"),
  Top3Sales = Table.FirstN(Table.Sort(Source, {{ "TotalSales", Order.Descending }}), 3),
  Result = { ChicagoOffice, Top3Sales }
in
  Result

Let’s pretend you ran the above and are looking at the rows returned in the first table (ChicagoOffice). In them, you find customer ABC:

CustomerID = 123, Customer = ‘ABC’, Office = ‘Chicago’, TotalSales = 50255

Looking at the rows in the second table (Top3Sales), you also find customer ABC (apparently, it’s assigned to the Chicago office and is one of your top customers):

CustomerID = 123, Customer = ‘ABC’, Office = ‘Chicago’, TotalSales = 62199

Wait a minute! What in the world?! The same customer has a row in each table but the data in those rows is different between the two tables. (The value of TotalSales is different between the two.) How could this happen?

In light of the fact that table and list variables really just hold handles to the expression that produces the table or list, this behavior makes sense. Both ChicagoOffice and Top3Sales were invoked to produce results. When invoked, each expression chain called back to the ultimate data source to get data. Total sales for ABC must have changed between those two calls to the data source (perhaps another sale was processed during the intervening moment of time).

To recap: Variable that seem to “hold” (e.g. produce or output) a table or list really just holds an immutable reference to an expression that, when invoked, produces the desired output—the expression is immutable, but the data returned when it is invoked is not. So, when a particular query pulls from the same source multiple times during execution, there is the possibility that the data pulled could change between accesses.

If this possibility isn’t acceptable, there are two options: rework the expression to eliminate the multiple invocations (which may or may not be possible) or manually cache (buffer) the output.

Please don’t go away thinking that Power Query is flawed because it allows this variability. This potential for variability is a necessary side effect of M not always saving all rows in memory (and it’s a good thing it doesn’t do that!). Instead, M puts you in control: if there’s a point where a data set needs to be cached to provide stability, you have the power to do that. By putting you in charge, you control when the associated resource and performance costs are paid.

Buffering

When you need to stabilize a table (or list), how do you do it? By buffering.

BufferedTable = Table.Buffer(SomeExpressionProducingATable)
BufferedList = List.Buffer(SomeExpresssionProducingAList)

The first time Table.Buffer is invoked, a buffer is loaded by reading all values from the source and saving them in memory. The data in memory is then used to service any accesses to that buffer that occur during the query’s execution (including when multiple method chains in the query reference the same buffer). However, buffers are not shared across separate executions of the same query.

Adapting our Chicago office + top sales example:

let
  Source = SomeDataSourceReturningATable,
  BufferedSource = Table.Buffer(Source),  ChicagoOffice = Table.SelectRows(BufferedSource , each [Office] = "Chicago"),
  Top3Sales = Table.FirstN(Table.Sort(BufferedSource , {{ "TotalSales", Order.Descending }}), 3),
  Result = { ChicagoOffice, Top3Sales }
in
  Result

Above, the first time step BufferedSource is accessed, Table.Buffer will pull all rows from Source and store them to memory. This stable snapshot of data will then be used to service both steps ChicagoOffice and Top3Sales. Both steps will be offered the exact same rows, with the exact same row values, in the exact same order. Then, when the query finishes executing, the buffer is discarded. If the query is later executed again, a new buffer will be populated from Source.

Of course, buffering all of Source could pose a resource problem, depending on how much data in contains. Again, the key is that you are in control: you decide when to use buffering, when the benefits it brings are worth the associated cost.

When you buffer, be consciousness of how much data you’re buffering. Minimize this quantity where possible. Above, if Source is expected to return more than a small number of rows, see if there is any way to apply filtering before buffering. For example, if you know that all large customers have at least $50,000 in total sales, you could limit buffering to customers with sales above this amount or a Chicago office assignment by changing BufferedSource to:

BufferedSource = Table.Buffer(Table.SelectRows(Source, each [Office] = "Chicago" or [TotalSales] >= 50000))Table.Buffer(Table.SelectRows(Source, each [Office] = "Chicago" or [TotalSales] >= 50000))

A small change like this could significantly reduce the quantity buffered, making a vast difference on resource usage.

Performance Thoughts

Now that you hopefully have a solid understanding of streaming, query folding and buffering, let’s test that knowledge by pondering some performance tips. Do the tips below make sense (like why/how each impacts performance)? Does the order they’re in make sense (like why would it be unlikely that you would want to do step 3 before step 1)?

The following are not absolute rules but rather a suggested starting place for addressing performance issues. Please don’t blindly follow these but rather evaluate them in conjunction with an understanding of your context and how Power Query processes tables.

  1. Order steps that are query folded first. This offloads as much processing as possible to the external data sources. To determine which steps are being folded, in many cases, you’ll need to use a trace tool or check logs to see the native requests generated, as the UI doesn’t always reveal whether/how steps are being folded.
  2. Next, for those steps that cannot be folded, try for the following order: first filters, then operations that do not internally hold rows in memory, followed by any remaining operations. This way, you discard unneeded rows before performing additional processing or in-memory storage.
  3. When buffering is necessary, buffer late in the chain of expressions. Why pay the cost for buffering before that data really needs to be buffered?
  4. If the above steps don’t produce the desired performance or cost too much in terms of caching and/or buffering resources (e.g. you run out of memory!), hand-crafting a native query which you then have Power Query execute is an alternative. Power Query’s query folding supports a limited set of possibilities; you may be able to hand-code a native request that incorporates logic that folding doesn’t know how to produce and so achieve performance or eliminate buffering in a way that isn’t possible using automatic query folding.

Since data, library/data source function, security and environment changes can also effect performance, there may be times where you may find it advantageous to re-try performance tuning even though you haven’t made any code changes.

Next Time

This post, we made good progress exploring how M “thinks” about tables at the language and library levels. Next time, we’ll continue by discussing table keys. Along with that, the plan is to explore two environmental factors, native query caching and the data protection layer/the firewall, that impact how M processes tables. We’ll also talk about why native queries may be executed more times than you’d expect.

Until then, have fun mashing up data!

2 thoughts on “Power Query M Primer (Part 12): Tables—Table Think I

  1. Ivan

    You say this in the last section: “To determine which steps are being folded, in many cases, you’ll need to use a trace tool or check logs to see the native requests generated”.

    Can you explain what trace tools to use and what logs to check?

    Reply
    1. Ben Gribaudo Post author

      If you’re using Power BI, you could try the new Query Diagnostics feature. Otherwise, try looking in the source system’s logs/trace tools. For example, if the source is Microsoft SQL Server, run a SQL Profiler trace or Extended Events session, then run your Power Query, then check what SQL queries were captured by the Profiler trace/Extended Events session). If the source is a web service, you could look in the web service’s logs or use a tool like Fiddler, WireShark or Postman to intercept the calls between Power Query and the web service.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *