Power Query M Primer (Part 24): Query Folding II

, , , , ,

Last time, we began a deep dive into the inner workings of query folding. We examined how you can implement foldability using Table.View, ending with a firm grasp on why answering the question “what functions fold?” isn’t simple, but rather depends on the Power Query version, the data connector and possibly even some combination of the particular operation’s parameters and the data set being accessed.

But this isn’t the only “moral of the lesson” to be gleaned from our query folding deep dive….

As part of processing an expression, do you think Power Query communicates just once with each external source? For that matter, does Power Query process your expression verbatim and exactly one time? On query folding: Is it guaranteed to be transparent, producing identical results regardless of whether an expression is processed locally by Power Query, partly folded to source or fully folded to source?

These questions, and their answers, will lead us to more morals to be learned from this continuation of our lesson on query folding!

Series Index

Chatting Happily Away

When you think of fetching data from source, what communications do you imagine take place between Power Query and that source?

Maybe something like: First, open the connection. Next, authenticate. Then, send a single “query database” request (or, in the case of a paged web API, a single sequence of page fetch requests). In response, each result row is sent from the source to Power Query exactly one time.  And that’s it. All done.

Sounds reasonable, and may be exactly what would happen if you manually queried the source using a tool like Azure Data Studio or SSMS. It may seem logical to assume Power Query works similarly, but this is not usually the case. Instead, Power Query can be chatty. That is, it may interact with the external source more than you might, at first glance, expect—a behavior which has ramifications you should be aware of.

Think back, if you would, to last time where we learned about Table.View handlers.

For starts, remember that a table view provides table type information (via handler GetType) separately from actual table row data (via handler GetRows). When Power Query retrieves data from a table view, it needs both the table’s type and its data, so invokes both handlers.

When no folding is in play, fetching data using a table view looks like:

  1. Create an instance of the view (that is, either code written by you, or code inside a connector written by someone else, invokes function Table.View).
  2. On this view, invoke handler GetType.
  3. On this view, invoke handler GetRows.

In most cases, this sequence will result in at least two communications with the data source:

  • The first includes whatever information is necessary to compute the type that’s output by GetType. These details could be fetched and internally saved by the connector when it is initialized or fetched on demand by the view’s GetType when it is invoked.

    Depending on the sophistication of the table view and the capabilities of the data source, what’s fetched as part of computing the type can range from just metadata about the expected result set to an entire result set whose type is read and its rows ignored. (Potentially fetching all rows may seem strange when just a type is needed, but remember that not every data source separately exposes metadata about the data it returns. Sometimes, the only way to determine the type of what will be returned is to actually have the source return that data.)
  • The second communication with the server is the actual result set retrieval, which occurs when GetRows is invoked.

Again, reading data from a table view almost always involves at least two communications with the external source. This potentially grows when operations are query folded.

  1. GetType
  2. OnTake
  3. GetType
  4. OnSkip
  5. GetType
  6. GetRows

Communicating More and More

Remember that Power Query configures folding operations by calling On* handlers? Turns out, after calling each On* handler, Power Query has a tendency to also ask for a revised type. The resulting chain of handler invocations often looks like:

To generalize the pattern: Between the initial GetType invocation and the concluding GetRows call (highlighted in yellow; both of which we discussed a moment ago), are a series of On* handler + GetType handler invocation pairs, one pair per operation that’s folded (in this example, the pair for OnTake is highlighted in green; the pair for OnSkip, in purple).

Each of these handlers technically can call to source. Whether they do depends on how they were built, which likely was influenced by the capabilities of the data source.

We could imagine that a row offset or quantity limiting handler (OnSkip or OnTake, respectively) probably won’t call to source. (Technically, there’s no law forbidding a view’s developer from including logic in these handlers that contacts source, but there’s usually no need for this to occur.) However, GetType may contact source if it dynamically computes the type it returns on demand (i.e. instead of by consulting cached information or simply returning a hard-coded value).

In the case of a more complex operation, like a join, again, what happens depends. The view’s OnJoin handler may be able to decide whether to accept the join fold request based on what it already knows (e.g. by consulting information that’s already been cached). Or, when OnJoin is invoked, it may pull (meta)data from the source as part of determining whether to handle the join.

Even more communications could be involved. Imagine a join handled by a table view to a table from a different data source. If the other table contains a moderate number of rows, it’s possible that the view may want to fold join key column values from that other table into the query the view sends to its source, then join these prefiltered results to the other source’s data. If, instead, the other source contains more than a certain quantity of rows, the view may decline to handle the join, leaving it for Power Query to process internally. In this latter case, the view will make a probe fetch to the other source to determine the quantity of rows it contains, then decline the folding request; then, slightly later, Power Query will fetch rows from the other source as part of processing the join. The view’s “possibly prefilter” behavior resulted in an additional communication with a source (the probe fetch) being added into the mix.

All this said, the point here is not the exact specifics of how a particular table view, handler or connector works, nor is it the exact pattern of how Power Query invokes handlers. Rather, it’s to illustrate the fact that table views (and related code) may interact multiple times with sources—which may be more than you’d expect.

Derivative Evaluations

Not only can a single expression evaluation lead to a table view interacting multiple times with a source, but Power Query may evaluate your expression, or variations thereof, multiple times. This can multiply the number of view-to-source communications!

For example, Power Query’s data protection firewall may need information about your expression to help it figure out how to appropriately implement privacy level protections. To obtain the needed details, it may trigger a preliminary run of your expression—but with it wrapped in a Table.FirstN(YourQuery, 1000). The firewall then uses what it receives back to help it revise your original query, wiring in the appropriate privacy gatekeeping code. Moments later, this rewritten query is run to produce the data that Power Query returns as its response to your original “evaluate query” request. You expected your query to be run once, but in actually, your original expression wasn’t run once or verbatim. Instead, multiple variations of it were run! (This isn’t to say that your original expression is never run exactly as you wrote it, but often this isn’t the case.)

Each derivative may result in additional communications with sources. These interactions may not be identical across derivatives, so may be query folded differently and/or result in different quantities of data being streamed. For example, the expression evaluated for the firewall-triggered discovery fetch mentioned a moment ago contains a Table.FirstN, which may result in a “TOP 1000” being added to that expression’s native query. In contrast, the derivative expression that’s run to output actual data won’t have this Table.FirstN added, so its native query won’t reflect it.

Then there’s query development: If you’re working inside Query Editor under Power Query’s default settings, each query step you define may get its own evaluation as part of generating its preview. In Power Query Online, each step may get its own evaluation as part of computing its step folding indicator icon. Again, almost certainly these introduce more ‘to source’ communications.

And this isn’t all: Turns out, for some data sources (like OData, and sometimes ODBC), Power Query can’t always figure out their exact folding capabilities ahead of time. To compensate for this, Power Query starts by folding all the operations it thinks could fold then asks the source to execute the resulting native query. If that fails, Power Query may try again, but with less folding—perhaps even trying multiple variations of less folding—until either it finds a less folded version that works or decides to fall back to a simple, “no folding” native request to source. When present, this dynamic query plan behavior introduces even more communications to source.

As these examples show, producing a single table may result in numerous communications taking place between Power Query and sources. Hopefully by now, the erroneous preconception that, after authentication, Power Query always makes exactly one single call to source (or, in the case of a paged API, a single sequence of calls to source) has been shattered!

For practical purposes, the exact sequence of communications that take place between Power Query and a source should be considered an internal detail, not something to take dependencies on. You might be tempted to thoroughly research a specific scenario to figure out exactly what its sequence of Power Query-to-source interactions looks like. While this may be interesting, don’t assume that what you see is guaranteed to be what will take place next time. What you observed could change without notice for a variety of reasons (e.g. a Power Query update, a data connector update, your migrating the expression to a different Power Query environment, changes in metadata on the source, etc.), so your observations shouldn’t be taken as predictions of what the same behavior will look like next time.

More Morals….

Power Query poly-fetches. Asking Power Query to evaluate an expression may result in multiple communications with external sources, including potentially multiple runs of the actual “produce data” native query—with the exact quantity of these communications being, for practical purposes, not dependably predictable.

In some cases, it may take getting used to to align with this paradigm. Usually, it works reasonably well; sometimes, it bites. Practically, it lends to a number of applications, including….

Moral of the Lesson #2

Do not use Power Query to make data modifications. No INSERTS, UPDATES, DETELEs. No HTTP PUTs or POSTs that change things.

Power Query is generally not suitable for data modification activities. Yes, technically, you have the ability to manually type in arbitrary native queries which Power Query will dutifully execute. However, Power Query’s poly-fetch behavior means that your arbitrary native query/request could be executed multiple times. If that native query is something like “INSERT …” or “UPDATE …“, or is an API call that makes changes, those modifications could end up being performed multiple times. Normally, making data modifications relies on the assumption that the given command is executed exactly once. If, instead, the command is run an unpredictable one or more times, negative ramifications may result (like more data being inserted than expected).

Please don’t try to shoehorn Power Query into being a tool for making data changes on external systems. It’s not meant to serve in capacities where guaranteed, precise control over how many times a native query is executed is important. Trying to make it do so may bite.

(Little known “secret”: Microsoft created an internal, undocumented type, action, along with related functionality, to enable Power Query to perform data modifications, presumably in a way that avoids poly-fetch behavior. The fact that Microsoft felt it needful to add special functionality to handle making data modifications should be a clue to the rest of us: If using public Power Query to make data changes was deemed unsuitable by Microsoft, then our trying to do the same is probably quite unwise.)

Moral of the Lesson #3

Ideally, the data connectors and table views you use will be able to produce type information without fetching actual result sets. A well-built data connector will do this if it is programmatically practical to do so. If it cannot or does not, you may want to consider providing the appropriate type yourself.

Perhaps you’ve specified an arbitrary native query whose complexity exceeds the connector’s ability to successfully perform a “metadata only” discovery. Maybe you’re using a basic web API that doesn’t separately expose metadata information. In cases like these, if you know the type to be returned, you could wrap whatever table-producing function you’re using with a table view that has a GetType handler hand-coded to return the appropriate type.

For example, say you’re doing Value.NativeQuery(SomeSource, SomeQuery) in a case where the connector can’t use a metadata fetch to compute the type for the response, but you know what that type should be. You can provide this information by nesting the Value.NativeQuery call inside a Table.View like this:

Table.View(
  null,
  [
    GetType = () => (your hand-coded type goes here)
    GetRows = () => Value.NativeQuery(SomeSource, SomeQuery)
  ]
)

Of course, context is key. If your query runs almost instantaneously without this change, then you probably shouldn’t sweat over making it. On the other hand, if you know from checking logs that a single query evaluation request from you triggers multiple long-running calls to source, this may be a way to eliminate one of them and so notably speed up processing.

Moral of the Lesson #4

As we learned, your request that Power Query evaluate an expression may lead to multiple adaptations of that expression being run. If your original expression is fully foldable, the chances that these variations also fully fold is greatly increased.

For example, if some Power Query component decides to do a Table.FirstN(YourQuery, 1000) fetch, and YourQuery fully folds, then there’s a good chance that the Table.FirstN-wrapped version of the expression will also fully fold.

On the other hand, say that your original expression doesn’t fully fold and a derivative probe expression with a Table.FirstN limit doesn’t fold either. What happens in this case?

To produce rows for the probe fetch, Power Query may start a full, “all rows” read from source, then abort that after streaming 1,000 rows. This may not be as efficient as when the “TOP 1000” is actually folded back to source. However, so long as the source limits its efforts to processing just those rows that are read from it (or, at least, doesn’t go much beyond that), the performance difference between an aborted read and full folding likely won’t be all that much. If so, special effort solely to make probe/analysis expressions fully fold likely isn’t justified.

The situation may be much different if the source doesn’t limit its processing to just rows that are read. Imagine a stored procedure (or multi-step SQL statement) that starts by loading lots of rows into a temporary table, which it then processes before outputting any data. This preliminary loading and processing must be completed in its entirety before the proc outputs a single row—it takes place regardless of how many rows are read by Power Query; it’s not short-circuited if Power Query stops streaming after row 1,000. In a case like this, effort to make probe expressions fold more efficiently may pay significant dividends. One option here is to use a Table.View which translates folding requests into stored procedure parameters which the proc uses, when appropriate, to adjust the processing it performs (e.g. invoking view handler OnTake could result in a @Max = limitSize parameter being passed to the proc, which it interprets as a signal to constrain its preliminary loading and processing).

Persistent Query Cache

As we’ve learned, asking Power Query to process a single expression can result in numerous communications with external data sources. Some of the resulting native requests may be exact duplicates. Sending the same request to source multiple times is inefficient. Thankfully, Power Query has a persistent query cache which can help deduplicate these communications, speeding up runtime and saving source processing efforts.

We’ve talked in more detail before about this cache, but for a quick recap:

Power Query can watch outbound native requests and cache the responses returned. When it sees a native request that is identical to one it has seen before (same query, same source, etc.), it may intercept the request. Instead of allowing it to be sent to the external source, it may answer it by returning cached results.

If and when the persistent query cache comes into play depends on multiple factors and is complex to predict (factors include data set size and timing of when the multiple identical requests complete). A switch from the cache servicing a particular request to not servicing it, or vice versa, can be one reason why a data load’s performance changes unexpectedly. For example, a particular result set could have grown to the point where it exceeds the cache’s size limit so is no longer being cached, resulting in requests that previously were answered by the cache taking longer to complete because they are now each being sent all the way to source.

Keep in mind that this cache only applies at the native query/request level. The persistent query cache does not cache the results produced by your M expressions—it only helps with results directly returned by external sources (e.g. the results sent back from the database in response to SELECT * FROM SomeTable).

Normally, the cache’s lifetime is limited to a single refresh operation (e.g. a single “Refresh All” in Microsoft Power BI Desktop) or shorter. However, query development tooling may use a longer lifespan cache. In Query Editor, this is why you sometimes see those “This preview may be older than X days” warnings: Query Editor is informing you about its cache’s extended age and offering you an option to trigger a refresh.

For more, hop back to Table Think II (part 13).

Power Query v. Folded to Source: Not Always the Same

Query folding should be transparent, as far as results go. That is, the exact same results should be produced if an M expression’s logic is processed internally by Power Query, partly folded to source or fully folded to source. Processing efficiency may change, but the results output should not change because of query folding.

At least, that’s the theory. But is it the guaranteed reality?

When a table view is used to implement query folding:

  • The view translates the operations it agreed to fold into an appropriate native query/request.
  • The source executes that request, returning the appropriate response.

In order for query folding to be transparent, the combined effects of the above two steps must produce exactly the same results as Power Query would produce if no folding took place.

This does not always occur.

Imagine a simple Microsoft SQL database table containing rows for “Joe” and “joe”. You filter this table using the following expression:

Table.SelectRows(Source, each [FirstName] = "Joe")

If the above is processed locally by Power Query (i.e. not query folded), just the row for “Joe” will be returned because Power Query’s equality operator (=) performs case-sensitive string comparisons.

Result set showing single "Joe" row

Instead, suppose query folding takes place. The data connector will translate the above M logic into a SQL fragment along the lines of the below, which it incorporates into the larger SQL query it sends to the database server.

WHERE FirstName = 'Joe'

So far, so good. This SQL fragment appears to align with the original M expression’s logic.

When the full SQL statement is executed by the database server, what’s returned?

It depends, but likely:

Result set showing "joe" and "Joe"

This is different from what Power Query produced. Why? In a nutshell, what equals meant to the database server was different from what it meant to Power Query.

When Table.SelectRows‘s filter was folded to the database server, its logic switched from being processed using Power Query’s definition of equality to the server’s definition of equality—and those two definitions were not identical. In this case, as is common in the SQL Server world, the database table’s column was configured to have a case-insensitive collation, so FirstName = 'Joe' matched both “Joe” and “joe”.

Due to paradigm differences between Power Query and an external data source, unfortunately, query folding is not always transparent.

The preceding example focused on string comparisons, but this issue has the potential to affect any folded operation. If, between the data connector and the source, an exact match to Power Query’s behaviors is not achieved, the results output when query folding is in play may deviate from what Power Query would output if it evaluated the same logic itself.

Sometimes, folding non-transparency, when it occurs, is glaringly obvious; other times, it may slip by undetected for extended periods of time while still skewing results. You make a seemingly unrelated edit which impacts the foldability of another part of the expression—a part that unbeknown to you doesn’t fold transparently. The resulting variations go unnoticed for some time because they were unexpected and affect only a small percentage of rows. For that matter, maybe you didn’t make any M code edits at all: Instead, a database change or a Power Query upgrade impacted foldability and so led to a results variation—again without you readily noticing.

Please don’t let the lack of guaranteed transparency scare you away from query folding. The Power Query-powered portion of the data world has continued to thrive in spite of the fact that sometimes, in some cases, query folding isn’t transparent. Understanding that it may not always be transparent (and why it may not be transparent) can save countless hours of frustrating troubleshooting. At the same time, thankfully, this issue doesn’t seem to bite nearly as frequently or badly as its bark may make it sound—which is good, because query folding is a key component to achieving optimal performance in the Power Query world.

[Further Reading: Equals Is Not Always Equivalent: When Query Folding Does Not Produce Identical Results & Equals Is Not Always Equivalent: Power Query Joins vs. SQL Joins]

Next Time

Quite a number of functions (principally the standard library and data connectors) come from code you didn’t write. How do they get into the global environment? The answer, at least in part, just might have some relation to extensions. Sounds like a good topic area to explore next time!

Until then, happy M coding!

6 thoughts on “Power Query M Primer (Part 24): Query Folding II

  1. Alex Groberman

    Thanks again for another great post Ben! Can you cover recursion (and perhaps List.Generate as an alternative) in more detail?

    For example, let a = {1} & @a in a returns a circular reference error, but let a = List.Generate(() => 1, each true, each _) in a works just fine. Why?

    Thanks!
    -Alex

    Reply
    1. Ben Gribaudo Post author

      Thanks for the suggestion, Alex! I will keep it in mind.

      On the question: For an authoritative answer on why using recursive references to build lists isn’t supported, you’d need to ask Microsoft. Considering how M is typically used, it could be that there wasn’t enough of a use case for this style of list building to justify supporting it. In contrast, your second example doesn’t use a recursive reference, so isn’t blocked by M’s anti-recursion guard.

      Reply
  2. Lutz

    Just hypothetically (cough) if I make doubly sure that Query folding is disabled and _then_ I (hypothetically) issue a write back command against the source – would that (hypothetically) guarantee that this query is only executed once?

    Reply
    1. Ben Gribaudo Post author

      No. 🙂 The non-folded command still may be executed multiple times (e.g. for a firewall analysis fetch, to get its type, to actually perform the produce data run of the command, etc.).

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *