Power Query M Primer (Part 13): Tables—Table Think II

, , ,

Last time, we began exploring how Power Query “thinks” about tables, delving more deeply into streaming and query folding. This time, we’ll continue building our understanding of how tables are processed by learning about keys, native query result caching and the data protection layer (firewall). We’ll also explore why native queries may be executed more times than you might expect.

The goal between these two posts is to equip you with a better understanding of the context in which your mashups are executed—knowledge you can use to author more efficient M queries, avoid unexpected data changes during processing and keep the data protection layer (firewall) happy.

Let’s get going!

Series Index

Keys

Keys to the kingdom…well, maybe more like a possible key to better performance.

Out of the box, a Power Query function that process tables uses an algorithm capable of working with any combination of valid values streamed to it in any order. However, if information about table keys is provided, an operation may be able to internally optimize itself to be more efficient.

For example, take a join operation between tables A and B on column ID. In table A, the operation encounters a row with an ID of 1, so it looks in table B for rows with that ID. It finds one, joins it to the row from A and returns the joined pair. Then, it resumes searching table B, looking for additional ID 1 rows to join to. In contrast, if the operation knows that table B’s ID column contains unique values, it doesn’t need to search for additional ID 1 rows in that table because the column’s uniqueness guarantees that there aren’t any more.

“What exactly is a key?” you might ask. For our purposes, we’ll use the following definition: A key is a column or set of columns whose values identify rows. A unique key is a key whose values uniquely identify a row in the current table. Out of a table’s unique keys, one may be identified as the primary key, which indicates that it is the main identifier being used.

For example, imagine a table holding company information. Among its columns there’s one named CompanyID. Almost certainly, just based on the name, we can guess that CompanyIDis a key column—that the column’s values are used to identify company rows in the table. If each CompanyID value identifies exactly one row in the table, CompanyID is a unique key and, based on its name, is likely the table’s primary key (primary unique identifier).

The Power Query language specification defines the ability to annotate a table with key information but—with one exception (primary keys)—does not specify the kinds of keys that can be identified or what significance operations should give to the keys that are identified. Generally, it seems like an operation can infer what it needs about non-unique and foreign keys (keys pointing to rows in other tables) based on what the operation is being asked to do, without those keys being explicitly tagged. Based on this, it would seem like unique keys, such as primary keys, are what we should focus on. However, it may be worth tagging other keys when they’re relevant, just in case the operation uses that information.

Keys can be viewed, defined and replaced using library functions on tables (Table.Keys, Table.AddKey, Table.ReplaceKeys) and on table types (Type.TableKeys, Type.AddTableKey, Type.ReplaceTableKeys).

Here’s an example of defining a key on a table then viewing details about the table’s keys:

let
  Source = #table(
    {"CompanyID", "Name", "Location"}, 
      {
        {1, "ABC Company", "Chicago"}, 
        {2, "ABC Company", "Charlotte"}, 
        {3, "Some Other Company", "Cincinnati"}
      }
    ),
  KeysTagged = Table.AddKey(Source, {"CompanyID"}, true)
in
  Table.Keys(KeysTagged) // returns a list containing the following record: [Columns = {"CompanyID"}, Primary = true]

Data connectors may automatically attach key information to the tables they output. For example, a database data connector might use the database’s indexes and constraints to determine which columns are keys and then tag the outputted table accordingly.

Functions may adjust key information on the tables they return. For example, if a table doesn’t have any keys identified, Table.Distinct tags a primary key on the table it outputs (defined as all columns in the table). This makes sense: After applying Distinct, the set of values in each row are, well, distinct, and so uniquely identify the row.

As was mentioned earlier, while the Power Query language provides support for identifying table keys, it’s up to each operation whether to use this information and, if so, how to use it. Unfortunately, there doesn’t seem to be any official operation-specific documentation describing when or how key information is used, so we have to fall back to experimentation to figure out when it’s profitable to identify keys.

To performance tune an operation by providing key information, you might try something like the following:

  • Determine the column(s) used by the function to identify rows. If the operation works with multiple tables, do this for each input table. For example, if the operation is a join, which column from each input table is used to match rows between the two tables?
  • Next, for each input table, check whether the values in the identified column(s) uniquely reference rows in that table. If so, check whether the table has a primary key defined for those column(s). If not, tag those column(s) as the primary key. For good measure, if the identified column(s) do not uniquely identify rows, tag them as a non-primary key, just in case the operation at hand uses that information.
  • After adding keys, rerun the query and see if performance improved.

It only makes sense to try this tuning on operations that identify rows based on column values. Knowledge of keys won’t help a function call like Table.FirstN(Source, 5) know which five rows to return!

What library functions do with table key information is an internal implementation detail that can change as the library is updated, so at times you may find it advantageous to re-try this performance tuning even though you haven’t changed your code.

Host Environment

Mashups are executed in the context of an environment—a host program like Microsoft Power BI or Microsoft Excel. This host environment can incorporate functionality that’s specially built to (hopefully!) improve performance or that can have a side effect of reducing performance. While not strictly a part of either the Power Query language or the standard library, these environmental factors are important to understand in order to most effectively work with tables.

Persistent Caching of Native Query Results

Environments can—but do not always—store the results from executing a native query (or native request) in a persistent cache that’s written to disk.

Let’s say you’re refreshing a Power BI data set. During the load, two mashups ask the same data connector to execute the same native query against the same server (e.g. run SELECT * FROM Customers on Server123). Your environment may choose to cache the results from the native query’s first execution, then re-stream those results in response to the second request instead of asking the external system to execute the same native query again.

Again, this caching only applies to native queries (e.g. a database query, a web service call). This caching does not save results produced by Power Query expressions, steps or queries—just data returned in response to native queries.

May cache is a key thought. Nothing in the language specification mandates that an environment provide this caching. It’s up to the host as to if, when and how to cache. When it’s done, it may not come into play for all sources (for example, File.Contents requests might not be cached, as reading from disk then caching the same data back to disk doesn’t always make sense). Even for supported data sources, caching may not always be used (such as when the same native query is executed multiple times before any instances of that native query have completed). 

Persistent caches, by definition, are intended to be shared. How broadly they are shared is also environment specific. Sharing possibilities include just within a single mashup query during a single execution, across all mashups during a single refresh and across multiple refreshes. Currently, Microsoft Power BI and Microsoft Excel may share these caches between all M queries during the same refresh while older versions of Excel may only share them within a single M query during a single refresh. In contract, both Power BI’s and Excel’s query editors can preserve their caches across refreshes (in query editor, ever notice a warning about the preview being x days old…that’s the persistent cache coming into play).

Technically, persistent caching may protect you from the data variability issue discussed in the previous post (see Tables Are Not Immutable in part 12). However, I’d strongly suggest not to rely on this caching to protect you—since you’re not guaranteed that it will always be in play. If data stability during processing is important, make sure your expressions are coded to provide stable data.

You may not need to worry much about this caching kicking in. If it kicks in, it does so quietly and hopefully gives your mashups a performance boost—a bonus—without you having to do anything. If you’re tracing native requests and are puzzled by why you only see one request where you’d expect to see that request repeated several times, it may be that persistent caching eliminated the need for that native request to go all the way back to source multiple times.

On the other hand, if caching stops where you’re used to it being in play, you may notice a performance loss. If performance mysteriously slows down after your dataset grows past a certain point, it could be that some of the results from the native queries it executes have become too large for the persistent cache, resulting in repeat native query invocations all being sent back to the external source. If this occurs, you might consider increasing the size limit on the persistent cache (if your environment allows that to be configured).

From the security standpoint, persistent caching may result in data being left around even after the report that loaded that data has been deleted. The persistent cache isn’t stored inside the report’s file, so deleting the report doesn’t remove its data from the cache. To guard against this, you’ll need to manually clear your environment’s persistent cache after deleting reports.

To optimize the likelihood of persistent caching, you could disable parallel loading of tables (if your host environment allows this). At the cost of a longer refresh run time, this decreases the likelihood of the same native query being executed simultaneously and so increases the chance that repeated invocations of the same native query will be serviced out of the cache. While not something I’d recommend doing by default, disabling parallel loading is an option to consider if repetitive native queries are incurring significant performance costs.

[Further Reading/References: TechNet forum post 1, TechNet forum post 2, TechNet forum post 3 (all by a Power Query team member)]

Firewall

When first running a query that pulls from multiple data sources, you’re asked to set the source’s privacy level. “What are privacy levels?,” you wonder, so you check the documentation where you learn that these levels control the level of isolation between data sources.

This leaves you puzzled. Aren’t data sources intrinsically isolated? After all, when you use M to pull from multiple data sources, M’s doing the combining, so each source is isolated from all the rest, right?

But what about query folding? With query folding, data returned by one source may be written into the native query (or native request) sent to another source.

To borrow an example from the last post, when executing the below, data could be separately pulled from the two sources then filtered locally or—if query folding comes into play—the appropriate filtering data could be pulled from one sources then written into the native query that’s sent to the other source.

let
  TableFromSourceA = GetFromSourceA(),
  TableFromSourceB = GetFromSourceB(),
  Result = Table.SelectRows(TableFromSourceA, each List.Contains(TableFromSourceB[ID], [ID]))
in
  Result

If query folding occurs, a native query similar to the following might be generated for source A (assuming that the source is a SQL database):

SELECT *
FROM SomeTable
WHERE ID IN (1, 2, 3); -- these three values were extracted from the results of executing TableFromSourceB

Above, query folding resulted in data being pulled from one source then pushed to the other. Sometimes this data disclosure is acceptable and the performance benefit is delightful. Other times, depending on factors such as the type of data (confidential healthcare records, trade secrets, etc.) and the trustworthiness of the data sources, quiet leaking of data across sources like this could be a major security issue and so must not be allowed regardless of performance impact.

Privacy levels are the mechanism for controlling the scope of data sharing that’s allowed across data sources during query folding. Privacy levels are not intended to keep you from combining between sources or to stop you from purposefully writing code that pulls data from one source then hands it to another—they exist solely to control query folding of data from one source into native queries sent to another source.

Privacy level public indicates that data from the source can be freely shared with other sources during query folding. Organizational level sources can only have their data exposed to other organizational level sources during folding. Data from private sources cannot be folded to any other source, even to other private sources. [References: Power BI, Excel]

Privacy levels have performance impacts. Allowing data to be shared across sources can bring performance advantages from query folding; blocking sharing can result in the cost of a larger than strictly necessary set of data being fetched and buffered locally before being combined.

Privacy levels also have coding impacts. Behind the scenes, when privacy levels are enabled, the Power Query interpreter divides your code into units called partitions (for now, just think “groupings of my code”). Then, it rewrites any code references that access data from other partitions to pass that data through Power Query’s firewall. This allows the firewall to act as a gatekeeper, controlling the flow of data between partitions. When cross-data source query folding needs to be blocked, the fact that cross-partition data flows through the firewall allows the firewall to buffer that data at the relevant partition boundary. Since buffering blocks query folding at the point where it occurs, this action keeps query folding from occurring across the partition boundary and so prevents leaking data between sources.

The way the data protection layer is currently designed, the following rule must be complied with to ensure that firewall logic can be inserted in the appropriate locations:

Either a partition may contain data sources with compatible privacy levels (i.e. where privacy levels allow folding between the sources) or the partition may reference other partitions—but not both.

If the first part of that rule is violated—meaning that, within a partition, there is more than one data source and the sources do not all have compatible privacy levels, an error along the lines of the following will be returned:

Formula.Firewall: Query 'ImportantData' (step 'Source') is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

On the other hand, if the “not both” part of the rule is violated—that is, if a partition contains a data source and references another partition—an error something like the below will be returned:

Formula.Firewall: Query 'ImportData' (step 'Source') references other queries or steps, so it may not directly access a data source. Please rebuild this data combination.

In either case, don’t worry! You can combine the two sources—just the data protection level needs your code to be re-worked so that it can insert the gatekeeping firewall code in the appropriate places to ensure that the firewall can do its job.

To resolve, the code that combines between sources can’t be in the same partition as the sources and incompatible sources can’t be together in the same partition. The key to adjusting code to comply with these requirements is understanding where partition boundaries are drawn.

Unfortunately, the rules that define how code is partitioned are complex. A shortcut solution: If you encounter one of these errors, place the code specific to each data source in a separate query (one query per data source) and then reference those queries from another query that combines between them. This will result in partition boundaries that are aligned in a way that works while saving you from wading through the complex specifics of partitioning.

[Further Reading/Reference: TechNet forum post (by a Power Query team member)]

Disabling the Data Protection Layer

There’s also the option to forget about privacy levels and partitions altogether by disabling the data protection layer.

Technically, if all data sources have a public privacy level or if all have an organization privacy level, they can be freely combined between, so the data protection level isn’t doing any protecting. Disabling it gets it out of the way, avoiding the coding constraints it imposes and any performance overhead it incurs.

Is disabling the data protection layer a good idea? Only you can answer that question. I’d hesitate, though. What if a data source is added down the road where data sharing should be prevented? The person adding the new source correctly configures its privacy level but doesn’t realize that privacy levels are being ignored. Query folding ends up quietly leaking data. Ouch! Leaving privacy level enforcement enabled gives you a valuable future-proofing protection. I’d encourage you to not go around disabling it by default but instead only turn it off when you encounter a problem where disabling is the only reasonable solution and you thoroughly understand the impact (risk) of making that change.

Extra Native Query Invocations

If you trace the native queries (or native requests) sent from Power Query to external sources, you might be surprised. Where it looks like a mashup would execute a particular native query once, you might find that the native request, or variants of it, are invoked multiple times.

“Why?” has to do with internal implementation details of the host environment and/or the functions being used. Perhaps a promote headers operation pulled from source twice—once to get data to derive header names and a second time to stream the result set. An environment might want the schema describing an M query’s results before starting to pull data. Maybe the firewall requested a chunk of rows for analysis to help it decide how to partition code. For various reasons, a native query (or variants of that query) may be executed multiple times where you’d only expect it to be invoked once.

Sometimes the data connector can optimize the “extra” native query invocations so that they only pull a subset of data (e.g. pull a zero-row result set from the database when just schema information is needed). When the connector cannot optimize, these “extra” requests can get costly because the full native request may be executed multiple times in its entirety.

Depending on the connector and external source, native requests involving entities like stored procedures, temporary tables and basic (vs. OData) API calls may not be automatically optimizable (or only partly optimizable) due to limitations of the external source. If lack of automatic optimization is causing performance complications, you may be able to implement your own optimization logic using Table.View.

The possibility of extra native query executions is one reason why it is strongly advisable not to use Power Query to execute native queries that modify data. You might think “On a scheduled basis, my report needs to pull some data but, before that can happen, a table needs to be updated. I know what I’ll do—I’ll schedule a report refresh that calls a stored procedure which first performs the update then returns the report’s data.” Not a good idea—your table may end up updated more times than expected because of extra native query invocations.

Next Time

Now that you (hopefully!) have a robust understanding of how M processes tables, next time let’s learn about Power Query’s control structure (no, that’s not a typo—there really is only one control structure!). We’ll also explore how to handle situations when you feel the need to loop or iterate.

Until this, have fun working with data!

Leave a Reply

Your email address will not be published. Required fields are marked *