Power Query M Primer (Part 5): Paradigm

, , , ,

To improve our understanding of how Power Query works, let’s step back from the pieces and look at the paradigm—the methodology that M is built around. Conceptually, how does M work? How does it think? What makes it tick? The particulars, like variables, let expressions, functions and identifiers, are all important, but if we only focus on details, we might miss the big picture. We might not see the forest because we’re looking so intently at the trees. Let’s step back and learn about M’s modus operandi.

Without an awareness of M’s methodology, we might spend our days writing M code perplexed by certain behaviors, puzzled by why M seemingly doesn’t let us do certain things we might want to do and perspiring as we waste effort trying to mimic code patterns from other languages that are needless in M.

Series Index

What It Is

Power Query (a.k.a. M) is a query language. It’s intended for building queries that mash up data. Between “querying” and “mashing up,” M can be used to output data that has been cleaned up, filtered, transformed, aggregated and combined from multiple sources and formats. This is powerful!

However, M does not provide the ability to modify the data in data sources. It pulls data from sources but doesn’t write changes or insert new data into those sources. Also, it’s not a general-purpose programming or scripting language. M isn’t intended to do things like draw dialogs on the screen, copy files from one drive to another or print. M is for querying and mashing up data, not for solving all of the world’s computing problems! This tailored focus means that M can try to do one thing exceptionally well vs. trying to do so many different kinds of things that it only manages to do them so-so.

How It Works

On the technical side, in a nutshell, M is a functional language, sharing similarities with languages like F# and Haskell. Specifically, it’s mostly pure, higher-order and partially lazy. If you’re not sure what these terms mean, don’t worry. Just keep reading!

Order of Evaluation

M doesn’t necessarily execute expressions in the order they appear in scripts. M follows dependency ordering, not the physical order expressed in source code.

Think of how formulas in spreadsheets are evaluated. Below, in order to produce the requested output, first the formula in cell A3 must be evaluated. Once its value has been computed, A1 and A2 are processed. Which is executed first doesn’t matter, so the spreadsheet engine gets to choose whether to execute A1 before A2 or the other way around (or maybe it executes both in parallel).

Spreadsheet with formulas displayed

M’s interpreter works similarly. It looks at the expression that produces output and determines the ordering needed to make that output possible. If there are steps where that ordering isn’t dictated by dependencies, M gets to choose.

This implies that you can write M code in any order you’d like, even if that order is out of order—though  in order coding will probably be easier for other humans who read your code. Below, these two expressions are both valid and equivalent in their output even though their child expressions are given in opposite order. In both cases, M figures out the correct execution order to use when it processes them.

let
  Data = { 1, 2, 3},
  Result = List.Transform(Data, each _ * 10)
in
  Result
let
  Result = List.Transform(Data, each _ * 10),
  Data = { 1, 2, 3}
in
  Result

Partially Lazy

When M’s interpreter determines evaluation order, what happens if it encounters an expression that isn’t needed?

let
  Source = { "Bob", "Bill", "Sam" },
  Filtered = List.Select(Source, each Text.StartsWith(_, "B"))
in
  Source

Above, what does the interpreter do with Filtered, which doesn’t need to be evaluated to produce the output required by the in clause? The interpreter skips Filtered. Since it’s unnecessary, it’s not evaluated. But why would someone write needless code?

Applied Steps list showing first step selected

For one, how about testing? In the GUI query editor, you might select a step that comes before the last so that you can inspect its output. When you do this, let’s in clause is set to point to the selected step. However, the skipped over steps are left in the source code. M’s interpreter figures out that evaluating them isn’t required so skips them. This is good—otherwise, time might be wasted evaluating (potentially complex) extraneous expressions!

Another example:

(even as logical) =>
let
  Odd = { 1, 3, 5 },
  Even = { 2, 4, 6 }
in
  if even then Even else Odd

Each time the above method is invoked, either the value of Odd or Even is needed—but not both values. M evaluates only the variable that will be used by in to produce output. In contrast, if M were a procedural or object-oriented language, likely both variables (both Odd and Even) would be evaluated prior to the in clause being executed. Needless work would have been done! Thankfully, M doesn’t operate this way.

We’ve just seen that variable assignments in let expressions are evaluated lazily. This lazy evaluation behavior also holds true for lists, records and tables.

List.Count({ ComputeValue(), ComputeAnotherValue() })

To count the number of items in a list, the M interpreter doesn’t need to evaluate the contents of those items, so it doesn’t. Above, ComputeValue() and ComputeAnotherValue() are never called because of lazy evaluation.

In the case of the record below, Salary is not evaluated because its value isn’t used when producing the requested output.

let
  ComputeWeeklySalary = (wage) => ...,
  PayDetails =
  [
    Wage = 25.10,
    Salary = ComputeWeeklySalary(Wage)
  ]
in
   PayDetails[Wage]

Similar lazy behavior also holds true for tables. However, in M, only let expressions, lists, records and tables are lazy evaluated. Eager evaluation is used for everything else. This is why we say that M is only partly lazy.

For example, arguments to a function call are eagerly evaluated.

let
  Numbers = { 1, 2, 3 },
  Letters = { "A", "B", "C" },
  Choose = (chooseFirst as logical, first, second) => 
    if chooseFirst then first else second
in
  Choose(true, Numbers, Letters)

Since both Numbers and Letters are used as parameters when Choose is called and parameters aren’t lazy evaluated, the expressions for both variables are evaluated even though only one of the two will used by the function.

Higher-Order

Above, both lists are evaluated because they were passed as arguments to a function. What if we really wanted to avoid having both lists materialized only for one to be thrown away? Hmm…

Wait! Remember—we can pass functions around. Instead of passing Choose two lists of values, what if we adapt it so that we pass it functions that produce values when they are invoked? (I don’t meanuse functions in the parameter list to populate the function’s arguments,” I mean “pass the functions as arguments into the function itself.”) Choose can then choose to only execute one of the functions—the  one that produces the output it needs. The other function is not invoked so the list it generates will not be generated.

let
  Numbers = () => { 1, 2, 3 },
  Letters = () => { "A", "B", "C" },
  Choose = (chooseFirst as logical, first, second) => 
    if chooseFirst then first() else second()
in
  Choose(true, Numbers, Letters)

Technically, the M interpreter still evaluates both Numbers and Letters. However, since these names now identify functions, evaluating them simply takes their function expressions and turns them into function values (we talked about this in part 3). The two function values are passed into function Choose which then invokes only one of them. For short, we say that we passed two functions into Choose which chose which of the two to invoke (how’s that for a tongue twister!).

The ability to pass functions into other functions as arguments, as well the ability to return functions from functions (we also talked about this in part 3), makes M a higher-order language.

Streaming Semantics

Directly or indirectly, a M mashup is built by chaining methods. Call a function that fetches data from the source, pass its output to a function that applies a transformation, take the result of that function and pass it to another function that computes an aggregate and so forth.

However, the just-described way of passing data can be inefficient. Implemented literally, it requires that the full output of the first method be assembled before being passed to the next method in the chain. Then, that method processes its entire input, doing whatever it does until it has produced a full set of output. This is then passed to the next method, and so forth. Each step of the chain can involve iterating through the entire data set, once per step.

Why iterate through all the data once for each transform? Instead, why not iterate it once, feeding each item, as it’s encountered, through all the transform steps? Why not take the first item from the source, apply each transformation step to it in turn and then collect it as part of what eventually will be output by the expression as a whole, then repeat the process for the second item, and so forth?

For lists and tables, M supports something like this automatically, without you having to do anything! Streaming semantics allows M to pull items from lists and tables through the chain of transforms. You simply code up M like you always have—by building the chain of methods that define the desired mashup—and M’s interpreter, when appropriate, streams items through that chain instead of executing each transformation step in its entirety on the entire data set before executing the next step.

A moment ago, I mentioned pulling items through the chain of transforms. Why pull? Pulling allows the steps that needs data to control the amount of data that’s processed.

let
  Numbers = List.Generate(()=>1, each _ <= 100000000, each _ + 1),
  Filtered = List.Transform(Numbers, each _ * 2),
  Result = List.FirstN(Filtered, 3)
in
  Result

Above, only the first three items are needed to produce the requested output. Since data is pulled as needed by the methods that needs it, List.FirstN causes only the first three items to flow through the method chain. Since only three items are pulled, List.Generate only generates the first three items. This is true even though List.Generate is coded to produce the sequence of numbers from 1 to 100,000,000. Whew! It would be wasteful to produce a 100,000,000 item list to have all but three of those items thrown away. Thankfully, with M’s streaming semantics, this didn’t happen.

You don’t need to do anything to turn on streaming behavior. You simply assemble the processing steps in the order you want them. M takes care of streaming, when appropriate.

M allows you to code up what should be done (the logic of the computations) without needing to define specifically how that what should be done (the flow of control). M, like other functional languages, is declarative, not imperative. You focus on declaring the intent; M takes care of choosing the technical steps to achieve that intent.

Query Folding

Streaming semantics help with efficiency within M. Often, though, it’s more efficient to move processing outside of M’s interpreter and back to the source.

Imagine this scenario: You have a database table containing one million records, which your mashup filters down to what ends up being a single row. To process this expression, Power Query could fetch all one million records from the database, then apply the filter itself. This would work but would expend network bandwidth and M interpreter processing energies to handle the extra 999,999 records. If M only could tell the database “send me results that are already filtered….”

It can (at least, sometimes)! With query folding, M’s interpreter translates a chain of expressions into a query written in the native language of the data source.

For example, the below expression is translated into SQL that follows it. The net effect is that the database server only sends a single row across the wire to M. Yay!

let
    Source = Sql.Databases("some-server"),
    MillionRowTable = Source{[Schema="dbo",Item="MillionRowTable"]}[Data],
    Filtered = Table.SelectRows(MillionRowTable, each [ID] = 123)
in
    Filtered
select [_].[ID],
    [_].[Value]
from [dbo].[MillionRowTable] as [_]
where [_].[ID] = 123 and [_].[ID] is not null

Not every data source and transform function supports query folding. The level of support is determined by the source and the revision of M being used (changes to sources and library functions introduced in revisions can expand support for query folding).

If a mashup contains both foldable and non-foldable steps, M (for obvious reasons!) processes the non-foldable steps itself. Once a non-foldable step is encountered in a particular chain, steps that logically come after that step in the chain won’t be folded.

Below, the last step won’t be folded even though it supports folding because it comes after a non-foldable step. The steps in green may be folded while the steps in orange will not be folded.

Source (supports folding) -> Transform (supports folding) -> Transform (doesn’t support folding) -> Transform (supports folding)

Similar to streaming semantics, you don’t need to do anything to turn query folding on. However, when working with large data sets, the division between folded and non-folded can have significant performance ramifications. At times, it may be advantageous to encourage M to fold as much as possible by reordering (and, if necessary, reengineering) steps so that as many as possible are foldable and come before any non-foldable steps.

Power Query sometimes helps you out here. If it determines that your expression can be revised to improve efficiency without changing the results that are produced, it may auto-optimize your mashup. The adjustments made can include reordering operations (say by taking the preceding example’s last Transform (supports folding) and moving it before Transform (doesn’t support folding)—a change which opens up the possibility that all of the mashup’s “supports folding” steps will fold).

Right-click menu showing View Native Query menu item

For visibility as to when query folding is being used, sometimes right-clicking on a step in the GUI’s Applied Steps menu helps. If the View Native Query menu item isn’t grayed out, the clicked-on step is being folded.

Unfortunately, not all native queries are displayed by this menu option, so the fact that it is grayed out doesn’t always mean the step isn’t being folded. When it’s grayed out, you may need to use a trace tool or check logs to determine whether query folding is in play.

Immutable

Have you noticed that we’ve assigned initial values to variables but never updated or assigned new values to variables? Ah! This is because we can’t. In M, it’s not possible to change a variable once its value has been evaluated. Instead, when a variable is evaluated for the first time, it’s value is fixed for life. M’s variables are immutable.

Well, that is, except for the one exception. When streaming semantics are involved, each time a variable that represents a streamed source is accessed, the data that’s streamed may be different.

M’s documentation is vague on this point, so here’s my guess as to what this rule means: If a variable exposing a stream is accessed multiple times and the data source changes or orders data differently between those accesses, the data streamed will be different.

Say the starting point is a database. If the first time the variable exposing the stream is accessed, the source database table contains three records and all records are requested, then three records will be streamed. After this access, suppose an external process adds a fourth record to the source table. The next time the variable is accessed with all records requested, the stream will contain four records.

Even if the data source remains unchanged between variable accesses, if the source doesn’t order the data it outputs the same way each time (for example, if the data comes from a SQL query that doesn’t include an ORDER BY clause), the stream exposed by the variable could stream the data items in a different order each time the variable is accessed.

So, in the case of streaming semantics, Power Query’s variables are not immutable because they are not evaluated once then fixed for life.

The designers of M could have made variables always immutable by having M’s interpreter cache the data streamed through each variable on first access, then return the cached data on each subsequent access. However, cached data has to be stored somewhere. Practically, that somewhere is in memory or on disk. Storing a large data set in either location might be prohibitive (the computer might not have enough memory; disk I/O would make processing much slower). Allowing data to be restreamed each time the variable exposing it is accessed avoids this issue.

Mostly Pure

In the world of computer language theory, functions that comply with the following two rules are pure functions.

A function is a pure function if it:

  • Always produces the same output given the same input. For example, calling AddTwo(5), where AddTwo is defined as (x) => x + 2, will output 7 every time it’s called.
  • Does not have side effects. That is, the only effect calling the method has is producing output. The method won’t also, for example, have the side effect of modifying a global variable or saving changes to a file.

In M, the very nature of the language makes any expression you write pure unless it accesses an impure library function or, possibly, encounters an error with one of its sub-steps.

Library functions can access the outside world. The outside world isn’t always deterministic. If you use a library function that reads from a file or database then rerun that expression a few moments later, you might get different results even though you called the method with the exact same arguments. Why? The file or database may have changed. This change in output means that the function doesn’t always produce the same output given the same input so we call it an impure function.

Errors in the sub-steps of an expression can also cause it to be non-deterministic and so to violate the ‘same input produces the same output’ rule. Take the expression a + b as an example. M’s interpreter is free to evaluate a and b in whichever order it chooses because the expression does not dictate a particular evaluation ordering. However, if both sub-expressions (both a and b) return errors, then which is evaluated first will determine the error returned by a + b. If, one time, M executes a first, its error will be the error returned by a + b. The next time, if b is executed first, b’s error will become the error that’s raised. The fact that, given the same input, the error returned could be the error from a or from b means that the expression doesn’t always produce the same output given the same input in this limited case.

So, M is mostly pure. The fact that some library functions aren’t pure is okay. If the language were 100% pure, its usefulness would be limited because it couldn’t read from the outside word (like databases, websites and files). These “impure” functions are what makes accessing the outside world possible. The occasional lack of purity encountered due to errors is usually a non-issue.

Conclusion

Congratulations on making it through a lot of technical information! I hope what you’ve read has given you an understanding of how M works.

“How” has been our focus but “why?” is also an interesting question. In particular, if you come from a procedural or object-oriented background, some of the behaviors discussed may seem unduly restrictive. Why not allow variables to be changed? What’s the big deal about pure functions?

The stipulations we discussed enable M to work the way it does. If, say, functions you write could have side effects, M might be unable to use lazy evaluation because seemingly unneeded expressions might have side effects that you’re counting on. Allowing variables to be changed would limit the ability to use dependency ordering because deviating from literal source code order when executing could cause an intended variable manipulation to happen at the wrong time. These are just a couple examples of the benefits gained by this different way of doing things. In essence, if M worked more like a procedural or object-oriented language, you’d have to do what you do in those languages: imperatively code up the flow of control, instead of simply declaratively stating what you what done—which is what M’s functional nature allows you to do.

Next Time

With a foundation in the fundamentals established, it’s back to delving into the details! Next time, why don’t we examine M’s types? For example, do you know how to do a case-insensitive string comparison or encode a line return into a string?

Curious? See you soon!

Revision History

  • 2019-12-10: Updated section Query Folding to reflect the fact that a grayed out View Native Query menu option does not necessary imply that query folding is not occurring.
  • 2022-03-16: Revised section Query Folding to more accurately reflect how streaming works + added a description of Power Query’s expression auto-optimization functionality.

13 thoughts on “Power Query M Primer (Part 5): Paradigm

  1. Wouter De Raeve

    Thank you for writing this piece. It gives a unique view on the language. I hope to read more in the future.

    Reply
  2. Андрей Минаков

    Hi Ben! Thanks a lot for the great info! And I have a question. I found an error in JoinKind.FullOuter (as well as for RightOuter). Luckily, I workarounded it via Table.Buffer. I expected that I could get the same effect passing the table as a parameter to some intermediate function, but this didn't work. For me it means that function parameters are not eagerly evaluated. Please could you comment on that?

    The code which shows the error:

    let
        Source = Table.NestedJoin(Table.Buffer(#table(type table [k = number, w = text], {{1, "w"}, {2, "w"}})), {"k"}, Table.Buffer(#table(type table [k = number, w = text], {{2, "k"}, {4, "w"}})), {"k"}, "ww", JoinKind.FullOuter),
        #"1.Zombie Values if filter Null" = Table.SelectRows(Source, each ([w] = null)), 
        #"2.Zombie Null Rows when filter not null" = Table.SelectRows(Source, each ([w] <> null))
    in 
        #"2.Zombie Null Rows when filter not null"

    The problem is that you see:
    In step 1 - 2 values in ww Field of the Table instead of 1.
    In step 2 - you can still see row with null values, despite the filter for not null values.

    Using Table.Buffer(Source) solves the problem. But using let f = (x)=>x in f(Source) doesn't. So definitely passing a table as a parameter doesn't mean evaluation of the table. Please could you comment on that?

    Reply
    1. Ben Gribaudo Post author

      Interesting bug! I see on TechNet that you made Microsoft aware. Thank you! Hopefully, this problem can get fixed soon.

      As to what you asked me to comment on, it might help to think about functions for a moment. Consider the function calls DoSomthing(SomeFunction()) and DoSomething(SomeFunction). Since function parameters are eager evaluated, in both cases the parameter is evaluated regardless of whether its value is used inside DoSomething. However, the value produced by that evaluation is different for each example. With the first call, the value is the result produced by invoking SomeFunction. For the second call, the value is not the result of invoking SomeFunction but rather a function value for SomeFunction. (A function value is the result of evaluating, but not invoking, a function. In a sense, it’s a handle or pointer to the function which the recipient can choose to invoke, if desired, at a point and time of its choosing. See Function Values and Passing Functions in part 3 of this series.)

      Now, to tables. In the case of the function call SomeFunction(SomeTable), we know that the argument (SomeTable) will be eager evaluated, but what is the result of that evaluation? I’d suggest thinking of that result similar to a function value, but for tables. My guess is that, in essence, when the expression defining SomeTable is evaluated, the result is a “handle” allowing access to (streaming from/iteration of) the table defined by the table expression. (This is similar to how evaluating a function’s expression produces not the result of the function but rather a function value that can be invoked to produce the result.)

      So, the table variable is eager evaluated, because it is a parameter; however, evaluating that variable does not result in materializing the entire table (e.g. buffering the table by evaluating each cell value’s expression and storing the result in memory). This is a very good thing! If passing a table as a function parameter resulted in it being materialized, then query folding would be pointless because each time the table variable was passed into a transform step it would be fully buffered.

      Reply
    2. Aleksey Semiletov

      Andrey, there’s another way, instead of Table.Buffer you can use try … otherwise or if .. then .. else constructions in the filter expressions:

      let
          Source = Table.NestedJoin(#table(type table [k = number, w = text], {{1, "w"}, {2, "w"}}), {"k"}, #table(type table [k1 = number, w1 = text], {{2, "k"}, {4, "w"}}), {"k1"}, "ww", JoinKind.FullOuter),
          #"1.Zombie Values if filter Null" = Table.SelectRows(Source, each ([w] = null)), 
          #"2.Zombie Null Rows when filter not null" = Table.SelectRows(Source, each try [w]  null  otherwise null)
      in 
          #"2.Zombie Null Rows when filter not null"
      

      Why this works well? It’s all about “promises” 😉 In your case interpreter skiped your last filter (i don’t now why? it’s a bug), but when you add “try” it can’t skip this transformation. So it have to do your filter.
      I found it here: https://datachant.com/2018/02/14/split-camelcase-headers-m/
      That way helps when you work with big tables, or do many iterative calculations and can’t use Table.Buffer.

      Reply
  3. Ben Rogers

    I was always mystified by what “Query Folding” meant prior to this, and I didn’t know that there was an easy way (the right-click trick) to see if it was actively being used. A lot of the concepts in this post were very enlightening! Thanks so much for sharing.

    Reply
    1. Ben Gribaudo Post author

      Thank you for the kind words, Ben! On the query folding right-click check, just keep in mind that folding can be going on even when the “View Native Queries” menu option is grayed out.

      Reply
  4. James Wagman

    Hi Ben – thanks for the explanation. Would you know where/if I could find further info on the history of the “M” language as implemented by Microsoft? For example – why did Microsoft choose to develop this language – is it an extension of another project at Microsoft? I’m not interested at this point in the technical details – more interested in a narrative history.

    Thank you!

    Reply
    1. Ben Gribaudo Post author

      Hi James,

      Interesting question. I don’t know much in this regards. I know that back in the day Power Query used to be called Data Explorer (you can read a little about it on Matt Masson’s blog). Sounds like what you asked could make a good question for Twitter.

      I’d be curious to hear what you learn on this.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *