Power Query M Primer (Part 5): Paradigm

, , , ,

To improve our understanding of how Power Query works, let’s step back from the pieces and look at the paradigm—the methodology that M is built around. Conceptually, how does M work? How does it think? What makes it tick? The particulars, like variables, let expressions, functions and identifiers, are all important, but if we only focus on details, we might miss the big picture. We might not see the forest because we’re looking so intently at the trees. Let’s step back and learn about M’s modus operandi.

Without an awareness of M’s methodology, we might spend our days writing M code perplexed by certain behaviors, puzzled by why M seemingly doesn’t let us do certain things we might want to do and perspiring as we waste effort trying to mimic code patterns from other languages that are needless in M.

Series Index

What It Is

Power Query (a.k.a. M) is a query language. It’s intended for building queries that mash up data. Between “querying” and “mashing up,” M can be used to output data that has been cleaned up, filtered, transformed, aggregated and combined from multiple sources and formats. This is powerful!

However, M does not provide the ability to modify the data in data sources. It pulls data from sources but doesn’t write changes or insert new data into those sources. Also, it’s not a general-purpose programming or scripting language. M isn’t intended to do things like draw dialogs on the screen, copy files from one drive to another or print. M is for querying and mashing up data, not for solving all of the world’s computing problems! This tailored focus means that M can try to do one thing exceptionally well vs. trying to do so many different kinds of things that it only manages to do them so-so.

How It Works

On the technical side, in a nutshell, M is a functional language, sharing similarities with languages like F# and Haskell. Specifically, it’s mostly pure, higher-ordered and partially lazy. If you’re not sure what these terms mean, don’t worry. Just keep reading!

Order of Evaluation

M doesn’t necessarily execute expressions in the order they appear in scripts. M follows dependency ordering, not the physical order expressed in source code.

Think of how formulas in spreadsheets are evaluated. Below, in order to produce the requested output, first the formula in cell A3 must be evaluated. Once its value has been computed, A1 and A2 are processed. Which is executed first doesn’t matter, so the spreadsheet engine gets to choose whether to execute A1 before A2 or the other way around (or maybe it executes both in parallel).

Spreadsheet with formulas displayed

M’s interpreter works similarly. It looks at the expression that produces output and determines the ordering needed to make that output possible. If there are steps where that ordering isn’t dictated by dependencies, M gets to choose.

This implies that you can write M code in any order you’d like, even if that order is out of order—though  in order coding will probably be easier for other humans who read your code. Below, these two expressions are both valid and equivalent in their output even though their child expressions are given in opposite order. In both cases, M figures out the correct execution order to use when it processes them.

let
  Data = { 1, 2, 3},
  Result = List.Transform(Data, each _ * 10)
in
  Result
let
  Result = List.Transform(Data, each _ * 10),
  Data = { 1, 2, 3}
in
  Result

Partially Lazy

When M’s interpreter determines evaluation order, what happens if it encounters an expression that isn’t needed?

let
  Source = { "Bob", "Bill", "Sam" },
  Filtered = List.Select(Source, each Text.StartsWith(_, "B"))
in
  Source

Above, what does the interpreter do with Filtered, which doesn’t need to be evaluated to produce the output required by the in clause? The interpreter skips Filtered. Since it’s unnecessary, it’s not evaluated. But why would someone write needless code?

Applied Steps list showing first step selectedFor one, how about testing? In the GUI query editor, you might select a step that comes before the last so that you can inspect its output. When you do this, let’s in clause is set to point to the selected step. However, the skipped over steps are left in the source code. M’s interpreter figures out that evaluating them isn’t required so skips them. This is good—otherwise, time might be wasted evaluating (potentially complex) extraneous expressions!

Another example:

(even as logical) =>
let
  Odd = { 1, 3, 5 },
  Even = { 2, 4, 6 }
in
  if even then Even else Odd

Each time the above method is invoked, either the value of Odd or Even is needed—but not both values. M evaluates only the variable that will be used by in to produce output. In contrast, if M were a procedural or object-oriented language, likely both variables (both Odd and Even) would be evaluated prior to the in clause being executed. Needless work would have been done! Thankfully, M doesn’t operate this way.

We’ve just seen that variable assignments in let expressions are evaluated lazily. This lazy evaluation behavior also holds true for lists, records and tables.

List.Count({ ComputeValue(), ComputeAnotherValue() })

To count the number of items in a list, the M interpreter doesn’t need to evaluate the contents of those items, so it doesn’t. Above, ComputeValue() and ComputeAnotherValue() are never called because of lazy evaluation.

In the case of the record below, Salary is not evaluated because its value isn’t used when producing the requested output.

let
  ComputeWeeklySalary = (wage) => ...,
  PayDetails =
  [
    Wage = 25.10,
    Salary = ComputeWeeklySalary(Wage)
  ]
in
   PayDetails[Wage]

Similar lazy behavior also holds true for tables. However, in M, only let expressions, lists, records and tables are lazy evaluated. Eager evaluation is used for everything else. This is why we say that M is only partly lazy.

For example, arguments to a function call are eagerly evaluated.

let
  Numbers = { 1, 2, 3 },
  Letters = { "A", "B", "C" },
  Choose = (chooseFirst as logical, first, second) => 
    if chooseFirst then first else second
in
  Choose(true, Numbers, Letters)

Since both Numbers and Letters are used as parameters when Choose is called and parameters aren’t lazy evaluated, the expressions for both variables are evaluated even though only one of the two will used by the function.

Higher-Ordered

Above, both lists are evaluated because they were passed as arguments to a function. What if we really wanted to avoid having both lists materialized only for one to be thrown away? Hmm…

Wait! Remember—we can pass functions around. Instead of passing Choose two lists of values, what if we adapt it so that we pass it functions that produce values when they are invoked? (I don’t meanuse functions in the parameter list to populate the function’s arguments,” I mean “pass the functions as arguments into the function itself.”) Choose can then choose to only execute one of the functions—the  one that produces the output it needs. The other function is not invoked so the list it generates will not be generated.

let
  Numbers = () => { 1, 2, 3 },
  Letters = () => { "A", "B", "C" },
  Choose = (chooseFirst as logical, first, second) => 
    if chooseFirst then first() else second()
in
  Choose(true, Numbers, Letters)

Technically, the M interpreter still evaluates both Numbers and Letters. However, since these names now identify functions, evaluating them simply takes their function expressions and turns them into function values (we talked about this in part 3). The two function values are passed into function Choose which then invokes only one of them. For short, we say that we passed two functions into Choose which chose which of the two to invoke (how’s that for a tongue twister!).

The ability to pass functions into other functions as arguments, as well the ability to return functions from functions (we also talked about this in part 3), makes M a higher-ordered language.

Streaming Semantics

Directly or indirectly, a M mashup is built by chaining methods. Call a function that fetches data from the source, pass its output to a function that applies a transformation, take the result of that function and pass it to another function that computes an aggregate and so forth.

However, the just-described way of passing data can be inefficient. Implemented literally, it requires that the full output of the first method be assembled before being passed to the next method in the chain. Then, that method processes its entire input, doing whatever it does until it has produced a full set of output. This is then passed to the next method, and so forth. Each step of the chain can involve iterating through the entire data set, once per step.

Why iterate through all the data once for each transform? Instead, why not iterate it once, feeding each item, as it’s encountered, through all the transform steps? Why not take the first item from the source, apply each transformation step to it in turn and then collect it as part of what eventually will be output by the expression as a whole, then repeat the process for the second item, and so forth?

For lists and tables, M supports something like this automatically, without you having to do anything! Streaming semantics allows M to pull items from lists and tables through the chain of transforms. You simply code up M like you always have—by building the chain of methods that define the desired mashup—and M’s interpreter, when appropriate, streams items through that chain instead of executing each transformation step in its entirety on the entire data set before executing the next step.

A moment ago, I mentioned pulling items through the chain of transforms. Why pull? Pulling allows the steps that needs data to control the amount of data that’s processed.

let
  Numbers = List.Generate(()=>1, each _ <= 100000000, each _ + 1),
  Filtered = List.Transform(Numbers, each _ * 2),
  Result = List.FirstN(Filtered, 3)
in
  Result

Above, only the first three items are needed to produce the requested output. Since data is pulled as needed by the methods that needs it, List.FirstN causes only the first three items to flow through the method chain. Since only three items are pulled, List.Generate only generates the first three items. This is true even though List.Generate is coded to produce the sequence of numbers from 1 to 100,000,000. Whew! It would be wasteful to produce a 100,000,000 item list to have all but three of those items thrown away. Thankfully, with M’s streaming semantics, this didn’t happen.

You don’t need to do anything to turn on streaming behavior. You simply assemble the processing steps in the order you want them. M takes care of streaming, when appropriate.

M allows you to code up what should be done (the logic of the computations) without needing to define specifically how that what should be done (the flow of control). M, like other functional languages, is declarative, not imperative. You focus on declaring the intent; M takes care of choosing the technical steps to achieve that intent.

Query Folding

Streaming semantics help with efficiency within M. Sometimes, though, it’s more efficient to move processing outside of M’s interpreter and back to the source.

Imagine this scenario: You have a database table containing one million records. All your M expression needs is the first row. Your expression could fetch all one million records from the database, then take the first record. This would work but would expend network bandwidth and M interpreter processing energies to handle the extra 999,999 records. If M only could tell the database “just send me the first record….”

It can (at least, sometimes)! With query folding, M’s interpreter translates a chain of expressions into a query written in the native language of the data source.

For example, the below expression is translated into SQL that follows it. The net effect is that the database server only sends a single row across the wire to M. Yay!

let
    Source = Sql.Databases("some-server"),
    db = Source{[Name="Test"]}[Data],
    MillionRowTable = db{[Schema="dbo",Item="MillionRowTable"]}[Data],
    KeepFirstRow = Table.FirstN(MillionRowTable,1)
in
    KeepFirstRow
select top 1
    [$Table].[Name] as [Name],
    [$Table].[Amount] as [Amount]
from [dbo].[MillionRowTable] as [$Table]

Not every data source and transform function supports query folding. The level of support is determined by the source and the revision of M being used (changes to sources and library functions introduced in revisions can expand support for query folding).

If a mashup contains both foldable and non-foldable steps, M (for obvious reasons!) executes the non-foldable steps itself. Once a non-foldable step is encountered in a particular chain, steps that logically come after that step in the chain won’t be folded.

Below, the last step won’t be folded even though it supports folding because it comes after a non-foldable step. The steps in green may be folded while the steps in orange will not be folded.

Source (supports folding) -> Transform (supports folding) -> Transform (doesn’t support folding) -> Transform (supports folding)

Right-click menu showing View Native Query menu itemSimilar to streaming semantics, you don’t need to do anything to turn query folding on. However, when working with large data sets, where the division between folded and non-folded occurs can have significant performance ramifications. At times, it may be advantageous to encourage M to fold as much as possible by reorder (and, if necessary, reengineer steps) so that as many as possible are foldable and come before any non-foldable steps.

For visibility as to when query folding is being used (and the query being used), try right-clicking on a step in the GUI’s Applied Steps menu. If the View Native Query menu item isn’t grayed out, the clicked-on step is being folded.

Immutable

Have you noticed that we’ve assigned initial values to variables but never updated or assigned new values to variables? Ah! This is because we can’t. In M, it’s not possible to change a variable once its value has been evaluated. Instead, when a variable is evaluated for the first time, it’s value is fixed for life. M’s variables are immutable.

Well, that is, except for the one exception. When streaming semantics are involved, each time a variable that represents a streamed source is accessed, the data that’s streamed may be different.

M’s documentation is vague on this point, so here’s my guess as to what this rule means: If a variable exposing a stream is accessed multiple times and the data source changes or orders data differently between those accesses, the data streamed will be different.

Say the starting point is a database. If the first time the variable exposing the stream is accessed, the source database table contains three records and all records are requested, then three records will be streamed. After this access, suppose an external process adds a fourth record to the source table. The next time the variable is accessed with all records requested, the stream will contain four records.

Even if the data source remains unchanged between variable accesses, if the source doesn’t order the data it outputs the same way each time (for example, if the data comes from a SQL query that doesn’t include an ORDER BY clause), the stream exposed by the variable could stream the data items in a different order each time the variable is accessed.

So, in the case of streaming semantics, Power Query’s variables are not immutable because they are not evaluated once then fixed for life.

The designers of M could have made variables always immutable by having M’s interpreter cache the data streamed through each variable on first access, then return the cached data on each subsequent access. However, cached data has to be stored somewhere. Practically, that somewhere is in memory or on disk. Storing a large data set in either location might be prohibitive (the computer might not have enough memory; disk I/O would make processing much slower). Allowing data to be restreamed each time the variable exposing it is accessed avoids this issue.

Mostly Pure

In the world of computer language theory, functions that comply with the following two rules are pure functions.

A function is a pure function if it:

  • Always produces the same output given the same input. For example, calling AddTwo(5), where AddTwo is defined as (x) => x + 2, will output 7 every time it’s called.
  • Does not have side effects. That is, the only effect calling the method has is producing output. The method won’t also, for example, have the side effect of modifying a global variable or saving changes to a file.

In M, the very nature of the language makes any expression you write pure unless it accesses an impure library function or, possibly, encounters an error with one of its sub-steps.

Library functions can access the outside world. The outside world isn’t always deterministic. If you use a library function that reads from a file or database then rerun that expression a few moments later, you might get different results even though you called the method with the exact same arguments. Why? The file or database may have changed. This change in output means that the function doesn’t always produce the same output given the same input so we call it an impure function.

Errors in the sub-steps of an expression can also cause it to be non-deterministic and so to violate the ‘same input produces the same output’ rule. Take the expression a + b as an example. M’s interpreter is free to evaluate a and b in whichever order it chooses because the expression does not dictate a particular evaluation ordering. However, if both sub-expressions (both a and b) return errors, then which is evaluated first will determine the error returned by a + b. If, one time, M executes a first, its error will be the error returned by a + b. The next time, if b is executed first, b’s error will become the error that’s raised. The fact that, given the same input, the error returned could be the error from a or from b means that the expression doesn’t always produce the same output given the same input in this limited case.

So, M is mostly pure. The fact that some library functions aren’t pure is okay. If the language were 100% pure, its usefulness would be limited because it couldn’t read from the outside word (like databases, websites and files). These “impure” functions are what makes accessing the outside world possible. The occasional lack of purity encountered due to errors is usually a non-issue.

Conclusion

Congratulations on making it through a lot of technical information! I hope what you’ve read has given you an understanding of how M works.

“How” has been our focus but “why?” is also an interesting question. In particular, if you come from a procedural or object-oriented background, some of the behaviors discussed may seem unduly restrictive. Why not allow variables to be changed? What’s the big deal about pure functions?

The stipulations we discussed enable M to work the way it does. If, say, functions you write could have side effects, M might be unable to use lazy evaluation because seemingly unneeded expressions might have side effects that you’re counting on. Allowing variables to be changed would limit the ability to use dependency ordering because deviating from literal source code order when executing could cause an intended variable manipulation to happen at the wrong time. These are just a couple examples of the benefits gained by this different way of doing things. In essence, if M worked more like a procedural or object-oriented language, you’d have to do what you do in those languages: imperatively code up the flow of control, instead of simply declaratively stating what you what done—which is what M’s functional nature allows you to do.

Next Time

With a foundation in the fundamentals established, it’s back to delving into the details! Next time, why don’t we examine M’s types? For example, did you know that M offers several ways to store numbers—and if you choose the wrong way, math on those numbers might not work the way you expect?

Curious? See you soon!

One thought on “Power Query M Primer (Part 5): Paradigm

  1. Wouter De Raeve

    Thank you for writing this piece. It gives a unique view on the language. I hope to read more in the future.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *