Lazy, Streamed, Immutable: Try Building a Table

, , , , ,

Lazy evaluation, streaming and immutability are key Power Query concepts which must be understood to truly grasp how Power Query “thinks.” Want to test your understanding in these foundational areas—and try to grow it—by tackling an assignment?!

Your Task

Code up a single row table containing the following columns:

  • A column of hard-coded data.
  • A couple columns whose data is fetched from an API.
  • A couple columns whose data is fetched from another API.
Example of expected table

Where:

  • Neither API is called if the table’s rows are simply counted.
  • Only the first API is called if the columns containing data from the second API are removed from the table, and vice versa.
  • Each API is called at most once, even when multiple columns that contain data from that API are output.

Without:

  • Using any standard library table functions to build the table (i.e. no “Table.*” functions may be used).

Row Count Without API Calls

Below is a bit of an abnormal table definition. Instead of specifying a list of column values, the expression defining the table’s single row raises an error.

let
  Source = #table(
    {"HardCodedCol1", "ApiACol1", "ApiACol2", "ApiBCol1", "ApiBCol2"},
    {
      error "no column values defined for row"
    }
  )
in
  Table.RowCount(Source) // returns 1

Perhaps surprisingly, neither the row’s absence of column values, nor the presence of the error-raising expression affect Table.RowCount, which dutifully reports that this table contains 1 row.

Why doesn’t Table.RowCount die with an error?

In M, a table isn’t stored in memory as a fully materialized set of row and column values. Rather, a table is represented as a programmatic construct that contains a handle which can be used to request that the promised rows be produced on demand, one at a time (the technical term for this behavior is streaming). The expressions defining those rows and columns will then be evaluated only if their values are actually needed (lazy evaluation).

To perform its duties, Table.RowCount takes the table construct it receives as its input and uses the promise handle it contains to start an enumeration of the table’s rows. Table.RowCount requests the table’s first row, and when it receives it, increments its internal counter by +1.

All that Table.RowCount needs to know is that this row exists, not the row’s value or the column values it contains. As it does not touch anything in the row (i.e. it does not read any of the values that make up the row), the row’s expression is not evaluated. Since it is not evaluated, the error it defines is not raised.

Thanks to lazy evaluation, Power Query has avoided the unnecessary work of evaluating the row’s expression. Table.RowCount‘s job is to count how many total rows are in the table, not how many error free rows are in the table. To do its job, there’s no need to know anything about what a row contains or raises. All that’s needed is simply to know that the row exists, a fact that can be determined without touching the row’s expression.

After receiving the first row and incrementing its internal counter, Table.RowCount moves on to request the next row from the table. As the table only contains one row, this request is answered with a signal that no more rows are available. Table.RowCount then simply exits, returning the value of its internal counter, which in this case is the humble “1”.

To satisfy the criterion that neither API is called when the table’s rows are counted, nothing special needs to be done. Power Query won’t compute the table’s row or column values unless they are needed. The fact that a column’s expression might be defined to pull data from an API is irrelevant: column values are not needed when counting all rows, so their expressions (including any API calls they may make) are not evaluated when performing a simple Table.RowCount.

Only Call API If Its Data Is Needed

The table you’ve been tasked to code up is to include a couple sets of columns, each pulling data from a different API. Not every consumer of the table is guaranteed to read all its columns. Perhaps some are only interested in a subset so will use Table.SelectColumns or Table.RemoveColumns (or another table-reshaping technique) to narrow the table. 

It would be wasteful to pay the time and (potentially) money cost of invoking APIs to compute column values which are not needed. What do you need to do to keep API calls for the removed columns from occurring?

To help us ponder this, let’s consider a table definition that is a bit more normal than our last example:

let
  GetFromA = () => Json.Document(Web.Contents("http://localhost:1234/a")),
  GetFromB = () => Json.Document(Web.Contents("http://localhost:5678/b")),
  Source = #table(
    {"HardCodedCol1", "ApiACol1", "ApiACol2", "ApiBCol1", "ApiBCol2"},
	{
	  { 123, GetFromA()[a1], GetFromA()[a2], GetFromB()[b1], GetFromB()[b2] }
	}
 )
in
  Source

To prevent unnecessary API calls, you don’t need to do anything. This behavior is yours, for free, by default, thanks to lazy evaluation.

An expression like the following reduces the table to just the two columns whose data comes from the first API:

Table.SelectColumns(Source, {"ApiACol1", "ApiACol2"})

Since Power Query only evaluates column expressions if they are needed, the second API won’t be called unless values from it are needed—which isn’t the case as the table has been narrowed to include only “ApiA” columns. Similar holds true if the table were reduced to just “ApiB” columns: in that case, API A won’t be called.

Lazy evaluation again eliminates unnecessary work from taking place.

Invoke Each API At Most Once

Going back to the last table definition example, if you trace the function invocations it makes (perhaps by wrapping the bodies of GetFromA and GetFromB with Diagnostics.Trace), you’ll notice that each is called twice whenever the full table’s output is consumed. Calling each API twice* as part of producing a single table row is undesirable.

(*At least, this is true from the perspective of M. If Power Query’s persistent query cache is in play, it may catch the second native HTTP request and answer it from that cache instead of allowing it to reach the network.)

Why is each called twice? Well, two column expressions invoke GetFromA() as part of computing their values, and similarly two invoke GetFromB(). When each column’s expression is evaluated, the function call it specifies is made. Nothing saves the results from that function call to share it across columns in the same row.

This is wasteful. Each API only needs to be invoked at most once to fetch the needed data. Twice is too much. What are you going to do about this?

How about:

let
  GetFromA = () => Json.Document(Web.Contents("http://localhost:1234/a")),
  GetFromB = () => Json.Document(Web.Contents("http://localhost:5678/b")),
  Source = #table(
    {"HardCodedCol1", "ApiACol1", "ApiACol2", "ApiBCol1", "ApiBCol2"},
    {
	  let
	    a = GetFromA(),
	    b = GetFromB()
	  in 
        { 123, a[a1], a[a2], b[b1], b[b2] }
    }
  )
in
  Source

So simple, yet also easy to miss what is going on.

For the moment, let’s focus on just the pair of columns whose values come from API A. Both column definitions include a reference to let expression variable a. Thanks to lazy evaluation, a let variable’s expression won’t be evaluated unless its value is requested, so a = GetFromA() won’t run unless either the value for column ApiACol1 or ApiACol2 (or both) is needed.

In M, an identifier’s value, once computed, is immutable for the remaining life of that identifier. So a‘s value, once computed, will be saved in that let expression variable. This saved value will then be used to satisfy any subsequent requests for a‘s value.

In our example, if the values of both ApiACol1 and ApiACol2 are needed, the first one to have its column expression evaluated will trigger a‘s value to be computed. That value will be saved in a as well as returned to the requesting column expression. Then, when the second column’s expression asks for a‘s value, the already computed, saved value will be returned instead of a = GetFromA() being evaluated a second time.

So, by saving API A’s value into a variable, it can share it between the two columns that may need it (immutability). At the same time, that variable’s expression won’t be evaluated unless at least one of those columns actually needs its value (lazy evaluation).

Zooming back out to the entire row, all together, each API will be invoked at most one time (thanks to M’s immutability), but only if data from it is truly needed (thanks to M’s laziness).

Keep in mind that this immutability only lasts for the lifespan of the identifier that holds the saved value. Above, nothing saves the row expression’s value, so each time a new request for the row to be streamed out is processed, the row’s expression will be evaluated anew—and this includes the let expression it contains. So, each streaming of the row will be produced using a brand new a and b—which each will have their values computed if and when they are first requested as part of outputting that row.

There you have it. Congratulations! You’ve built a table that achieved the stated objectives—and hopefully learned a bit about lazy evaluation, streaming and immutability in the process!

Leave a Reply

Your email address will not be published. Required fields are marked *