The Data Mashup Binary Stream: How Power Queries Are Stored

, , ,

You create a Power Query and save it. Its written somewhere inside your Microsoft Excel or Microsoft Power BI file—but where and in what format?

First, though, why would understanding the file format used to store Power Queries be of interest? Maybe you’re just curious and like to know how things work. 🙂 Maybe comparing query text between files or hooking up some kind of M version control is important. Maybe you’re looking for automation options, so you can do something like mass update the M code in a folder full of files quickly. While it’s unnecessary to understand the underlying file format to simply use Power Query, there are times and places where knowing about that format can come in handy.

In this post, we’ll lay a foundation on how to get to those saved mashups. We’ll overview the file format used to persist mashups, describing the various pieces of supporting information saved along with them, and point you to where you can learn about the fine, byte-level details of the format (in case you want to write your own parsing logic that reads and even modifies saved M queries!).

Enhanced Zip Files

If you’ve played with Excel or Power BI files, you probably almost immediately figured out that they are zip files (change the xlsx, pbix or pbit extension to zip and you can open the file in your favorite archive program). Yes, they are zip files—but more than just ordinary zip files: they’re structured according to the Open Package Conventions (OPC) format. Notice a file named [Content_Types].xml in the zip file’s top-level directory? That’s a clue that the zip is (probably) an OPC zip.

If you open an OPC zip file using an ordinary zip-extractor tool or API, you’ll see various OPC structures. Depending on what you’re trying to do, these can either be ignored or will need to manually parsed. Instead, if you’re using .Net, you might find it advantageous to use System.IO.Packaging.Package, which understands the OPC format and so can interpret and leverage the extra structures provided by OPC. (For working with Excel files from .Net, there’s even an option to go further: the Open XML SDK provides a strongly-typed set of objects for accessing and manipulating Excel’s specific XML OPC zip content. However, for purposes of getting to Power Queries, this SDK doesn’t add much relevant beyond what Package provides.)

Finding the Data Mashup

So, through one means or the other, you’ve opened an Excel or Power BI OPC zip file. Where are the Power Queries? In the data mashup binary stream!

For Excel, this stream is in what’s called a “custom part” (more or less is a place for add-ons and extensions to store arbitrary data). Custom parts are stored in the Excel OPC zip file in the /customXml subdirectory. Each is in an XML file whose name starts with item and is optionally followed by a number (e.g. item.xml, item1.xml, item2.xml, etc.). Usually, the part containing Power Queries is in item1.xml. However, this is not guaranteed and so should not be relied on. Instead, to locate the data mashup stream, iterate through all the item[/d+]?.xml custom part files, looking for one with a schema of http://schemas.microsoft.com/DataMashup (if the Excel file contains Power Query, there will be exactly one custom part with this schema). In this XML file, there will be <DataMashup> element with some base64-encoded content, and possibly an attribute named sqmid (a telemetry ID). Decode the base64 content and you have the data mashup binary stream.

For Power BI, finding the data mashup binary stream is quite simple. In the pbix or pbit OPC zip file, open the file named DataMashup, and you have it. No iterating XML files or base64 decoding required!

What’s Inside?

The data mashup binary stream’s format, along with the schema for Excel’s <DataMashup> xml container (mentioned above), is defined in the Query Definition File Format (MS-QDEFF).

Technically, MS-QDEFF is only documented as being implemented by Microsoft Excel. Power BI currently also uses the same format. While acknowledged by Microsoft, this fact is not listed in the specification. The net effect is that if you implement code that works with MS-QDEFF for Excel, it will also work (almost) out of the box with Power BI; however, if you encounter trouble using MS-QDEFF with Power BI, Microsoft may decline to provide support.

Inside the binary stream are five components. As we walk through them, you may find it useful to open an Excel or Power BI file in Data Mashup Explorer so that you can visually follow along.

Package Parts (OPC zip file)

Contains the actual Power Query mashup code, plus a little configuration information and (optionally) embedded content.

Specifics:

  • /Config/Package.xml—A simple XML file identifying the client version used to create the MS-QDEFF file, the minimum client version needed to read the file and the culture to use for parsing date/time strings.
  • /Formulas/Section1.m—The actual Power Queries (yay!), saved as a section document.
    • From my testing, several special rules apply to the section document beyond what are laid out in the Power Query language specification:
      • Only a single section is allowed. This section must be named “Section1”.
      • Section member names cannot contain . (periods), " (double quotes), leading/trailing whitespace or tab characters.
      • Section members should all be shared for proper functionality and user experience. Technically, non-shared members may be present; however, Query Editor provides no indication that they are unshared and no option to change them to shared; also, unshared members cannot be loaded/refreshed into Excel or Power BI (attempting to do so results in [Expression.Error] The import {memberName} matches no exports. Did you miss a module reference?). So, while technically allowed, I’d argue that unshared members are not generally intended to be supported and so should not be present.
  • /Content/{guid}—Legacy feature, used to hold embedded content (e.g. content that is consumed using Power Query’s Embedded.Value function). From what I’ve seen, these components are themselves each an OPC zip file with contents similar to a Package Part, though without the /Content subfolder and with the addition of a /Config/metadata.xml file.

Metadata (XML)

Can be divided into two logical sections: a small amount of information applicable to all queries and specific details about each query.

Several of the properties use custom binary formats or repetitious JSON for their values, which increases the complexity of reading metadata. (To help with this, in Data Mashup Explorer, Metadata’s JSON tab renders metadata as a JSON, with the repetitiousness filtered out, the various binary formats translated and the verbosity of the XML syntax hidden.)

Details:

  • AllFormulas (the general information section):
    • Defines query groups (if any are present).
    • Also, if Excel’s Update relationships when refreshing queries loaded to the Data Model option is enabled, that fact along with details about those relationships are stored here.
  • Formulas (contains an entry for each query/formula).
    • Names are in the format of {sectionName}/{formulaName}, where both sectionName and formulaName are URL encoded.
    • A number of properties may be given for each formula. MS-QDEFF lists most (but not all) of the ones that Excel uses. For Power BI, some of the properties output/used are the same as Excel, some are different and, like Excel, not all are listed in MS-QDEFF.
      • Some property values reflect where things stood as of the query’s last execution vs. its last edit. For example, suppose you edit a query, changing the columns it returns, then  close Query Editor before the query’s preview loads and don’t run a refresh in Excel/Power BI. The list of column names stored in metadata for the query may not reflect the change you just made because the edited query has yet to be processed.
    • Not mentioned in the specification, there can be formula entries with names in the form of {sectionName}/{formulaName}/{stepName}. These can be ignored. For some reason, Microsoft tooling emits these, but from what I’m told, both Excel and Power BI don’t do anything with them.

Metadata Content (OPC zip file)

I’ve never seen this used and am not sure what it was intended for. From what I’m told, it’s a legacy component.

Permissions (XML)

Contains three Boolean values:

  • CanEvaluateFuturePackages—Always false and ignored.
  • FirewallEnabled—File-level setting, where true = enforce privacy levels and false = ignore privacy levels.
    • Note: As this is a file-level setting, it can be overridden by application-level settings.
  • WorkbookGroupType—Controls the privacy level used when mashups in the current file read data from the Excel workbook that contains them (i.e. Excel.CurrentWorkbook()). In contrast, the privacy levels for external sources (like external servers, reading from the file system) are stored on a per-user basis on the local system.
    • Note: Setting this element to nil (null) and omitting it altogether are equivalent to setting it to None.

Permission Bindings (binary)

Used to implement a cryptographic checksum to protect the values in Permissions.

Saved Permissions values (i.e. FirewallEnabled and WorkbookGroupType) should only apply if the current user is the user that saved those values and if both those values and Package Parts have not been tampered with since that user saved them. If the user is different or if tampering occurred, saved Permissions should be discarded and replaced with the following defaults:

  • FirewallEnabled = true
  • WorkbookGroupType = null (i.e. xsi:nil=”true”)

How it works: On save, SHA-256 hashes are calculated for Package Parts and Permissions, then combined, encrypted using DPAPI scoped to the user, and written here as the contents of this section. When the file is later read, an attempt is made to decrypt these bindings. If they cannot be decrypted (e.g. because the current user is different from the user who encrypted them) or if the decrypted values do not match the current hash values of Package Parts and Permissions, the saved Permissions are ignored and replaced with the fallback values.

Background: DPAPI is a Windows-specific API which offers the ability to encrypt and decrypt data based on the current authenticated user, saving the user from needing to remember a special encryption key/password.

Support/Questions

Since MS-QDEFF is a part of Microsoft’s Open Specifications program, free support is available for questions related to it by contacting dochelp@microsoft.com. As already mentioned, MS-QDEFF does not officially list Power BI as implementing that standard, so MS-QDEFF questions specific to Power BI are out of scope for this free help.

The Doc Help team has helped me quite a bit with figuring out different parts of MS-QDEFF. Special kudos to team members HC, T and S for researching the many questions I sent your way—thank you very much for your help!

Change is a Coming

For Power BI, a format change is on the horizon. Currently, “enhanced dataset metadata” is in preview. This new format does not use MS-QDEFF DataMashup files, at least not when it turned on for pbix files. No word yet on whether a published specification for the new format will be released any time soon. (I hope there is but I’m not holding my breath).

Leave a Reply

Your email address will not be published. Required fields are marked *