ETL Without the T: A Loop-Driven, Data about the Dataflow Approach Using SSIS

Microsoft SQL Server Integration Services (SSIS) is designed for ETL. However, sometimes the T in ETL isn’t necessary: data simply needs to be moved without manipulation. While defining a simple source-to-destination dataflow is easy, filling a package with the numerous copy operations required to move data between multiple source-destination pairs results in a verbose package that violates the Don’t Repeat Yourself (DRY) principle.

Using a loop to apply a single copy data dataflow to a list of source-destination pairs eliminate this negative. With this approach, the main dataflow in the SSIS package is not the data being moved; rather, it is data about that data—that is, data describing the sources and destinations to which the extract-and-load operation should be applied.

SSIS’s reliance on static metadata often complicates implementing this approach. While variables set by a Foreach Loop Container can be used in child components to change source and destination, the schema defining the data paths can transmit is statically configured at design-time. Using variables to dynamically adjust source and destination doesn’t change what’s allowed to pass on the path between those connections. Unless all data to be moved shares the same schema, flowing the data between reused source and destination connections isn’t viable.

The static path metadata problem can be bypassed if a SQL statement batch (e.g. INSERT INTO…SELECT or MERGE) can be used to perform the entire extract-and-load operation. This is often the case when source and destination reside on the same database server instance or when one server links to the other. Inside the Foreach Loop, a Script or Expression Task can be used to build the SQL statement appropriate for the current source-destination pair. This statement can then be run by an Execute SQL Task. In certain situations, this solution is ideal, as the performance cost of flowing data from server through SSIS to server is eliminated. However, the prerequisite that the entire extract-and-load process be encapsulated in a single SQL statement batch isn’t always practical (for example, linking the servers may not be an option due to technical or policy reasons), leaving a desire for a more robust solution.

The Transfer SQL Server Objects Task also bypasses the static metadata issue by not passing data over paths. However, this task doesn’t allow the list of what it copies to be dynamically configured, making it unsuitable for use in a loop- or list-driven solution.

Avoiding the static metadata issue requires that the entire extract-and-load process take place inside a single SSIS task. Eliminating the requirement that a single SQL statement be used implies that source and destination can be on different servers. Reusing a single dataflow to process multiple source-destination pairs requires that relevant task properties be dynamically configurable. No built-in SSIS task fulfills all of these requirements. A third-party component or custom code must be used. The custom code involved isn’t necessarily complex: the prototype extract-and-load Script Component provided in the appendix is simple enough to fit on a single page if comments are removed.

A loop-driven, data about the dataflow approach separates logic from configuration, simplifying package maintainability and speeding up the process of modifying source-destination mappings. With this approach, the package contains only one copy of the extract-and-load operation’s dataflow instead of the many duplicated copies necessary with the standard SSIS ETL-centric approach. Eliminating this repetition simplifies maintainability: logic changes now only need to be made in one place. Also, as which data to move and where to move it to are controlled outside the package, editing source-destination mappings bypasses the development workflow involved with modifying a SSIS package and can be delegated to non-developers.

When the T in ETL is unnecessary, the simplification of package maintenance and ease of modifying source-destination mappings that comes from a loop-driven, data about the dataflow approach makes this approach an attractive option.

Appendix: A Custom Code Script Component Prototype

While error handling, SQL escaping and other necessities of production code have yet to be added, this proof-of-concept illustrates the core functionality involved in implementing a custom extract-and-load Script Component. This example component is fed a list of source and destination table mappings for it to copy between. Updating the source and destination database connections, as necessary, is done outside the component.

SSIS’s normal design paradigm isolates sources and destinations via connection-type-agnostic paths. Moving both data fetching and loading into a single component eliminates this isolation. Additionally, bulk loading functionality tends to be database-platform-specific. The custom code component needs to support all connection types and bulk loading methods of interest. In the example below, a single source connection type and bulk loading method are implemented.

// ** Proof-of-Concept, Prototype Code -- not intended for production use **
// All warrantees disclaimed, to fullest extent possible as allowed by law.
using System;
using System.Data;
using System.Linq;
using System.Data.SqlClient;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private object rawSourceConnection;
    private SqlConnection sourceConnection;
    private object rawDestinationConnection;
    private SqlBulkCopy bulkCopy;

    public override void PreExecute()
    {
        base.PreExecute();

        rawSourceConnection = Connections.Source.AcquireConnection(null);
        sourceConnection = (SqlConnection)rawSourceConnection;

        rawDestinationConnection = Connections.Destination.AcquireConnection(null);
        var destinationConnection = (SqlConnection)rawDestinationConnection;

        /* Transactions, null value & identity value preservation, trigger firing, table vs. row 
         * locking and constraint checking are all controlled via constructor arguments. 
         *
         * Re-using a single instance across copy operations per 
         * http://msdn.microsoft.com/en-us/library/s4s223c6%28v=vs.100%29.aspx 
         */
        bulkCopy = new SqlBulkCopy(destinationConnection, SqlBulkCopyOptions.KeepNulls | SqlBulkCopyOptions.KeepIdentity, null);
    }

    public override void PostExecute()
    {
        base.PostExecute();

        ((IDisposable)bulkCopy).Dispose();

        Connections.Source.ReleaseConnection(rawSourceConnection);
        Connections.Destination.ReleaseConnection(rawDestinationConnection);
    }

    public override void Configuration_ProcessInputRow(ConfigurationBuffer Row)
    {
        // *WARNING* - Prototype Code - no SQL escaping performed
        var fetchSql = String.Format("SELECT * FROM {0}", Row.SourceTable);
        var command = new SqlCommand(fetchSql, sourceConnection);

        using (var reader = command.ExecuteReader())
        {
            ConfigureMappings(reader);
            bulkCopy.DestinationTableName = Row.DestinationTable;
            bulkCopy.WriteToServer(reader);
        }
    }

    /* Without mappings, SqlBulkCopy maps from source to destination by ordinal column position 
     * (e.g. source's column 1 is copied to destination's column 1). Setting up name mappings 
     * overrides this behavior.
     */
    private void ConfigureMappings(SqlDataReader reader)
    {
        bulkCopy.ColumnMappings.Clear();

        var schemaData = reader.GetSchemaTable().Rows.Cast();
        var columns = (from row in schemaData
                       select row["ColumnName"]).Cast();

        foreach (var column in columns)
        {
            bulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(column, column));
        }
    }
}

SQL Server Management Object (SMO)’s Transfer class is an alternative to SqlBulkCopy. Oracle users may find OracleBulkCopy of interest.

1 thought on “ETL Without the T: A Loop-Driven, Data about the Dataflow Approach Using SSIS

  1. Sam Mathew

    Many many many thanks..
    Succinct article.

    I really wish that this easily available for others to view and see.
    I have been thinking of this issue for the past couple of weeks (I am not an expert in SSIS, but had to use it to architect a solution in one of the projects).

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *