Migration Thoughts: Raw Data vs. Real Data

, ,

A challenge when migrating data to a new system is that the legacy database may only contain a representation of the old system’s data. Arguably, the actual to-be-migrated data is what a user of the old system sees when using its user interface, reports or APIs. This data isn’t always exactly what is present in the database. Whether purposefully or inadvertently (think: bugs), the legacy system may apply transforms, filtering and such to the database’s raw data before displaying or outputting it. In order to create an accurate export for migration, these data manipulations need to be factored in.

From one migration, I remember an email address table along the lines of the following. This table’s design allows a customer to have multiple email addresses, including multiple primary email addresses (as the example below shows).

EmailId  CustomerId  EmailAddress       IsPrimary
-------  ----------  -----------------  ---------
1        15835       joe@example.com    1
2        15835       jsmith@work.com    1
3        15835       jsmith@home.com    0
4        321163      bob@elsewhere.com  1

If this system’s database were the exhaustive, authoritative source for its data, it would seem reasonable to conclude that all of the table’s email addresses should be included in the dataset to be migrated.

However, in the old system’s user interface, a customer could only have two email addresses: one primary and one secondary. Due to a bug, some email address edits resulted in new address rows being created instead of existing rows being updated.  The system, perhaps inadvertently, was smart enough to ignore these old rows, so their presence didn’t cause it problems. Even though these rows were physically present in the database, they weren’t a part of the system’s authoritative data because from the user’s perspective they did not exist; instead, they were data corruptions which needed to be ignored during the export extraction process.

Another example: Summing up the database’s financial rows for a given order sometimes produced a total different from the amount displayed in the application’s UI. Turns out, the legacy application ignored financial rows that had a “transaction type” column set to a certain value. (Apparently, these rows were used back in the day as a redundant, alternate way to represent financial details.) To mirror the legacy system’s current, expected behavior—to produce totals matching those displayed in its UI—these relic rows had to be filtered out prior to migration export generation.

In both situations, if the database’s raw contents were migrated as-is, data elements would be brought into the new system that business users would not have consider real, valid data. In the case of the email addresses, previously deleted addresses would reappear in the new system, likely leading to confusion and potentially interfering with communications reaching customers (e.g. the new system might try to email a customer at an address they no longer use). In the case of the financial data, the presence of extraneous, out-of-date financial details would corrupt transaction histories and totals.

The moral of the lesson: What’s in the database is only a representation of the system’s data. The actual data to migrate is what a user of the system would see when they interact with the system. Extracting this data requires factoring in how the legacy system translates and interprets the database’s data.

Unfortunately, actually doing this can be painstaking, potentially involving reading documentation, manual comparison between UI and database, referencing source code and data validation script writing—but it is of great value to do so to avoid inadvertently populating the new system with corrupt data.

Leave a Reply

Your email address will not be published. Required fields are marked *