A challenge when migrating data to a new system is that the legacy database may only contain a representation of the old system’s data. Arguably, the actual to-be-migrated data is what a user of the old system sees when using its user interface, reports or APIs. This data isn’t always exactly what is present in the database. Whether purposefully or inadvertently (think: bugs), the legacy system may apply transforms, filtering and such to the database’s raw data before displaying or outputting it. In order to create an accurate export for migration, these data manipulations need to be factored in.
From one migration, I remember an email address table along the lines of the following. This table’s design allows a customer to have multiple email addresses, including multiple primary email addresses (as the example below shows).
EmailId CustomerId EmailAddress IsPrimary
------- ---------- ----------------- ---------
1 15835 joe@example.com 1
2 15835 jsmith@work.com 1
3 15835 jsmith@home.com 0
4 321163 bob@elsewhere.com 1
If this system’s database were the exhaustive, authoritative source for its data, it would seem reasonable to conclude that all of the table’s email addresses should be included in the dataset to be migrated.
However, in the old system’s user interface, a customer could only have two email addresses: one primary and one secondary. Due to a bug, some email address edits resulted in new address rows being created instead of existing rows being updated. The system, perhaps inadvertently, was smart enough to ignore these old rows, so their presence didn’t cause it problems. Even though these rows were physically present in the database, they weren’t a part of the system’s authoritative data because from the user’s perspective they did not exist; instead, they were data corruptions which needed to be ignored during the export extraction process.
Continue reading →