Skip to content

Updated migrations to handle unparseable data#534

Open
edkeeble wants to merge 3 commits into
devfrom
bugfix/casei-538-jsonfields
Open

Updated migrations to handle unparseable data#534
edkeeble wants to merge 3 commits into
devfrom
bugfix/casei-538-jsonfields

Conversation

@edkeeble

@edkeeble edkeeble commented Jul 5, 2023

Copy link
Copy Markdown
Contributor

Also reconvert data after switching columns back to JSONField.

This PR addresses a couple of issues with the previous attempt to store CMR data as JSON:

  1. Data in certain fields weren't converting correctly
  2. We converted the data and first stored it as a string in the existing TextField columns, then converted those columns to JSONField, but didn't reprocess the data again to store it as objects, so all data remained in string format.

I've run this on a dump of the production DB and it works correctly, with the exception of one record, which has an unparseable "false" value. My recommendation is to restore the staging DB from a dump of the production DB prior to deploying this and manually update that record, since it appears to be an anomaly:

d = DOI.objects.get(uuid='6a8e809b-c522-4029-a6b0-e28348a6bcf1')
d.cmr_dates
'[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": false}]'
d.cmr_dates='[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": False}]'
d.save()

Also of note: the cmr_data_formats column contains inconsistently formatted data. In some cases, it stores an array of strings and in others it stores a single string. The updated migration will force a single string to be stored as an array, which I think makes sense given the name of the column.

@edkeeble edkeeble force-pushed the bugfix/casei-538-jsonfields branch from 8ba67a6 to a7fda0e Compare July 5, 2023 17:46
@edkeeble edkeeble requested review from CarsonDavis and alukach July 5, 2023 17:48
@alukach

alukach commented Jul 5, 2023

Copy link
Copy Markdown
Member

the cmr_data_formats column contains inconsistently formatted data. In some cases, it stores an array of strings and in others it stores a single string.

Do you feel like you have a solid understanding of how we are getting ourselves into these types of situations? Shouldn't we also add some logic into the application to ensure that we don't find ourselves back in this same situation in the future?

@edkeeble

edkeeble commented Jul 5, 2023

Copy link
Copy Markdown
Contributor Author

Yeah, that's a good point. I haven't updated the DOI Matching process to alter the data we insert into the DB. At a glance, the DOI Matcher still serializes all the data. I'll check if that matters in this context or if that's to be expected, since the DOI Matcher creates Change objects and stores everything in the update field.

The only references I see to cmr_data_formats seems to suggest that they are always stored as an array/list, so I'm not sure how we ended up with plain strings in the DB. We could check the CMR API and see if the issue is at the source.

@alukach

alukach commented Jul 5, 2023

Copy link
Copy Markdown
Member

My recommendation is to restore the staging DB from a dump of the production DB prior to deploying this and manually update that record, since it appears to be an anomaly:

d = DOI.objects.get(uuid='6a8e809b-c522-4029-a6b0-e28348a6bcf1')
d.cmr_dates
'[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": false}]'
d.cmr_dates='[{\\"RangeDateTimes\\": [{\\"EndingDateTime\\": \\"1998-09-20T17:46:43.000Z\\", \\"BeginningDateTime\\": \\"1998-08-04T16:43:12.000Z\\"}], \\"EndsAtPresentFlag\\": False}]'
d.save()

This confuses me a bit. What is this encoding? false seems like a fine JSON boolean, but then all the keys are double-escaped. Any idea of how we got here?

@edkeeble

edkeeble commented Jul 6, 2023

Copy link
Copy Markdown
Contributor Author

Any idea of how we got here?

The double escaping is the result of the issue around the model form passing a string to json.dumps (#389). I made a mistake in thinking I had to correct that false value, though. What I actually need to do is remove the escape characters prior to attempting to load the json string in the migration. That allows this value to convert properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants