Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some recently updated images are missing license_url in the meta_data field #4318

Open
krysal opened this issue May 13, 2024 · 0 comments
Open
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented May 13, 2024

Description

On 2024-05-08 UTC the batched_update DAG was triggered1 to fill the license_url in the meta_data field with its corresponding value for rows WHERE license = 'by' AND license_version = '2.0', and it reported a successful end on 2024-05-09, 17:00:18 UTC updating 746,571 records. However, after triggering a run of the add_license_url DAG on 2024-05-10, it reported the same row number missing said license, which indicates that some workflows may not be filling this field or are overwriting it.

Flicker is confirmed to be on the set of rows missing this value.

SELECT source, provider, created_on, updated_on FROM image 
 WHERE license = 'by' AND license_version = '2.0' AND meta_data->>'license_url' IS NULL LIMIT 2;

+--------+----------+-------------------------------+-------------------------------+
| source | provider | created_on                    | updated_on                    |
|--------+----------+-------------------------------+-------------------------------|
| flickr | flickr   | 2020-04-28 07:20:32.183578+00 | 2024-05-12 03:13:46.696867+00 |
| flickr | flickr   | 2020-04-28 07:08:30.821693+00 | 2024-05-12 03:13:46.696867+00 |
+--------+----------+-------------------------------+-------------------------------+

If there are more, it is to be confirmed. It is known the Flickr DAG was running those days, as well as Europeana, the Finnish Museum, Wikimedia Commons, and the Metropolitan Museum.


Screenshot of DAG reports on Thursday, May 9th. Time is in VET.

Additional context

Discovered while working on #3885.

Footnotes

  1. Link only available to maintainers.

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 🧱 stack: catalog Related to the catalog and Airflow DAGs 🗄️ aspect: data Concerns the data in our catalog and/or databases labels May 13, 2024
@krysal krysal closed this as completed May 29, 2024
@krysal krysal reopened this May 29, 2024
@krysal krysal self-assigned this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📅 To Do
Development

No branches or pull requests

1 participant