Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Persons in cdp-seattle instance #222

Open
BrianL3 opened this issue Nov 16, 2022 · 7 comments
Open

Duplicate Persons in cdp-seattle instance #222

BrianL3 opened this issue Nov 16, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@BrianL3
Copy link

BrianL3 commented Nov 16, 2022

Describe the Bug

There are duplicate documents in the Person collection in the cdp-seattle instance. They have the same name and the same legistar person id. I think this may result in unexpected behavior in the front end.

Expected Behavior

There should be one and only one record of legistar persons per... person.

Reproduction

Check out records for Tammy Morales:
ID 87638bc6-fd68-4f1f-8449-6137ac242a8
ID 4bf88f27-9933-4c57-8819-342111a6a68c

Dan Strauss:
0996b93d-fdbb-41d4-b488-1dc85ac37366
2ff3c312-04cd-4e84-b46e-5989f239259

Mosqueda and Lewis also have dupes.

@BrianL3 BrianL3 added the bug Something isn't working label Nov 16, 2022
@dphoria
Copy link
Contributor

dphoria commented Nov 16, 2022

I think there is a possibility that this ends up being an issue for cdp-scrapers. There is mechanism in place there to handle situations like this, i.e. erratic/duplicate/etc. information entered by the municipalities/clerks.

So, looks like different IDs were entered for the same person. I haven't investigated this at all, but this is my guess for the time being.

@dphoria
Copy link
Contributor

dphoria commented Nov 17, 2022

Wait, I'm confused by this issue. In the CDP Seattle instance DB, I see only 1 Tammy Morales. e.g. If I follow the quickstart example, and query the person collection, there is just 1 Tammy Morales.

from cdp_backend.database import models as db_models
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

fireo.connection(client=Client(
    project="cdp-seattle-21723dcf",
    credentials=AnonymousCredentials()
))

ppl = list(db_models.Person.collection.fetch())

for p in ppl:
    if 'tammy' in p.name.lower():
        print(p.name, p.external_source_id, p.id, p.key)

# Tammy J. Morales 662 d1dbed7401e6 person/d1dbed7401e6

If this issue is saying that the Legistar end point for Seattle is returning multiple records for Tammy Morales (and others), that is known, unfortunately. And we have a system in place on the scrapers side to at least help us deal with those situations. Definitely possible it's not working 100%, but if so, shouldn't I be able to see multiple Tammy Morales when I execute the code blob above?

I think I'm probably not looking at the same "database" that Brian used to get those IDs...

@evamaxfield
Copy link
Member

Can also confirm from the database directly that there are not two people of the same name.

@evamaxfield
Copy link
Member

Where did you get those IDs btw? the IDs in the firestore database are much much shorter

@BrianL3
Copy link
Author

BrianL3 commented Nov 18, 2022 via email

@dphoria
Copy link
Contributor

dphoria commented Nov 18, 2022

I think I'm gonna pull some events on the scraper and check out the ingestion model Persons. Will report back.

@evamaxfield
Copy link
Member

i dont think it is the scraper. and i think you were checking staging (should probably refresh the data on staging since its a bit behind i think).

I think it is just a minutes item / and event minutes item ref that is broken somewhere. I will look into this weekend -- no worries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants