Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/get-updated-legistar-matter #94

Closed
wants to merge 11 commits into from

Conversation

dphoria
Copy link
Collaborator

@dphoria dphoria commented Apr 23, 2022

Link to Relevant Issue

This pull request is a part of #59 .

Description of Changes

LegistarScraper.get_updated_model(matter) -> LegistarScraper.get_updated_model(matter) where Legistar API is queried for the given matter.

from cdp_backend.pipeline.ingestion_models import Matter
from cdp_scrapers.instances.seattle import SeattleScraper

old_matter = Matter(name="CB 120305", matter_type=None, title=None)
s = SeattleScraper()
s.get_updated_model(old_matter)
# Matter(name='CB 120305', matter_type='Council Bill (CB)', title='CB 120305', result_status='In Progress', sponsors=[Person(name='Dan Strauss', is_active=True, router_string=None, email='Dan.Strauss@seattle.gov', phone='206-684-8806', website='http://www.seattle.gov/council/strauss', picture_uri=None, seat=None, external_source_id='664')], external_source_id='12846')

s.get_updated_model(Matter(name=None, matter_type=None, title=None, external_source_id=1))
# None

Eventually, will provide API similar to get_{municipality}_events(), e.g.

from cdp_scrapers.instances import get_seattle_updated_model
get_seattle_updated_model(old_matter)

@dphoria dphoria added the enhancement New feature or request label Apr 23, 2022
@dphoria
Copy link
Collaborator Author

dphoria commented Apr 23, 2022

/test-scraper get_king_county_events --from_dt=2022-04-04T00:00:00 --to_dt=2022-04-09T00:00:00

@github-actions
Copy link

Begin /test-scraper. This may take several minutes.

@github-actions
Copy link

/test-scraper is finished.
Please go to the Actions page then click on the latest workflow titled "feature/get-updated-legistar-matter".
The results are saved in scraper-test-results under Artifacts on the workflow summary page.

@dphoria
Copy link
Collaborator Author

dphoria commented Apr 23, 2022

/test-scraper get_king_county_events --from_dt=2022-04-04T00:00:00 --to_dt=2022-04-09T00:00:00

Just to prove nothing has broken. This PR yields no changes to scraped data. Have run and passed cdp_scrapers/test/scraper_tests.py.

@dphoria
Copy link
Collaborator Author

dphoria commented Apr 23, 2022

Before more work is committed, wanted to see if something like get_{municipality}_updated_model() will be a good API from cdp-backend perspective, @JacksonMaxfield @isaacna .

So just get_updated_model() in this PR. In fact will break up future PRs for this issue to keep them manageable.

@dphoria dphoria marked this pull request as ready for review April 23, 2022 22:58
@dphoria
Copy link
Collaborator Author

dphoria commented Apr 23, 2022

/test-scraper is finished. Please go to the Actions page then click on the latest workflow titled "feature/get-updated-legistar-matter". The results are saved in scraper-test-results under Artifacts on the workflow summary page.

https://github.com/CouncilDataProject/cdp-scrapers/actions/runs/2213727492

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I think I get what is going on and in general I like it.

A couple of comments / questions:

  1. I saw a bit of the code that said like get_matter vs find_matter -- just to fully understand, the get matter takes and id where find matter takes the name?
  2. There is a lot of fancy caching that is happening that could probably be abstracted away by using lru_cache see my use of it in cdp-data here.
  3. we should hook up seattle-staging to this repo so that everytime a PR runs on this repo, it runs the pipeline for seattle. I am thinking of a ci job that wipes seattle staging DB and FS, then runs the pipeline with this scraper, then comments with a link(s) to the stored event. Would make it easier to "trust" this. (not that I don't trust you hahahah)

Comment on lines +1596 to +1597
if isinstance(model, Matter):
return self.get_updated_matter(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I assume we are going to have a lot of these, it may be better to have a LUT.

MODEL_UPDATER_LUT = {
    Matter: self.get_updated_matter,
    Person: self.get_updated_person,
}

return MODEL_UPDATER_LUT[type(model)]

LEGISTAR_MATTER_SPONSORS = "MatterSponsorInfo"
LEGISTAR_SPONSOR_PERSON = "SponsorPersonInfo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So before, we never queried the full Legistar Matter. We just used bits of Legistar EventItem like EventItemMatterName to create ingestion model Matter.
Now that there is need to query for a Legistar Matter, I implemented get_legistar_matter() so I can reuse it from both the main get_legistar_events_for_timespan() and this new get_updated_model()...

So, before in get_legistar_events_for_timespan()

legistar_event_item = {
    ...
    "SponsorPersonInfo" : [
        {
            "PersonFullName" : ...
        },
        ...
    ],
    ...
}

Now,

legistar_event_item = {
    ...
    "EventItemMatterInfo" : {
        "MatterFile" : ...,
        "MatterSponsorInfo" : [
            {
                "PersonFullName" : ...
            },
            ...
        ],
        ...
    },
    ...
}

@evamaxfield
Copy link
Member

Other general comment is that I see you are using the ingestion model as the "update" model but since this is the "update API" it may be better to update from the database models and not the ingestion models.

@@ -99,6 +99,8 @@

LEGISTAR_EV_ITEMS = "EventItems"
LEGISTAR_EV_ATTACHMENTS = "EventItemMatterAttachments"
LEGISTAR_EV_MATTER_ID = "EventItemMatterId"
Copy link

@isaacna isaacna May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification, what's the difference between LEGISTAR_EV_MATTER_ID and LEGISTAR_MATTER_EXT_ID? I'm not too familiar with Legistar's schema

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the same Legistar Matter, the value will be the same.

LEGISTAR_EV_MATTER_ID = "EventItemMatterId" is a field in the Legistar Event Item. So, after querying for a Legistar Event Item, we can then query for the associated Matter using that field.

LEGISTAR_MATTER_EXT_ID = "MatterId" is a field in the Legistar Matter. After obtaining a Legistar Matter, we use that field to set Matter.external_source_id.

matter: Dict[str, Any] = resp.json()

# Person JSON for this matter's sponsors
matter[LEGISTAR_MATTER_SPONSORS] = reduced_list(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was some built-in python method at first and was kinda confused lol. I think this list comprehension is a little awkward to read but I'm fine with it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I will change this. Readability is important for us.

@isaacna
Copy link

isaacna commented May 5, 2022

Eventually, will provide API similar to get_{municipality}_events(), e.g.

I'm personally a little more partial to something like get_events(municipality) instead of keeping separate methods for each instance. @JacksonMaxfield what do you think?

@dphoria
Copy link
Collaborator Author

dphoria commented May 5, 2022

1. I saw a bit of the code that said like `get_matter` vs `find_matter` -- just to fully understand, the get matter takes and id where find matter takes the name?

Correct. And I don't expect the find-by-name code path to be used that much, but I wrote it just in case we don't have the Legistar MatterId for the requested Matter.

2. There is a lot of fancy caching that is happening that could probably be abstracted away by using `lru_cache` see my use of it in `cdp-data` [here](https://github.com/CouncilDataProject/cdp-data/blob/main/cdp_data/utils/db_utils.py#L48).

Thank you!

3. we should hook up seattle-staging to this repo so that everytime a PR runs on this repo, it runs the pipeline for seattle. I am thinking of a ci job that wipes seattle staging DB and FS, then runs the pipeline with this scraper, then comments with a link(s) to the stored event. Would make it easier to "trust" this. (not that I don't trust you hahahah)

No no, definitely do not trust me LOL 🤣

@dphoria
Copy link
Collaborator Author

dphoria commented May 5, 2022

Other general comment is that I see you are using the ingestion model as the "update" model but since this is the "update API" it may be better to update from the database models and not the ingestion models.

Ah I like this.

@evamaxfield
Copy link
Member

  1. we should hook up seattle-staging to this repo so that everytime a PR runs on this repo, it runs the pipeline for seattle. I am thinking of a ci job that wipes seattle staging DB and FS, then runs the pipeline with this scraper, then comments with a link(s) to the stored event. Would make it easier to "trust" this. (not that I don't trust you hahahah)

I am pretty busy but that may be a good project for someone who wants to learn devops stuff. It also ties in well with: CouncilDataProject/cdp-backend#166

Further, I just think we could use it for all scrapers. So we have test scraper command but we should have a command that is like test-scraper --full or something?? similarly, if we can hook it up to the cookiecutter bot so that whatever scraper we generate, we can preview the deployment on staging before we fully deploy it 👀

@evamaxfield
Copy link
Member

I'm personally a little more partial to something like get_events(municipality) instead of keeping separate methods for each instance. @JacksonMaxfield what do you think?

Those functions are auto generated. If all CDP instances were backed by legistar instances then that proposal would work but since not all are we shouldn't even try to guarentee that the API will be consistent across all of them. Hence why we generate those functions. If that doesn't make sense happy to expand more. I am sort of thinking that make sense but I could have talked myself into a circle.

@evamaxfield
Copy link
Member

I am finally getting around to checking on other PRs. I don't think I need to do anything on this one right? No update yet? (no pressure I am just behind and lost track of stuff)

@dphoria dphoria closed this Sep 17, 2022
@dphoria dphoria deleted the feature/get-model branch September 17, 2022 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants