Scraping San Francisco's Legistar

Tags: data

San Francisco’s Board of Supervisors uses Legistar in order to publish information about legislation and board meetings. I scraped this into a local SQLite instance and threw it in Datasette. The repo is here, but it’s pretty messy. Included is the resulting SQLite database.

It was pretty easy to scrape this. I initially wrote my own scraper but it turns out that was silly, because there is Python Legistar Scraper which works wonderfully. Unfortunately it’s pretty undocumented, but it was easy to reverse engineer how it was supposed to work from the tests (see scrape.py in the repo):

def scrape_bills():
    s = LegistarBillScraper()
    s.BASE_URL = 'https://sfgov.legistar.com/'
    s.LEGISLATION_URL = 'https://sfgov.legistar.com/Legislation.aspx'
    for page in s.searchLegislation():
        print(dumps(list(s.parseSearchResults(page))))

I cleaned the data a little bit (clean.py) and used sqlite-utils to shove it into a SQLite database (make_sqlite.sh).

sqlite-utils insert --pk FileNumber --nl \
    --alter --analyze legistar.db legistar cleaned.json
sqlite-utils create-index legistar.db legistar Type
sqlite-utils create-index legistar.db legistar Status
sqlite-utils create-index legistar.db legistar Introduced
sqlite-utils create-index legistar.db legistar FinalAction
sqlite-utils enable-fts legistar.db legistar Title

Then was able to run it in Datasette:

datasette --template-dir templates/ legistar.db

With the help of my datasette-vega-dashboards, I was able to also add a quick visualization so I could visualize the flow of topics over time:

Caption: Homelessness is a common topic in Board of Supervisors meetings. This shows how many “matters” (Resolutions, Hearing, etc.) mentioned the word “homeless” (unnormalized).

Anyway, this was all fun but I’m not quite sure where to go from here. Some additional ideas:

  • scrape more things: full “matters” instead of just the summary.
  • scrape all the “attachments”, convert them from PDF to text, and make them searchable. This would be useful, since there’s often a lot of good content in the attachments (like public comment) which isn’t represented in the summary.
  • scrape votes and calendar data.
  • do something interesting with the data? Not really sure what. It was already pretty fun poking around the data, but some more structured investigation could be fun.
Posted on 2022-03-29