Scraping San Francisco's Legistar
San Francisco’s Board of Supervisors uses Legistar in order to publish information about legislation and board meetings. I scraped this into a local SQLite instance and threw it in Datasette. The repo is here, but it’s pretty messy. Included is the resulting SQLite database.
It was pretty easy to scrape this. I initially wrote my own scraper but it turns out that was silly, because there is Python Legistar Scraper which works wonderfully. Unfortunately it’s pretty undocumented, but it was easy to reverse engineer how it was supposed to work from the tests (see
scrape.py in the repo):
def scrape_bills(): = LegistarBillScraper() s = 'https://sfgov.legistar.com/' s.BASE_URL = 'https://sfgov.legistar.com/Legislation.aspx' s.LEGISLATION_URL for page in s.searchLegislation(): print(dumps(list(s.parseSearchResults(page))))
I cleaned the data a little bit (
clean.py) and used
sqlite-utils to shove it into a SQLite database (
sqlite-utils insert --pk FileNumber --nl \ --alter --analyze legistar.db legistar cleaned.json sqlite-utils create-index legistar.db legistar Type sqlite-utils create-index legistar.db legistar Status sqlite-utils create-index legistar.db legistar Introduced sqlite-utils create-index legistar.db legistar FinalAction sqlite-utils enable-fts legistar.db legistar Title
Then was able to run it in Datasette:
datasette --template-dir templates/ legistar.db
With the help of my datasette-vega-dashboards, I was able to also add a quick visualization so I could visualize the flow of topics over time:
Anyway, this was all fun but I’m not quite sure where to go from here. Some additional ideas:
- scrape more things: full “matters” instead of just the summary.
- scrape all the “attachments”, convert them from PDF to text, and make them searchable. This would be useful, since there’s often a lot of good content in the attachments (like public comment) which isn’t represented in the summary.
- scrape votes and calendar data.
- do something interesting with the data? Not really sure what. It was already pretty fun poking around the data, but some more structured investigation could be fun.