...
- A source is declared in a config file that’s committed to the repo. This means anybody can propose a source by submitting a PR and debating its validity in a github issue.
- Sources could include W3C Respec docs, IETF RFCs, Aries RFCs, DIDComm specs hosted at DIF, etc. Corporate websites wouldn’t work because A) they’re too partisan; B) they’d require random, browser-style web crawling, which is too hard to automate well.
- Crawler pulls docs and scans them, looking for regexes that allow it to isolate term declarations, their associated definitions, and examples that demonstrate their usage.
- Output from crawler is a set of candidate terms that must be either admitted to a pipeline, or rejected, by human judgment. Candidates that are already in the corpus are ignored, so this just helps us keep up to date with evolving term usage in our industry.
Content Templates: Archived