The projects on my site are automatically scraped and formatted at publish time using the scripts in this directory. Read more about my reasoning below, or skip to the directory structure.
Gatsby's source and transformer plugins are powerful, and I used them in the initial development of this site. I eventually decided that separating my collection process would be good for flexibility, control, and offline work.
GraphQL's filters and transforms are powerful, and Gatsby's APIs add more options for how data is fetched, cached, and transformed. However, complicated or non-standard data transforms and sanitization are much easier outside of Gatsby's ecosystem. For instance, the API starts to feel clunky for one-off treatment of specific content nodes,
I've had a good experience with Gatsby but I may decide to migrate my site to another platform or format someday. Keeping my data entirely separate from the from the site's framework makes migrating my data as easy as copy/pasting this directory. It's just a few JS files!
Gatsby stores requests made through its source plugins in the
.cache directory by default. The
.cache directory is deleted after:
gatsby cleanis called.
package.jsonchanges, for example a dependency is updated or added.
gatsby-config.jschanges, for example a plugin is added or modified.
gatsby-node.jschanges, for example if a new Node API is invoked.
I found I was frequently triggering
.cache wipes during development. At best this meant I was pinging APIs and atom feeds more than necessary. At worst, it made working offline with project data impossible.
Here's how the scraper is organized for now:
scrape-projects.js The megafile to replace Gatsby's source plugins. This pulls project data from all online sources and saves them into `_generated/`. _generated/ Files generated by the `scrape-projects.js` above. DO NOT EDIT THESE FILES MANUALLY! They will be overwritten. scraped-projects-raw.json Not quite the raw response, but pretty close. This file contains all the data that I may decide to use someday, but haven't yet. Organized by `type` in a nested object. scraped-projects-formatted.json Standardized into a smaller format that can be smashed together with `curation/` data. Flattened into an array with `type` annotations on each node, as well as unique, unchanging project IDs (`UID`). curation/ This is where all custom curation and processing go, eg. tagging content. Projects are modified based on their generated UID. tweaks.js Mainly for one-off changes eg. fixing formatting errors from immutable online sources. This file can also be used to apply changes on groups of files. tags.js TODO: figure out where `tags`, `lastTagged`, and `coolness` data are going to live. sources/ Offline data files and collections to compliment the online data cached in `_generated/`. standalone-projects.json TODO: Move these over from the `src/data` directory. tools/ Custom tools to help classify, organize, or edit project nodes without opening a text editor. Custom tools are only built for data that is too difficult to keep updated or standardized manually. TODO: Hook these up to a Node server so they edit the JSON files directly. tagger.html Finds untagged or incorrectly tagged projects, as well as projects that were last tagged before a new tag type was added. Provides an interface to preview and re-tag each project. cool-sort.html TODO: sort or insert nodes based on their "coolness". test/ Quick test files to ensure data is downloaded without any dropped nodes, UIDs are unique, etc.