Collecting package.json files from GitHub

I recently got it into my head that I wanted to collect as many package.json files from the wild as possible. Well, maybe not “as many as possible”, but “as many as I can without losing interest”. The motivation was to explore dependency graphs on a larger-than-my-immediate-circles scale. The largest collection of package.json files around is most likely NPM's public registry (npmjs.com), and their database is easily replicated locally (after jumping through a few hoops). I will write about that process another time. This post is about finding package.json files on GitHub.

Firstly, I can't keep saying “package.json files”. It's just not sustainable. I don't think “package.jsons” or “package.json's” or “package manifests” are great alternatives, so I'll be calling them PJs (mostly) for the remainder of this post. If you're here at the second paragraph and you don't know what a PJ is, it's a simple file describing a Node.js/JavaScript project and its dependencies (more info is here). Because of the The Way Things Are, the file is always named package.json.

Next, let's define the goal: to download as many unique PJs as possible from GitHub so that maybe I can do something with them later. And given the goal we should list what tools we have to work with (and their limitations).

GitHub search GitHub allows for “advanced” searches, in which you can say “I'm looking specifically for files named package.json in any repository.” That's a great start. This is the result. If you refresh those search results, you'll probably see the number of results change dramatically (could be 6MM, 66MM, 666MM, anything really). That's because the search “took too long to finish” (GitHub's words), so they stop looking at some point and just show you what they've found.

Further, since we're searching code, GitHub has some constraints we can't get around. To name a few: 1. Only the default branch of the repo is searched (usually master) 2. Only files smaller than 384kb are searched (most PJs will be smaller than this anyway) 3. Only repos that have <500k files are searched 4. You have to be authenticated with GitHub to do these searches across all public repositories

GitHub API Since we don't want to do this manually, we will use GitHub's API to make the queries and process the results programatically. GitHub's API has it's own set of interesting quirks and searching with the API is extra-quirky. 1. The API only allows a certain number of requests per time period — up to 5000 requests per hour when authenticated (60 per hour if unauthenticated; not even a consideration at this point). BUT the Search functionality of the API has different rate limiting: 30 requests per minute when authenticated. Each response from the API has x-ratelimit-remaining and x-ratelimit-reset headers that let you know how many requests remain in this time period and at what time the next period will begin. 2. The API (for all operations) has abuse detection mechanisms in place that, when triggered, tell the requester to “wait a little while before trying again” in the form of a Retry-After header (usually 60 seconds). To avoid triggering these mechanisms, GitHub recommends waiting at least 1 second between API calls, and to respect the retry-after header in all cases. 3. The search operation (specifically when searching for code) has 3 parameters: q, sort, and order. The q param is just verbatim what one would type into GitHub.com's search bar. sort by default is “best match”, but can be set to indexed to sort by how recently GitHub indexed the file being returned. order is ascending/descending. Notably, order is only honored when sort is indexed. 4. The search operation paginates its results, and you can choose to get at most 100 results in a single page. But more importantly, the API will only return up to 1000 results total (by design). This is true of using the normal GitHub search as well, of course (the normal search is hitting the same API). 5. In that vein, all the limitations of using the normal GitHub search apply. 6. I'm not sure why, but it seems that any extra text included in the search term (when searching by filename) has to be an exact match. So “d filename:package.json” has 0 results, but “dependencies filename:package.json” has millions.

So this maximum of 1000 results is a real bummer. I definitely wanted more than 1000 PJs, and we know millions are public on GitHub. To get more results, we can mix and match the available parameters and hopefully move that 1000-size result “window” over different portions of the many millions of actual results. In searches where all the results are available, sort and order would mostly be meaningless to us, but since we are only exposed to a subset of the results, changing the sort/order of the full set may increase overall exposure.

Only permuting sort and order leads to 3 separate searches, each of which returns 1000 results:

[
  { sort: best match },
  { sort: indexed, order: asc },
  { sort: indexed, order: desc }
]

We can get a broader range of results by adding some noise to the query q — PJs have a small list of standard fields, so we can search for those fields and they should be “exact matches” with some of those millions of results. For example, “dependencies” (searched above) matches some millions of PJs, 1000 of which will be returned. “devDependencies”, a different standard field, matches some millions of PJs (partially overlapping the “dependencies” results), and 1000 of those will be returned. We probably won't be exposed to 2000 unique PJs after those two searches, but we probably won't be exposed to only 1000 either (the overlap should be < 100%).

In all, there are 30 standard fields, and that's plenty for a proof of concept. Combining those fields with the other search options results in 90,000 search results (1 sort with 30 fields, 1 sort with 2 orders with 30 fields, 1000 results per query). Kind of like this:

  const searchTerms = [
    'name', 'version', 'description', 'keywords',
    'homepage', 'bugs', 'license', 'author',
    'contributors', 'files', 'main', 'browser',
    'bin', 'man', 'directories', 'repository',
    'scripts', 'config', 'dependencies', 'devDependencies',
    'peerDependencies', 'bundledDependencies',
    'optionalDependencies', 'engines', 'engineStrict', 'os',
    'cpu', 'preferGlobal', 'private', 'publishConfig'
  ]
  const orderParams = ['asc', 'desc']

  for (const searchTerm of searchTerms) { // 30,000
    searchWith({ sortParam: undefined, searchTerm })
  }

  for (const orderParam of orderParams) {
    for (const searchTerm of searchTerms) {  // 60,000
      searchWith({ sortParam: 'indexed', orderParam, searchTerm })
    }
  }

That will get us a bunch of results. A actual “result” is a blob of information the API returns about the repository and file that matched; the bits of info we care about are: 1. html_url — the a URL to view the file. Importantly, this URL takes you to an HTML page (as the name suggests), and not the raw file. There are two other URLs in the result object: api_url and url. api_url sends back an object with the contents of the file included in base64; url sends back an object that includes a download_url field. The download_url is almost the same as the html_url, except the hostname is raw.githubusercontent.com. I went with html_url and translated manually to download_url, but it would be just as well to get the content from api_url and decode it, or to follow the url -> download_url path. 2. sha — the Git hash of the file (SHA1 of the file contents with a small header). For efficiency and de-duping we can persist this hash along with the downloaded files. The hash will be the same for any PJs that have the same content, so before downloading we can check if we've already seen the hash.

So putting it all together—

That's pretty much the gist of it. Here is my implementation: stripedpajamas/sweep. It would be nice if there were a better way to collect a lot of PJs (not just the ones on registries)... if you know of a way to find more I'd love to hear it :)