Organic Tomato Basil Marinara

Articles / blog posts / academic papers / ebooks that have effected my perspectives. Most of them cover multiple topics, so they are not sorted or grouped; instead there are some simple tags following each link.

I will be updating the list as I find/remember more.

The Ecosystem is Moving – Moxie Marlinspike [open source, federation, centralization]

Announcing the Zig Software Foundation — Andrew Kelley [open source, incentives, capitalism, startups, non-profit]

A Call To Arms — theblacksquid [anarchism, communism, capitalism, politics]

Code Only Says What it Does — Marc Brooker [software, formal spec, documentation]

Reflections on Trusting Trust — Ken Thompson [software, trust, quine]

The Authoritarians — Bob Altemeyer [politics]

I recently got it into my head that I wanted to collect as many package.json files from the wild as possible. Well, maybe not “as many as possible”, but “as many as I can without losing interest”. The motivation was to explore dependency graphs on a larger-than-my-immediate-circles scale. The largest collection of package.json files around is most likely NPM's public registry (npmjs.com), and their database is easily replicated locally (after jumping through a few hoops). I will write about that process another time. This post is about finding package.json files on GitHub.

Firstly, I can't keep saying “package.json files”. It's just not sustainable. I don't think “package.jsons” or “package.json's” or “package manifests” are great alternatives, so I'll be calling them PJs (mostly) for the remainder of this post. If you're here at the second paragraph and you don't know what a PJ is, it's a simple file describing a Node.js/JavaScript project and its dependencies (more info is here). Because of the The Way Things Are, the file is always named package.json.

Next, let's define the goal: to download as many unique PJs as possible from GitHub so that maybe I can do something with them later. And given the goal we should list what tools we have to work with (and their limitations).

GitHub search GitHub allows for “advanced” searches, in which you can say “I'm looking specifically for files named package.json in any repository.” That's a great start. This is the result. If you refresh those search results, you'll probably see the number of results change dramatically (could be 6MM, 66MM, 666MM, anything really). That's because the search “took too long to finish” (GitHub's words), so they stop looking at some point and just show you what they've found.

Further, since we're searching code, GitHub has some constraints we can't get around. To name a few: 1. Only the default branch of the repo is searched (usually master) 2. Only files smaller than 384kb are searched (most PJs will be smaller than this anyway) 3. Only repos that have <500k files are searched 4. You have to be authenticated with GitHub to do these searches across all public repositories

GitHub API Since we don't want to do this manually, we will use GitHub's API to make the queries and process the results programatically. GitHub's API has it's own set of interesting quirks and searching with the API is extra-quirky. 1. The API only allows a certain number of requests per time period — up to 5000 requests per hour when authenticated (60 per hour if unauthenticated; not even a consideration at this point). BUT the Search functionality of the API has different rate limiting: 30 requests per minute when authenticated. Each response from the API has x-ratelimit-remaining and x-ratelimit-reset headers that let you know how many requests remain in this time period and at what time the next period will begin. 2. The API (for all operations) has abuse detection mechanisms in place that, when triggered, tell the requester to “wait a little while before trying again” in the form of a Retry-After header (usually 60 seconds). To avoid triggering these mechanisms, GitHub recommends waiting at least 1 second between API calls, and to respect the retry-after header in all cases. 3. The search operation (specifically when searching for code) has 3 parameters: q, sort, and order. The q param is just verbatim what one would type into GitHub.com's search bar. sort by default is “best match”, but can be set to indexed to sort by how recently GitHub indexed the file being returned. order is ascending/descending. Notably, order is only honored when sort is indexed. 4. The search operation paginates its results, and you can choose to get at most 100 results in a single page. But more importantly, the API will only return up to 1000 results total (by design). This is true of using the normal GitHub search as well, of course (the normal search is hitting the same API). 5. In that vein, all the limitations of using the normal GitHub search apply. 6. I'm not sure why, but it seems that any extra text included in the search term (when searching by filename) has to be an exact match. So “d filename:package.json” has 0 results, but “dependencies filename:package.json” has millions.

So this maximum of 1000 results is a real bummer. I definitely wanted more than 1000 PJs, and we know millions are public on GitHub. To get more results, we can mix and match the available parameters and hopefully move that 1000-size result “window” over different portions of the many millions of actual results. In searches where all the results are available, sort and order would mostly be meaningless to us, but since we are only exposed to a subset of the results, changing the sort/order of the full set may increase overall exposure.

Only permuting sort and order leads to 3 separate searches, each of which returns 1000 results:

[
  { sort: best match },
  { sort: indexed, order: asc },
  { sort: indexed, order: desc }
]

We can get a broader range of results by adding some noise to the query q — PJs have a small list of standard fields, so we can search for those fields and they should be “exact matches” with some of those millions of results. For example, “dependencies” (searched above) matches some millions of PJs, 1000 of which will be returned. “devDependencies”, a different standard field, matches some millions of PJs (partially overlapping the “dependencies” results), and 1000 of those will be returned. We probably won't be exposed to 2000 unique PJs after those two searches, but we probably won't be exposed to only 1000 either (the overlap should be < 100%).

In all, there are 30 standard fields, and that's plenty for a proof of concept. Combining those fields with the other search options results in 90,000 search results (1 sort with 30 fields, 1 sort with 2 orders with 30 fields, 1000 results per query). Kind of like this:

  const searchTerms = [
    'name', 'version', 'description', 'keywords',
    'homepage', 'bugs', 'license', 'author',
    'contributors', 'files', 'main', 'browser',
    'bin', 'man', 'directories', 'repository',
    'scripts', 'config', 'dependencies', 'devDependencies',
    'peerDependencies', 'bundledDependencies',
    'optionalDependencies', 'engines', 'engineStrict', 'os',
    'cpu', 'preferGlobal', 'private', 'publishConfig'
  ]
  const orderParams = ['asc', 'desc']

  for (const searchTerm of searchTerms) { // 30,000
    searchWith({ sortParam: undefined, searchTerm })
  }

  for (const orderParam of orderParams) {
    for (const searchTerm of searchTerms) {  // 60,000
      searchWith({ sortParam: 'indexed', orderParam, searchTerm })
    }
  }

That will get us a bunch of results. A actual “result” is a blob of information the API returns about the repository and file that matched; the bits of info we care about are: 1. html_url — the a URL to view the file. Importantly, this URL takes you to an HTML page (as the name suggests), and not the raw file. There are two other URLs in the result object: api_url and url. api_url sends back an object with the contents of the file included in base64; url sends back an object that includes a download_url field. The download_url is almost the same as the html_url, except the hostname is raw.githubusercontent.com. I went with html_url and translated manually to download_url, but it would be just as well to get the content from api_url and decode it, or to follow the url -> download_url path. 2. sha — the Git hash of the file (SHA1 of the file contents with a small header). For efficiency and de-duping we can persist this hash along with the downloaded files. The hash will be the same for any PJs that have the same content, so before downloading we can check if we've already seen the hash.

So putting it all together—

  • For each result of each page of each search parameter combination:
    1. Check if we have seen Git hash (result.sha); if so, skip.
    2. If not, download the package.json from result.url -> download_url or api_url -> .content, and save result.sha with it.
  • If the x-ratelimit-remaining header hits 0 during the searches, wait until the timestamp specified in the x-ratelimit-reset header.
  • If the API returns a 403 status code, wait for as many seconds as specified in the retry-after header before searching again.

That's pretty much the gist of it. Here is my implementation: stripedpajamas/sweep. It would be nice if there were a better way to collect a lot of PJs (not just the ones on registries)... if you know of a way to find more I'd love to hear it :)

The knapsack problem—sometimes called the subset-sum problem—is to determine whether or not (or how) you can sum a subset of a set of integers (called “weights”) to reach some target s exactly. The idea is that s represents the size or capacity of the knapsack, and you're trying to figure out if it's possible to exactly fill the knapsack with various objects that have the given weights.

For example, suppose you have these parameters:

weights: [1, 2, 3, 4, 5]
s: 10

This particular problem has solutions, and they are:

[1, 2, 3, 4]
[1, 4, 5]
[2, 3, 5] 

All of these subsets sum to 10, that's what makes them solutions. This is called the 0-1 knapsack problem (really a special case of the 0-1 knapsack problem), because we either allow 0 or 1 of each weight. Other versions of the problem allow reusing weights. Forget about those versions for now.

Another way of thinking about this problem is that instead of looking for some “subset” of the weights that sum to the target, we're looking for a list of 0s and 1s to multiply by the weights—that is, we can “choose” a weight by multiplying by 1 and “leave out” a weight by multiplying by 0. In the above example, one of our solutions was [1, 4, 5]. Here's the other way of viewing that solution:

weights:  [1, 2, 3, 4, 5]
solution: [1, 0, 0, 1, 1]

result: (1*1) + (2*0) + (3*0) + (4*1) + (5*1) = 10 yippee

It's just another way of expressing the solution, but it will come in handy later. Also, I should mention: the weights aren't constrained to be in any particular order; I've sorted the example weights for clarity. The similar problem of weights: [1, 3, 5, 2, 4], s: 10 has solution [1, 0, 1, 0, 1], for example.

The knapsack problem was considered Very Difficult™ in the 70s, and was proven to be NP-complete (i.e. “easy to confirm whether a proposed solution is valid, but may be prohibitively difficult to determine in the first place whether any solution exists”). The general method to solve it for a while was to try every combination of the weights to see if one combination sums to s (that's in the order of 2^n, where n is the number of weights). A ton of really clever algorithms have been thought-up since then, but generally it's safe to say that the problem is still Pretty Darn Difficult™ when the number of weights and s are both big.

Solutions as Messages

There's a really interesting application (or attempted application) of the knapsack problem: cryptography! Remember above when we represented the solution as a list of 0s and 1s? 0s and 1s make up any binary data, so consider that the “solution” could actually be a secret message!

Let's unpack that a little. We start with a message: “x” will work (keep it short so the math doesn't get overwhelming). “x” in binary is one byte (8 bits): 01111000. This is just like a solution to some knapsack problem. Now imagine we chose some set of “weights”, like [59, 24, 96, 67, 79, 93, 21, 51]. We can do the knapsack problem backwards:

weights:  [59, 24, 96, 67, 79, 93, 21, 51] <-- our made up weights
solution: [0,  1,  1,  1,  1,  0,  0,  0] <-- "x" in binary

result: (59*0) + (24*1) + (96*1) + (67*1) + (79*1) + (93*0) + (21*0) + (51*0) = 266

The idea is that if Alice generates these weights and publicizes them, Bob could encode a message (“x”) using the above method, and send 266 to Alice. An eavesdropper would see 266 and have to solve the knapsack problem to find the message/solution: 01111000 (which is “x”). Assuming the knapsack problem is hard, the eavesdropper should be stuck and Bob's message to Alice should remain secret.

But, obviously, this means it would also be hard for Alice to decrypt the message :( The knapsack problem doesn't magically become easier for Alice to solve just because she is the intended recipient. So this scheme by itself is a non-starter. Cryptosystems have to be easy to use (for Alice), but hard to break (by eavesdroppers).

As it turns out, some knapsack problems are Pretty Easy™ to solve. If the weights form a superincreasing sequence—that is, each number of the sequence is larger than the sum of all the preceding elements—then it's actually really easy to solve it.

Here is the classic example of a “superincreasing sequence”: powers of two.

[1, 2, 4, 8, 16, 32, 64]

(Sanity check: it's a superincreasing sequence because 2 > 1, 4 > (1 + 2), 8 > (1 + 2 + 4), etc.)

If our weights look like that, you can work out whether or not the last weight is included or dropped by asking “is the target greater than the sum of all the preceding weights?” If yes, the last weight must be included. If no, the last weight must not be included. Since you know the last weight's inclusion/exclusion for sure, you can ask the same question about the second-to-last weight, and so on, with a smaller and smaller target. Confusing in English... here's an example:

weights: [1, 2, 4, 8]
s: 13

1. Is 13 > 1+2+4 ? Yes, so 8 must be included. New target is 13-8.
  Solution so far: [?, ?, ?, 1]
2. Is 13-8 > 1+2 ? Yes, so 4 must be included. New target is 13-8-4.
  Solution so far: [?, ?, 1, 1]
3. Is 13-8-4 > 1 ? No, so 2 must not be included. Target doesn't change.
  Solution so far: [?, 0, 1, 1]
4. Is 13-8-4 > 0 ? Yes, so 1 must be included. New target is 13-8-4-1.
  Solution so far: [1, 0, 1, 1]

The last target (13-8-4-1) is 0, which means we're done. 
solution: [1, 0, 1, 1]
result: (1*1) + (2*0) + (4*1) + (8*1) = 13

(Eagle eyes will have noticed that when the weights are powers of two, the solution is simply the binary representation of the target: 1011 is 13 in base 2).

If the knapsack used to encrypt information had superincreasing weights, Alice would be able to decrypt the message easily. Of course that means the eavesdropper would be similarly empowered. So the idea of knapsack cryptosystems is to generate a private, easily-solved set of weights (superincreasing), and then apply some transformations to the weights, resulting in a hard-to-solve (not superincreasing) public set. People that want to send you a message encode it using the public set of weights; you then “reverse” the transformations and solve the easy knapsack problem to get the message.

An easy-ish to learn, famous one (also broken one) is the Merkle/Hellman knapsack described in their Hiding Information and Signatures in Trapdoor Knapsacks. Here is how the transformations of the weights work: 1. Alice chooses big numbers M and W (where the GCD of M and W is 1) 2. She generates a superincreasing sequence of pretty huge numbers (b1, b2, ..., bn) 3. She computes a1, a2, ..., an by solving an = bn * W (mod M)

And that's it. Bob can now encrypt the bits of some message using weights a1, a2, ..., an, and it will result in a hard-to-solve knapsack problem— the sum s, and the public set of weights [a1, a2, ..., an]. When Alice receives this, she can apply a reverse-transformation on the sum s (s' = s*W^-1 (mod M)), and solve an easy knapsack problem: s' and [b1, b2, ..., bn] (weights that we already know to be superincreasing).

This just blows my mind. It sucks that Merkle and Hellman's transformations have been demonstrably compromised; they recommended a couple of further things, like mixing up the order of the weights and adding some multiple of M to the each weight, but that doesn't help. Others have followed up with more secure transformations, and I think there are a couple schemes out there that have held up well enough to scrutiny. Regardless, it's just still so cool to hide a message inside what's basically a software developer whiteboard interview problem. (Also, that's too many nouns in a row, and 8 9 10 9 7 is interesting).

I made a small tool implementing Merkle and Hellman's transformations here: stripedpajamas/knapsack. As a follow up, I might dig into the method that trounced Merkle and Hellman's cryptosystem from A Polynomial Time Algorithm for Breaking the Basic Merkle-Hellman Cryptosystem.

✌️

credit to Odlyzko for The Rise and Fall of Knapsack Cryptosystems; this post is basically the first couple pages of that paper

I've been working on a small and simple tool called caesar that implements some of the old, basic ciphers like Caesar, Playfair, and Vigenère. These are lightweight and trivially broken with modern technology (and sometimes trivially broken without modern technology). But nonetheless, I find them really fun to mess around with. caesar, the tool, is supposed to provide a work-free way to encrypt and decrypt using these classic, out-of-style ciphers.

A cipher that I was trying to implement the past week was an old WWI-era cipher called ADFGX. There's a better version of it called ADFGVX which supports numbers and the letter J, but so far caesar is letters-only, so I figured ADFGX is a good place to start. ADFGX is really simple to work with by hand and takes advantage of a cool idea I will explain presently.

ADFGX takes two keys and consists of two steps. The first key is used to build what's called a Polybius square with a mixed alphabet. Essentially we write the alphabet out into a square, but we start it off with a keyword (our first key). This results in a mixed up alphabet. To get an alphabet into a square, we need to drop a letter (26 is not a perfect square), so typically I and J are merged and everyone named Joel has to go by Ioel for the foreseeable future. Here's a Polybius square without a key:

    1 2 3 4 5
    ---------
1 | A B C D E
2 | F G H I K
3 | L M N O P
4 | Q R S T U
5 | V W X Y Z

To make a square with a keyword, you start the square with letters of the keyword and then write the rest of the alphabet in order (skipping keyword letters, since they're already in the square). If our keyword was “apple”, it would look like this:

    1 2 3 4 5
    ---------
1 | A P L E B
2 | C D F G H
3 | I K M N O
4 | Q R S T U
5 | V W X Y Z

With a square like this, you can convert each letter of your message to a pair (row, column). “Hello” becomes 25 14 13 13 35. In ADFGX, the row and column headers are A D F G X instead of 1 2 3 4 5:

    A D F G X
    ---------
A | A P L E B
D | C D F G H
F | I K M N O
G | Q R S T U
X | V W X Y Z

So “Hello” passed through the first step of ADFGX (with key one being “apple”) would produce DX AG AF AF FX. This substitution by itself isn't great security-wise, since the pairs are as vulnerable to frequency analysis as any other one-to-one substitution cipher. So we come to step two of ADFGX.

Step two requires a new key, the transposition key. The output from step one is written underneath the transposition key (in a normal, left-to-right, row-by-row way) and then letters are read off by column in alphabetical order. To continue our example, let's say our second key is “cat”:

C A T
-----
D X A
G A F
A F F
X

Reading off columns in alphabetical order (ACT), we would get XAF DGAX AFF. This step decouples the pairs that each letter was substituted with and that means frequency analysis won't reveal too much anymore. This is a super cool upgrade to just using a Polybius square. The Polybius square fractionates the original message, such that each letter becomes a pair. The transposition step muddles up those pairs. Fractionation + transposition is a killer duo.

This is pretty much where Wikipedia and the other handful of sites I checked out leave off. Reversing the process to decrypt is pretty straightforward, for the most part. But I don't like this finished ciphertext (XAF DGAX AFF). I always like to group ciphertext letters into consistent sizes, like XAF DGA XAF F or XAFD GAXA FF. Left as-is, the ciphertext is leaking the length of the second key — 3 groups, 3 letters (this is called out on some websites as a “feature” of the cipher). I'm not sure how things were done a hundred years ago, but I'm hoping the letter groupings of the final output weren't heavily legislated.

It's not very obvious how to get the letters back in their proper columns without knowing how long each column should be. I found one way of reversing the transposition step without having the column spacing but it involved a bunch of moving columns around and writing in all kinds of directions at once and that was just too complicated for me. I found another website that said to fill the transposition block with random a/d/f/g/x characters so you wouldn't have to worry about how many letters to put into each column when you're decrypting. Nothing grinds my gears like some indistinguishable, non-standardized padding.

Anyway, look no further. Here is my Simple Method of Properly Populating the Transposition Columns When The Ciphertext Isn't Grouped By Column™ (the full impetus for this blog post; everything else was introduction):

  1. Write out the key word and put blanks for each letter of the message underneath:

    C A T
    -----
    _ _ _
    _ _ _
    _ _ _
    _
    
  2. Starting from the beginning of the message, populate the columns in alphabetical order (read the first 3 letters, XAF, into column A, etc.)

    C A T
    -----
    _ X _
    _ A _
    _ F _
    _
    
  3. Repeat, done.

🙌🙌🙌

I saw a tweet today that said “How much of intelligence is compression”.

I don't know what the original poster meant by it, but it really resonated with me. There are probably a lot of different ways of interpreting it. When I saw “compression”, I thought of data compression — “encoding information using fewer bits than the original representation.” I'm fascinated by data compression.

Here is how I would describe it in really simplistic terms (and probably the extent of my actionable knowledge on the subject):

Compression is about finding and exploiting redundancy. Put another way, I think we could say compression is about finding patterns and expressing those patterns in more efficient ways. An example text:

Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.

When this blazing sun is gone, When he nothing shines upon, Then you show your little light, Twinkle, twinkle, through the night.

A simple compression algorithm might move through this text and create tokens for the letter combinations it finds. On the first line there's already a repeat — “winkle,” is there twice. The algorithm could mark the first “winkle,” as token 0, and then perhaps replace the second “winkle,” with a pointer to token 0. This process can continue throughout the text, tokenizing “When”, “ight”, “you”, etc. If pointing to an already seen token can be done with less space than the token itself, we've just compressed the text. (This is all hand waving, I know only a little about this but I love it so much).

So then how does this relate to intelligence. After reading the tweet I sat for a while and thought about how intelligence could fit this model. I think perhaps that learning something new is mostly connecting something already learned with a new context. I think that's why we are always making analogies to other things when talking about complex ideas. For instance, in object oriented programming, I see all the time that lessons and tutorials will use the relationships between physical entities (like cars, trucks, cats, and dogs) to teach patterns of relating different objects together.

I read once in an old spiritual text that when a person has learned a subject, they ought to be able to condense the subject down a level — that is, they ought to be able to simplify the idea. The text went on to say that the wisest person to ever live would learn the most complex and esoteric subjects and be able to explain them on a child's level because the person's understanding was so complete.

So I see that intelligence could be seen as tokenizing bits of information that flow into the mind, and then when new information enters the mind, to be able to detect those already-seen bits and link back to the tokens. Maybe an example is that a person could learn to ride a bicycle and then learn to ride a motorcycle. The person doesn't need to learn how to stay balanced on two-wheels when learning to ride a motorcycle. They don't need to think about it, they can refer back to the earlier learning and apply it in the new context of going a lot faster and not having to pedal.

Maybe intelligence can be seen as the structure that results from all this linking and condensing.