Have you ever wondered what it would be like to assign a label to literally anything? If yes, this article might attract your interest as this is exactly what hash functions are about. Those are the functions that accept input of an arbitrary length and provide output of fixed length which uniquely identifies the provided input. Really uniquely? Well… It depends how good the hash function is!

This article is going to be a lot less math-heavy than usual. We will briefly discuss what hash functions are, why are they important and how to distinguish good hash functions from bad hash functions. We will implement a simple hash function in Python and calculate hashes of some Pokémon and other things.

Given different input, hash function is expected to provide distinct output. If that is not the case — if provided two different inputs same output is given — a collision occurs. Such collisions are rare and extremely hard to find (as we will realize soon). For the widely used modern hash functions such as SHA256 no collision has ever been found. A good hash function would satisfy the following properties.

1. Deterministic, same input should always lead to same hash value.
2. Efficient, fast to compute.
3. Difficult to analyze, chaotic, small change of input should change output a lot. It should be infeasible to find pattern and predict the output.
4. Irreversible. Given a hash value it should be computationally hard to find matching input.
5. Collision free. In practice it should be extremely unlikely to encounter a collision.

Cryptographic hash functions have wide range of applications. They can be used to store user passwords, which makes it possible to verify if password is correct without knowing the password. They are used to generate pseudorandom numbers, by applying hash function many times to some seed input. The proof of work algorithms in cryptocurrencies also take advantage of hash function properties. It is also possible to use hash function to address resources by their content, as done in IPFS file system. We will not discuss examples of those usages, as we are going to discuss those kind of applications in the future tutorials and now just need to briefly introduce the concept.

An example of a insecure hash function would be bitwise XOR operation which we can implement in Python. First lets implement $\text{XOR}$ operation of two byte-type values which is well done here

def byte_xor(ba1, ba2):
return bytes([_a ^ _b for _a, _b in zip(ba1, ba2)])


then based on that we can split an arbitrary length input into chunks of $32$-byte length and pad last one with zeros (if required). Finally compute the $\text{XOR}$ operation of all the chunks leading to a simple hash function fixed length output

def bitwiseXOR(data):
chunks = [data[i:i+32] for i in range(0, len(data), 32)]
for i in range(len(chunks)):
chunk = chunks[i]
hexchunk = chunk.hex()
len_diff = 32 - len(chunk)
if len_diff:
hexchunk += '00' * len_diff
chunk = bytearray.fromhex(hexchunk)
chunks[i] = chunk
res = bytes.fromhex('00' * 32)
for chunk in chunks:
res = byte_xor(res, chunk)
return res


I need to apply it to something. I decided it would be good to continue our little Pokémon tradition which began back in with Vertex Cover and continued in Quantum Adiabatic Optimization and Constraint Programming articles. We will calculate hash values of Pokémons and search for collisions. I borrowed some sprites from bulbagarden and used them as input for bitwise $\text{XOR}$ and SHA256 hash function. Below you can find first and last five Pokémons and their hashes.

#   SHA256                                                           Bitwise XOR
001 43aa5ea04dc0ee3fb3aa5269639703b9756cb4fb88cbda2885288a983c9f6357 799e4fabfe1ff7600e7fe347cbe36a5a81c4a8d4f5eb8b439da39a0fcf84d819
002 184bff41d64d393528346e4f36e2b517e65f6385b180d66c5562c59314b2d7c0 1f65203a3e0f997e8a4d69a8d19de13e05b3cd8e094792330f31dc82cd2cbbfe
003 e13c27acbbdc7d89457c44af6e7bb5212610350d1bbced0f0ac0dc171c3b48d1 e18b6909143385e10e5ee35f58980fbe897417222328b8f39e581ceffd7b36ec
004 960c166bc063123aaa38e8acde3b8d3667e9568d4eeaa5eb29df316b05b33e94 b6b2b70ccb51d459aa17cf59e962f77b0c1f3fdb40b874dc0a4ef3e9d0c5f560

...

149 fe3390a3bcc6cd9fc352a9b071e907d8187daf1544f5c9869ee2341b93eaee5f 31b3afd221fbb01f729ded1be4af69e4d0486c3b0f3ec9151a50c852c71e624c
150 9ca51be1e824179bee4d5e821fbbd5ab45ea73b6cd7b4d6ffd4e7da89925d965 2330bb0c369daef8ed6a68c8c25eae6a5ce8e11afd7ee9b5146896d4acdb110d
151 8e4a8eb7817ee0987eeb58b294344867504e6e6feffaa17d2b4e3883219ef4f1 c4ece21913a43e9943c988a4e9ca4e863c20fc3c9b3e3413c0eb504673b8c14b

No collisions found...


No collisions found!!! The bitwise $\text{XOR}$ operation is not a secure hash function, yet it does seem secure enough to characterize the first $151$ Pokémons based on their sprites from Pokémon RGB version. It is important to mention here that the sprite used as input were encoded as PNG images, thus compressed. Another limitation was that they were not $2$-byte pixels (as should be in case of GameBoy graphics). Those sprites are SGB colored. They contain a lot more information than regular GameBoy sprites used. Below I provide a little visualization of first nine Pokémons with their sprites and hash value QR codes. The regular GameBoy sprites would look similar to the Mario figure from Simulating Superdense Coding article.

That was my internal dillema regarding aesthetics. There is a chance regular $2$-bit GameBoy pixels would provide a bitwise $\text{XOR}$ collision among first $151$ Pokémons, but I just cannot look at those sprites without SGB coloring, I spent too many hours playing this game as a child to bear looking at regular sprites…

We failed to find bitwise $\text{XOR}$ collision in the Pokémons sprites. I decided to try one more thing. Back in Poland we have this book-length poem called Pan Tadeusz. I did not read it… (even given that it was mandatory to read it at school, it was just too not-interesting) however I finally found a use-case for it! It is a pretty long poem, available online and not protected by any copyright licence so I searched for collisions in its verses. The first and last ten verses provided below.

#    Verse                                                            Bitwise XOR
0001 Litwo! Ojczyzno moja! ty jesteś jak zdrowie:                     6c03151c4f5b443d0514131c406e6f206d6f6a6121207479206a65737465c59b
0002 Ile cię trzeba cenić, ten tylko się dowie,                       264c1649a7f0e4fd4f031b1f4962612063656e69c4872c2074656e2074796c6b
0003 Kto cię stracił. Dziś piękność twą w całej ozdobie               8eefaba7431db35da55303520202ac47e744002b000daaf9491569c4996b6e6f
0004 Widzę i opisuję, bo tęsknię po tobie.                            2706440eabfb490c0e6f706973756ac4992c20626f2074c499736b6e69c49920
0005 Panno święta, co Jasnej bronisz Częstochowy                      2a412d14abb9b6ef180aacf603182c20636f204a61736e656a2062726f6e6973
0006 I w Ostrej świecisz Bramie! Ty, co gród zamkowy                  694318002801b7c1014a5aa4f61c06121a69737a204272616d6965212054792c
0007 Nowogródzki ochraniasz z jego wiernym ludem!                     270a05011e1fe3df111e0e04016f636872616e6961737a207a206a65676f2077
0008 Jak mnie dziecko do zdrowia powróciłaś cudem                     89d20849a8ec08a0bb44191c0106066f20646f207a64726f77696120706f7772
0009 (Gdy od płaczącej matki, pod Twoją opiekę                        5f280ebda54f0b5019a0e9a5fa7ac48563656a206d61746b692c20706f642054
0010 Ofiarowany, martwą podniosłem powiekę;                           3f091e081704b3f855792c206d61727477c48520706f646e696f73c582656d20

...

9933 A przy stoliku drewnianym pan włodarz                            c34f141308032073746f6c696b7520647265776e69616e796d2070616e2077c5
9934 Albo ekonom, lub nawet gospodarz,                                6d6c626f20656b6f6e6f6d2c206c7562206e6177657420676f73706f6461727a
9935 Nie bronił czytać i sam słuchać raczył,                          2fade20010130c141000000c637a797461c48720692073616d2073c582756368
9936 I młodszym rzeczy trudniejsze tłumaczył,                         8ca218a8e30c1e0abffb4120727a65637a7920747275646e69656a737a652074
9937 Chwalił piękności, a błędom wybaczył.                            631f0e030d0abffbe5f247c4996b6e6fc59b63692c20612062c582c499646f6d
9938 I zazdrościła młodzież wieszczów sławie,                         2a5ab9d20d4401aa47fa1400a0ae61206dc5826f647a6965c5bc20776965737a
9939 Która tam dotąd brzmi w lasach i w polu,                         6b1de3c452114f18144120646f74c485642062727a6d692077206c6173616368
9940 I którym droższy niż laur Kapitolu,                              20540418b69f72796d2064726fc5bc737a79206e69c5bc206c617572204b6170
9941 Wianek rękami wieśniaczki usnuty                                 2310616e656b2072c4996b616d6920776965c59b6e6961637a6b692075736e75
9942 Z modrych bławatków i zielonej ruty…                             7a52181b1d90f9c5682062c582617761746bc3b3772069207a69656c6f6e656a

No collisions found...


Again, no collisions! I conclude that bitwise $\text{XOR}$ is more secure than I anticipated (but still, please do not use it, you will find collision in large-enough dataset). I also learned that it is not easy to find collisions even for insecure hash function. How about you? Would you take the challenge and find a bitwise $\text{XOR}$ collision? If you do, please tweet me! The complete source code for this article is provided in this public github repository. If you want to read more about hash functions I recommend the Practical Cryptography for Developers book. That would be all for this article… But expect the hash function to appear on this blog very often from now on and point back to this text.