Category: math

May 29, 2013

Math for Distributed Redundant Storage

What we want – a mechanism to split a file into pieces so we can recover the original file even if some pieces are lost.

One simple mechanism to achieve this is to use XOR blocks. For example if we have blocks A and B that make up a file and we calculate block C = A ^ B (where ^ represents XOR operator). Then given any two blocks (A,B), (A,C), or (B,C), we can reconstruct the third missing block and recover the file. In this example we have 50% redundancy and can tolerate a 33% loss of data and still recover the file. More complicated mechanisms can be based on variations of this simple XOR approach.

I have played around with this approach and found it to get complicated quite quickly when trying to come up with an optimal system involving many more blocks. Instead I discovered a much simpler approach.

If we imagine an output vector

with n components given by:

Y = AX

where

A is a (m by n) matrix, and X is an input vector with m components, and n>m then, given Y to calculate X, we have an overdetermined system (i.e. we can lose some elements of Y and still recover X).

So, yes this is another way, but it suffers from a couple a serious problem. The Y component integers will tend to be larger (require more bits to represent) then the X integer values. For a large matrix, the number of extra bits required
rises dramatically with no benefit for this extra redundancy. So this approach seems inferior.

Luckily there is a way to solve this problem. We can do the all math operations over an integer field p, where p is a prime number. In this case multiplication ‘wraps’ the integer around so that a large integer n
would reduced to be n mod p. The ‘hard-part’ is how do we calculate the inverse of a matrix over a field ‘p’?

We know that

x ^p-1 mod p = 1

therefore

x*x ^p-2 mod p = 1

and therefore

x ^p-2 mod p

is the inverse of x over the finite field p.

Luckily the same property applies to matrices, so

A ^-1 = A ^p-2 mod p

This is pretty cool (in my opinion).

So if we construct a matrix A of size (m by n) where (n>m) that is properly invertible over all square (m by m) matrix subsets over some prime integer field then we will have n output values. We can now lose n-m values from this output and still be able to reconstruct our original output vector.

This is probably all a bit difficult to understand without an example
and I hope to address this topic later with more ‘code’ and less math.

The key take-away is this:
1. We can use matrix multiplication to transform a set of values into another set that contains more values.
2. This set has some redundancy in it. If we do math over an integer field we make this redundancy optimum.
3. We can calculate the inverse of a matrix over a field by using the formula A ^-1 = A ^p-2 mod p
4. This allows us to lose some values (n-m) and still recover the original data.
5. The Matrix A acts like a key (in an encryption sense) that is ‘probably’ (although unknown) fairly difficult to crack.