Current web security best-practices call for the use of two-factor authentication. This authentication mechanism forces the owner of an account to provide two pieces of information to prove identity. This information is usually a password and a random code sent to the user’s device. Two-factor authentication is one of the easiest ways to improve account security. However, its security relies on the random code being being – well – random.
I question the randomness of these two factor authentication codes for a very simple reason. It feels like a very high proportion of the codes that I receive contain repeated digits. For example, codes like 537881
or 155389
or 223485
. Repeated digits are a pattern. And random authentication codes that appear to have patterns is a devastating sign for account security. Nation-states and hackers alike have very smart people doing very smart math to find patterns in randomness. The greatest ciphers in the world were unraveled when codebreakers found patterns in supposedly random data.
I like math. I like combinatorics (counting things and calculating probabilities). These are the tools needed to dive into this problem. We need to calculate some probabilities. We need to find out whether the two-factor authentication codes we get are safely random. Or, whether they contain dangerous patterns.
Problem statement.
Given a sequence of six random digits (a standard two-factor authentication code), what is the probability that at least two consecutive digits are repeating?
Starting from the base
Before we start, here’s some notation that I’ll be using throughout the post. $P(X)$
represents the probability of $X$
occuring. In this case, $X$
is that at least two consecutive digits are repeating.
With combinatorics, it’s almost always best to start from the simplest base case possible and work your way up. Solving the problem with the numbers 0-9 in each position is difficult. So let’s first solve something with only two options. 0s and 1s.
For a sequence of two binary digits, what is the probability that at least two digits are repeating? This is simple enough that we can just count these cases.
00, 01, 10, 11
Two cases out of the four possibilities have repeated digits. $P(X)=\frac{2}{4}$
Let’s expand to three binary digits. We can still count these cases with relative ease.
000, 001, 010, 011, 100, 101, 110, 111
Six out of eight possibilities have repeated digits. $P(X)=\frac{6}{8}$
Let’s keep expanding to four binary digits. We can still enumerate them.
0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111
14 out of 16 possibilities have repeated digits. $P(X)=\frac{14}{16}$
We can start to find some patterns and formulas now that we have examples to look at. The general formula for the probability that a digit is repeating is the number of repeating possibilities divided by the number of total possibilities. $P(X)=\frac{Repeating}{Total}$
The denominator of that formula is straightforward to compose. Each position in the sequence has two options, a 0 or a 1. For a sequence of 3 binary digits, the total number of options is described by the formula $Total=2*2*2=2^3$
. This formula generalizes to $2^n$
when we have sequences of length $n$
.
The formula for the numerator is a little more difficult to compose. One key insight is noticing that the sequence of 2 binary digits forms the base of the sequence of 3 binary digits. All of the 3-sequence possibilities are just the 2-sequence possibilities with an extra digit added to them. Similarly the 3-sequence forms the base for the 4-sequence.
Using this technique, we can start to compose a formula for the numerator. For the sequence of length 2, there are 2 out of the 4 options that have repeats. Any extra digit added to one of these sequences will still form a sequence with repeats. No matter whether that extra digit was a 0 or a 1.
We can write this insight into a formula like so:
$$Repeating=f(n)=2*f(n-1) + ???$$
The number of repeating sequences for a series of length $n$
is 2 times the number of repeating sequences for a series of length $n-1$ plus something else. The 2 in the formula comes from the fact that adding an extra digit creates 2 more possibilities . Both possibilities are still characteristically repeating. e.g. $00 \rightarrow 000 \space or \space 001$
The formula to describe the remaining sequences is a little trickier. The $???$
part of the formula. So I’m just going to write out the results we counted and defer uncovering the formula until the next section.
$$P(X) = f(n) / 2^n$$
$$f(0) = 0$$
$$f(1) = 0$$
$$f(2) = 2$$
$$f(3) = 2 * f(2) + 2 = 6$$
$$f(4) = 2 * f(3) + 2 = 14$$
Writing it out, it looks like there might be a pattern… $4-2=2$
, $8-2=6$
, $16-2=14$
It appears we could use the formula $f(n)=2^n-2$
, to describe the scenario. But we should be careful about generalizing too early. Remember, we are starting with the simplest base case. So it’s possible that simple formulas are sufficient to describe this scenario. But, those formulas won’t generalize when we start getting more complex. Ultimately, we need a formula for using 10 digits.
Moving to more digits
Let’s expand to 3 options for each digit. 0,1, 2.
For a sequence of length 2, there are 3 options for the first digit, and 3 options for the second digit. $Total=3*3=3^2$
. This looks very similar to our formula from the base case for the total number of possibilities. In fact, we can slightly modify our equation to describe both scenarios. To do that, we need to introduce another variable $k$
, to describe the number of options for each digit. With that extra variable, we can write that the total number of possibilities can be described with the formula $Total=k^n$
, for a sequence of length $n$
with $k$
options for each digit. This formula may look familiar to anybody with stats experience. It is the formula for the number of permutations.
Counting all the possibilities starts to get unwieldy for the trinary sequences. But we can still comfortably count a sequence of length 2.
00, 01, 02, 10, 11, 12, 20, 21, 22
3 options have repeated digits out of the 9 possibilities. $P(X)=\frac{3}{9}$
We won’t enumerate the possibilities for a sequence of length 3. There are 27. But we can use the technique we used in the base case to find out the number of repeating sequences.
For the 3 options that already had a repeat, it does not matter which of the 3 options we add next. We will still have a repeating sequence. $f(n) =3*3 = 9$
For all the other sequences, we notice that if the previous sequence ended in a 0, we need a 0 to form a repeating sequence. If the next digit is a 1 or a 2, it will not form a repeating sequence. The same logic applies if the previous sequence ended in a 1 or a 2. We have a $\frac{1}{3}$
chance of forming a repeating sequence from the previous set of non-repeating sequences.
So if we can count how many non-repeating sequences there are, we can multiply that by $\frac{1}{3} * 3$
to get the new count of repeating sequences.
The number of non-repeating possibilities is really easy to find. It is the total number of possibilities minus the number of repeating possibilities. We already have formulas for both of those! $Nonrepeating = Total - Repeating =k^n - f(n-1)$
.
Putting these into a formulation.
$$f(0) = 0$$
$$f(1) = 0$$
$$f(2) = 3$$
$$f(3) = 3 * f(2) + [3*\frac{1}{3}*(3^2 - f(2))]$$
$$= 2*f(2) + 3^2 = 6 + 9 = 15$$
$$f(4) = 3 * f(3) + [3*\frac{1}{3}*(3^3 - f(3))]$$
$$= 2*f(3) + 3^3 = 30 + 27 = 57$$
Note: I didn’t say this before, but this style of formulation is known as a recurrence relation. When the current case can be described using the previous cases. That’s why I wrote out the formulas for $f(0)$
, $f(1)$
, and $f(2)$
. Those formulas are necessary to form the base cases for the recurrence relation.
Bringing it all together
So far, we’ve used fairly simple techniques to find formulas to solve our problem with binary and trinary sequences. Now, I think we’re prepared to tackle the original problem statement. With 10 digits available to us, how often can we expect a 6 digit sequence to contain repeating digits?
For a sequence of length 2, there are $10^2 = 100$
possibilities. 10 of those have repeating digits. The paired digits.
00, 11, 22, ... , 88, 99
For a sequence of length 3, there are $10^3$
possibilities. All the sequences that were already repeating, are still repeating no matter what digit is added at the end. Of the remaining non-repeating sequences, these can be turned into a repeating sequence if the digit that is added is the same as the previous digit. This next digit will match the previous digit 1/10th of the time.
Jumping straight into the formulation.
$$f(0) = 0$$
$$f(1) = 0$$
$$f(2) = 10$$
$$f(3) = 10 * f(2) + [10*\frac{1}{10}*(10^2 - f(2))]$$
$$= 9 * f(2) + 10^2 = 190$$
$$f(4) = 10 * f(3) + [10*\frac{1}{10}*(10^3 - f(3))]$$
$$= 9 * f(3) + 10^3 = 2710$$
$$f(5) = 10 * f(4) + [10*\frac{1}{10}*(10^4 - f(4))]$$
$$= 9 * f(4) + 10^4 = 34390$$
Almost there! I wish there was more of a climax at this point in the post. However, the epiphanies all happened in the previous 2 sections. Here we’re just calculating the results of a formula. I guess the lesson is don’t be scared of big, imposing problems. Instead, break them down into chunks you can solve (by counting if necessary!). And then build your way up. There’s your takeaway. And here’s your answer.
$$f(6) = 10 * f(5) + [10*\frac{1}{10}*(10^4 - f(5))]$$
$$= 9 * f(5) + 10^5$$
$$= 409510$$
There are $10^6 = 1,000,000$
total possibilities for a sequence of length 6 using 10 possible digits. 409,510 of those contain repeating digits. ~41%.
The wrap up
If my math from above is correct (which it still might not be), then I should definitely be concerned if every two-factor code I get contains repeating characters.
What (I hope) is a more likely occurrence, is that some cognitive bias is at play. My brain will selectively notice and remember two-factor codes that appear to contain information. Those codes which stand-out ever so slightly. Codes with repeating digits. Codes that start and end with the same digit. Codes that have a pair of matching digits sandwiched around another digit. My brain will try to pattern-match because it thinks there is information hidden in the numbers.
The availability heuristic, salience bias, and the Von Restorff effect are probably the biases at play. I’m not totally clear on the distinction between those three at this time. But they’ll be good further reading options.
As an aside, these same biases are also why students taking standardized tests will second-guess themselves when they get 5 Cs in a row. Something that is supposed to be encoded as random attracts attention when our brains try to find patterns. And the data which our brains thinks is sufficiently random are dropped from memory and avoid attention.
EDIT: Looks like my math was correct! There’s a discrete math PDF that asks this exact same question and arrived at the same answer. Huzzah! http://www.discrete-math-hub.com/modules/S18_Ch_4_1.pdf