Last week, I was at the Mathematics and Machine
Learning program at Harvard's
Center of Mathematical Sciences and Applications.1
1Later Update: I'll be
talking about this and related experiments on October 29th at Harvard. The talk
should be made available on youtube.
The underlying topic was on
number theory and I've been studying various number theoretic problems from a
machine learning perspective.
I've been computing several experiments related to estimating the
Mobius function .
I don't expect to be easily approximable; all earlier attempts to
study using machine learning have resisted much success.
This is perhaps related to Mobius Randomness.2
2See for example Peter Sarnak's
Three Lectures on the Mobius Function Randomness and Dynamics.
Previous machine learning experiments on studying have used neural
networks or classifiers. Francois Charton made an integer sequence to integer
sequence transformer-based translator, Int2Int,
and I thought it would be fun to see if this works any different.
Initially, I sought to get Int2Int to work. Then I set it on various examples.
I describe some of them here.
I'm splitting my description into two parts: a general report and a technical
report. This is the general report.3
3This is also
available as a pdf.
The
technical4
4 By "technical" here, I mean pertaining to technology (i.e. to
programming). Both notes are nonelementary. But I acknowledge that there are
very few people who are experts in both number theory and machine
learning.
report includes technical details for running or re-running
Int2Int experiments and other programming-related aspects.
Mobius Function
Recall that the mobius function is if the square of any prime
divides , and otherwise is , where is the number
of prime divisors of . For example, and so on.
Int2Int takes as input a sequence of integers, and the output is a sequence of
integers. I struggled to make sense of studying many outputs, but this is
really my own problem.
Inputs and Outputs for Möbius
Int2Int takes sequences of integers as input and produces sequences of
integers as output.
I tried several variations to estimate , including
-
Input just and output . (Or rather, make sure I can get Int2Int
to process anything at all with the simplest possible example).
-
Input and for the first primes.
-
Input and for the second primes.
-
Input the Legendre symbol for the first primes.
-
Input , , and for the first primes.
For each of these, I estimated , , and .
The input were sampled uniformly randomly from between and
(with a few larger experiments here and there), using training sets
between for initial runs and to investigate
further.
I also trained over restricted to be squarefree.
Better than Random: squarefree detection
I quickly saw that Int2Int could guess better than random guesses. But
the reason why was because it was determining if was squarefree or not
with reasonable accuracy.5
5This is similar to a pattern observed by Jordan
Ellenberg when attempting to train neural networks to estimate . The
network seemed to figure out eventually that ,
and then sometime later that also implies . Presumably
it would figure out other squares later, eventually.
The Int2Int models were determining whether was squarefree or not with very
high accuracy, and then guessing randomly when it thought
was squarefree. Some of these models were guessing correctly around
percent of the time: far better than chance.
Looking closer, the best-performing model (which also had the most data: and for the first primes ) correctly recognized
almost percent of squareful numbers,6
6Be careful with what is the
condition here. In particular it doesn't say that the model computes
correctly percent of the time.
but only correctly recognized
whether about percent of the time.
Using that the density of squareful numbers is about , this gave the
overall correctness at
recovering the approximately percent overall correctness.
The model tended to overestimate the number of squareful numbers and guessed
that several squarefree numbers were squareful.
This occurred quickly when trained using quadratic residue symbols.
I wasn't initially surprised by this because of course Legendre symbols include
information about squares.
Thus it should be possible to quickly train a network to recognize most squares
given for the first primes (most numbers are divisible mostly by
small primes, and hence checking small prime behavior usually suffices).
But here we're looking at numbers that are or are not squarefree: multiplying a
square by a squarefree number mixes up all the quadratic residues and
nonresidues.
With a bit more training, having only for the first primes
produced very similar behavior.
How was it doing this?
This is an interesting purely mathematical question: how would you guess
whether is squarefree or not given for lots of primes
One way would be to perform the Chinese remainder theorem, reconstruct , and
then actually check.
Is the model recognizing something like this?
To test, I ran several experiments along the following lines:
- Given for the first primes, output if is in the
interval .
- Given for the first primes but excluding , output
.
These probe CRT-type knowledge.
I sample input uniformly at random from large intervals.
The frequencies of each residue class should be approximately uniformly
randomly distributed.
But the model never did better than random guessing on either of this type of
experiment.
I guess the model isn't recovering CRT-like information.
How to guess if given
After talking with Noam Elkies and Andrew Sutherland, I think I know how the
model is guessing when with such high accuracy.
The point is that numbers that are not squarefree are probably divisible by a
small square and thus likely to be mod a small prime.
Numbers that are squarefree might be mod a small prime, but not as often.
Let's look at this in greater detail.
The zeta function associated to squarefree numbers is
Thus the ratio of numbers up to that are squarefree is about7
7This is
tangentially related to my note from yesterday
The default algorithm to use would be to guess that every integer is
squarefree: this is right just over percent of the time.
We need to do better than that.
The zeta function associated to even squarefree numbers is
It follows that the ratio of numbers up to that are even and squarefree is
about
This implies that the remaining squarefree
integers up to are odd.
We could see this directly by noting that the corresponding zeta function is
and computing the residue as .
A squarefree integer is twice as likely to be odd as even.
For this classification problem, we're interested in the converse conditional:
what is the probability that is squarefree given that it is even (or odd)?
Basic probability shows that
and
This already gives a better-than-naive strategy: if is even, guess that
it's not squarefree (correct about of the time); if
is odd, then guess squarefree (correct about of the time).
This should be correct about (or actually ) of the time.
As , this type of thinking is an improvement.
This readily generalizes to other primes.
The Dirichlet series for squarefree numbers that are divisible by a fixed prime
is
and the series for squarefree numbers that aren't divisible by a fixed prime
is the same, but without .
Thus the percentage of integers that are squarefree and divisible by or not
divisible by are, respectively,
Playing conditional probabilities as above shows that
I use the adhoc shorthand -even to mean divisible by , and -odd to
mean not divisible by .
The differences are the largest when the prime is small.
A good strategy would then be to look at a couple of small primes and then
predict whether is squarefree based on divisibility rules for the primes
.
I've ignored all the joint probabilities. These are explicitly computable by
computing the local densities at the appropriate primes, as above.
But the point is that divisibility by primes correlates nontrivially with
being squarefree, and this sort of table correlation is something that we
should expect machine learning to figure out.
Explicit computation shows that using the first primes and guessing
squarefree
or not squarefree
based on which divisibility pattern of those
primes is more common yields an overall correct rate of percent, only
percent higher than using alone.
We can hope that machine learning algorithms could do better. Computing the
table of cross correlations given sufficient data isn't hard. But ML models
should also determine weights to use to better predict outcomes. Predicting
what ML can or can't do is much harder.
Pure squarefree detection
With the same inputs, I looked at guessing . That is, I tried to look
just at the squarefree detection powers.
Overall, the models were correct about percent of the time. This is
consistent with the above behavior and with the heuristic that it could only
use mod information.
Restricting to squarefree
In the other direction, I also restricted all inputs to squarefree . This
balances the expected outputs: about percent each should correspond to
and about percent should correspond to . Any prediction with
accuracy greater than percent would be a major achievement.
But ultimately none of the models I checked did any better than percent
consistently.
Removing
Still input for primes, but use the primes
after .
As we saw above, has the most explanatory power using pure Bayesian
probability.
This asks: is the machine learning doing anything else other than
the -based cross correlations described above?
In short, the performance plummeted to less than percent accuracy for
guessing .
The performance was consistent with determining whether was squarefree
correctly about percent of the time, and then guessing randomly
between and when was determined to be squarefree.
And this is consistent with using the pure Bayesian probabilistic approach
on exactly the prime .
Indeed, the probability that is squarefree given that divides is
, and the probability that is
squarefree given that doesn't divide is .
Thus of the time, we would guess "not squarefree" with accuracy and the rest of the time we would guess "squarefree" with
accuracy , giving a total accuracy around
Leave a comment
Comment via email
Info on how to comment
To make a comment, please send an email using the button below. Your email address won't be shared (unless you include it in the body of your comment). If you don't want your real name to be used next to your comment, please specify the name you would like to use. If you want your name to link to a particular url, include that as well.
bold, italics, and plain text are allowed in comments. A reasonable subset of markdown is supported, including lists, links, and fenced code blocks. In addition, math can be formatted using
$(inline math)$
or$$(your display equation)$$
.Please use plaintext email when commenting. See Plaintext Email and Comments on this site for more. Note also that comments are expected to be open, considerate, and respectful.