Bayesian Spam Filtering

Lecture



As an introduction


At present, Bayesian methods are fairly widespread and are actively used in various fields of knowledge. However, unfortunately, not many people have an idea of ​​what it is and why it is needed. One of the reasons is the lack of a large amount of literature in Russian. Therefore, here I will try to state their principles as simply as I can, starting from the very beginning (I apologize if it seems too simple to someone).

In the future, I would like to go directly to Bayesian analysis and talk about the processing of real data and, in my opinion, an excellent alternative to the R language (it was written a little about it here) - Python with the pymc module. Personally, Python seems to me much more understandable and logical than R with JAGS and BUGS packages, besides Python gives much more freedom and flexibility (although Python has its own difficulties, but they are surmountable, and in simple analysis are rarely ).

A bit of history


As a brief historical background, I will say that the Bayesian formula was already published in 1763, 2 years after the death of its author, Thomas Bayes. However, the methods using it were really widespread only by the end of the twentieth century. This is explained by the fact that the calculations require certain computational costs, and they became possible only with the development of information technologies.

On probability and the Bayes theorem


The Bayes formula and all subsequent statements require an understanding of probability. More information about the probability can be read on Wikipedia.
In practice, the probability of an event occurring is the frequency of occurrence of this event, that is, the ratio of the number of observations of an event to the total number of observations with a large (theoretically infinite) total number of observations.
Consider the following experiment: we call any number from the interval [0, 1] and look at what this number will be between, for example, 0.1 and 0.4. As you might guess, the probability of this event will be equal to the ratio of the length of the segment [0.1, 0.4] to the total length of the segment [0, 1] (in other words, the ratio of the “number” of possible equiprobable values ​​to the total “number” of values), that is (0.4 - 0.1) / (1 - 0) = 0.3, that is, the probability of falling into the interval [0.1, 0.4] is 30%.

Now look at the square [0, 1] x [0, 1].
Bayesian Spam Filtering

Suppose we have to call pairs of numbers (x, y), each of which is greater than zero and less than one. The probability that x (the first number) will be within the interval [0.1, 0.4] (shown in the first figure as a blue area, for the moment the second number y is not important for us) is equal to the ratio of the area of ​​the blue area to the area of ​​the entire square, is (0.4 - 0.1) * (1 - 0) / (1 * 1) = 0.3, that is, 30%. Thus, we can write that the probability that x belongs to the segment [0.1, 0.4] is equal to p (0.1 <= x <= 0.4) = 0.3 or, for short, p (X) = 0.3.
If we now look at y, then, similarly, the probability that y is inside the segment [0.5, 0.7] is equal to the ratio of the area of ​​the green area to the area of ​​the entire square p (0.5 <= y <= 0.7) = 0.2, or for brevity p (Y) = 0.2.
Now let's see what you can learn about the values ​​of x and y at the same time.
If we want to know what is the probability that both x and y are in the corresponding given segments, then we need to calculate the ratio of the dark area (the intersection of the green and blue areas) to the area of ​​the entire square: p (X, Y) = (0.4 - 0.1 ) * (0.7 - 0.5) / (1 * 1) = 0.06.

And now let's say we want to know what is the probability that y is in the interval [0.5, 0.7], if x is already in the interval [0.1, 0.4]. That is, in fact, we have a filter and when we call pairs (x, y), then we immediately discard those pairs that do not satisfy the condition of finding x in a given interval, and then from the filtered pairs we consider those for which y satisfies our condition and consider the probability as the ratio of the number of pairs for which y lies in the above segment to the total number of filtered pairs (that is, for which x lies in the interval [0.1, 0.4]). We can write this probability as p (Y | X). Obviously, this probability is equal to the ratio of the area of ​​the dark area (the intersection of the green and blue areas) to the area of ​​the blue area. The area of ​​the dark area is (0.4 - 0.1) * (0.7 - 0.5) = 0.06, and the area of ​​the blue (0.4 - 0.1) * (1 - 0) = 0.3, then their ratio is 0.06 / 0.3 = 0.2. In other words, the probability of finding y on the interval [0.5, 0.7] while x already belongs to the segment [0.1, 0.4] is equal to p (Y | X) = 0.2.
It can be noted that, in view of the foregoing and all the above notation, we can write the following expression
p (Y | X) = p (X, Y) / p (X)

Let us briefly reproduce all the previous logic with respect to p (X | Y): we call pairs (x, y) and filter those for which y lies between 0.5 and 0.7, then the probability that x lies in the interval [0.1, 0.4 ] provided that y belongs to the segment [0.5, 0.7] is equal to the ratio of the area of ​​the dark area to the area of ​​green:
p (X | Y) = p (X, Y) / p (Y)

In the two formulas above, we see that the term p (X, Y) is the same, and we can exclude it:
Bayesian Spam Filtering
We can rewrite the last equality as
Bayesian Spam Filtering

This is the Bayes theorem.
It is also interesting to note that p (Y) is actually p (X, Y) for all values ​​of X. That is, if we take a dark area and stretch it so that it covers all values ​​of X, it will exactly repeat the green area , which means that it will be equal to p (Y). In the language of mathematics, this will mean the following: Bayesian Spam Filtering
Then we can rewrite the Bayes formula as follows:
Bayesian Spam Filtering

Application of the Bayes theorem


Let's consider the following example. Take a coin and throw it 3 times. With the same probability we can get the following results (O - Eagle, P - Tails): LLC, OOR, ORO, ORR, ROO, ROR, PPO, PPP.

We can calculate how many eagles fell in each case and how many shifts were heads-tails, tails-heads:
Bayesian Spam Filtering

We can consider the number of eagles and the number of changes as two random variables. Then the probability table will have the following form:
Bayesian Spam Filtering

Now we can see the Bayes formula in action.
But before we draw an analogy with the square, which we considered earlier.
You can see that p (1O) is the sum of the third column (the “blue area” of the square) and is equal to the sum of all cell values ​​in this column: p (1O) = 2/8 + 1/8 = 3/8
p (1С) is the sum of the third row (the “green area” of the square) and, similarly, is equal to the sum of all cell values ​​in this row p (1С) = 2/8 + 2/8 = 4/8
The probability that we got one eagle and one shift is equal to the intersection of these areas (that is, the value in the cage of the intersection of the third column and the third row) p (1C, 1O) = 2/8
Then, following the formulas described above, we can calculate the probability of getting one shift, if we got one eagle in three throws:
p (1С | 1О) = p (1С, 1О) / p (1О) = (2/8) / (3/8) = 2/3
or the probability of getting one eagle, if we got one shift:
p (1О | 1С) = p (1С, 1О) / p (1С) = (2/8) / (4/8) = 1/2
If we calculate the probability of getting one shift in the presence of one eagle p (1О | 1С) through the Bayes formula, we get:
p (1О | 1С) = p (1С | 1О) * p (1О) / p (1С) = (2/3) * (3/8) / (4/8) = 1/2
What we got higher.

But what practical value does the above example have?
The fact is that when we analyze real data, we are usually interested in some parameter of this data (for example, mean, variance, etc.). Then we can draw the following analogy with the above table of probabilities: let the rows be our experimental data (we denote them Data), and the columns the possible values ​​of the parameter of interest to us this data (we denote it Bayesian Spam Filtering ). Then we are interested in the probability of obtaining a certain value of the parameter based on the available data. Bayesian Spam Filtering .
We can apply Baeys formula and write the following:
Bayesian Spam Filtering
And remembering the formula with the integral, we can write the following:
Bayesian Spam Filtering

That is, in fact, as a result of our analysis, we have the probability as a function of a parameter. Now we can, for example, maximize this function and find the most probable parameter value, calculate the variance and average value of the parameter, calculate the boundaries of the segment within which the parameter of interest lies with a probability of 95%, etc.

Probability Bayesian Spam Filtering called a posteriori probability. And in order to calculate it we need to have
Bayesian Spam Filtering - the likelihood function and Bayesian Spam Filtering - a priori probability.
The likelihood function is determined by our model. That is, we create a data collection model that depends on the parameter we are interested in. For example, we want to interpolate data using the straight line y = a * x + b (thus, we assume that all data have a linear relationship with Gaussian noise superimposed on it with known dispersion). Then a and b are our parameters, and we want to know their most probable values, and the likelihood function is Gauss with an average, given by a straight line equation, and a given dispersion.
A priori probability includes the information that we know before the analysis. For example, we know for sure that the straight line should have a positive slope, or that the value at the point of intersection with the x axis should be positive - all this and not only we can incorporate it into our analysis.
As you can see, the denominator fraction is an integral (or in the case when the parameters can take only certain discrete values, the sum) of the numerator over all possible values ​​of the parameter. In practice, this means that the denominator is a constant and serves to normalize the a posteriori probability (that is, the a posteriori probability integral is equal to unity).

This post is a logical continuation of my first post about Bayesian methods, which can be found here.
I would like to talk in detail about how to carry out the analysis in practice.

As I already mentioned, the most popular tool used for Bayesian analysis is the R language with JAGS and / or BUGS packages. For an ordinary user, the difference in packages is in cross-platform JAGS (first-hand experience has made sure there is a conflict between Ubuntu and BUGS), and also that you can create your own functions and distributions in JAGS (however, infrequently, if ever, arises). By the way, an excellent and convenient IDE for R is, for example, RStudio.
But in this post I will write about an alternative to R - Python with the pymc module.
As a convenient IDE for Python, I can offer spyder.
I prefer Python because, firstly, I see no reason to learn such a not too common language like R, just to count some kind of problem related to statistics. Secondly, Python with its modules, from my point of view, is not inferior to R in simplicity, clarity and functionality.

I propose to solve a simple problem of finding the coefficients of linear data dependence. A similar problem of parameter optimization is quite common in various fields of knowledge, so I would say that it is very revealing. In the end, we will be able to compare the result of the Bayesian analysis and the least squares method

Software installation


First of all, we need to install Python (for those who have not yet done so). I did not use Python 3.3, but with 2.7 everything works fine.
Then you need to install all the necessary modules for Python: numpy, Matplotlib.
If desired, you can also install additional modules: scipy, pyTables, pydot, IPython and nose.
All of these settings (except for Python itself) are easier to do through setuptools.
And now you can install the actual pymc (you can also install it via setuptools).

A detailed guide on pymc can be found here.

Task statement


We have data obtained during a hypothetical experiment, which linearly depend on a certain quantity x. Data arrives with noise whose dispersion is unknown. It is necessary to find the coefficients of linear dependence.

Decision


First we import the modules that we have installed and which we will need:

import numpy import pymc 


Then we need to get our hypothetical linear data.
To do this, we determine how many points we want to have (in this case, 20), they are evenly distributed on the interval [0, 10], and set the real coefficients of a linear relationship. Next, we impose on the data Gaussian noise:

 #Generate data with noise number_points = 20 true_coefficients = [10.4, 5.5] x = numpy.linspace(0, 10, number_points) noise = numpy.random.normal(size = number_points) data = true_coefficients[0]*x + true_coefficients[1] + noise 


So, we have the data, and now we need to think about how we want to conduct an analysis.
First, we know (or assume) that our gaussian noise means that our likelihood function will be gaussian. It has two parameters: mean and variance. Since the noise average is zero, the mean for the likelihood function will be set by the model value (and the model is linear, so there are two parameters). While the variance is unknown to us, therefore, it will be another parameter.
As a result, we have three parameters and a Gaussian likelihood function.
We do not know anything about the values ​​of the parameters; therefore, a priori we assume them to be uniformly distributed with arbitrary boundaries (these boundaries can be moved as far as desired).
When specifying an a priori distribution, we must specify a label by which we will learn about the a posteriori values ​​of the parameters (the first argument), as well as indicate the boundaries of the distribution (the second and third arguments). All of the above arguments are required (there are additional arguments that can be found in the documentation).

 sigma = pymc.Uniform('sigma', 0., 100.) a = pymc.Uniform('a', 0., 20.) b = pymc.Uniform('b', 0., 20.) 


Now we need to set our model. In pymc there are two of the most commonly used classes: deterministic and stochastic. If, given the input data, you can uniquely determine the value (s) that the model (s) returns, then this is a deterministic model. In our case, given the coefficients of a linear relationship for any point, we can uniquely determine the result, respectively, this is a deterministic model:

 @pymc.deterministic(plot=False) def linear_fit(a=a, b=b, x=x): return a*x + b 


Finally, we define a likelihood function, in which the mean is the value of the model, sigma is a parameter with a given prior distribution, and data is our experimental data:

 y = pymc.Normal('y', mu=linear_fit, tau=1.0/sigma**2, value=data, observed=True) 


So, the whole model.py file looks like this:

 import numpy import pymc #Generate data with noise number_points = 20 true_coefficients = [10.4, 5.5] x = numpy.linspace(0, 10, number_points) noise = numpy.random.normal(size = number_points) data = true_coefficients[0]*x + true_coefficients[1] + noise #PRIORs: #as sigma is unknown then we define it as a parameter: sigma = pymc.Uniform('sigma', 0., 100.) #fitting the line y = a*x+b, hence the coefficient are parameters: a = pymc.Uniform('a', 0., 20.) b = pymc.Uniform('b', 0., 20.) #define the model: if a, b and x are given the return value is determined, hence the model is deterministic: @pymc.deterministic(plot=False) def linear_fit(a=a, b=b, x=x): return a*x + b #LIKELIHOOD #normal likelihood with observed data (with noise), model value and sigma y = pymc.Normal('y', mu=linear_fit, tau=1.0/sigma**2, value=data, observed=True) 



Now I propose to make a small theoretical digression and discuss what pymc does after all.

From the point of view of mathematicians, we need to solve the following problem:
p (a, b, sigma | Data) = p (Data | a, b, sigma) * p (a, b, sigma) / p (Data)

Because a, b and sigma are independent, then we can rewrite the equation as follows:
p (a, b, sigma | Data) = p (Data | a, b, sigma) * p (a) * p (b) * p (sigma) / p (Data)

On paper, the task looks very simple, but when we solve it numerically (we must also think that we want to solve any problem of this class numerically, and not only with certain types of probabilities), then difficulties arise.

p (Data) is, as discussed in my previous post, a constant.
p (Data | a, b, sigma) is definitely given to us (that is, with known a, b and sigma, we can uniquely calculate the probabilities for our available data)
a here, instead of p (a), p (b) and p (sigma), we, in fact, have only pseudo-random variable generators distributed according to the law we specified.
How to get a posteriori distribution from all this? That's right, generate (sample) a, b and sigma, and then read p (Data | a, b, sigma). As a result, we get a chain of values, which is a sample of the posterior distribution. But this raises the question of how we can do this sample correctly. If our a posteriori distribution has several modes ("hills"), then how can we generate a sample covering all the modes. That is, the task is how to effectively make a sample that would “cover” the entire distribution in the least amount of iterations. There are several algorithms for this, the most used of which are MCMC (Markov chain Monte Carlo). The Markov chain is a sequence of random events in which each element depends on the previous one, but does not depend on the previous one. I will not describe the algorithm itself (this may be the topic of a separate post), but just note that pymc implements this algorithm and as a result gives the Markov chain, which is a sample of the a posteriori distribution. Generally speaking, if we do not want the chain to be Markov, then we just need to “thin down” it, i.e. take, for example, every second element.
So, we create a second file, let's call it run_model.py, in which we will generate a Markov chain. The model.py and run_model.py files must be in the same folder, otherwise the code should be added to the run_model.py file:

 from sys import path path.append("путь/к/папке/с/файлом/model.py/") 


First we import some modules that will be useful to us:

 from numpy import polyfit from matplotlib.pyplot import figure, plot, show, legend import pymc import model 


polyfit implements the least squares method — we will compare the Bayesian analysis with it.
figure, plot, show, legend are needed in order to build a final schedule.
model is, in fact, our model.

Then we create the MCMC object and run the sample:

 D = pymc.MCMC(model, db = 'pickle') D.sample(iter = 10000, burn = 1000) 


D.sample takes two arguments (in fact, you can specify more) - the number of iterations and burn-in (let's call it the “warm-up period”). The warm-up period is the number of first iterations that are clipped. The fact is that MCMC initially depends on the starting point (this is a property), so we need to cut off this period of dependence.

This completes our analysis.
Now we have an object D, in which the sample is located, and which has various methods (functions) allowing to calculate the parameters of this sample (mean, most probable value, variance, etc.).

In order to compare the results, we do the analysis of the method of least squares:

 chisq_result = polyfit(model.x, model.data, 1) 


Now we print all the results:

 print "\n\nResult of chi-square result: a= %f, b= %f" % (chisq_result[0], chisq_result[1]) print "\nResult of Bayesian analysis: a= %f, b= %f" % (Davalue, Dbvalue) print "\nThe real coefficients are: a= %f, b= %f\n" %(model.true_coefficients[0], model.true_coefficients[1]) 


We build standard pymc graphics:

 pymc.Matplot.plot(D) 


And finally, we build our final schedule:

 figure() plot(model.x, model.data, marker='+', linestyle='') plot(model.x, Davalue * model.x + Dbvalue, color='g', label='Bayes') plot(model.x, chisq_result[0] * model.x + chisq_result[1], color='r', label='Chi-squared') plot(model.x, model.true_coefficients[0] * model.x + model.true_coefficients[1], color='k', label='Data') legend() show() 


Here is the full content of the run_model.py file:

 from numpy import polyfit from matplotlib.pyplot import figure, plot, show, legend import pymc import model #Define MCMC: D = pymc.MCMC(model, db = 'pickle') #Sample MCMC: 10000 iterations, burn-in period is 1000 D.sample(iter = 10000, burn = 1000) #compute chi-squared fitting for comparison: chisq_result = polyfit(model.x, model.data, 1) #print the results: print "\n\nResult of chi-square result: a= %f, b= %f" % (chisq_result[0], chisq_result[1]) print "\nResult of Bayesian analysis: a= %f, b= %f" % (Davalue, Dbvalue) print "\nThe real coefficients are: a= %f, b= %f\n" %(model.true_coefficients[0], model.true_coefficients[1]) #plot graphs from MCMC: pymc.Matplot.plot(D) #plot noised data, true line and two fitted lines (bayes and chi-squared): figure() plot(model.x, model.data, marker='+', linestyle='') plot(model.x, Davalue * model.x + Dbvalue, color='g', label='Bayes') plot(model.x, chisq_result[0] * model.x + chisq_result[1], color='r', label='Chi-squared') plot(model.x, model.true_coefficients[0] * model.x + model.true_coefficients[1], color='k', label='Data') legend() show() 


In the terminal we see the following answer:

Result of chi-square result: a = 10.321533, b = 6.307100

Result of Bayesian analysis: a = 10.366272, b = 6.068982

The real coefficients are: a = 10.400000, b = 5.500000

I note that since we are dealing with a random process, the values ​​that you see in yourself may differ from the above (except for the last line).

And in the folder with the file run_model.py we will see the following graphics.
For parameter a:
Bayesian Spam Filtering
For parameter b:
Bayesian Spam Filtering
For the sigma parameter:
Bayesian Spam Filtering
On the right, we see a histogram of the posterior distribution, and the two pictures on the left belong to the Markov chain.
I will not focus on them now. Let me just say that the bottom graph is the autocorrelation graph (you can read more here). It gives an idea of ​​the convergence of MCMC.
And the top graph shows the sample trace. That is, it shows how the sample took place over time. The average of this trace is the average of the sample (compare the vertical axis in this graph with the horizontal axis in the histogram on the right).

In conclusion, I will talk about one more interesting option.
If you still put the pydot module and include the following line in the run_model.py file:

 pymc.graph.dag(D).write_png('dag.png') 


He will create the following image in the folder with the file run_model.py:

Bayesian Spam Filtering

This is a direct acyclic graph representing our model. White ellipses show stochastic nodes (these are a, b and sigma), triangles are deterministic nodes, and a darkened ellipse includes our pseudo-experimental data.
That is, we see that the values ​​a and b come into our model (linear_fit), which itself is a deterministic node, and then go to the likelihood function y. Sigma is first set by the stochastic node, but since the parameter in the likelihood function is not sigma, but tau = 1 / sigma ^ 2, the stochastic value sigma is first squared (the upper triangle on the right), and then it is considered tau. And tau already enters the likelihood function, as well as our generated data.
I think that this graph is very useful both for explaining the model and for self-checking of logic.
models.

Bayesian spam filtering is a spam filtering method based on the use of a naive Bayes classifier based on the use of the Bayes theorem.

Content

[remove]
  • 1. History
  • 2 Description
  • 3 Mathematical foundations
    • 3.1 Calculate the probability that a message containing this word is spam
    • 3.2 Spam word
    • 3.3 Combining individual probabilities
    • 3.4 The Problem of Rare Words
    • 3.5 Other heuristic improvements
    • 3.6 Mixed methods
  • 4 Characteristics
  • 5 Notes
  • 6 Literature

§Story

The first known mail filtering program using the Bayesian classifier was Jason Rennie’s iFile program, released in 1996. The program used mail sorting by folders [1] . The first academic publication on naive Bayes spam filtering appeared in 1998 [2] . Shortly after this publication, work was started on creating commercial spam filters [ source not specified 906 days ] . However, in 2002, Paul Graham was able to significantly reduce the number of false positives to such an extent that the Bayesian filter could be used as the only spam filter [3] [4] [5] .

Modifications of the basic approach have been developed in many research papers and implemented in software products [6] . Many modern mail clients do Bayesian spam filtering. Users can also install individual mail filtering programs. Mail server filters such as DSPAM, SpamAssassin, SpamBayes, SpamProbe, Bogofilter, CRM114 use Bayesian spam filtering methods [5] . The e-mail server software either includes filters in its delivery, or provides an API for connecting external modules.

§Description

When training a filter for each word encountered in letters, its “weight” is calculated and saved - an estimate of the probability that a letter with this word is spam. In the simplest case, the frequency is used as an estimate: “appearances in spam / appearances of everything”. In more complex cases, pre-processing of the text is possible: reduction of words to the initial form, removal of official words, calculation of the “weight” for whole phrases, transliteration, and so on.

When checking a newly arrived letter, the probability of “spamming” is calculated using the above formula for a variety of hypotheses. In this case, “hypotheses” are words, and for each word “accuracy of the hypothesis” Bayesian Spam Filtering - the share of the word in the letter, and "the dependence of the event on the hypothesis Bayesian Spam Filtering - previously calculated "weight" of the word. That is, the “weight” of the letter in this case is the averaged “weight” of all its words.

The assignment of a letter to “spam” or “non-spam” is made according to whether its “weight” exceeds a certain level specified by the user (usually 60-80% is taken). After making a decision on the letter in the database, the “weights” are updated for the words included in it.

§ Mathematical foundations

Mail Bayes filters are based on Bayes theorem. The Bayes theorem is used several times in the context of spam: for the first time, to calculate the probability that a message is spam, knowing that the given word appears in this message; the second time, to calculate the probability that the message is spam, taking into account all its words (or their corresponding subsets); sometimes for the third time when there are messages with rare words.

§Calculation of the probability that a message containing this word is spam

Let's assume that the suspect message contains the word "Replica". Most people who are used to receiving an email know that this message is likely to be spam, or more precisely an offer to sell fake watches of famous brands. A spam detection program, however, does not “know” such facts, all it can do is calculate probabilities.

The formula used by the software to determine this is derived from the Bayes theorem and the total probability formula:

Bayesian Spam Filtering

Where:

  • Bayesian Spam Filtering - the conditional probability that the message is spam, provided that the word "Replica" is in it;
  • Bayesian Spam Filtering - the total probability that an arbitrary message is spam;
  • Bayesian Spam Filtering - the conditional probability that the word “replica” appears in messages if they are spam;
  • Bayesian Spam Filtering - the total probability that an arbitrary message is not spam (i.e., “ham”);
  • Bayesian Spam Filtering - the conditional probability that the word “replica” appears in messages if they are “ham”.

§ Spam word

Recent statistical studies [7] have shown that today the probability of any message being spam is at least 80%: Bayesian Spam Filtering .

However, most Bayesian spam detection programs suggest that there are no prior preferences for the message to be “spam” rather than “ham”, and it assumes that both cases have equal probability of 50%: Bayesian Spam Filtering .

The filters that use this hypothesis are referred to as “no prejudice” filters. This means that they have no prejudice regarding incoming email. This assumption allows us to simplify the general formula to:

Bayesian Spam Filtering

Value Bayesian Spam Filtering called “spamming” words Bayesian Spam Filtering while the number Bayesian Spam Filtering used in the formula above is approximately equal to the relative frequency of messages containing the word Bayesian Spam Filtering in messages identified as spam during the training phase, that is:

Bayesian Spam Filtering

Similar Bayesian Spam Filtering approximately equal to the relative frequency of messages containing the word Bayesian Spam Filtering in messages identified as “ham” during the training phase.

Bayesian Spam Filtering

In order for these approximations to make sense, the set of training messages must be large and representative enough. It is also desirable that the training message set matches the 50% hypothesis of a redistribution between spam and ham, that is, that the spam and ham message sets are the same size.

Of course, determining whether a message is “spam” or “ham” based solely on the presence of only one specific word is error-prone, which is why Bayesian spam filters try to look at several words and combine their spamming to determine the overall probability that the message is spam

§Uniting individual probabilities

Software spam filters built on the principles of a naive Bayes classifier make a “naive” assumption that events corresponding to the presence of a word in an email or message are independent of each other. This simplification is generally incorrect for natural languages ​​such as English, where the probability of detecting an adjective increases if there is, for example, a noun. Based on this “naive” assumption, to solve the problem of classifying messages only into 2 classes: Bayesian Spam Filtering (spam) and Bayesian Spam Filtering ("Ham", that is, not spam) from the Bayes theorem, we can derive the following formula for estimating the probability of "spaminess" of an entire message containing the words Bayesian Spam Filtering :

Bayesian Spam Filtering [by Bayes theorem] Bayesian Spam Filtering [because Bayesian Spam Filtering - are supposed to be independent] Bayesian Spam Filtering

Bayesian Spam Filtering [by Bayes theorem] Bayesian Spam Filtering [по формуле полной вероятности] Bayesian Spam Filtering

Bayesian Spam Filtering

Bayesian Spam Filtering

Bayesian Spam Filtering

Таким образом, предполагая Bayesian Spam Filtering , имеем:

Bayesian Spam Filtering

Where:

  • Bayesian Spam Filtering — вероятность, что сообщение, содержащее слова Bayesian Spam Filtering — спам;
  • Bayesian Spam Filtering — условная вероятность Bayesian Spam Filtering того, что сообщение — спам, при условии, что оно содержит первое слово (к примеру «replica»);
  • Bayesian Spam Filtering — условная вероятность Bayesian Spam Filtering того, что сообщение — спам, при условии, что оно содержит второе слово (к примеру «watches»);
  • etc.
  • Bayesian Spam Filtering — условная вероятность Bayesian Spam Filtering того, что сообщение — спам, при условии, что оно содержит N ое слово (к примеру «home»).

(Demonstration: [8] )

Результат p обычно сравнивают с некоторым порогом, например, Bayesian Spam Filtering , чтобы решить, является ли сообщение спамом или нет. Если p ниже чем порог, сообщение рассматривают как вероятный «ham», иначе его рассматривают как вероятный спам.

§Проблема редких слов

Она возникает в случае, если слово никогда не встречалось во время фазы обучения: и числитель, и знаменатель равны нулю, и в общей формуле, и в формуле спамовости.

В целом, слова, с которыми программа столкнулась только несколько раз во время фазы обучения, не являются репрезентативными (набор данных в выборке мал для того, чтобы сделать надёжный вывод о свойстве такого слова). Простое решение состоит в том, чтобы игнорировать такие ненадёжные слова.

§Другие эвристические усовершенствования

«Нейтральные» слова такие, как, «the», «a», «some», или «is» (в английском языке), или их эквиваленты на других языках, могут быть проигнорированы. Вообще говоря, некоторые байесовские фильтры просто игнорируют все слова, у которых спамовость около 0.5, так как в этом случае получается качественно лучшее решение. Учитываются только те слова, спамовость которых около 0.0 (отличительный признак законных сообщений — «ham»), или рядом с 1.0 (отличительный признаки спама). Метод отсева может быть настроен например так, чтобы держать только те десять слов в исследованном сообщении, у которых есть самое большое absolute value |0.5 − Pr |.

Некоторые программные продукты принимают во внимание тот факт, что определённое слово появляется несколько раз в проверяемом сообщении [9] , другие не делают.

Некоторые программные продукты используют словосочетанияpatterns (последовательности слов) вместо изолированных слов естественных языков [10] . Например, с «контекстным окном» из четырех слов, они вычисляют спамовость словосочетания «Виагра, хорошо для», вместо того, чтобы вычислить спамовости отдельных слов «Виагры», «хорошо», и «для». Этот метод более чувствителен к контексту и устраняет байесовский шум лучше, за счет большей базы данных.

§Смешанные методы

Кроме «наивного» байесовского подхода есть и другие способы скомбинировать—объединить отдельные вероятности для различных слов. Эти методы отличаются от «наивного» метода предположениями, которые они делают о статистических свойствах входных данных. Две различные гипотезы приводят к радикально различным формулам для совокупности (объединения) отдельных вероятностей.

Например, для проверки предположения о совокупности отдельных вероятностей, логарифм произведения которых, с точностью до константы подчиняетсяраспределению хи-квадрат с 2 N степенями свободы, можно было использовать формулу:

Bayesian Spam Filtering

где C −1 is the inverse of the chi-square function.

Отдельные вероятности могут быть объединены также методами марковской дискриминации.

§Характеристика

Данный метод прост (алгоритмы элементарны), удобен (позволяет обходиться без «черных списков» и подобных искусственных приемов), эффективен (после обучения на достаточно большой выборке отсекает до 95—97 % спама, и в случае любых ошибок его можно дообучать). В общем, есть все показания для его повсеместного использования, что и имеет место на практике — на его основе построены практически все современные спам-фильтры.

Впрочем, у метода есть и принципиальный недостаток: он базируется на предположении , что одни слова чаще встречаются в спаме, а другие — в обычных письмах , и неэффективен, если данное предположение неверно. Впрочем, как показывает практика, такой спам даже человек не в состоянии определить «на глаз» — только прочтя письмо и поняв его смысл. Существует метод Байесова отравления ( англ. ), позволяющий добавить много лишнего текста, иногда тщательно подобранного, чтобы «обмануть» фильтр.

Еще один не принципиальный недостаток, связанный с реализацией — метод работает только с текстом. Зная об этом ограничении, спамеры стали вкладывать рекламную информацию в картинку. Текст же в письме либо отсутствует, либо не несёт смысла. Против этого приходится пользоваться либо средствами распознавания текста («дорогая» процедура, применяется только при крайней необходимости), либо старыми методами фильтрации — «черные списки» и регулярные выражения (так как такие письма часто имеют стереотипную форму).


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Probability theory. Mathematical Statistics and Stochastic Analysis

Terms: Probability theory. Mathematical Statistics and Stochastic Analysis