book excerptise:   a book unexamined is wasting trees

Foundations of Statistical Natural Language Processing

Christopher D. Manning and Hinrich Schuetze

Manning, Christopher D.; Hinrich Schuetze [Sch\"utze];

Foundations of Statistical Natural Language Processing

Cambridge, MA: The MIT Press, 1999, xxxvii + 680 pages

ISBN 0262133601

topics: |  language-nlp | statistics

Stanford University / Xerox Palo Alto Research Center

set of outcomes - samples, stochastic variable
[stochastic - indeterminacy in the outcome; as opp to deterministic]
formally : random or stochastic variable : function from a probability space
	to a real number, often between 0 and 1
sample space = probability space
[formally: prob space Omega = set of all events (event may incl multiple outcomes
	e.g. several coin-tosses).  The set of all outcomes (events)
	constitutes a sigma-field F - closed under countable union, and under
	complement, with maximal element Omega.  Phi = impossible event
	(this is the kolmogrovian axiomatization, alternatives exist)]

A discrete probability function P maps F -> [0,1] s.t.
a) There must be some outcome: P(Omega) = 1,
b) Countable Additivity : given disjoint events Aj, Ak in F,  P(Aj U Ak) =
   	P(Aj)+P(Ak)

[Note: if we choose Ak to be complement of A (~A), then we have
	P(~A) = 1 - P(A)]

e.g. for 3 coin tosses, the prob space is {HHH, ... TTT} = Omega

UNIFORM DISTRIBUTION : each outcome is eqally likely
prob of exactly 2 heads = {HHT, THH, HTH} = 3/8
An UNBIASED SAMPLE is a set of objects chosen from a complete sample using a
	selection process that does not depend on the properties of the
	objects.
	e.g. if choosing people for experiment, all humans have equal chance
	of being selected.  email polls etc are biased towards internet users

Conditional Probability

PRIOR PROBABILITY prob of an event prior to some other knowledge,
i.e. P(event) based on P: Om-> [0,1]

additional event can alter assessment.
e.g. prob of exactly 2 heads, when we know the first is a head
is {HT,TH}/4 = 1/2
CONDITIONAL PROBABILITY P (A|B) = P(A^B) / P(B)
	[venn diagram, A^B is inside circle B]

PRODUCT RULE
MULTIPLICATION RULE: P(A^B) = P(A|B)*P(B) = P(B|A) * P(A)
    [also holds when P(B)=0]

SUM RULE:
    P(X) = SUMy P(X^Y) = SUM y P(X|Y) P(Y)

CHAIN RULE: generalization of Multiplicn rule
	   P(A^B^C^D) = P(A) * P(B|A) * P(C|A^B) * P(D|ABC) ...

INDEPENDENCE : A, B indep if P(A^B) = P(A) * P(B)

BAYES theorem P(B|A)  = [ P(A|B) * P(B) / P(A) ]
      P(B) = prior; P(B|A) = posterior;
      P(A|B) = likelihood
      P(A) = normalizing constant [brings result to 0,1 range
e.g. B = some parameter value Theta, A = evidence x.
   then likelihood of Theta given x = prob (x | theta) = prob_Th(x)

now, given several choices of hypotheses B, we can compare them by looking
only at the numerator P(A^B) and we can ignore the normalization factor
P(A) [choose argmax_B P(A|B)P(B)]

If B_i constitute a partition on A (each Bi is disjt and Union Bi = A)
then denom P(A) = SUM P(A|Bi) P(Bi)

Ex. 2:  parasitic Gap, a rare syntactic construction occurs on average
once in 100,000 sentences. a pattern matcher that
attempts to identify sentences with parasitic gaps.
if a sentence has a parasitic gap (G), it will say so (T) with prob
0.95.  if it has no gap (~G) it will wrongly say it does (T) with prob
0.005.  If test says that a sentence contains a parasitic gap. What is
the probability that this is true? P(G|T)

P(G) = 10^-5.  P(T|G) = 0.95 P(T|~G) = 0.0005 = 5.10^-4

P(G|T) = P(T|G) * P(G) / [P(T|G) * P(G) + P(T|~G) * P(~G)]
       = 0.95 * 10^-5 / [ .95* 10**(-5) + 5.10^-3 . (1 - 10^-5) ]
       = 9.5e-3 / (9.5e-3 + 5 * 0.99999)  [div by 10^-3]
       = 0.0095 / (0.0095 + 4.9995)  = 0.0095 / 5.00945
       = 0.0019
or abt 1 in 500 cases actually have a parasitic gap [owing to low prior]

Random variable


r.v : function X : Omega -> R^n, (commonly n=1).
discrete random variable is a fn X : Omega -> discrete set S = {a,b,c,d},
where abcd are the outcomes.
If S = {0,1} then X is called an indicator r.v or a BERNOULLI TRIAL

roll dice 1x : UNIFORM DISTRIBUTION
roll dice 2x : triangular distribution
roll dice n-x : tend toward GAUSSIAN

discrete r.v.: mapping to [0,1] = probability MASS FUNCTION pmf
	SUM p(xi) = P (Omega) = 1
any fn satisfying this is also a pmf

continuous r.v. probability DENSITY function
	INTEGR_univ p(x) dx = 1

SUM and PRODUCT rule

Expectation / Covariance


expectn of function f(x) = SUM p(x) f(x)  [or INTEGR p(x) f(x) dx ]

if x1..xi..xN are samples drawn from some distribn (cont or discrete), then
   E(f) ~=  1/N SUM (f(xi))

Q. what is the expected value when
   rolling one die = 1/6 SUM (1 to 6) = 3.5
   rolling two dies = SUM_2,12 x p(x) = 1/36.[2+12] + 2/36 ...
   	       	    = 14. [1/36 + 2/36 + 3/36 + 4/36 + 5/36] + 7. 1/6
		    = 252/36 = 7
i.e. E(x + x) = E(x) + E(x)


amitabha mukerjee (mukerjee [at-symbol] gmail) 2012 Apr 20