Manning, Christopher D.; Hinrich Schuetze [Sch\"utze];
Foundations of Statistical Natural Language Processing
Cambridge, MA: The MIT Press, 1999, xxxvii + 680 pages
ISBN 0262133601
topics: | language-nlp | statistics
Stanford University / Xerox Palo Alto Research Center set of outcomes - samples, stochastic variable [stochastic - indeterminacy in the outcome; as opp to deterministic] formally : random or stochastic variable : function from a probability space to a real number, often between 0 and 1 sample space = probability space [formally: prob space Omega = set of all events (event may incl multiple outcomes e.g. several coin-tosses). The set of all outcomes (events) constitutes a sigma-field F - closed under countable union, and under complement, with maximal element Omega. Phi = impossible event (this is the kolmogrovian axiomatization, alternatives exist)] A discrete probability function P maps F -> [0,1] s.t. a) There must be some outcome: P(Omega) = 1, b) Countable Additivity : given disjoint events Aj, Ak in F, P(Aj U Ak) = P(Aj)+P(Ak) [Note: if we choose Ak to be complement of A (~A), then we have P(~A) = 1 - P(A)] e.g. for 3 coin tosses, the prob space is {HHH, ... TTT} = Omega UNIFORM DISTRIBUTION : each outcome is eqally likely prob of exactly 2 heads = {HHT, THH, HTH} = 3/8 An UNBIASED SAMPLE is a set of objects chosen from a complete sample using a selection process that does not depend on the properties of the objects. e.g. if choosing people for experiment, all humans have equal chance of being selected. email polls etc are biased towards internet users
PRIOR PROBABILITY prob of an event prior to some other knowledge, i.e. P(event) based on P: Om-> [0,1] additional event can alter assessment. e.g. prob of exactly 2 heads, when we know the first is a head is {HT,TH}/4 = 1/2 CONDITIONAL PROBABILITY P (A|B) = P(A^B) / P(B) [venn diagram, A^B is inside circle B] PRODUCT RULE MULTIPLICATION RULE: P(A^B) = P(A|B)*P(B) = P(B|A) * P(A) [also holds when P(B)=0] SUM RULE: P(X) = SUMy P(X^Y) = SUM y P(X|Y) P(Y) CHAIN RULE: generalization of Multiplicn rule P(A^B^C^D) = P(A) * P(B|A) * P(C|A^B) * P(D|ABC) ... INDEPENDENCE : A, B indep if P(A^B) = P(A) * P(B) BAYES theorem P(B|A) = [ P(A|B) * P(B) / P(A) ] P(B) = prior; P(B|A) = posterior; P(A|B) = likelihood P(A) = normalizing constant [brings result to 0,1 range e.g. B = some parameter value Theta, A = evidence x. then likelihood of Theta given x = prob (x | theta) = prob_Th(x) now, given several choices of hypotheses B, we can compare them by looking only at the numerator P(A^B) and we can ignore the normalization factor P(A) [choose argmax_B P(A|B)P(B)] If B_i constitute a partition on A (each Bi is disjt and Union Bi = A) then denom P(A) = SUM P(A|Bi) P(Bi) Ex. 2: parasitic Gap, a rare syntactic construction occurs on average once in 100,000 sentences. a pattern matcher that attempts to identify sentences with parasitic gaps. if a sentence has a parasitic gap (G), it will say so (T) with prob 0.95. if it has no gap (~G) it will wrongly say it does (T) with prob 0.005. If test says that a sentence contains a parasitic gap. What is the probability that this is true? P(G|T) P(G) = 10^-5. P(T|G) = 0.95 P(T|~G) = 0.0005 = 5.10^-4 P(G|T) = P(T|G) * P(G) / [P(T|G) * P(G) + P(T|~G) * P(~G)] = 0.95 * 10^-5 / [ .95* 10**(-5) + 5.10^-3 . (1 - 10^-5) ] = 9.5e-3 / (9.5e-3 + 5 * 0.99999) [div by 10^-3] = 0.0095 / (0.0095 + 4.9995) = 0.0095 / 5.00945 = 0.0019 or abt 1 in 500 cases actually have a parasitic gap [owing to low prior]
r.v : function X : Omega -> R^n, (commonly n=1). discrete random variable is a fn X : Omega -> discrete set S = {a,b,c,d}, where abcd are the outcomes. If S = {0,1} then X is called an indicator r.v or a BERNOULLI TRIAL roll dice 1x : UNIFORM DISTRIBUTION roll dice 2x : triangular distribution roll dice n-x : tend toward GAUSSIAN discrete r.v.: mapping to [0,1] = probability MASS FUNCTION pmf SUM p(xi) = P (Omega) = 1 any fn satisfying this is also a pmf continuous r.v. probability DENSITY function INTEGR_univ p(x) dx = 1 SUM and PRODUCT rule
expectn of function f(x) = SUM p(x) f(x) [or INTEGR p(x) f(x) dx ] if x1..xi..xN are samples drawn from some distribn (cont or discrete), then E(f) ~= 1/N SUM (f(xi)) Q. what is the expected value when rolling one die = 1/6 SUM (1 to 6) = 3.5 rolling two dies = SUM_2,12 x p(x) = 1/36.[2+12] + 2/36 ... = 14. [1/36 + 2/36 + 3/36 + 4/36 + 5/36] + 7. 1/6 = 252/36 = 7 i.e. E(x + x) = E(x) + E(x)