timerring

Parameter Estimation

February 2, 2025 · 5 min read · Page View:
Tutorial
Random Process | Math
If you have any questions, feel free to comment below.

This is the notes for the lecture of Parameter Estimation.

Prediction vs. Estimation #

eg. point VS interbal prediction

  • point: find C that minimizes $E[|X-C|^2]$
  • interval: find $a,b$ that minimizes $P(a<X<b)=\gamma(0.9, 0.95, 0.99)$

eg Toss a fair coin 100 times. Need to estimate the number \(n_{A}\) of heads with \(\gamma=0.997\)

n=100 and p=0.5.Hence $\(k_{1}=n p-3 \sqrt{npq}=35, k_2=n p+3 \sqrt{npq}=65\)$, therefore, it will be 35 to 65 with probability 0.997.

Observation with noise #

the ith observation can be represented as $\(X_{i}=\theta+n_{i}\)$ ,i=1,2,… Obeservation = Signal(desired) + Noise

The estimation problem is to obtain the best estimator for the unknown parameter $\(\theta\)$ based on the observation.

Denote by $\hat{\theta}(X)$ the estimator. And the error $\(e_{i}=\hat{\theta}(X)-\theta\)$ is a random variable. Such as e^2 or |e|.

Select the best estimator so as to minimize some function of this error.

Max likelihood #

The method of maximum likelihood assumes that the given sample data set is representative of the population $f_{X}(x_{1}, x_{2}, \cdots, x_{n} ; \theta)$ and chooses that value for $\theta$ that most likely caused the observed data to occur, i.e, once observations $x_{1}, x_{2}, \cdots, x_{n}$ are given, $f_{X}(x_{1}, x_{2}, \cdots, x_{n} ; \theta)$ is a function of $\theta$ alone, and the value of $\theta$ that maximizes the above p.d.f is the most likely value for $\theta$ and it is chosen as the ML estimate $\hat{\theta}_{M L}(X)$ for $\theta$.

eg. Bernoulli Sampling

Let $X_{i} \sim \operatorname{Bernoulli}(\theta)$. That is,

$$ P\left(X_{i}=1\right)=\theta, \quad P\left(X_{i}=0\right)=1-\theta $$

The pdf for $x_{i}$ is

$$ f\left(x_{i} ; \theta\right)=\theta^{x_{i}}(1-\theta)^{1-x_{i}}, x_{i}=0,1 $$

Let $X_{1}, \cdots, X_{n}$ be an id sample with $X_{i} \sim$ Bernouli($\theta$),The joint density/ likelihood function is given by

$$ f(x;\theta)=L(\theta | x)=\prod_{i=1}^{n} \theta^{x_{i}}(1-\theta)^{1-x_{i}}=\theta^{\sum_{i=1}^{n} x_{i}}(1-\theta)^{n-\sum_{i=1}^{n} x_{i}} $$

Max-likelihood Estimation #

$$ f_X(x_1,x_2,\cdots,x_n;\theta) $$

So

$$ \underset{\theta}{\text{sup}} f_X(x_1,x_2,\cdots,x_n;\theta) $$

or using the log-likelihood function

$$ L(x_{1}, x_{2}, \cdots, x_{n} ; \theta)=log f_{X}(x_{1}, x_{2}, \cdots, x_{n} ; \theta) . $$

If $L(x_{1}, x_{2}, \cdots, x_{n} ; \theta)$ is differentiable and a supremum $\hat{\theta}_{M L}$ exists, then that must satisfy the equation

$$ \frac{\partial L}{\partial \theta}=0 $$

Therefore,

$$ \hat{\theta}(X)=\frac{1}{n}\sum_{i = 1}^{n}x_i $$

Taking its expected value, we get

$$ E[\hat{\theta}(x)]=\frac{1}{n} \sum_{i=1}^{n} E(X_{i})=\theta, $$

i.e, the expected value of the estimator does not differ from the desired parameter, and hence there is no bias between the two. So it can be called as an unbiased estimator. In short, $E[\hat{\theta_{ML}}(X)]=\theta$

Then

$$ Var\left(\hat{\theta_{ML}}\right)=\frac{1}{n^{2}} \sum_{i=1}^{n} Var\left(X_{i}\right)=\frac{n \sigma^{2}}{n^{2}}=\frac{\sigma^{2}}{n} . $$

Thus

$$ Var\left(\hat{\theta}_{M L}\right) \to 0 \quad as n \to \infty, $$

another desired property. Such estimators are called consistent estimators. it means that the estimator converges to the true parameter as the sample size increases.

ML estimator can be highly nonlinear.

Best Estimator vs. Best Unbiased Estimators #

For $X_{i}=\theta+w_{i}$ , $i=1 \to n$,

$\hat{\theta_{ML}}(X)=\frac{1}{n} \sum_{i=1}^{n} X_{i}$ with variance $Var(\hat{\theta_{ML}})=\frac{1}{n^{2}} \sum_{i=1}^{n} Var(X_{i})=\frac{n \sigma^{2}}{n^{2}}=\frac{\sigma^{2}}{n}$

And the best unbiased estimator is the $\frac{\sigma^{2}}{n}$ according to the Cramer-Rao lower bound.

MAP #

How to obtain a good estimator based on the observation? having the prior information of the parameter $f_{\theta}(\theta)$

Of course, we can use the Bayes’theorem to obtain this a-posteriori p.d.f.This gives

$$ f_{\theta | X}\left(\theta | x_{1}, x_{2}, \cdots, x_{n}\right)=\frac{f_{X | \theta}\left(x_{1}, x_{2}, \cdots, x_{n} | \theta\right) f_{\theta}(\theta)}{f_{X}\left(x_{1}, x_{2}, \cdots, x_{n}\right)} . $$

Hypothesis Testing #

Hypothesis testing is not a part of statistics. It is part of decision theory based on statistics.

  • Null Hypothesis: $H_0$, statements regarding the values of unknown parameters.(Always equality)

  • Alternative Hypothesis: $H_1$, statements contradictory to $H_0$.

  • Test statistic Z: Quantity based on the sample data.

  • Rejection region: The set of values of the test statistic that lead to the rejection of the null hypothesis. defined by the significance level $\alpha$.

To establish whether experimental evidence(sample data) supports the rejection of H_0. Not to say whether H_0 or H_1 is true or false. Such a decision is based on the sample data.

Cases #

Accept or reject hypotheses for means:

  • Normally distributed population with known variance.
  • Normally distributed population with unknown variance.
  • Not normally distributed population, but with a large enough sample.

Accept or reject hypotheses for normally distributed population variances.

Steps #

  1. Null and alternative hypotheses
  2. Test statistic
  3. P-value and interpretation
  4. Significance level (optional)

Suppose $H_0$: $\theta=\theta_0$, reasonable to reject $H_0$ if $X$ is in the $R_c$ and accept $H_0$ if $X$ is not in $R_c$.

The set $R_c$ is the critical region. of test. And the $\bar{R_c}$ is the acceptance region.

Test statistic #

how closely the sample data fit the null hypothesis predicted the distribution.

P-value #

Smaller the p-value, the more likely we are to reject the null hypothesis.

The calculation of the p-value depends on the statistical test you are using to test your hypothesis.

  • To compare only two different diets, then a two-sample t-test is a good way to compare them.
  • To compare three different diets, use an ANOVA instead – doing multiple pairwise comparisons will result in artificially low p-values that overestimate the significance of the difference between groups.

Statistical significance #

It is another way of saying the p-value of a statistical test is small enough to reject the null hypothesis.

The common threshold is 0.05. The threshold is also called the significance level. (alpha)

comparison of two means #

Traditional method:

  • Reject H_0 if the test statistic falls within the critical region.
  • Accept H_0 if the test statistic not falls within the critical region.

Modern method:

  • Reject H_0 if the p-value is less than or equal to the significance level.
  • Accept H_0 if the p-value is greater than the significance level.

Besides, you can use confidence intervals.

Result #

EXAMPLE - EFFICACY TEST FOR NEW DRUG #

Null hypothesis - New drug is no better than standard treatment.

$$ H_{0}: \mu_{New }-\mu_{S t d} \leq 0 \quad\left(\mu_{New }-\mu_{S t d}=0\right) $$

Alternative hypothesis - New drug is better than standard treatment.

$$ H_{1}: \mu_{New}-\mu_{S t d}>0 $$

Experimental (Sample) data:

In large samples, the difference in two sample means is approximately normally distributed:

Under the null hypothesis, $\mu_{New }-\mu_{S t d}=0$ and:

$\sigma_{1}$ and $\sigma_{2}$ are unknown and estimated by $s_1^2$ and $s_2^2$

Type I error - Concluding that the new drug is better than the standard (H_A) when in fact it is no better (H_0). Ineffective drug is deemed better.

Traditionally $\alpha=P( Type I error )=0.05$

Type II error - Failing to conclude that the new drug is better than the standard (H_A) when in fact it is. Effective drug is deemed to be no better.

Traditionally a clinically important difference $(\triangle)$ is assigned and sample sizes chosen so that:

$$ B= P(Type II error | \mu_{1}-\mu_{2}=\Delta) ≤.20 $$

Test Statistic - Difference between the Sample means, scaled to number of standard deviations (standard errors) from the null difference of O for the Population means:

Rejection Region - Set of values of the test statistic that are consistent with \(H_{A}\) such that the probability it falls in this region when \(H_{O}\) is true is a (we will always set \(\alpha=0.05\) )

$$ R.R. : z_{o b s} \geq z_{\alpha} \alpha=0.05 \Rightarrow z_{\alpha}=1.645 $$

P - value : $P\left(Z \geq z_{obs }\right)$

Conclusion - Reject $\(H_{O}\)$ if test statistic falls in rejection region, or equivalent if the P-value is ≤ a.

Result:

Conclusion: Botox A produces lower mean Tsui scores than placebo (since 2.82> 1.645 and P-value<0.05)

Related readings


<< prev | Two Random... Continue strolling Sequences of... | next >>

If you find this blog useful and want to support my blog, need my skill for something, or have a coffee chat with me, feel free to: