Let $Y_1,\ldots,Y_N$ be $(d+1)$-dimensional observations (collecting the $X_n\in\mathbb{R}^d$ covariate within each $Y_n\in\mathbb{R}$ response for shorthand) generated from some model with unknown parameters $\theta\in\Theta$.

Goal: Find the “true” parameters $\theta^*\in\Theta$.

Intuition: The idea is to find a set of $k$ constraints, or “moments”, involving the parameters $\theta$. What makes GMMs nice is that you need no information per say about how the model depends on $\theta$. Certainly they can be used to construct moments (special case: maximum likelihood estimation (MLE)), but one can use, for example, statistical moments (special case: method of moments (MoM)) as the constraints. Analogously, tensor decompositions are used in the case of spectral methods.

More formally, the $k$ moment conditions for a vector-valued function $g(Y,\cdot):\Theta\to\mathbb{R}^k$ is

where $0_{k\times 1}$ is the $k\times 1$ zero vector.

As we cannot analytically derive the expectation for arbitrary $g$, we use the sample moments instead:

By the Law of Large Numbers, $\hat{m}(\theta)\to m(\theta)$, so the problem is thus to find the $\theta$ which sets $\hat m(\theta)$ to zero.

Cases:

• $\Theta\supset\mathbb{R}^k$, i.e., there are more parameters than moment conditions: The model is not identifiable. This is the standard scenario in ordinary least squares (OLS) when there are more covariates than observations and so no unique set of parameters $\theta$ exist. Solve this by simply constructing more moments!
• $\Theta=\mathbb{R}^k$: There exists a unique solution.
• $\Theta\subset\mathbb{R}^k$, i.e., there are fewer parameters than moment conditions: The parameters are overspecified and the best we can do is to minimize $m(\theta)$ instead of solve $m(\theta)=0$.

Consider the last scenario: we aim to minimize $\hat m(\theta)$ in some way, say $\|\hat m(\theta)\|$ for some choice of $\|\cdot\|$. We define the weighted norm as

where $W$ is a positive definite matrix.

The generalized method of moments (GMMs) procedure is to find

Note that while the motivation is for $\theta\supset\mathbb{R}^k$, by the unique solution, this is guaranteed to work for $\Theta=\mathbb{R}^k$ too. Hence it is a generalized method of moments.

Theorem. Under standard assumptions¹, the estimator $\hat\theta$ is consistent and asymptotically normal. Furthermore, if

then $\hat \theta$ is asymptotically optimal, i.e., achieves the Cramér-Rao lower bound.

Note that $\Omega$ is the covariance matrix of $g(Y_n,\theta^*)$ and $\Omega^{-1}$ the precision. Thus the GMM weights the parameters of the estimator $\hat\theta$ depending on how much “error” remains in $g(Y,\cdot)$ per parameter of $\theta^*$ (that is, how far away $g(Y,\cdot)$ is from 0).

I haven’t seen anyone make this remark before, but the GMM estimator can also be viewed as minimizing a log-normal quantity. Recall that the multivariate normal distribution is proportional to

Setting $g(Y_n,\theta)\equiv Y_n-\mu$, $W\equiv\Sigma$, and taking the log, this is exactly the expression for the GMM! By the asymptotic normality, this explains why would want to set $W\equiv\Sigma$ in order to achieve statistical efficiency.

¹ The standard assumptions can be found in . In practice they will almost always be satisfied, e.g., compact parameter space, $g$ is continuously differentiable in a neighborhood of $\theta^*$, output of $g$ is never infinite, etc.

 Alastair Hall. Generalized Method of Moments (Advanced Texts in Econometrics). Oxford University Press, 2005.