Let Y1,,YNY_1,\ldots,Y_N be (d+1)(d+1)-dimensional observations (collecting the XnRdX_n\in\mathbb{R}^d covariate within each YnRY_n\in\mathbb{R} response for shorthand) generated from some model with unknown parameters θΘ\theta\in\Theta.

Goal: Find the “true” parameters θΘ\theta^*\in\Theta.

Intuition: The idea is to find a set of kk constraints, or “moments”, involving the parameters θ\theta. What makes GMMs nice is that you need no information per say about how the model depends on θ\theta. Certainly they can be used to construct moments (special case: maximum likelihood estimation (MLE)), but one can use, for example, statistical moments (special case: method of moments (MoM)) as the constraints. Analogously, tensor decompositions are used in the case of spectral methods.

More formally, the kk moment conditions for a vector-valued function g(Y,):ΘRkg(Y,\cdot):\Theta\to\mathbb{R}^k is

m(θ)E[g(Y,θ)]=0k×1,\displaystyle {m(\theta^*) \equiv \mathbb{E}[g(Y,\theta^*)] = 0_{k\times 1},}

where 0k×10_{k\times 1} is the k×1k\times 1 zero vector.

As we cannot analytically derive the expectation for arbitrary gg, we use the sample moments instead:

m^(θ)1Nn=1Ng(Yn,θ)\displaystyle {\hat m(\theta) \equiv \frac{1}{N}\sum_{n=1}^N g(Y_n,\theta)}

By the Law of Large Numbers, m^(θ)m(θ)\hat{m}(\theta)\to m(\theta), so the problem is thus to find the θ\theta which sets m^(θ)\hat m(\theta) to zero.

Cases:

  • ΘRk\Theta\supset\mathbb{R}^k, i.e., there are more parameters than moment conditions: The model is not identifiable. This is the standard scenario in ordinary least squares (OLS) when there are more covariates than observations and so no unique set of parameters θ\theta exist. Solve this by simply constructing more moments!
  • Θ=Rk\Theta=\mathbb{R}^k: There exists a unique solution.
  • ΘRk\Theta\subset\mathbb{R}^k, i.e., there are fewer parameters than moment conditions: The parameters are overspecified and the best we can do is to minimize m(θ)m(\theta) instead of solve m(θ)=0m(\theta)=0.

Consider the last scenario: we aim to minimize m^(θ)\hat m(\theta) in some way, say m^(θ)\|\hat m(\theta)\| for some choice of \|\cdot\|. We define the weighted norm as

m^(θ)W2m^(θ)TWm^(θ),\displaystyle {\|\hat m(\theta)\|_W^2 \equiv \hat m(\theta)^T W \hat m(\theta),}

where WW is a positive definite matrix.

The generalized method of moments (GMMs) procedure is to find

θ^=arg minθΘ(1Nn=1Ng(Yn,θ))TW(1Nn=1Ng(Yn,θ))\displaystyle {\hat\theta = {arg\ min}_{\theta\in\Theta} \left(\frac{1}{N}\sum_{n=1}^N g(Y_n,\theta)\right)^T W \left(\frac{1}{N}\sum_{n=1}^N g(Y_n,\theta)\right)}

Note that while the motivation is for θRk\theta\supset\mathbb{R}^k, by the unique solution, this is guaranteed to work for Θ=Rk\Theta=\mathbb{R}^k too. Hence it is a generalized method of moments.

Theorem. Under standard assumptions¹, the estimator θ^\hat\theta is consistent and asymptotically normal. Furthermore, if

WΩ1E[g(Yn,θ)g(Yn,θ)T]1\displaystyle {W \propto \Omega^{-1}\equiv\mathbb{E}[g(Y_n,\theta^*)g(Y_n,\theta^*)^T]^{-1}}

then θ^\hat \theta is asymptotically optimal, i.e., achieves the Cramér-Rao lower bound.

Note that Ω\Omega is the covariance matrix of g(Yn,θ)g(Y_n,\theta^*) and Ω1\Omega^{-1} the precision. Thus the GMM weights the parameters of the estimator θ^\hat\theta depending on how much “error” remains in g(Y,)g(Y,\cdot) per parameter of θ\theta^* (that is, how far away g(Y,)g(Y,\cdot) is from 0).

I haven’t seen anyone make this remark before, but the GMM estimator can also be viewed as minimizing a log-normal quantity. Recall that the multivariate normal distribution is proportional to

exp((Ynμ)TΣ1(Ynμ))\displaystyle {\exp\Big((Y_n-\mu)^T\Sigma^{-1}(Y_n-\mu)\Big)}

Setting g(Yn,θ)Ynμg(Y_n,\theta)\equiv Y_n-\mu, WΣW\equiv\Sigma, and taking the log, this is exactly the expression for the GMM! By the asymptotic normality, this explains why would want to set WΣW\equiv\Sigma in order to achieve statistical efficiency.

¹ The standard assumptions can be found in [1]. In practice they will almost always be satisfied, e.g., compact parameter space, gg is continuously differentiable in a neighborhood of θ\theta^*, output of gg is never infinite, etc.

References

[1] Alastair Hall. Generalized Method of Moments (Advanced Texts in Econometrics). Oxford University Press, 2005.