Let be -dimensional observations (collecting the covariate within each response for shorthand) generated from some model with unknown parameters .

Goal: Find the “true” parameters .

Intuition: The idea is to find a set of constraints, or “moments”, involving the parameters . What makes GMMs nice is that you need no information per say about how the model depends on . Certainly they can be used to construct moments (special case: maximum likelihood estimation (MLE)), but one can use, for example, statistical moments (special case: method of moments (MoM)) as the constraints. Analogously, tensor decompositions are used in the case of spectral methods.

More formally, the moment conditions for a vector-valued function is

where is the zero vector.

As we cannot analytically derive the expectation for arbitrary , we use the sample moments instead:

By the Law of Large Numbers, , so the problem is thus to find the which sets to zero.


  • , i.e., there are more parameters than moment conditions: The model is not identifiable. This is the standard scenario in ordinary least squares (OLS) when there are more covariates than observations and so no unique set of parameters exist. Solve this by simply constructing more moments!
  • : There exists a unique solution.
  • , i.e., there are fewer parameters than moment conditions: The parameters are overspecified and the best we can do is to minimize instead of solve .

Consider the last scenario: we aim to minimize in some way, say for some choice of . We define the weighted norm as

where is a positive definite matrix.

The generalized method of moments (GMMs) procedure is to find

Note that while the motivation is for , by the unique solution, this is guaranteed to work for too. Hence it is a generalized method of moments.

Theorem. Under standard assumptions¹, the estimator is consistent and asymptotically normal. Furthermore, if

then is asymptotically optimal, i.e., achieves the Cramér-Rao lower bound.

Note that is the covariance matrix of and the precision. Thus the GMM weights the parameters of the estimator depending on how much “error” remains in per parameter of (that is, how far away is from 0).

I haven’t seen anyone make this remark before, but the GMM estimator can also be viewed as minimizing a log-normal quantity. Recall that the multivariate normal distribution is proportional to

Setting , , and taking the log, this is exactly the expression for the GMM! By the asymptotic normality, this explains why would want to set in order to achieve statistical efficiency.

¹ The standard assumptions can be found in [1]. In practice they will almost always be satisfied, e.g., compact parameter space, is continuously differentiable in a neighborhood of , output of is never infinite, etc.


[1] Alastair Hall. Generalized Method of Moments (Advanced Texts in Econometrics). Oxford University Press, 2005.