moremata
/
mf_mm_density.sthlp

{smcl}
{* 17aug2020}{...}
{cmd:help mata mm_density()}
{hline}

{title:Title}

{pstd}
    {bf:mm_density() -- Density estimation (Stata 11 required)}


{title:Syntax}

{dlgtab:Initialize density estimation object}

        {cmd:class mm_density scalar} {it:D}
        or
        {it:D} {cmd:= mm_density()}

{dlgtab:Setup}

{pstd}Provide data

        {it:D}{cmd:.data(}{it:X} [{cmd:,} {it:w}{cmd:,} {it:pw}{cmd:,} {it:sorted}]{cmd:)}

{p 12 14 2}{it:X} is a {it:real colvector} containing data (missing values not allowed)

{p 12 14 2}{it:w} is a {it:real colvector} containing weights (missing or negative values not allowed); default is {it:w} = 1 (no weights)

{p 12 14 2}{it:pw}!=0 indicates that weights are sampling weights; default is {it:pw} = 0

{p 12 14 2}{it:sorted}!=0 indicates that {it:X} is sorted and non-missing, that {it:w} is non-missing and
non-negative, and that {it:X} does not contain values that are outside of the support defined by
{it:D}{cmd:.support()}; default is {it:sorted} = 0; specify {it:sorted}!=0 to save some computer time
if you know that these conditions are true

{pstd}Choose kernel

        {it:D}{cmd:.kernel(}{it:kernel} [{cmd:,} {it:adapt}]{cmd:)}

{p 12 14 2}{it:kernel} is a {it:string scalar} specifying the kernel name; available
kernels are {cmd:"epanechnikov"},
{cmd:"epan2"}, {cmd:"biweight"}, {cmd:"triweight"},
{cmd:"cosine"}, {cmd:"gaussian"}, {cmd:"parzen"},
{cmd:"rectangle"}, and {cmd:"triangle"}
(abbreviations allowed); default is {cmd:"gaussian"}; specifying {cmd:""}
(empty string) selects the default

{p 12 14 2}{it:adapt} is a {it:real scalar} specifying the number of stages of the adaptive
kernel density estimator; {it:adapt} may not be negative; default is {it:adapt} = 0
(non-adaptive kernel); specifying {cmd:.} (missing) selects the default

{pstd}Set bandwidth or choose bandwidth selection method

        {it:D}{cmd:.bw(}{it:bw} [{cmd:,} {it:adjust}{cmd:,} {it:dpi}{cmd:,} {it:qui}]{cmd:)}

{p 12 14 2}{it:bw} is a (strictly positive) {it:real scalar} specifying the value
of the bandwidth or a {it:string scalar} specifying the bandwidth selection
rule; available rules are {cmd:"silverman"} (optimal of Silverman),
{cmd:"normalscale"} (normal scale rule), {cmd:"oversmoothed"} (oversmoothed rule),
{cmd:"sjpi"} (Sheather-Jones solve-the-equation plug-in), {cmd:"dpi"}
(Sheather-Jones direct plug-in), and {bf:"isj"} (bandwidth based on
diffusion estimator); default is {cmd:"sjpi"}; specifying {cmd:""} (empty string)
or {cmd:.} (missing) selects the default

{p 12 14 2}{it:adjust} is a {it:real scalar} specifying a multiplication factor
for the bandwidth; {it:adjust} must be strictly positive; default is
{it:adjust} = 1; specifying {cmd:.} (missing) selects the default

{p 12 14 2}{it:dpi} is a {it:real scalar} specifying the number of stages of functional
estimation for the {cmd:"dpi"} rule; {it:dpi} may not be negative; default is
{it:dpi} = 2; specifying {cmd:.} (missing) selects the default

{p 12 14 2}{it:qui}!=0 suppresses the warning message that is displayed if
SJPI or ISJ fails; default is {it:qui} = 0

{pstd}Define support of data and choose boundary correction method

        {it:D}{cmd:.support(}{it:minmax} [{cmd:,} {it:method}{cmd:,} {it:rd}]{cmd:)}

{p 12 14 2}{it:minmax} is a {it:real vector} specifying the lower and upper bounds
of the support of {it:X}; specify one value (lower boundary only) or
two values (lower and upper boundary); value {cmd:.} (missing) is interpreted as
(minus) infinity (i.e. unbounded); default is {it:minmax} = {cmd:(.,.)}
(unbounded support)

{p 12 14 2}{it:method} is a {it:string scalar} specifying the boundary-correction
method to be used; available methods are {cmd:"renormalization"},
{cmd:"reflection"}, and {bind:{cmd:"linear correction"}} (abbreviations allowed); default
is {cmd:"renormalization"}; specifying {cmd:""}
(empty string) selects the default

{p 12 14 2}{it:rd}!=0 indicates that the data is to be interpreted as relative data (relative ranks); {it:minmax} defaults to
{cmd:(0,1)} in this case and automatic bandwidth selection is modified; values outside [0,1] are not allowed
in {it:minmax} if {it:rd}!=0; default is {it:rd} = 0

{pstd}Set grid size of approximation estimator

        {it:D}{cmd:.n(}{it:n} [{cmd:,} {it:pad}]{cmd:)}

{p 12 14 2}{it:n} is a {it:real scalar} specifying the grid size used by the fast approximation
estimator (and the bandwidth selectors); default
is {it:n} = 2^10 = 1024; specifying {cmd:.} (missing) selects the default

{p 12 14 2}{it:pad} is a {it:real scalar} specifying the padding proportion of the approximation
grid; {it:pad} may not be negative; the default is {it:pad} = 0.1, which means that the grid will be padded
by 10 percent of the observed data range on each side of the data (unless the support of the
data has been restricted by {it:D}{cmd:.support()}, in which case the approximation grid
will span the defined support); specifying {cmd:.} (missing) selects the default

{pstd}Retrieve settings

{p2colset 9 30 32 2}{...}
{p2col:{it:X}{space 5} = {it:D}{cmd:.X()}}{it:real colvector} containing data
    {p_end}
{p2col:{it:w}{space 5} =  {it:D}{cmd:.w()}}{it:real colvector} containing weights
    {p_end}
{p2col:{it:nobs}{space 2} =  {it:D}{cmd:.nobs()}}{it:real scalar} containing number of observations (sum of weights)
    {p_end}
{p2col:{it:pw}{space 4} =  {it:D}{cmd:.pw()}}{it:real scalar} containing {it:pw} flag
    {p_end}
{p2col:{it:sorted} =  {it:D}{cmd:.sorted()}}{it:real scalar} containing {it:sorted} flag
    {p_end}
{p2col:{it:kernel} =  {it:D}{cmd:.kernel()}}{it:string scalar} containing kernel name
    {p_end}
{p2col:{it:adapt}{space 1} =  {it:D}{cmd:.adapt()}}{it:real scalar} containing number of stages of adaptive estimator
    {p_end}
{p2col:{it:kh}{space 4} =  {it:D}{cmd:.kh()}}{it:real scalar} containing canonical bandwidth of kernel
    {p_end}
{p2col:{it:bw}{space 4} =  {it:D}{cmd:.bw()}}{it:string scalar} containing name of bandwidth selection rule or {it:real scalar} containing user-provided bandwidth
    {p_end}
{p2col:{it:adjust} =  {it:D}{cmd:.adjust()}}{it:real scalar} containing bandwidth adjustment factor
    {p_end}
{p2col:{it:dpi}{space 3} =  {it:D}{cmd:.dpi()}}{it:real scalar} containing number of DPI levels
    {p_end}
{p2col:{it:minmax} =  {it:D}{cmd:.support()}}{it:real rowvector} containing lower and upper bound of support
    {p_end}
{p2col:{it:lb}{space 4} =  {it:D}{cmd:.lb()}}{it:real scalar} containing lower bound of support
    {p_end}
{p2col:{it:ub}{space 4} =  {it:D}{cmd:.ub()}}{it:real scalar} containing upper bound of support
    {p_end}
{p2col:{it:bc}{space 4} =  {it:D}{cmd:.bc()}}{it:string scalar} containing name of boundary-correction method
    {p_end}
{p2col:{it:rd}{space 4} =  {it:D}{cmd:.rd()}}{it:real scalar} containing {it:rd} flag
    {p_end}
{p2col:{it:n}{space 5} =  {it:D}{cmd:.n()}}{it:real scalar} containing grid size of approximation estimator
    {p_end}
{p2col:{it:pad}{space 3} =  {it:D}{cmd:.pad()}}{it:real scalar} containing padding proportion of approximation grid
    {p_end}

{pstd}
    Technical remarks

{phang}
{bind:  o }Changing any of the above settings will clear all results that may already exist in {it:D}.

{phang}
{bind:  o }To preserve memory, {it:D}{cmd:.data()} does not store a copy of the
    data in {it:D}; it only stores pointers to {it:X} and {it:w}. Modifying {it:X}
    or {it:w} after {it:D}{cmd:.data()} has been applied will change the data used
    by {cmd:mm_density()}, but it will not clear existing results in
    {it:D}. This may lead to inconsistent results. It is best not to modify
    {it:X} and {it:w} as long as {it:D} is active.

{dlgtab:Estimation}

{pstd}Compute density estimate

        {it:d} = {it:D}{cmd:.d(}{it:at} [{cmd:,} {it:exact}]{cmd:)}
        or
        {it:d} = {it:D}{cmd:.d(}k{cmd:,} {it:from}{cmd:,} {it:to} [{cmd:,} {it:exact}]{cmd:)}

{p 12 14 2}{it:at} is a {it:real vector} specifying custom values at which the density is to be estimated

{p 12 14 2}{it:k} is a {it:real scalar} specifying the size of the grid over which the density is to be estimated
(regular grid with {it:k} points)

{p 12 14 2}{it:from} is a {it:real scalar} specifying the lowest value of the
grid; specify {it:from} = {cmd:.} (missing) to determine the value automatically

{p 12 14 2}{it:to} is a {it:real scalar} specifying the largest value of the
grid; specify {it:to} = {cmd:.} (missing) to determine the value automatically

{p 12 14 2}{it:exact}!=0 requests the exact density estimator to be used; default is
{it:exact} = 0 (approximation estimator)

{pstd}
    Retrieve results after estimation

{p2colset 9 22 24 2}{...}
{p2col:{it:d}{space 1} = {it:D}{cmd:.d()}}{it:real colvector} containing density estimate fitted last
    {p_end}
{p2col:{it:at} = {it:D}{cmd:.at()}}{it:real colvector} containing evaluation points of density estimate
    {p_end}

{pstd}
    Self-threading functions

{p2colset 9 22 24 2}{...}
{p2col:{it:h}{space 1} = {it:D}{cmd:.h()}}{it:real scalar} containing bandwidth
    {p_end}
{p2col:{it:D}{space 1} = {it:D}{cmd:.D()}}{it:real colvector} containing full-grid approximation estimator
    {p_end}
{p2col:{it:AT} = {it:D}{cmd:.AT()}}{it:real colvector} containing evaluation grid of approximation estimator
    {p_end}
{p2col:{it:W}{space 1} = {it:D}{cmd:.W()}}{it:real colvector} containing grid counts of approximation estimator
    {p_end}
{p2col:{it:L}{space 1} = {it:D}{cmd:.L()}}{it:real colvector} containing local bandwidth factors of approximation estimator
    {p_end}
{p2col:{it:l}{space 1} = {it:D}{cmd:.l()}}{it:real colvector} containing local bandwidth factors of exact estimator
    {p_end}

{pmore}
    The self-threading functions can be used irrespective of whether {it:D}{cmd:.d()} has been applied or
    not; results will be the same.

{dlgtab:Other functions}

{pstd}Obtain levels or first derivatives of kernel function using current settings
(including boundary correction)

        {it:levels}      {cmd:=} {it:D}{cmd:.K(}{it:X}{cmd:,} {it:at}{cmd:,} {it:h}{cmd:)}
        {it:derivatives} {cmd:=} {it:D}{cmd:.Kd(}{it:X}{cmd:,} {it:at}{cmd:,} {it:h}{cmd:)}

{p 12 14 2}{it:X} is a {it:real colvector} containing the points at which the levels or derivatives will be evaluated

{p 12 14 2}{it:at} is a {it:real colvector} containing kernel locations (midpoints)

{p 12 14 2}{it:h} is a {it:real colvector} containing kernel bandwidths

{p 12 14 2}{it:X}, {it:at}, and {it:h} are required to be r-conformable
(see {helpb m6_glossary##r-conformability:{bind:[M-6] glossary}}).


{title:Description}

{pstd}
    {cmd:mm_density()} is a class-based system for univariate kernel density
    estimation. Stata 11 or newer is required.


{title:Examples}

{pstd}
    Density estimate with default settings:

        . {stata "mata:"}
        : {stata x = rnormal(1000, 1, 0, 1)}
        : {stata S = mm_density()}
        : {stata S.data(x)}
        : {stata S.d(13, -3, 3), S.at()}
        : {stata end}

{pstd}
    Display full grid approximation estimator:

        . {stata "mata:"}
        : {stata x = rnormal(50000,1, 0, 1)  \ rnormal(50000,1, 4, 1.5)}
        : {stata S = mm_density()}
        : {stata S.data(x)}
        : {stata mm_plot((S.D(), S.AT()), "line")}
        : {stata end}

{pstd}
    Compute bandwidth (without estimating the density):

        . {stata "mata:"}
        : {stata x = rnormal(1000,1, 0, 1)  \ rnormal(1000,1, 4, 1.5)}
        : {stata S = mm_density()}
        : {stata S.data(x)}
        : {stata S.bw("sj");     S.h()}  // Sheather-Jones solve-the-equation
        : {stata S.bw("dpi");    S.h()}  // Sheather-Jones direct plug-in
        : {stata S.bw("isj");    S.h()}  // diffusion estimator bandwidth
        : {stata S.bw("silver"); S.h()}  // optimal of Silverman
        : {stata S.bw("over");   S.h()}  // oversmooothed rule
        : {stata S.bw("normal"); S.h()}  // normal scale rule
        : {stata end}


{title:Source code}

{pstd}
    {help moremata11_source##mm_density:mm_density.mata}


{title:Methods and Formulas}

{pstd}
For a brief overview of the methods used by {cmd:mm_density()} see
{browse "http://boris.unibe.ch/69421/2/kdens.pdf":Jann (2007)}. A book-length
treatment can be found in Wand and Jones (1995). Not covered in these
references is the improved Sheather-Jones bandwidth selector based on diffusion
methods; see {browse "http://doi.org/10.1214/10-AOS799":Botev et al. (2010)}.


{title:References}

{phang}
    Botev, Z.I., J.F. Grotowski, and D.P. Kroese (2010). Kernel density
    estimation via diffusion. Annals of Statistics
    38(5): 2916-2957. DOI: {browse "http://doi.org/10.1214/10-AOS799":10.1214/10-AOS799}.
    {p_end}
{phang}
    Jann, B. (2007). Univariate kernel density
    estimation. DOI: {browse "http://boris.unibe.ch/69421/2/kdens.pdf":10.7892/boris.69421}.
    {p_end}
{phang}
    Wand, M.P., M.C. Jones (1995). Kernel Smoothing. London: Chapman and Hall.
    {p_end}


{title:Author}

{pstd} Ben Jann, University of Bern, ben.jann@soz.unibe.ch


{title:Also see}

{psee}
Online:  help for
{helpb mf_mm_ddens:mm_ddens()},
{helpb mf_mm_ecdf:mm_ecdf()},
{helpb mf_mm_histogram:mm_histogram()},
{helpb kdensity},
{helpb histogram},
{helpb moremata}
{p_end}