fpsden() R function from [evmix]

MLE Fitting of P-splines Density Estimator

Maximum likelihood estimation for P-splines density estimation. Histogram binning produces frequency counts, which are modelled by constrained B-splines in a Poisson regression. A penalty based on differences in the sequences B-spline coefficients is used to smooth/interpolate the counts. Iterated weighted least squares (IWLS) for a mixed model representation of the P-splines regression, conditional on a particular penalty coefficient, is used for estimating the B-spline coefficients. Leave-one-out cross-validation deviances are available for estimation of the penalty coefficient.


fpsden(x, lambdaseq = NULL, breaks = NULL, xrange = NULL,
  nseg = 10, degree = 3, design.knots = NULL, ord = 2)

lpsden(x, beta = NULL, bsplines = NULL, nbinwidth = 1, log = TRUE)

nlpsden(pvector, x, bsplines = NULL, nbinwidth = 1,
  finitelik = FALSE)

cvpsden(lambda = 1, counts, bsplines, ord = 2)

iwlspsden(counts, bsplines, ord = 2, lambda = 10)

Arguments

x: quantiles
lambdaseq: vector of $\lambda$ 's (or scalar) to be considered in profile likelihood. Required.
breaks: histogram breaks (as in hist function)
xrange: vector of minimum and maximum of B-spline (support of density)
nseg: number of segments between knots
degree: degree of B-splines (0 is constant, 1 is linear, etc.)
design.knots: spline knots for splineDesign function
ord: order of difference used in the penalty term
beta: vector of B-spline coefficients (required)
bsplines: matrix of B-splines
nbinwidth: scaling to convert count frequency into proper density
log: logical, if TRUE then log density
pvector: vector of initial values of GPD parameters (sigmau, xi) or NULL
finitelik: logical, should log-likelihood return finite value for invalid parameters
lambda: penalty coefficient
counts: counts from histogram binning

Returns

Log-likelihood for original data is given by lpsden and it's wrappers for negative log-likelihood from nlpsden. Cross-validation sum of square of errors is provided by cvpsden. Poisson regression fitting by IWLS is carried out in iwlspsden. Fitting function fpsden returns a simple list with the following elements


`call` :	`optim` call
`x` :	data vector `x`
`xrange` :	range of support of B-splines
`degree` :	degree of B-splines
`nseg` :	number of internal segments
`design.knots` :	knots used in `splineDesign`
`ord` :	order of penalty term
`binned` :	histogram results
`breaks` :	histogram breaks
`mids` :	histogram mid-bins
`counts` :	histogram counts
`nbinwidth` :	scaling factor to convert counts to density
`bsplines` :	B-splines matrix used for binned counts
`databsplines` :	B-splines matrix used for data
`counts` :	histogram counts
`lambdaseq` :	$\lambda$ vector for profile likelihood or scalar for fixed $\lambda$
`cvlambda` :	CV MSE for each $\lambda$
`mle` and `beta` :	vector of MLE of coefficients
`nllh` :	negative log-likelihood for original data
`n` :	total original sample size
`lambda` :	Estimated or fixed $\lambda$

Details

The P-splines density estimator is fitted using maximum likelihood estimation, following the approach of Eilers and Marx (1996). Histogram binning produces frequency counts, which are modelled by constrained B-splines in a Poisson regression. A penalty based on differences in the sequences B-spline coefficients is used to smooth/interpolate the counts.

The B-splines are defined as in Eiler and Marx (1996), so that those are meet the boundary are simply shifted and truncated version of the internal B-splines. No renormalisation is carried out. They are not "natural" B-spline which are also commonly in use. Note that atural B-splines can be obtained by suitable linear combinations of these B-splines. Hence, in practice there is little difference in the fit obtained from either B-spline definition, even with the penalty constraining the coefficients. If the user desires they can force the use of natural B-splines, by prior specification of the design.knots

with appropriate replication of the boundaries, see dpsden.

Iterated weighted least squares (IWLS) for a mixed model representation of the P-splines regression, conditional on a particular penalty coefficient, is used for estimating the B-spline coefficients which is equivalent to maximum likelihood estimation. Leave-one-out cross-validation deviances are available for estimation of the penalty coefficient.

The parameter vector is the B-spline coefficients beta, no matter whether the penalty coefficient is fixed or estimated. The penalty coefficient lambda is treated separately.

The log-likelihood functions lpsden and nlpsden

evaluate the likelihood for the original dataset, using the fitted P-splines density estimator. The log-likelihood is output as nllh from the fitting function fpsden. They do not provide the likelihood for the Poisson regression of the histogram counts, which is usually evaluated using the deviance. The deviance (via CVMSE for Poisson counts) is also output as cvlambda

from the fitting function fpsden.

The iwlspsden function performs the IWLS. The cvpsden function calculates the leave-one-out cross-validation sum of the squared errors. They are not designed to be used directly by users. No checks of the inputs are carried out.

Note

The data are both vectors. Infinite and missing sample values are dropped.

No initial values for the coefficients are needed.

It is advised to specify the range of support xrange, using finite end-points. This is especially important when the support is bounded. By default xrange is simply the range of the input data range(x).

Further, it is advised to always set the histogram bin breaks, expecially if the support is bounded. By default 10*ln(n) equi-spaced bins are defined between xrange.

Acknowledgments

The Poisson regression and leave-one-out cross-validation functions are based on the code of Eilers and Marx (1996) available from Brian Marx's website http://statweb.lsu.edu/faculty/marx/, which is gratefully acknowledged.

Examples


## Not run:

set.seed(1)
par(mfrow = c(1, 1))

x = rnorm(1000)
xx = seq(-4, 4, 0.01)
y = dnorm(xx)

# Plenty of histogram bins (100)
breaks = seq(-4, 4, length.out=101)

# P-spline fitting with cubic B-splines, 2nd order penalty and 10 internal segments
# CV search for penalty coefficient. 
fit = fpsden(x, lambdaseq = 10^seq(-5, 5, 0.25), breaks = breaks,
             xrange = c(-4, 4), nseg = 10, degree = 3, ord = 2)
psdensity = exp(fit$bsplines %*% fit$mle)

hist(x, freq = FALSE, breaks = seq(-4, 4, length.out=101), xlim = c(-6, 6))
lines(xx, y, col = "black") # true density

lines(fit$mids, psdensity/fit$nbinwidth, lwd = 2, col = "blue") # P-splines density

# check density against dpsden function
with(fit, lines(xx, dpsden(xx, beta, nbinwidth, design = design.knots),
                lwd = 2, col = "red", lty = 2))

# vertical lines for all knots
with(fit, abline(v = design.knots, col = "red"))

# internal knots
with(fit, abline(v = design.knots[(degree + 2):(length(design.knots) - degree - 1)], col = "blue"))
  
# boundary knots (support of B-splines)
with(fit, abline(v = design.knots[c(degree + 1, length(design.knots) - degree)], col = "green"))

legend("topright", c("True Density","P-spline density","Using dpsdens function"),
  col=c("black", "blue", "red"), lty = c(1, 1, 2))
legend("topleft", c("Internal Knots", "Boundaries", "Extra Knots"),
  col=c("blue", "green", "red"), lty = 1)
## End(Not run)