Patterns in static

Apophenia

Models

Macros

#define apop_model_set_parameters(in,...)
 

Functions

apop_modelapop_model_clear (apop_data *data, apop_model *model)
 
void apop_model_free (apop_model *free_me)
 
apop_modelapop_model_copy (apop_model *in)
 
apop_modelapop_estimate (apop_data *d, apop_model *m)
 
double apop_p (apop_data *d, apop_model *m)
 
double apop_log_likelihood (apop_data *d, apop_model *m)
 
void apop_score (apop_data *d, gsl_vector *out, apop_model *m)
 
apop_modelapop_parameter_model (apop_data *d, apop_model *m)
 
int apop_draw (double *out, gsl_rng *r, apop_model *m)
 
void apop_prep (apop_data *d, apop_model *m)
 
apop_dataapop_predict (apop_data *d, apop_model *m)
 
double apop_cdf (apop_data *d, apop_model *m)
 

Detailed Description

Macro Definition Documentation

#define apop_model_set_parameters (   in,
  ... 
)

Take in an unparameterized apop_model and return a new apop_model with the given parameters. For example, if you need a N(0,1) quickly:

1 apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1);

This doesn't take in data, so it won't work with models that take the number of parameters from the data, and it will only set the vector of the model's parameter apop_data set. This is most standard models, so that's not a real problem either. If you have a situation where these options are out, you'll have to do something like apop_model *new = apop_model_copy(in); apop_model_clear(your_data, in); and then set in->parameters using your data.

Parameters
inAn unparameterized model, like apop_normal or apop_poisson.
...The list of parameters.
Returns
A copy of the input model, with parameters set.
Exceptions
out->error=='d'dimension error: you gave me a model with an indeterminate number of parameters. Set .vsize or .msize1 and .msize2 first, then call this fn, or use apop_model *new = apop_model_copy(in); apop_model_clear(your_data, in); and then call this (because apop_model_clear sets the dimension based on your data size).
  • This would have been called apop_model_parametrize, but the OED lists four acceptable spellings for parameterise, so it's not a great candidate for a function name.

Model Documentation

apop_model apop_bernoulli

The Bernoulli model: A single random draw with probability $p$.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Bernoulli distribution

Input format 

The matrix or vector can have any size, and I just count up zeros and non-zeros. The Bernoulli parameter $p$ is the percentage of non-zero values in the matrix. Its variance is $p(1-p)$.

Parameter format 

A vector of length one

Post-estimate 
dataUnchanged.
parameters$p$ is the only element in the vector. A <Covariance> page has the variance of $p$ in the (0,0)th element of the matrix.
infoReports log likelihood.
RNG 

Returns a single zero or one.

Settings 

None.

apop_model apop_beta

The Beta distribution.

The beta distribution has two parameters and is restricted between zero and one. You may also find apop_beta_from_mean_var to be useful.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Beta distribution

Input format 

Any arrangement of scalar values.

Parameter format 

a vector, v[0]= $\alpha$; v[1]= $\beta$

Post-estimate 
dataUnchanged.
parametersSee parameter format.
infoReports log likelihood.
RNG 

Produces a scalar $\in[0,1]$.

apop_model apop_binomial

The multi-draw generalization of the Bernoulli; the two-bin special case of the Multinomial distribution. This differs from the apop_multinomial only in the input data format.

It is implemented as an alias of the apop_multinomial model, except that it has a CDF, .vsize==2 and .dsize==1 (i.e., we know it has two parameters and a draw returns a scalar).

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Input format 

Each row of the matrix is one observation, consisting of two elements. The number of draws of type zero (sometimes read as `misses' or `failures') are in column zero, the number of draws of type one (`hits', `successes') in column one.

Parameter format 

a vector, v[0]= $n$; v[1]= $p_1$. Thus, $p_0$ isn't written down; see apop_multinomial for further discussion. If you input $v[1]>1$ and apop_opts.verbose >=1, the log likelihood function will throw a warning.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

The RNG returns a single number representing the success count, not a vector of length two giving both the failure bin and success bin. This is notable because it differs from the input data format, but it tends to be what people expect from a Binomial RNG. For draws with both dimensions, use a apop_multinomial model with .vsize =2.

Apply a coordinate transformation of the data to produce a distribution over the transformed data space. This is sometimes called a Jacobian transformation.

  • This is still in beta. Expect the interface to change.

Here is an example that replicates the Lognormal distribution.

/* A Lognormal distribution is a transform of the Normal distribution, where
the data space of the Normal is exponentiated. Thus, to get back to the original data space, take the log of the current data.
*/
#include <apop.h>
#define Diff(a, b) assert(fabs((a)-(b)) < 1e-2);
apop_data *draw_exponentiated_normal(double mu, double sigma, double draws){
gsl_rng *r = apop_rng_alloc(13);
for (int i=0; i< draws; i++) apop_draw(gsl_vector_ptr(d->vector,i), r, n01);
apop_vector_exp(d->vector);
return d;
}
// The transformed->base function and its derivative for the Jacobian:
apop_data *rev(apop_data *in){ return apop_map(in, .fn_d=log, .part='a'); }
/*The derivative of the transformed_to_base function. */
double inv(double in){return 1./in;}
double rev_j(apop_data *in){ return fabs(apop_map_sum(in, .fn_d=inv, .part='a')); }
int main(){
.transformed_to_base= rev, .jacobian_to_base=rev_j,
.base_model=apop_normal);
Apop_model_add_group(ct, apop_parts_wanted);//Speed up the MLE.
//make fake data
double mu=2, sigma=1;
apop_data *d = draw_exponentiated_normal(mu, sigma, 2e5);
//If we correctly replicated a Lognormal, mu and sigma will be right:
apop_model *est = apop_estimate(d, ct);
Diff(apop_data_get(est->parameters, 0, -1), mu);
Diff(apop_data_get(est->parameters, 1, -1), sigma);
/*The K-L divergence between our Lognormal and the stock Lognormal
should be small. I try it with both the original params and the estimated ones. */
ln2->parameters = est->parameters;
Diff(apop_kl_divergence(ln, ln2,.draw_ct=1000), 0);
Diff(apop_kl_divergence(ln, est,.draw_ct=1000), 0);
}
Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

apop_mle

Input format 

The input data is sent to the first model, so use the input format for that model.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
Settings 

apop_ct_settings

apop_model apop_dcomposition

Use random draws or parameter estimates output from a first model as input data for a second model.

  • This is still in beta. Expect the interface to change.
Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Data-composed model

Input format 

The input data is sent to the first model, so use the input format for that model.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
Settings 

apop_composition_settings

apop_model apop_dconstrain

A model that constrains the base model to within some data constraint. E.g., truncate $P(d)$ to zero for all $d$ outside of a given constraint. Generate using apop_model_dconstrain .

  • This is still in beta. Expect the interface to change.

The log likelihood works by using the base_model log likelihood, and then scaling it based on the part of the base model's density that is within the constraint. If you have an easy means of specifying what that density is, please do, as in the example. If you do not, the log likelihood will calculate it by making draw_ct random draws from the base model and checking whether they are in or out of the constraint. Because this default method is stochastic, there is some loss of precision, and conjugate gradient methods may get confused.

Here is an example that makes a few draws and estimations from data-constrained models.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Data-constrained model

Input format 

That of the base model.

Parameter format 

That of the base model. In fact, the parameters element is a pointer to the base model parameters, so both are modified simultaneously.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Draw from the base model; if the draw is outside the constraint, throw it out and try again.

Settings 

apop_dconstrain_settings

Examples 
#include <apop.h>
#ifdef Testing
#define Show_results(m)
#else
#define Show_results(m) apop_model_print(m, NULL);
#endif
//The constraint function.
double over_zero(apop_data *in, apop_model *m){
return apop_data_get(in) > 0;
}
//The optional scaling function.
double in_bounds(apop_model *m){
double z = 0;
gsl_vector_view vv = gsl_vector_view_array(&z, 1);
return 1- apop_cdf(&((apop_data){.vector=&vv.vector}), m);
}
int main(){
/*Set up a Normal distribution, with data truncated to be nonnegative.
This version doesn't use the in_bounds function above, and so the
default scaling function is used.*/
gsl_rng *r = apop_rng_alloc(213);
apop_model *trunc = apop_model_dconstrain(.base_model=apop_model_copy(norm),
.constraint=over_zero, .draw_ct=5e4, .rng=r);
//make draws. Currently, you need to prep the model first.
apop_prep(NULL, trunc);
apop_data *d = apop_model_draws(trunc, 1e5);
//Estimate the parameters given the just-produced data:
apop_model *est = apop_estimate(d, trunc);
Show_results(est);
assert(apop_vector_distance(est->parameters->vector, norm->parameters->vector)<1e-1);
//Generate a data set that is truncated at zero using alternate means
for (int i=0; i< normald->matrix->size1; i++){
double *d = apop_data_ptr(normald, i);
if (*d < 0) *d *= -1;
}
//this time, use an unparameterized model, and the in_bounds fn
apop_model *re_trunc = apop_model_dconstrain(.base_model=apop_normal,
.constraint=over_zero, .scaling=in_bounds);
apop_model *re_est = apop_estimate(normald, re_trunc);
Show_results(re_est)
assert(apop_vector_distance(re_est->parameters->vector, apop_vector_fill(gsl_vector_alloc(2), 0, 1))<1e-1);
}
apop_model apop_dirichlet

A multivariate generalization of the Beta distribution.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Dirichlet distribution

Input format 

Each row of your data matrix is a single observation.

Parameter format 

The estimated parameters are in the output model's parameters->vector. The size of the model is determined by the width of your input data set, so later RNG draws, &c will match in size.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

A call to gsl_ran_dirichlet.

apop_model apop_exponential

The Exponential distribution.

$Z(\mu,k) = \sum_k 1/\mu e^{-k/\mu} $
$ln Z(\mu,k) = \sum_k -\ln(\mu) - k/\mu $
$dln Z(\mu,k)/d\mu = \sum_k -1/\mu + k/(\mu^2) $

Some write the function as: $Z(C,k) dx = \ln C C^{-k}. $ If you prefer this form, just convert your parameter via $\mu = {1\over \ln C}$ (and convert back from the parameters this function gives you via $C=\exp(1/\mu)$).

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Exponential distribution

Input format 

Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.

Parameter format 

$\mu$ is in the zeroth element of the vector.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
infoReports log likelihood.
RNG 

Just a wrapper for gsl_ran_exponential.

CDF 

Produces a single number.

apop_model apop_gamma

The Gamma distribution

$G(x, a, b) = {1\over (\Gamma(a) b^a)} x^{a-1} e^{-x/b}$

$ln G(x, a, b)= -ln \Gamma(a) - a ln b + (a-1)ln(x) + -x/b$

$d ln G/ da = -\psi(a) - ln b + ln(x) $ (also, $d ln \gamma = \psi$)

$d ln G/ db = -a/b + x/(b^2) $

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Gamma distribution

Input format 

Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same.

Parameter format 

First two elements of the vector.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Just a wrapper for gsl_ran_gamma.

See the notes for apop_exponential on a popular alternate form.

The improper uniform returns $P(x) = 1$ for every value of x, all the time (and thus, log likelihood(x)=0). It has zero parameters. It is useful, for example, as an input to Bayesian updating, to represent a fully neutral prior.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Improper uniform distribution

Input format 

Ignored.

Parameter format 

NULL

Post-estimate 
dataUnchanged.
parametersNULL
RNG 

The draw function makes no sense, and therefore sets the value in *out to NAN, returns 1, and prints a warning if apop_opts.verbose >=1.

CDF 

Half of the distribution is less than every given point, so the CDF always returns 0.5. One could perhaps make an argument that this should really be infinity, but a half is more in the spirit of the distribution's use to represent a lack of information.

apop_model apop_iv

Instrumental variable regression

Operates much like the apop_ols model, but the input parameters also need to have a table of substitutions (like the addition of the .instruments setting in the example below). The vector element of the table lists the column numbers to be substituted (the dependent var is zero; first independent col is one), and then one column for each item to substitute.

  • If the vector of your apop_data set is NULL, then I will use the row names to find the columns to substitute. This is generally more robust and/or convenient.
  • If the instruments data set is somehow NULL or empty, I'll just run OLS.
  • Don't forget that the apop_lm_settings group has a destroy_data setting. If you set that to 'y', I will overwrite the column in place, saving the trouble of copying the entire data set.
Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

instrumental variables

Input format 

See apop_ols; see Data prep rules.

Prep_routine 

Focuses on the data shunting.

Parameter format 

As per apop_ols

Post-estimate 
dataUnchanged.
parametersSee parameter format.
Examples 
/* Instrumental variables are often used to deal with variables measured with noise, so
this example produces a data set with a column of noisy data, and a separate instrument
measured with greater precision, then sets up and runs an instrumental variable regression.
To guarantee that the base data set has noise and the instrument is cleaner, the
procedure first generates the clean data set, then copies the first column to the
instrument set, then the \c add_noise function inserts Gaussian noise into the base
data set. Once the base set and the instrument set have been generated, the setup for
the IV consists of adding the relevant names and using \ref Apop_model_add_group to add a
\c lm (linear model) settings group with a <tt>.instrument= instrument_data</tt> element.
In fact, the example sets up a sequence of IV regressions, with more noise each
time. This sample is part of Apophenia's test suite, and so checks that the coefficients
are correct along the way.
*/
#include <apop.h>
#define Diff(L, R, eps) Apop_stopif(fabs((L)-(R)>=(eps)), return, 0, "%g is too different from %g (abitrary limit=%g).", (double)(L), (double)(R), eps);
int datalen =1e4;
//generate a vector that is the original vector + noise
void add_noise(gsl_vector *in, gsl_rng *r, double size){
for (int i=0; i< in->size; i++){
double noise;
apop_draw(&noise, r, nnoise);
*gsl_vector_ptr(in, i) += noise;
}
apop_model_free(nnoise);
}
void test_for_unbiased_parameter_estimates(apop_model *m, double tolerance){
Diff(apop_data_get(m->parameters, .row=0,.col=-1), -1.4, tolerance);
Diff(apop_data_get(m->parameters, .row=1,.col=-1), 2.3, tolerance);
}
int main(){
gsl_rng *r = apop_rng_alloc(234);
apop_data *data = apop_data_alloc(datalen, 2);
for(int i=0; i< datalen; i++){
apop_data_set(data, i, 1, 100*(gsl_rng_uniform(r)-0.5));
apop_data_set(data, i, 0, -1.4 + apop_data_get(data,i,1)*2.3);
}
apop_name_add(data->names, "dependent", 'c');
apop_name_add(data->names, "independent", 'c');
#ifndef Testing
apop_model_show(oest);
#endif
//the data with no noise will be the instrument.
gsl_vector *col1 = Apop_cv(data, 1);
apop_data *instrument_data = apop_data_alloc(data->matrix->size1, 1);
gsl_vector_memcpy(Apop_cv(instrument_data, 0), col1);
apop_name_add(instrument_data->names, "independent", 'c');
Apop_model_add_group(apop_iv, apop_lm, .instruments = instrument_data);
//Now add noise to the base data four times, and estimate four IVs.
int tries = 4;
apop_model *ests[tries];
for (int nscale=0; nscale<tries; nscale++){
add_noise(col1, r, nscale==0 ? 0 : pow(10, nscale-tries));
ests[nscale] = apop_estimate(data, apop_iv);
#ifndef Testing
if (nscale==tries-1){ //print the one with the largest error.
printf("\nnow IV:\n");
apop_model_show(ests[nscale]);
}
#endif
}
/* Now test. The parameter estimates are unbiased.
As we add more noise, the covariances expand.
Test that the ratio of one covariance matrix to the next
is less than one, though these are typically very much
smaller than one (as the noise is an order of magnitude
larger in each case), and the ratios will be identical
for each j, k below. */
test_for_unbiased_parameter_estimates(ests[0], 1e-6);
for (int i=1; i<tries; i++){
test_for_unbiased_parameter_estimates(ests[i], 1e-3);
gsl_matrix *cov = apop_data_get_page(ests[i-1]->parameters, "<Covariance>")->matrix;
gsl_matrix *cov2 = apop_data_get_page(ests[i]->parameters, "<Covariance>")->matrix;
gsl_matrix_div_elements(cov, cov2);
for (int j =0; j< 2; j++)
for (int k =0; k< 2; k++)
assert(gsl_matrix_get(cov, j, k) < 1);
}
}
apop_model apop_kernel_density

The kernel density smoothing of a PMF or histogram.

At each point along the histogram, put a distribution (default: Normal(0,1)) on top of the point. Sum all of these distributions to form the output distribution.

Elements of apop_kernel_density_settings that you may want to set:

  • data a data set, which, if not NULL and !base_pmf , will be converted to an apop_pmf model.
  • base_pmf This is the preferred format for input data. It is the histogram to be smoothed.
  • kernelbase The kernel to use for smoothing, with all parameters set and a p method. Popular favorites are apop_normal and apop_uniform.
  • set_params A function that takes in a single number and the model, and sets the parameters accordingly. The function will call this for every point in the data set. Here is the default, which is used if this is NULL. It simply sets the first element of the model's parameter vector to the input number; this is appropriate for a Normal distribution, where we want to center the distribution on each data point in turn.
1 static void apop_set_first_param(apop_data *in, apop_model *m){
2  m->parameters->vector->data[0] = apop_data_get(in);
3 }

For a Uniform[0,1] recentered around the first element of the PMF matrix, you could put this function in your code:

1 void set_midpoint(apop_data * in, apop_model *m){
2  apop_data_set(m->parameters, 0, -1, apop_data_get(in)+0.5);
3  apop_data_set(m->parameters, 1, -1, apop_data_get(in)-0.5);
4 }
Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

kernel density estimate

Input format 

I'll estimate a apop_pmf internally, so I follow that format, which is one observation (of any format) per line.

Parameter format 

None

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Randomly selects a data point, then randomly draws from that sub-distribution. Returns 0 on success, 1 if unable to pick a sub-distribution (meaning the weights over the distributions are somehow broken), and 2 if unable to draw from the sub-distribution.

CDF 

Sums the CDF to the given point of all the sub-distributions.

Settings 

apop_kernel_density_settings

Examples 

This example sets up and uses KDEs based on a Normal and a Uniform distribution.

/* This program draws ten random data points, and then produces two kernel density
estimates: one based on the Normal distribution and one based on the Uniform.
It produces three outputs:
--stderr shows the random draws
--kerneldata is a file written with plot data for both KDEs
--stdout shows instructions to gnuplot, so you can pipe:
./kernel | gnuplot -persist
Most of the code is taken up by the plot() and draw_some_data() functions, which are
straightforward. Notice how plot() pulls the values of the probability distributions
at each point along the scale.
The set_uniform_edges function sets the max and min of a Uniform distribution so that the
given point is at the center of the distribution.
The first KDE uses the defaults, which are based on a Normal distribution with std dev 1;
the second explicitly sets the .kernel and .set_fn for a Uniform.
*/
#include <apop.h>
void set_uniform_edges(apop_data * r, apop_model *unif){
apop_data_set(unif->parameters, 0, -1, r->matrix->data[0]-0.5);
apop_data_set(unif->parameters, 1, -1, r->matrix->data[0]+0.5);
}
void plot(apop_model *k, apop_model *k2){
apop_data *onept = apop_data_alloc(0,1,1);
FILE *outtab = fopen("kerneldata", "w");
for (double i=0; i<20; i+=0.01){
apop_data_set(onept,0,0, i);
fprintf(outtab, "%g %g %g\n", i, apop_p(onept, k), apop_p(onept, k2));
}
fclose(outtab);
printf("plot 'kerneldata' using 1:2\n"
"replot 'kerneldata' using 1:3\n");
}
apop_data *draw_some_data(){
apop_data *d = apop_model_draws(uniform_0_20, 10);
apop_data_print(apop_data_sort(d), .output_pipe=stderr);
return d;
}
int main(){
apop_data *d = draw_some_data();
apop_kernel_density, .base_data=d,
.set_fn = set_uniform_edges,
.kernel = apop_uniform);
plot(k, k2);
}
apop_model apop_loess

Regression via loess smoothing

This uses a somewhat black-box routine, first written by Chamberlain, Devlin, Grosse, and Shyu in 1988, to fit a smoothed series of quadratic curves to the input data, thus producing a curve more closely fitting than a simple regression would.

The curve is basically impossible to describe using a short list of parameters, so the representation is in the form of the predicted vector of the expected data set; see below.

From the 1992 manual for the package: ``The method we will use to fit local regression models is called {loess}, which is short for local regression, and was chosen as the name since a loess is a deposit of fine clay or silt along a river valley, and thus is a surface of sorts. The word comes from the German löss, and is pronounced löíss.''

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Loess smoothing

Input format 

The data is basically OLS-like: the first column of the data is the dependent variable to be explained; subsequent variables are the independent explanatory variables. Thus, your input data can either have a dependent vector plus explanatory matrix, or a matrix where the first column is the dependent variable.

Unlike with OLS, I won't move your original data, and I won't add a 1, because that's not really the loess custom. You can of course set up your data that way if you like.

If your data set has a weights vector, I'll use it.

In any case, all data is copied into the model's apop_loess_settings. The code is primarily FORTRAN code from 1988 converted to C; the data thus has to be converted into a relatively obsolete internal format.

Parameter format 

The parameter vector is unused.

Post-estimate 
dataUnchanged.
parametersNone.
settings

The apop_loess_settings is filled with results (and internal processing cruft). The out_model->info data set has a table giving the actual, predicted, and residual columns, which is probably what you were looking for. Try:

1 apop_data_show(apop_data_get_page(output_model->info, "<Predicted>"));
Predict 

Fills in the zeroth column (ignoring and overwriting any data there), and at the data's ->more pointer, adds an apop_data set named "Confidence" (i.e.,

1 !strcmp(outdata->more->names->title, "Confidence") == 1.

This routine is in beta testing.

apop_model apop_logit

The Logit model.

Apophenia makes no distinction between the bivariate logit and the multinomial logit. This does both.

The likelihood of choosing item $j$ is: $e^{x\beta_j}/ (\sum_i{e^{x\beta_i}})$

so the log likelihood is $x\beta_j - ln(\sum_i{e^{x\beta_i}})$

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Logit

Input format 

The first column of the data matrix this model expects is zeros, ones, ..., enumerating the factors; to get there, try apop_data_to_factors; if you forget to run it, I'll run it on the first data column for you. The remaining columns are values of the independent variables. Thus, the model will return [(data columns)-1] $\times$[(option count)-1] parameters. Column names are options; row names are input variables.

Prep_routine 

You will probably want to convert some column of your data into factors, via apop_data_to_factors. If you do, then that adds a page of factors to your data set (and of course adjusts the data itself). If I find a factor page, I will use that info; if not, then I will run apop_data_to_factors on the first column (the vector if there is one, else the first column of the matrix.)

Also, if there is no vector, then I will move the first column of the matrix, and replace that matrix column with a constant column of ones, just like with OLS.

Parameter format 

As above.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Much like the apop_ols RNG, qv. Returns the category drawn.

  • PS: Here is a nice trick used in the implementation. let $y_i = x\beta_i$. Then

    \[ln(\sum_i{e^{x\beta_i}}) = max(y_i) + ln(\sum_i{e^{y_i - max(y_i)}}).\]

The elements of the sum are all now exp(something negative), so overflow won't happen, and if there's underflow, then that term must not have been very important. [This trick is attributed to Tom Minka, who implemented it in his Lightspeed Matlab toolkit.]

Here is an artifical example:

#include <apop.h>
#include <unistd.h>
char *testfile = "logit_test_data";
//generate a fake data set.
//Notice how the first column is the outcome, just as with standard regression.
void write_data(){
FILE *f = fopen(testfile, "w");
fprintf(f, "\
outcome,A, B \n\
0, 0, 0 \n\
1, 1, 1 \n\
1, .7, .5 \n\
1, .7, .3 \n\
1, .3, .7 \n\
\n\
1, .5, .5 \n\
0, .4, .4 \n\
0, .3, .4 \n\
1, .1, .3 \n\
1, .3, .1 ");
fclose(f);
}
int main(){
write_data();
apop_data *d = apop_text_to_data(testfile);
Apop_model_add_group(apop_logit, apop_mle, .tolerance=1e-5);
unlink(testfile);
/* Apophenia's test suite checks that this code produces
values close to canned values. As a human, you probably
just want to print the results to the screen. */
#ifndef Testing
apop_model_show(est);
#else
assert(fabs(apop_data_get(est->parameters, .rowname="1")- -1.155026) < 1e-6);
assert(fabs(apop_data_get(est->parameters, .rowname="A")- 4.039903) < 1e-6);
assert(fabs(apop_data_get(est->parameters, .rowname="B")- 1.494694) < 1e-6);
#endif
}
apop_model apop_lognormal

The Lognormal distribution

The log likelihood function for the lognormal distribution:

$f = exp(-(ln(x)-\mu)^2/(2\sigma^2))/ (x\sigma\sqrt{2\pi})$ $ln f = -(ln(x)-\mu)^2/(2\sigma^2) - ln(x) - ln(\sigma\sqrt{2\pi})$

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Lognormal distribution

Input format 

I use the all elements of the matrix and vector, without regard to their order.

Parameter format 

Zeroth vector element is the mean (after logging); first is the std dev (after logging)

Post-estimate 
dataUnchanged.
parametersSee parameter format.
infoReports log likelihood.
RNG 

An Apophenia wrapper for the GSL's Normal RNG, exp'ed.

apop_model apop_mixture

The mixture model transformation: a linear combination of multiple models.

Generated via apop_model_mixture.

Note that a kernel density is a mixture of a large number of homogeneous models, where each is typically centered around a point in your data. For such situations, apop_kernel_density will be easier to use.

Use apop_model_mixture to produce one of these models. In the examples below, some are generated from unparameterized input models with a form like

1 apop_model *mf = apop_model_mixture(apop_model_copy(apop_normal), apop_model_copy(apop_normal));
2 Apop_model_add_group(mf, apop_mle, .starting_pt=(double[]){50, 5, 80, 5},
3  .step_size=3, .tolerance=1e-6);
4 apop_model_show(apop_estimate(dd, mf));

One can also skip the estimation and use already-parameterized models as input to apop_model_mixture, e.g.:

1 apop_model *r_ed = apop_model_mixture(apop_model_set_parameters(apop_normal, 54.6, 5.87),
2  apop_model_set_parameters(apop_normal, 80.1, 5.87));
3 apop_data *wts = apop_data_falloc((2), 0.36, 0.64);
4 Apop_settings_add(r_ed, apop_mixture, weights, wts->vector);
5 printf("LL=%g\n", apop_log_likelihood(dd, r_ed));

Notice that the weights vector has to be added after the model is set up. If none is given, then equal weights are assigned to all components of the mixture.

One can think of the estimation as a missing-data problem: each data point originated in one distribution or the other, and if we knew with certainty which data point came from which distribution, then the estimation problem would be trivial: just generate the subsets and call apop_estimate(dataset1, model1), ..., apop_estimate(datasetn, modeln) separately. But the assignment of which element goes where is unknown information, which we guess at using an E-M algorithm. The standard algorithm starts with an initial set of parameters for the models, and assigns each data point to its most likely model. It then re-estimates the model parameters using their subsets. The standard algorithm, see e.g. this PDF, repeats until it arrives at an optimum.

Thus, the log likelihood method for this model includes a step that allocates each data point to its most likely model, and calculates the log likelihood of each observation using its most likely model. [It would be a valuable extension to extend this to not-conditionally IID models. Commit 1ac0dd44 in the repository had some notes on this, now removed.] As a side-effect, it calculates the odds of drawing from each model (the vector λ). Following the above-linked paper, the probability for a given observation under the mixture model is its probability under the most likely model weighted by the previously calculated λ for the given model.

Apohenia modifies this routine slightly because it uses the same maximum likelihood back-end that most other apop_models use for estimation. The ML search algorithm provides model parameters, then the LL method allocates observations and reports a LL to the search algorithm, then the search algorithm uses its usual rules to step to the next candidate set of parameters. This provides slightly more flexibility in the search.

  • Estimations of mixture distributions can be sensitive to initial conditions. You are encouraged to try a sequence random starting points for your model parameters. Some authors recommend plotting the data and eyeballing a guess as to the model parameters.

Determining to which parts of a mixture to assign a data point is a well-known hard problem, which is often not solvable–that information is basically lost.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Mixture of models

Input format 

The same data gets sent to each of the component models of the mixture. Each row is an observation, and the estimation routine assumes that models are conditionally IID (i.e., having chosen what component of the mixture the observation comes from, its likelihood can be calculated independently of all other observations).

Parameter format 

The parameters are broken out in a readable form in the settings group, so your best bet is to use those. See the sample code for usage.
The parameter element is a single vector piling up all elements, beginning with the first $n-1$ weights, followed by an apop_data_pack of each model's parameters in sequence. Because all elements are in a single vector, one could run a maximum likelihood search for all components (including the weights) at once. Fortunately for parsing, the log_likehood, estimate, and other methods unpack this vector into its component parts for you.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Uses the weights to select a component model, then makes a draw from that component. The model's dsize (draw size) element is set when you set up the model in the model's prep method (automatically called by apop_estimate, or call it directly) iff all component models have the same dsize.

Settings 

apop_mixture_settings

Examples 

The first example uses a text file faith.data, in the tests directory of the distribution.

#include <apop.h>
/* This replacement for apop_model_show(in) demonstrates retrieval of the useful
settings: the weights (λ) and list of estimated models. */
void show_mix(apop_model *in){
printf("The weights:\n");
printf("\nThe models:\n");
for (apop_model **m = ms->model_list; *m; m++) //model_list is a NULL-terminated list.
apop_model_print(*m, stdout);
}
int main(){
apop_text_to_db("faith.data", "ff");
apop_data *dd = apop_query_to_data("select waiting from ff");
/* The process is famously sensitive to starting points. Try many random points, or
eyeball the distribution's plot and guess at the starting values. */
Apop_model_add_group(mf, apop_mle, .starting_pt=(double[]){50, 5, 80, 5},
.step_size=3, .tolerance=1e-6);
apop_model *mfe = apop_estimate(dd, mf);
apop_model_print(mfe, stdout);
printf("LL=%g\n", apop_log_likelihood(dd, mfe));
printf("\n\nValues calculated in the source paper, for comparison.\n");
apop_model_set_parameters(apop_normal, 54.61364, 5.869089),
apop_model_set_parameters(apop_normal, 80.09031, 5.869089));
apop_data *wts = apop_data_falloc((2), 0.3608498, 0.6391502);
Apop_settings_add(r_ed, apop_mixture, weights, wts->vector);
show_mix(r_ed);
printf("LL=%g\n", apop_log_likelihood(dd, r_ed));
}
#include <apop.h>
/*
Use \ref apop_model_mixture to generate a hump-filled distribution, then find
the most likely data points and check that they are near the humps.
Part of Apophenia's test suite.
*/
//Produce a 2-D multivariate normal model with unit covariance and given mean
apop_model *produce_fixed_mvn(double x, double y){
out->parameters = apop_data_falloc((2, 2, 2),
x, 1, 0,
y, 0, 1);
out->dsize = 2;
return out;
}
int main(){
//here's a mean/covariance matrix for a standard multivariate normal.
produce_fixed_mvn(5, 6),
produce_fixed_mvn(-5, -4),
produce_fixed_mvn(0, 1));
apop_prep(NULL, many_humps);
int len = 100000;
apop_data *d = apop_model_draws(many_humps, len);
gsl_vector *first = Apop_cv(d, 0);
#ifndef Testing
printf("mu=%g\n", apop_mean(first));
#endif
assert(fabs(apop_mean(first)- 0) < 5e-2);
gsl_vector *second = Apop_cv(d, 1);
#ifndef Testing
printf("mu=%g\n", apop_mean(second));
#endif
assert(fabs(apop_mean(second)- 1) < 5e-2);
/* Abuse the ML imputation routine to search for the input value with the highest
log likelihood. Do the search via simulated annealing. */
gsl_matrix_set_all(x->matrix, NAN);
Apop_settings_add_group(many_humps, apop_mle, .n_tries=20, .iters_fixed_T=10, .k=3, .method="annealing");
apop_ml_impute(x, many_humps);
#ifndef Testing
printf("Optimum found at:\n");
#endif
assert(fabs(apop_data_get(x, .col=0)- 0) + fabs(apop_data_get(x, .col=1) - 1) < 1e-2);
}
apop_model apop_multinomial

The $n$–option generalization of the Binomial distribution.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Binomial distribution

Input format 

Each row of the matrix is one observation: a set of draws from a single bin. The number of draws of type zero are in column zero, the number of draws of type one in column one, et cetera.

  • You may have a set of several Bernoulli-type draws, which could be summed together to form a single Binomial draw. See apop_data_to_dummies to do the aggregation (using the .keep_first='y' option).
Parameter format 

The parameters are kept in the vector element of the apop_model parameters element. parameters->vector->data[0]==n; parameters->vector->data[1...]==p_1....

The numeraire is bin zero, meaning that $p_0$ is not explicitly listed, but is $p_0=1-\sum_{i=1}^{k-1} p_i$, where $k$ is the number of bins. Conveniently enough, the zeroth element of the parameters vector holds $n$, and so a full probability vector can easily be produced by overwriting that first element. Continuing the above example:

1 int n = apop_data_get(estimated->parameters, 0, -1);
2 apop_data_set(estimated->parameters, 0, 1 - (apop_sum(estimated->parameters)-n));

And now the parameter vector is a proper list of probabilities.

  • Because an observation is a single row, the number of bins, $k$ is set to equal the length of the first row (counting both vector and matrix elements, as appropriate). The covariance matrix will be $k \times k$.
  • Each row should sum to $N$, the number of draws. The estimation routine doesn't check this, but instead uses the average sum across all rows.
Post-estimate 
dataUnchanged.
parametersAs per the parameter format. Has a <Covariance> page with the covariance matrix for the $p$s ( $n$ effectively has no variance).
infoReports log likelihood.
RNG 

Returns a single vector of length $k$, the result of an imaginary tossing of $N$ balls into $k$ urns, with the given probabilities.

The Multinomial Probit model.

Deprecated:
Use apop_probit, which handles multiple options.
Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Post-estimate 
dataUnchanged.
parametersSee parameter format.

This is the multivariate generalization of the Normal distribution.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Multivariate normal distribution

Input format 

Each row of the matrix is an observation.

Parameter format 

An apop_data set whose vector element is the vector of means, and whose matrix is the covariances.

If you had only one dimension, the mean would be a vector of size one, and the covariance matrix a $1\times 1$ matrix. This differs from the setup for apop_normal, which outputs a single vector with $\mu$ in element zero and $\sigma$ in element one.

Post-estimate 
dataUnchanged.
parameters

Format as above. The <Covariance> page gives the covariance matrix of the means.

infoReports log likelihood.
RNG 

The RNG fills an input array whose length is based on the input parameters.

The nice, easy method from Devroye, p 565

Settings 

None.

apop_model apop_normal

The Normal (Gaussian) distribution

You know it, it's your attractor in the limit, it's the Gaussian distribution.

$N(\mu,\sigma^2) = {1 \over \sqrt{2 \pi \sigma^2}} \exp (-x^2 / 2\sigma^2)$

$\ln N(\mu,\sigma^2) = (-(x-\mu)^2 / 2\sigma^2) - \ln (2 \pi \sigma^2)/2 $

$d\ln N(\mu,\sigma^2)/d\mu = (x-\mu) / \sigma^2 $

$d\ln N(\mu,\sigma^2)/d\sigma^2 = ((x-\mu)^2 / 2(\sigma^2)^2) - 1/2\sigma^2 $

See also the apop_multivariate_normal.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Normal distribution

Input format 

I use the elements of the matrix and vector, without regard to their order or position.

Parameter format 

As is custom, parameter zero (in the vector) is the mean, parmeter one is the standard deviation (i.e., the square root of the variance).

Post-estimate 
dataUnchanged.
parametersZeroth vector element is $\mu$, element 1 is $\sigma$. A page is added named <Covariance> with the 2 $\times$ 2 covariance matrix for these two parameters
infoReports the log likelihood.
Predict 

Returns the expected value. The ->more element holds a apop_data set with the title <Covariance>, whose matrix holds the covariance of the mean.

RNG 

An apophenia wrapper for the GSL's Normal RNG.

This one asks explicitly for a mean, and the GSL assumes zero and lets you add the mean yourself.

Settings 

None.

apop_model apop_ols

Ordinary least squares. Weighted least squares is also handled by this model. You can also use it for a lot of not-entirely linear models based on the form $Y = f(x_1) + f(x_2) + ... + \epsilon$.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Ordinary Least Squares

Input format 

See Data prep rules.

Prep_routine 

Focuses on the data shunting.

Parameter format 

A vector of OLS coefficients. coeff. zero refers to the constant column, if any.

Post-estimate 
data

You can specify whether the data is modified with an apop_lm_settings group. If so, see Data prep rules for details. Else, left unchanged.

parameters

The parameters set will hold the coefficients; the first coefficient will be the coefficient on the constant term, and the remaining will correspond to the independent variables. It will therefore be of size (data->size2).

I add a page named <Covariance>, which gives the covariance matrix for the estimated parameters (not the data itself).

parameter modelsFor the mean, a noncentral $t$ distribution (apop_t_distribution).
info

Reports log likelihood, and runs apop_estimate_coefficient_of_determination to add $R^2$-type information (SSE, SSR, &c) to the info page.

Residuals: I add a page named <Predicted>, with three columns. If this is a model with a single dependent and lots of independent vars, then the first column is the actual data. Let our model be $ Y = \beta X + \epsilon$. Then the second column is the predicted values: $\beta X$, and the third column is the residuals: $\epsilon$. The third column is therefore always the first minus the second, and this is probably how that column was calculated internally.

Given your estimate est, the zeroth element is one of
apop_data_get(est->info, .page= "Predicted", .row=0, .colname="observed"),
apop_data_get(est->info, .page= "Predicted", .row=0, .colname="predicted") or
apop_data_get(est->info, .page= "Predicted", .row=0, .colname="residual").

RNG 

Linear models are typically only partially defined probability models. For OLS, we know that $P(Y|X\beta) \sim {\cal N}(X\beta, \sigma)$, because this is an assumption about the error process, but we don't know much of anything about the distribution of $X$.

The apop_lm_settings group includes an apop_model* element named input_distribution. This is the distribution of the independent/predictor/X columns of the data set.

The default is that input_distribution = apop_improper_uniform , meaning that $P(X)=1$ for all $X$. So $P(Y, X) = P(Y|X)P(X) = P(Y|X)$. This seems to be how many people use linear models: the $X$ values are taken as certain (as with actually observed data) and the only question is the odds of the dependent variable. If that's what you're looking for, just leave the default. This is sufficient for getting log likelihoods under the typical assumption that the observed data has probability one.

But you can't draw from an improper uniform. So if you draw from a linear model with a default input_distribution, then you'll get an error.

Alternatively, you may know something about the distribution of the input data. At the least, you could generate a PMF from the actual data:

1 apop_settings_set(your_model, apop_lm, input_distribution, apop_estimate(inset, apop_pmf));

Now, random draws are taken from the input data, and the dependent variable value calculated via $X\beta+\epsilon$, where $X$ is the drawn value, $\beta$ the previously-estimated parameters and $\epsilon$ is a Normally-distributed random draw. Or change the PMF to any other appropriate distribution, such as a apop_multivariate_normal, or an apop_pmf filled in with more data, or perhaps something from http://en.wikipedia.org/wiki/Errors-in-variables_models , as desired.

Examples 

First, you will need a file named data in comma-separated form. The first column is the dependent variable; the remaining columns are the independent. For example:

1 Y, X_1, X_2, X_3
2 2,3,4,5
3 1,2,9,3
4 4,7,9,0
5 2,4,8,16
6 1,4,2,9
7 9,8,7,6

The program:

#include <apop.h>
int main(){
apop_text_to_db(.text_file="data", .tabname="d");
apop_data *data = apop_query_to_data("select * from d");
apop_model_show(est);
}

If you saved this code to sample.c, then you can compile it with

1 gcc sample.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me

and then run it with ./run_me. Alternatively, you may prefer to compile the program using a Makefile .

Feeling lazy? The program above was good form and demonstrated useful features, but the code below will do the same thing in two lines:

1 #include <apop.h>
2 int main(){ apop_model_show(apop_estimate(apop_text_to_data("data"), apop_ols)); }
apop_model apop_pmf

A probability mass function is commonly known as a histogram, or still more commonly, a bar chart. It indicates that at a given coordinate, there is a given mass.

Each row of the PMF's data set holds the coordinates, and the weights vector holds the mass at the given point. This is in contrast to the crosstab format, where the location is simply given by the position of the data point in the grid.

For example, here is a typical crosstab:

col 0col 1col 2
row 0 08.13.2
row 1 002.2
row 2 07.31.2

Here it is as a sparse listing:

value dimension 1dimension 2
8.1 0 1
3.2 0 2
2.2 1 2
7.3 2 1
1.2 2 2

The apop_pmf internally represents data in this manner, with the dimensions in the matrix, vector, and text element of the data set, and the cell values are held in the weights element (not the vector).

If your data is in a crosstab (with entries in the matrix element for 2-D data or the vector for 1-D data), then use apop_crosstab_to_db to make the conversion. See also this page for another crosstab-to-PMF function as well.

If your data is already in the sparse listing format (which is probably the case for 3- or more dimensional data), then just point the model to your parameter set:

1 apop_model *my_pmf = apop_model_copy(apop_pmf);
2 my_pmf->data = in_data;
3 //or equivalently:
4 apop_model *my_pmf = apop_estimate(in_data, apop_pmf);
  • If the weights element is NULL, then I assume that all rows of the data set are equally probable.
  • If the weights are present but sum to a not-finite value, the model's error element is set to 'w' when the estimation is run, and a warning printed.
Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

PDF or sparse matrix

Input format 

As above, you can input to the estimate routine a 2-D matrix that will be converted into this form.

Parameter format 

None. The list of observations and their weights are in the data set, not the parameters.

Post-estimate 
dataThe data you sent in is linked to (not copied).
parametersStill NULL.
RNG 

Return the data in a random row of the PMF's data set. If there is a weights vector, i will use that to make draws; else all rows are equiprobable.

  • If you set draw_index to 'y', e.g.,
1 Apop_settings_add(your_model, apop_pmf, draw_index, 'y');

then I will return the row number of the draw, not the data in that row.

  • The first time you draw from a PMF with uneven weights, I will generate a vector tallying the cumulative mass. Subsequent draws will have no computational overhead. Because the vector is built using the data on the first call to this or the cdf method, do not rearrange or modify the data after the first call. I.e., if you choose to use apop_data_sort or apop_data_sort on your data, do it before the first draw or CDF calculation.
Exceptions
m->error='f'There is zero or NaN density in the CMF. I set the model's error element to 'f' and set out=NAN.
m->error='a'Allocation error. I set the model's error element to 'a' and set out=NAN. Maybe try apop_data_pmf_compress first?
CDF 

Assuming the data is sorted in a meaningful manner, find the total mass up to a given data point.

That is, a CDF only makes sense if the data space is totally ordered. The sorting you define using apop_data_sort defines that ordering.

  • The input data should have the same number of columns as the data set used to construct the PMF. I use only the first row.
  • If the observation is not found in the data, return zero.
  • The first time you get a CDF from from a data set with uneven weights, I will generate a vector tallying the cumulative mass. Subsequent draws will have no computational overhead. Because the vector is built using the data on the first call to this or the cdf method, do not rearrange or modify the data after the first call. I.e., if you choose to use apop_data_sort or apop_data_sort on your data, do it before the first draw or CDF calculation.
Settings 

apop_pmf_settings

apop_model apop_poisson

The Poisson distribution.

$p(k) = {\mu^k \over k!} \exp(-\mu), $

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Poisson distribution

Input format 

Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same.

Parameter format 

One parameter, the zeroth element of the vector.

Post-estimate 
dataUnchanged.
parameters

Unless you decline it by adding the apop_parts_wanted_settings group, I will also give you the variance of the parameter, via bootstrap, stored in a page named <Covariance>.

infoReports log likelihood.
RNG 

Just a wrapper for gsl_ran_poisson.

apop_model apop_probit

The Probit model.

Apophenia makes no distinction between the bivariate probit and the multinomial probit. This one does both.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Probit

Input format 

The first column of the data matrix this model expects is zeros, ones, ..., enumerating the factors; to get there, try apop_data_to_factors; if you forget to run it, I'll run it on the first data column for you. The remaining columns are values of the independent variables. Thus, the model will return [(data columns)-1] $\times$[(option count)-1] parameters. Column names are options; row names are input variables.

Prep_routine 

You will probably want to convert some column of your data into factors, via apop_data_to_factors. If you do, then that adds a page of factors to your data set (and of course adjusts the data itself). If I find a factor page, I will use that info; if not, then I will run apop_data_to_factors on the first column (the vector if there is one, else the first column of the matrix.)

Also, if there is no vector, then I will move the first column of the matrix, and replace that matrix column with a constant column of ones, just like with OLS.

Parameter format 

As above

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

See apop_ols; this one is similar but produces a category number instead of OLS's continuous draw.

apop_model apop_stack

A stack of models.

For the case when you need to bundle two uncorrelated models into one larger model. For example, the prior for a multivariate normal (whose parameters are a vector of means and a covariance matrix) is a Multivariate Normal-Wishart pair.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Only one model input; returning a copy of that model.

Input format 

There are two means of handling the input format. If the settings group attached to the data set has a non-NULL splitpage element, then append the second data set as an additional page to the first data set, and name the second set with the name you listed in splitpage; see the example.

If splitpage is NULL, then I will send the same data set to both models.

Parameter format 

currently NULL; check the sub-models for their parameters.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
Settings 

apop_stack_settings

apop_model apop_t_distribution

The t distribution, primarily for descriptive purposes.

If you want to test a hypothesis, you probably don't need this, and should instead use apop_test. See notes in t-, chi-squared, F-, Wishart distributions.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

t distribution

Input format 

Unordered list of scalars in the matrix and/or vector.

Parameter format 

vector->data[0] = mu
vector->data[1] = sigma
vector->data[2] = df

Post-estimate 
dataUnchanged.
parametersSee parameter format.
apop_model apop_uniform

This is the two-parameter version of the Uniform, expressing a uniform distribution over [a, b].

The MLE of this distribution is simply a = min(your data); b = max(your data). Primarily useful for the RNG, such as when you have a Uniform prior model.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Uniform distribution

Input format 

An unordered set of numbers in the data set's vector, matrix, or both.

Parameter format 

Zeroth vector element is $a$, the min; element one is $b$, the max.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
infoReports log likelihood.
apop_model apop_wishart

The Wishart distribution, which is currently somewhat untested.

Here's the likelihood function. $p$ is the dimension of the data and covariance matrix, $n$ is the degrees of freedom, $\mathbf{V}$ is the $p\times p$ matrix of Wishart parameters, and ${\mathbf{W}}$ is the $p\times p$ matrix whose likelihood is being evaluated. $\Gamma_p(\cdot)$ is the multivariate gamma function.

\[ P(\mathbf{W}, \mathbf{V}) = \frac{\left|\mathbf{W}\right|^\frac{n-p-1}{2}} {2^\frac{np}{2}\left|{\mathbf V}\right|^\frac{n}{2}\Gamma_p(\frac{n}{2})} \exp\left(-\frac{1}{2}{\rm Tr}({\mathbf V}^{-1}\mathbf{W})\right)\]

See also notes in t-, chi-squared, F-, Wishart distributions.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Wishart distribution

Input format 

Each row of the input matrix is a single square matrix, flattened; use apop_data_pack to convert your sequence of matrices into rows.

Prep_routine 

Just allocates the parameters based on the size of the input data.

Parameter format 

$N$ (the degrees of freedom) is the zeroth element of the vector. The matrix holds the matrix of parameters.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

You can use this to generate random covariance matrices, should you need them. See example below.

Examples 

Making some random draws:

1 apop_model *m = apop_estimate(yr_data, apop_wishart);
2 gsl_matrix *rmatrix = gsl_matrix_alloc(10, 10);
3 gsl_rng *r = apop_rng_alloc(8765);
4 for (int i=0; i< 1e8; i++){
5  apop_draw(rmatrix->data, r, m);
6  do_math_with_matrix(rmatrix);
7 }
apop_model apop_wls

The Weighed Least Squares model This is a (deprecated) synonym for apop_ols, qv. If you use the apop_ols model and provide weights in your_input_data->weights, then I will use them appropriately. That is, the apop_ols model really implements Weighted Least Squares, but in most cases weights==NULL and the math reduces to the special case of Ordinary Least Squares.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Post-estimate 
dataUnchanged.
parametersSee parameter format.
apop_model apop_yule

$ Y(x, b) = (b-1) \gamma(b) \gamma(k) / \gamma(k+b) $

$ \ln Y(x, b) = \ln(b-1) + ln\gamma(b) + \ln\gamma(k) - \ln\gamma(k+b) $

$ d\ln Y/db = 1/(b-1) + \psi(b) - \psi(k+b) $

apop_yule.estimate() is an MLE, so feed it appropriate apop_mle_settings.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Yule distribution

Input format 

Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.

See also apop_data_rank_compress for means of dealing with one more input data format.

Parameter format 

One element at the top of the parameter set's vector.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Cribbed from <a href="http://cgm.cs.mcgill.ca/~luc/mbookindex.html>Devroye (1986), p 553.

apop_model apop_zipf

Wikipedia has notes on the Zipf distribution. $Z(a) = {1\over \zeta(a) * i^a} $

$lnZ(a) = -(\log(\zeta(a)) + a \log(i)) $

apop_zipf.estimate() is an MLE, so feed it appropriate apop_mle_settings.

Enumerator
Methods are Default or Model-specific 
Estimation D
Prob. D
Log likelihood D
RNG D
Predict D
CDF D
Score D
Prep routine D
Name 

Zipf distribution

Input format 

Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.

See also apop_data_rank_compress for means of dealing with one more input data format.

Parameter format 

One item in the parameter set's vector.

Post-estimate 
dataUnchanged.
parametersSee parameter format.
RNG 

Returns a ranking: If the population were Zipf distributed, you're most likely to get the 1st most common item, so this produces a lot of ones, a great deal of twos, and so on.

Cribbed from <a href="http://cgm.cs.mcgill.ca/~luc/mbookindex.html>Devroye (1986), Chapter 10, p 551.

Function Documentation

double apop_cdf ( apop_data d,
apop_model m 
)

Input a data point in canonical form and a model; returns the area of the model's PDF beneath the given point.

By default, I just make random draws from the PDF and return the percentage of those draws beneath or equal to the given point. Many models have closed-form solutions that make no use of random draws.

See also apop_cdf_settings, which is the structure I use to store draws already made (which means the second, third, ... calls to this function will take much less time than the first), the gsl_rng, and the number of draws to be made. These are handled without your involvement, but if you would like to change the number of draws from the default, add this group before calling apop_cdf :

1 Apop_model_add_group(your_model, apop_cdf, .draws=1e5, .rng=my_rng);
2 double cdf_value = apop_cdf(your_data_point, your_model);

Here are many examples using common, mostly symmetric distributions.

#include <apop.h>
int main(){
//Set up an apop_data set with only one number.
//Most of these functions will only look at the first data point encountered.
apop_data *onept = apop_data_alloc(1, 1);
apop_data_set(onept, .val=23);
double val = apop_cdf(onept, norm);
assert(fabs(val - 0.5) < 1e-4);
double tolerance = 1e-8;
//Macroizing the sample routine above:
#define model_val_cdf(model, value, cdf_result) { \
apop_data_set(onept, .val=(value)); \
assert(fabs((apop_cdf(onept, model))-(cdf_result))< tolerance); \
}
model_val_cdf(uni, 0, 0);
model_val_cdf(uni, 20, 0);
model_val_cdf(uni, 21, 1./6);
model_val_cdf(uni, 23, 0.5);
model_val_cdf(uni, 25, 5./6);
model_val_cdf(uni, 26, 1);
model_val_cdf(uni, 260, 1);
//Improper uniform always returns 1/2.
model_val_cdf(apop_improper_uniform, 0, 0.5);
model_val_cdf(apop_improper_uniform, 228, 0.5);
model_val_cdf(apop_improper_uniform, INFINITY, 0.5);
model_val_cdf(binom, 0, 0);
model_val_cdf(binom, 1000, .5);
model_val_cdf(binom, 2000, 1);
//p(0)=.25; p(1)=.75; that determines the CDF.
//Notice that the CDF's integral is over a closed interval.
model_val_cdf(bernie, -1, 0);
model_val_cdf(bernie, 0, 0.25);
model_val_cdf(bernie, 0.1, 0.25);
model_val_cdf(bernie, .99, 0.25);
model_val_cdf(bernie, 1, 1);
model_val_cdf(bernie, INFINITY, 1);
//alpha=beta -> symmetry
model_val_cdf(beta, -INFINITY, 0);
model_val_cdf(beta, 0.5, 0.5);
model_val_cdf(beta, INFINITY, 1);
//This beta distribution -> uniform
model_val_cdf(beta_uni, 0, 0);
model_val_cdf(beta_uni, 1./6, 1./6);
model_val_cdf(beta_uni, 0.5, 0.5);
model_val_cdf(beta_uni, 1, 1);
beta_uni->cdf = NULL; //With no closed-form CDF; make random draws to estimate the CDF.
Apop_model_add_group(beta_uni, apop_cdf, .draws=1e6); //extra draws to improve accuracy, but we have to lower our tolerance anyway.
tolerance=1e-3;
model_val_cdf(beta_uni, 0, 0);
model_val_cdf(beta_uni, 1./6, 1./6);
model_val_cdf(beta_uni, 0.5, 0.5);
model_val_cdf(beta_uni, 1, 1);
//sum of three symmetric distributions: still symmetric.
apop_model *sum_of_three = apop_model_mixture(beta, apop_improper_uniform, beta_uni);
model_val_cdf(sum_of_three, 0.5, 0.5);
apop_data *threepts = apop_data_falloc((3,1), -1, 0, 1);
model_val_cdf(kernels, -5, 0);
model_val_cdf(kernels, 0, 0.5);
model_val_cdf(kernels, 10, 1);
}
int apop_draw ( double *  out,
gsl_rng *  r,
apop_model m 
)

Draw from a model.

Parameters
outAn already-allocated array of doubles to be filled by the draw method. This probably has size your_model->dsize.
rA gsl_rng, probably allocated via apop_rng_alloc. Optional; if NULL, then I will call apop_rng_get_thread for an RNG.
mThe model from which to make draws.
  • If the model has its own draw method, then use that.
  • If the model is univariate, use apop_arms_draw to generate random draws.
  • If the model is multivariate, use apop_model_metropolis to generate random draws.
  • This makes a single draw of the given size. See apop_model_draws to fill a matrix with draws.
Returns
Zero on success; nozero on failure. out[0] is probably NAN on failure.
apop_model* apop_estimate ( apop_data d,
apop_model m 
)

estimate the parameters of a model given data.

This function copies the input model, preps it, and calls m.estimate(d, m). If your model has no estimate method, then I assume apop_maximum_likelihood(d, m), with the default MLE params.

I assume that you are using this function rather than directly calling the model's the estimate method. For example, the estimate method may assume that apop_prep has already been called.

Parameters
dThe data
mThe model
Returns
A pointer to an output model, which typically matches the input model but has its parameters element filled in.
double apop_log_likelihood ( apop_data d,
apop_model m 
)

Find the log likelihood of a data/parametrized model pair.

Parameters
dThe data
mThe parametrized model, which must have either a log_likelihood or a p method.
apop_model* apop_model_clear ( apop_data data,
apop_model model 
)

Allocate an apop_model.

This sets up the output elements of the apop_model: the parameters and info.

At close, the input model has parameters of the correct size.

  • This is the default action for apop_prep. If your model has its own prep method, then that gets used instead, but most don't (or call apop_model_clear at the end of their prep routine).

The above two points mean that you probably don't need to call this function directly.

Parameters
dataIf your params vary with the size of the data set, then the function needs a data set to calibrate against. Otherwise, it's OK to set this to NULL.
modelThe model whose output elements will be modified.
Returns
A pointer to the same model, should you need it.
Exceptions
outmodel->error=='d'dimension error.
apop_model* apop_model_copy ( apop_model in)

Outputs a copy of the apop_model input.

Parameters
inThe model to be copied
Returns
A pointer to a copy of the original, which you can mangle as you see fit. Includes copies of all settings groups, and the parameters (if not NULL, copied via apop_data_copy).
  • If in.more_size > 0 I memcpy the more pointer from the original data set.
Exceptions
out->error=='a'Allocation error. In extreme cases, where there aren't even a few hundred bytes available, I will return NULL.
out->error=='s'Error copying settings groups.
out->error=='p'Error copying parameters or info page; the given apop_data struct may be NULL or may have its own ->error element.
void apop_model_free ( apop_model free_me)

Free an apop_model structure.

The parameters element is freed. These are all the things that are completely copied, by apop_model_copy, so the parent model is still safe after this is called. data is not freed, because the odds are you still need it.

If free_me->more_size is positive, the function runs free(free_me->more). But it has no idea what the more element contains; if it points to other structures (like an apop_data set), you need to free them before calling this function.

  • If free_me is NULL, this does nothing.
Parameters
free_meA pointer to the model to be freed.
double apop_p ( apop_data d,
apop_model m 
)

Find the probability of a data/parametrized model pair.

Parameters
dThe data
mThe parametrized model, which must have either a log_likelihood or a p method.
apop_model* apop_parameter_model ( apop_data d,
apop_model m 
)

Get a model describing the distribution of the given parameter estimates.

For many models, the parameter estimates are well-known, such as the $t$-distribution of the parameters for OLS.

For models where the distribution of $\hat{}p$ is not known, if you give me data, I will return a apop_normal or apop_multivariate_normal model, using the parameter estimates as mean and apop_bootstrap_cov for the variances.

If you don't give me data, then I will assume that this is a stochastic model where re-running the model will produce different parameter estimates each time. In this case, I will run the model 1e4 times and return a apop_pmf model with the resulting parameter distributions.

Before calling this, I expect that you have already run apop_estimate to produce $\hat{}p$.

The apop_pm_settings structure dictates details of how the model is generated. For example, if you want only the distribution of the third parameter, and you know the distribution will be a PMF generated via random draws, then set settings and call the model via:

1 apop_model_group_add(your_model, apop_pm, .index =3, .draws=3e5);
2 apop_model *dist = apop_parameter_model(your_data, your_model);
  • index gives the position of the parameter (in apop_data_pack order) in which you are interested. Thus, if this is zero or more, then you will get a univariate output distribution describing a single parameter. If index == -1, then I will give you the multivariate distribution across all parameters. The default is zero (i.e. the univariate distribution of the zeroth parameter).
  • rng If the method requires random draws (as the default bootstrap will), then use this. If you provide NULL and one is needed, I provide one for you via apop_rng_alloc(apop_opts.rng_seed++).
  • draws If there is no closed-form solution and bootstrap is inappropriate, then the last resort is a large numbr of random draws of the model, summarized into a PMF. Default: 1,000 draws.
  • The default is via resampling as above, but special-case calculations for certain models are held in a vtable; see Registering new methods in vtables for details. The typedef new functions must conform to and the hash used for lookups are:
1 typedef apop_model* (*apop_parameter_model_type)(apop_data *, apop_model *);
2 #define apop_parameter_model_hash(m1) ((size_t)((m1).log_likelihood ? (m1).log_likelihood : (m1).p)*33 + (m1).estimate ? (size_t)(m1).estimate: 27)
apop_data* apop_predict ( apop_data d,
apop_model m 
)

A prediction supplies E(a missing value | original data, already-estimated parameters, and other supplied data elements ).

For a regression, one would first estimate the parameters of the model, then supply a row of predictors X. The value of the dependent variable $y$ is unknown, so the system would predict that value. [In some models, this may not be the expected value, but is a best value for the missing item using some other meaning of `best'.]

For a univariate model (i.e. a model in one-dimensional data space), there is only one variable to omit and fill in, so the prediction problem reduces to the expected value: E(a missing value | original data, already-estimated parameters).

In other cases, prediction is the missing data problem: for three-dimensional data, you may supply the input (34, NaN, 12), and the parameterized model provides the most likely value of the middle parameter.

  • If you give me a NULL data set, I will assume you want all values filled in—the expected value.
  • If you give me data with NaNs, I will take those as the points to be predicted given the provided data.

If the model has no predict method, the default is to use the apop_ml_impute function to do the work.

Returns
If you gave me a non-NULL data set, I will return that, with the zeroth column or the NaNs filled in. If NULL input, I will allocate an apop_data set and fill it with the expected values.

There may be a second page (i.e., a apop_data set attached to the ->more pointer of the main) listing confidence and standard error information. See your specific model documentation for details.

This segment of the framework is in beta—subject to revision of the details.

1 typedef apop_data * (*apop_predict_type)(apop_data *d, apop_model *params);
2 #define apop_predict_hash(m1) ((size_t)((m1).log_likelihood ? (m1).log_likelihood : (m1).p)*33 + (m1).estimate ? (size_t)(m1).estimate: 27)
void apop_prep ( apop_data d,
apop_model m 
)

The default prep is to simply call apop_model_clear. If the function has a prep method, then that gets called instead.

void apop_score ( apop_data d,
gsl_vector *  out,
apop_model m 
)

Find the vector of first derivatives (aka the gradient) of the log likelihood of a data/parametrized model pair.

Parameters
dThe data
outThe score to be returned. I expect you to have allocated this already.
mThe parametrized model, which must have either a log_likelihood or a p method.
1 typedef void (*apop_score_type)(apop_data *d, gsl_vector *gradient, apop_model *m);
2 #define apop_score_hash(m1) ((size_t)((m1).log_likelihood ? (m1).log_likelihood : (m1).p))

As input to your function, you can expect that the model m is sufficiently prepped that the log likelihood can be evaluated; see p, log_likelihood for details. On output, the a gsl_vector input to the function must be filled with the gradients (or NaNs on errors). If the model parameters have a more complex shape than a simple vector, then the vector must be in apop_data_pack order; use apop_data_unpack to reformat to the preferred shape.

Autogenerated by doxygen on Sun Oct 26 2014 (Debian 0.999b+ds3-2).