![]() |
|
Macros | |
#define | apop_model_set_parameters(in,...) |
Functions | |
apop_model * | apop_model_clear (apop_data *data, apop_model *model) |
void | apop_model_free (apop_model *free_me) |
apop_model * | apop_model_copy (apop_model *in) |
apop_model * | apop_estimate (apop_data *d, apop_model *m) |
double | apop_p (apop_data *d, apop_model *m) |
double | apop_log_likelihood (apop_data *d, apop_model *m) |
void | apop_score (apop_data *d, gsl_vector *out, apop_model *m) |
apop_model * | apop_parameter_model (apop_data *d, apop_model *m) |
int | apop_draw (double *out, gsl_rng *r, apop_model *m) |
void | apop_prep (apop_data *d, apop_model *m) |
apop_data * | apop_predict (apop_data *d, apop_model *m) |
double | apop_cdf (apop_data *d, apop_model *m) |
#define apop_model_set_parameters | ( | in, | |
... | |||
) |
Take in an unparameterized apop_model
and return a new apop_model
with the given parameters. For example, if you need a N(0,1) quickly:
This doesn't take in data, so it won't work with models that take the number of parameters from the data, and it will only set the vector of the model's parameter apop_data set. This is most standard models, so that's not a real problem either. If you have a situation where these options are out, you'll have to do something like apop_model *new = apop_model_copy(in); apop_model_clear(your_data, in);
and then set in->parameters
using your data.
in | An unparameterized model, like apop_normal or apop_poisson. |
... | The list of parameters. |
out->error=='d' | dimension error: you gave me a model with an indeterminate number of parameters. Set .vsize or .msize1 and .msize2 first, then call this fn, or use apop_model *new = apop_model_copy(in); apop_model_clear(your_data, in); and then call this (because apop_model_clear sets the dimension based on your data size). |
apop_model apop_bernoulli |
The Bernoulli model: A single random draw with probability .
apop_model apop_beta |
The Beta distribution.
The beta distribution has two parameters and is restricted between zero and one. You may also find apop_beta_from_mean_var to be useful.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Any arrangement of scalar values. | ||||||||||||||||||
Parameter format |
a vector, v[0]= | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Produces a scalar |
apop_model apop_binomial |
The multi-draw generalization of the Bernoulli; the two-bin special case of the Multinomial distribution. This differs from the apop_multinomial only in the input data format.
It is implemented as an alias of the apop_multinomial model, except that it has a CDF, .vsize==2
and .dsize==1
(i.e., we know it has two parameters and a draw returns a scalar).
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Input format |
Each row of the matrix is one observation, consisting of two elements. The number of draws of type zero (sometimes read as `misses' or `failures') are in column zero, the number of draws of type one (`hits', `successes') in column one. | ||||||||||||||||||
Parameter format |
a vector, v[0]= | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
The RNG returns a single number representing the success count, not a vector of length two giving both the failure bin and success bin. This is notable because it differs from the input data format, but it tends to be what people expect from a Binomial RNG. For draws with both dimensions, use a apop_multinomial model with |
apop_model apop_coordinate_transform |
Apply a coordinate transformation of the data to produce a distribution over the transformed data space. This is sometimes called a Jacobian transformation.
Here is an example that replicates the Lognormal distribution.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
The input data is sent to the first model, so use the input format for that model. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
Settings |
apop_model apop_dcomposition |
Use random draws or parameter estimates output from a first model as input data for a second model.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
The input data is sent to the first model, so use the input format for that model. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
Settings |
apop_model apop_dconstrain |
A model that constrains the base model to within some data constraint. E.g., truncate to zero for all
outside of a given constraint. Generate using apop_model_dconstrain .
The log likelihood works by using the base_model
log likelihood, and then scaling it based on the part of the base model's density that is within the constraint. If you have an easy means of specifying what that density is, please do, as in the example. If you do not, the log likelihood will calculate it by making draw_ct
random draws from the base model and checking whether they are in or out of the constraint. Because this default method is stochastic, there is some loss of precision, and conjugate gradient methods may get confused.
Here is an example that makes a few draws and estimations from data-constrained models.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
That of the base model. | ||||||||||||||||||
Parameter format |
That of the base model. In fact, the | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Draw from the base model; if the draw is outside the constraint, throw it out and try again. | ||||||||||||||||||
Settings | |||||||||||||||||||
Examples |
#include <apop.h>
#ifdef Testing
#define Show_results(m)
#else
#define Show_results(m) apop_model_print(m, NULL);
#endif
//The constraint function.
return apop_data_get(in) > 0;
}
//The optional scaling function.
double in_bounds(apop_model *m){
double z = 0;
gsl_vector_view vv = gsl_vector_view_array(&z, 1);
}
int main(){
/*Set up a Normal distribution, with data truncated to be nonnegative.
This version doesn't use the in_bounds function above, and so the
default scaling function is used.*/
gsl_rng *r = apop_rng_alloc(213);
.constraint=over_zero, .draw_ct=5e4, .rng=r);
//make draws. Currently, you need to prep the model first.
apop_prep(NULL, trunc);
apop_data *d = apop_model_draws(trunc, 1e5);
//Estimate the parameters given the just-produced data:
apop_model *est = apop_estimate(d, trunc);
Show_results(est);
//Generate a data set that is truncated at zero using alternate means
for (int i=0; i< normald->matrix->size1; i++){
double *d = apop_data_ptr(normald, i);
if (*d < 0) *d *= -1;
}
//this time, use an unparameterized model, and the in_bounds fn
.constraint=over_zero, .scaling=in_bounds);
apop_model *re_est = apop_estimate(normald, re_trunc);
Show_results(re_est)
assert(apop_vector_distance(re_est->parameters->vector, apop_vector_fill(gsl_vector_alloc(2), 0, 1))<1e-1);
apop_model_free(trunc);
}
|
apop_model apop_dirichlet |
A multivariate generalization of the Beta distribution.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Each row of your data matrix is a single observation. | ||||||||||||||||||
Parameter format |
The estimated parameters are in the output model's | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
A call to |
apop_model apop_exponential |
The Exponential distribution.
Some write the function as: If you prefer this form, just convert your parameter via
(and convert back from the parameters this function gives you via
).
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.
| ||||||||||||||||||
Parameter format |
| ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Just a wrapper for | ||||||||||||||||||
CDF |
Produces a single number. |
apop_model apop_gamma |
The Gamma distribution
(also,
)
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same.
| ||||||||||||||||||
Parameter format |
First two elements of the vector. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Just a wrapper for See the notes for apop_exponential on a popular alternate form. |
apop_model apop_improper_uniform |
The improper uniform returns for every value of x, all the time (and thus, log likelihood(x)=0). It has zero parameters. It is useful, for example, as an input to Bayesian updating, to represent a fully neutral prior.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Ignored. | ||||||||||||||||||
Parameter format |
| ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
The | ||||||||||||||||||
CDF |
Half of the distribution is less than every given point, so the CDF always returns 0.5. One could perhaps make an argument that this should really be infinity, but a half is more in the spirit of the distribution's use to represent a lack of information. |
apop_model apop_iv |
Instrumental variable regression
Operates much like the apop_ols model, but the input parameters also need to have a table of substitutions (like the addition of the .instruments
setting in the example below). The vector element of the table lists the column numbers to be substituted (the dependent var is zero; first independent col is one), and then one column for each item to substitute.
NULL
, then I will use the row names to find the columns to substitute. This is generally more robust and/or convenient.instruments
data set is somehow NULL
or empty, I'll just run OLS.destroy_data
setting. If you set that to 'y'
, I will overwrite the column in place, saving the trouble of copying the entire data set. Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
See apop_ols; see Data prep rules. | ||||||||||||||||||
Prep_routine |
Focuses on the data shunting. | ||||||||||||||||||
Parameter format |
As per apop_ols | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
Examples |
/* Instrumental variables are often used to deal with variables measured with noise, so
this example produces a data set with a column of noisy data, and a separate instrument
measured with greater precision, then sets up and runs an instrumental variable regression.
To guarantee that the base data set has noise and the instrument is cleaner, the
procedure first generates the clean data set, then copies the first column to the
instrument set, then the \c add_noise function inserts Gaussian noise into the base
data set. Once the base set and the instrument set have been generated, the setup for
the IV consists of adding the relevant names and using \ref Apop_model_add_group to add a
\c lm (linear model) settings group with a <tt>.instrument= instrument_data</tt> element.
In fact, the example sets up a sequence of IV regressions, with more noise each
time. This sample is part of Apophenia's test suite, and so checks that the coefficients
are correct along the way.
*/
#include <apop.h>
#define Diff(L, R, eps) Apop_stopif(fabs((L)-(R)>=(eps)), return, 0, "%g is too different from %g (abitrary limit=%g).", (double)(L), (double)(R), eps);
int datalen =1e4;
//generate a vector that is the original vector + noise
void add_noise(gsl_vector *in, gsl_rng *r, double size){
for (int i=0; i< in->size; i++){
double noise;
apop_draw(&noise, r, nnoise);
*gsl_vector_ptr(in, i) += noise;
}
apop_model_free(nnoise);
}
Diff(apop_data_get(m->parameters, .row=0,.col=-1), -1.4, tolerance);
Diff(apop_data_get(m->parameters, .row=1,.col=-1), 2.3, tolerance);
}
int main(){
gsl_rng *r = apop_rng_alloc(234);
apop_data *data = apop_data_alloc(datalen, 2);
for(int i=0; i< datalen; i++){
apop_data_set(data, i, 1, 100*(gsl_rng_uniform(r)-0.5));
apop_data_set(data, i, 0, -1.4 + apop_data_get(data,i,1)*2.3);
}
#ifndef Testing
apop_model_show(oest);
#endif
//the data with no noise will be the instrument.
gsl_vector *col1 = Apop_cv(data, 1);
apop_data *instrument_data = apop_data_alloc(data->matrix->size1, 1);
gsl_vector_memcpy(Apop_cv(instrument_data, 0), col1);
Apop_model_add_group(apop_iv, apop_lm, .instruments = instrument_data);
//Now add noise to the base data four times, and estimate four IVs.
int tries = 4;
apop_model *ests[tries];
for (int nscale=0; nscale<tries; nscale++){
add_noise(col1, r, nscale==0 ? 0 : pow(10, nscale-tries));
ests[nscale] = apop_estimate(data, apop_iv);
#ifndef Testing
if (nscale==tries-1){ //print the one with the largest error.
printf("\nnow IV:\n");
apop_model_show(ests[nscale]);
}
#endif
}
/* Now test. The parameter estimates are unbiased.
As we add more noise, the covariances expand.
Test that the ratio of one covariance matrix to the next
is less than one, though these are typically very much
smaller than one (as the noise is an order of magnitude
larger in each case), and the ratios will be identical
for each j, k below. */
test_for_unbiased_parameter_estimates(ests[0], 1e-6);
for (int i=1; i<tries; i++){
test_for_unbiased_parameter_estimates(ests[i], 1e-3);
gsl_matrix *cov = apop_data_get_page(ests[i-1]->parameters, "<Covariance>")->matrix;
gsl_matrix *cov2 = apop_data_get_page(ests[i]->parameters, "<Covariance>")->matrix;
gsl_matrix_div_elements(cov, cov2);
for (int j =0; j< 2; j++)
for (int k =0; k< 2; k++)
assert(gsl_matrix_get(cov, j, k) < 1);
}
}
|
apop_model apop_kernel_density |
The kernel density smoothing of a PMF or histogram.
At each point along the histogram, put a distribution (default: Normal(0,1)) on top of the point. Sum all of these distributions to form the output distribution.
Elements of apop_kernel_density_settings that you may want to set:
data
a data set, which, if not NULL
and !base_pmf
, will be converted to an apop_pmf model. base_pmf
This is the preferred format for input data. It is the histogram to be smoothed. kernelbase
The kernel to use for smoothing, with all parameters set and a p
method. Popular favorites are apop_normal and apop_uniform. set_params
A function that takes in a single number and the model, and sets the parameters accordingly. The function will call this for every point in the data set. Here is the default, which is used if this is NULL
. It simply sets the first element of the model's parameter vector to the input number; this is appropriate for a Normal distribution, where we want to center the distribution on each data point in turn.For a Uniform[0,1] recentered around the first element of the PMF matrix, you could put this function in your code:
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
I'll estimate a apop_pmf internally, so I follow that format, which is one observation (of any format) per line. | ||||||||||||||||||
Parameter format |
None | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Randomly selects a data point, then randomly draws from that sub-distribution. Returns 0 on success, 1 if unable to pick a sub-distribution (meaning the weights over the distributions are somehow broken), and 2 if unable to draw from the sub-distribution. | ||||||||||||||||||
CDF |
Sums the CDF to the given point of all the sub-distributions. | ||||||||||||||||||
Settings | |||||||||||||||||||
Examples |
This example sets up and uses KDEs based on a Normal and a Uniform distribution. /* This program draws ten random data points, and then produces two kernel density
estimates: one based on the Normal distribution and one based on the Uniform.
It produces three outputs:
--stderr shows the random draws
--kerneldata is a file written with plot data for both KDEs
--stdout shows instructions to gnuplot, so you can pipe:
./kernel | gnuplot -persist
Most of the code is taken up by the plot() and draw_some_data() functions, which are
straightforward. Notice how plot() pulls the values of the probability distributions
at each point along the scale.
The set_uniform_edges function sets the max and min of a Uniform distribution so that the
given point is at the center of the distribution.
The first KDE uses the defaults, which are based on a Normal distribution with std dev 1;
the second explicitly sets the .kernel and .set_fn for a Uniform.
*/
#include <apop.h>
apop_data_set(unif->parameters, 0, -1, r->matrix->data[0]-0.5);
apop_data_set(unif->parameters, 1, -1, r->matrix->data[0]+0.5);
}
apop_data *onept = apop_data_alloc(0,1,1);
FILE *outtab = fopen("kerneldata", "w");
for (double i=0; i<20; i+=0.01){
apop_data_set(onept,0,0, i);
}
fclose(outtab);
printf("plot 'kerneldata' using 1:2\n"
"replot 'kerneldata' using 1:3\n");
}
apop_data *draw_some_data(){
apop_data *d = apop_model_draws(uniform_0_20, 10);
apop_data_print(apop_data_sort(d), .output_pipe=stderr);
return d;
}
int main(){
apop_data *d = draw_some_data();
apop_kernel_density, .base_data=d,
.set_fn = set_uniform_edges,
.kernel = apop_uniform);
plot(k, k2);
}
|
apop_model apop_loess |
Regression via loess smoothing
This uses a somewhat black-box routine, first written by Chamberlain, Devlin, Grosse, and Shyu in 1988, to fit a smoothed series of quadratic curves to the input data, thus producing a curve more closely fitting than a simple regression would.
The curve is basically impossible to describe using a short list of parameters, so the representation is in the form of the predicted
vector of the expected
data set; see below.
From the 1992 manual for the package: ``The method we will use to fit local regression models is called {loess}, which is short for local regression, and was chosen as the name since a loess is a deposit of fine clay or silt along a river valley, and thus is a surface of sorts. The word comes from the German löss, and is pronounced löÃss.''
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
The data is basically OLS-like: the first column of the data is the dependent variable to be explained; subsequent variables are the independent explanatory variables. Thus, your input data can either have a dependent vector plus explanatory matrix, or a matrix where the first column is the dependent variable. Unlike with OLS, I won't move your original data, and I won't add a 1, because that's not really the loess custom. You can of course set up your data that way if you like. If your data set has a weights vector, I'll use it. In any case, all data is copied into the model's apop_loess_settings. The code is primarily FORTRAN code from 1988 converted to C; the data thus has to be converted into a relatively obsolete internal format. | ||||||||||||||||||
Parameter format |
The parameter vector is unused. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
Predict |
Fills in the zeroth column (ignoring and overwriting any data there), and at the data's This routine is in beta testing. |
apop_model apop_logit |
The Logit model.
Apophenia makes no distinction between the bivariate logit and the multinomial logit. This does both.
The likelihood of choosing item is:
so the log likelihood is
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
The first column of the data matrix this model expects is zeros, ones, ..., enumerating the factors; to get there, try apop_data_to_factors; if you forget to run it, I'll run it on the first data column for you. The remaining columns are values of the independent variables. Thus, the model will return [(data columns)-1] | ||||||||||||||||||
Prep_routine |
You will probably want to convert some column of your data into factors, via apop_data_to_factors. If you do, then that adds a page of factors to your data set (and of course adjusts the data itself). If I find a factor page, I will use that info; if not, then I will run apop_data_to_factors on the first column (the vector if there is one, else the first column of the matrix.) Also, if there is no vector, then I will move the first column of the matrix, and replace that matrix column with a constant column of ones, just like with OLS. | ||||||||||||||||||
Parameter format |
As above. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Much like the apop_ols RNG, qv. Returns the category drawn.
The elements of the sum are all now exp(something negative), so overflow won't happen, and if there's underflow, then that term must not have been very important. [This trick is attributed to Tom Minka, who implemented it in his Lightspeed Matlab toolkit.] Here is an artifical example: #include <apop.h>
#include <unistd.h>
char *testfile = "logit_test_data";
//generate a fake data set.
//Notice how the first column is the outcome, just as with standard regression.
void write_data(){
FILE *f = fopen(testfile, "w");
fprintf(f, "\
outcome,A, B \n\
0, 0, 0 \n\
1, 1, 1 \n\
1, .7, .5 \n\
1, .7, .3 \n\
1, .3, .7 \n\
\n\
1, .5, .5 \n\
0, .4, .4 \n\
0, .3, .4 \n\
1, .1, .3 \n\
1, .3, .1 ");
fclose(f);
}
int main(){
write_data();
apop_data *d = apop_text_to_data(testfile);
Apop_model_add_group(apop_logit, apop_mle, .tolerance=1e-5);
unlink(testfile);
/* Apophenia's test suite checks that this code produces
values close to canned values. As a human, you probably
just want to print the results to the screen. */
#ifndef Testing
apop_model_show(est);
#else
#endif
}
|
apop_model apop_lognormal |
The Lognormal distribution
The log likelihood function for the lognormal distribution:
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
I use the all elements of the matrix and vector, without regard to their order. | ||||||||||||||||||
Parameter format |
Zeroth vector element is the mean (after logging); first is the std dev (after logging) | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
An Apophenia wrapper for the GSL's Normal RNG, exp'ed. |
apop_model apop_mixture |
The mixture model transformation: a linear combination of multiple models.
Generated via apop_model_mixture.
Note that a kernel density is a mixture of a large number of homogeneous models, where each is typically centered around a point in your data. For such situations, apop_kernel_density will be easier to use.
Use apop_model_mixture to produce one of these models. In the examples below, some are generated from unparameterized input models with a form like
One can also skip the estimation and use already-parameterized models as input to apop_model_mixture, e.g.:
Notice that the weights vector has to be added after the model is set up. If none is given, then equal weights are assigned to all components of the mixture.
One can think of the estimation as a missing-data problem: each data point originated in one distribution or the other, and if we knew with certainty which data point came from which distribution, then the estimation problem would be trivial: just generate the subsets and call apop_estimate(dataset1, model1)
, ..., apop_estimate(datasetn, modeln)
separately. But the assignment of which element goes where is unknown information, which we guess at using an E-M algorithm. The standard algorithm starts with an initial set of parameters for the models, and assigns each data point to its most likely model. It then re-estimates the model parameters using their subsets. The standard algorithm, see e.g. this PDF, repeats until it arrives at an optimum.
Thus, the log likelihood method for this model includes a step that allocates each data point to its most likely model, and calculates the log likelihood of each observation using its most likely model. [It would be a valuable extension to extend this to not-conditionally IID models. Commit 1ac0dd44
in the repository had some notes on this, now removed.] As a side-effect, it calculates the odds of drawing from each model (the vector λ). Following the above-linked paper, the probability for a given observation under the mixture model is its probability under the most likely model weighted by the previously calculated λ for the given model.
Apohenia modifies this routine slightly because it uses the same maximum likelihood back-end that most other apop_model
s use for estimation. The ML search algorithm provides model parameters, then the LL method allocates observations and reports a LL to the search algorithm, then the search algorithm uses its usual rules to step to the next candidate set of parameters. This provides slightly more flexibility in the search.
Determining to which parts of a mixture to assign a data point is a well-known hard problem, which is often not solvable–that information is basically lost.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
The same data gets sent to each of the component models of the mixture. Each row is an observation, and the estimation routine assumes that models are conditionally IID (i.e., having chosen what component of the mixture the observation comes from, its likelihood can be calculated independently of all other observations). | ||||||||||||||||||
Parameter format |
The parameters are broken out in a readable form in the settings group, so your best bet is to use those. See the sample code for usage. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Uses the weights to select a component model, then makes a draw from that component. The model's | ||||||||||||||||||
Settings | |||||||||||||||||||
Examples |
The first example uses a text file #include <apop.h>
/* This replacement for apop_model_show(in) demonstrates retrieval of the useful
settings: the weights (λ) and list of estimated models. */
void show_mix(apop_model *in){
printf("The weights:\n");
apop_vector_print(ms->weights);
printf("\nThe models:\n");
apop_model_print(*m, stdout);
}
int main(){
/* The process is famously sensitive to starting points. Try many random points, or
eyeball the distribution's plot and guess at the starting values. */
Apop_model_add_group(mf, apop_mle, .starting_pt=(double[]){50, 5, 80, 5},
.step_size=3, .tolerance=1e-6);
apop_model *mfe = apop_estimate(dd, mf);
apop_model_print(mfe, stdout);
printf("LL=%g\n", apop_log_likelihood(dd, mfe));
printf("\n\nValues calculated in the source paper, for comparison.\n");
apop_model *r_ed = apop_model_mixture(
apop_model_set_parameters(apop_normal, 54.61364, 5.869089),
apop_model_set_parameters(apop_normal, 80.09031, 5.869089));
apop_data *wts = apop_data_falloc((2), 0.3608498, 0.6391502);
Apop_settings_add(r_ed, apop_mixture, weights, wts->vector);
show_mix(r_ed);
printf("LL=%g\n", apop_log_likelihood(dd, r_ed));
}
#include <apop.h>
/*
Use \ref apop_model_mixture to generate a hump-filled distribution, then find
the most likely data points and check that they are near the humps.
Part of Apophenia's test suite.
*/
//Produce a 2-D multivariate normal model with unit covariance and given mean
out->parameters = apop_data_falloc((2, 2, 2),
x, 1, 0,
y, 0, 1);
out->dsize = 2;
return out;
}
int main(){
//here's a mean/covariance matrix for a standard multivariate normal.
apop_model *many_humps = apop_model_mixture(
produce_fixed_mvn(5, 6),
produce_fixed_mvn(-5, -4),
produce_fixed_mvn(0, 1));
apop_prep(NULL, many_humps);
int len = 100000;
apop_data *d = apop_model_draws(many_humps, len);
gsl_vector *first = Apop_cv(d, 0);
#ifndef Testing
printf("mu=%g\n", apop_mean(first));
#endif
assert(fabs(apop_mean(first)- 0) < 5e-2);
gsl_vector *second = Apop_cv(d, 1);
#ifndef Testing
printf("mu=%g\n", apop_mean(second));
#endif
assert(fabs(apop_mean(second)- 1) < 5e-2);
/* Abuse the ML imputation routine to search for the input value with the highest
log likelihood. Do the search via simulated annealing. */
apop_data *x = apop_data_alloc(1,2);
gsl_matrix_set_all(x->matrix, NAN);
Apop_settings_add_group(many_humps, apop_mle, .n_tries=20, .iters_fixed_T=10, .k=3, .method="annealing");
apop_ml_impute(x, many_humps);
#ifndef Testing
printf("Optimum found at:\n");
apop_data_show(x);
#endif
assert(fabs(apop_data_get(x, .col=0)- 0) + fabs(apop_data_get(x, .col=1) - 1) < 1e-2);
}
|
apop_model apop_multinomial |
The –option generalization of the Binomial distribution.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Each row of the matrix is one observation: a set of draws from a single bin. The number of draws of type zero are in column zero, the number of draws of type one in column one, et cetera.
| ||||||||||||||||||
Parameter format |
The parameters are kept in the vector element of the The numeraire is bin zero, meaning that And now the parameter vector is a proper list of probabilities.
| ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Returns a single vector of length |
apop_model apop_multinomial_probit |
The Multinomial Probit model.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Post-estimate |
|
apop_model apop_multivariate_normal |
This is the multivariate generalization of the Normal distribution.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Each row of the matrix is an observation. | ||||||||||||||||||
Parameter format |
An If you had only one dimension, the mean would be a vector of size one, and the covariance matrix a | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
The RNG fills an input array whose length is based on the input parameters. The nice, easy method from Devroye, p 565 | ||||||||||||||||||
Settings |
None. |
apop_model apop_normal |
The Normal (Gaussian) distribution
You know it, it's your attractor in the limit, it's the Gaussian distribution.
See also the apop_multivariate_normal.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
I use the elements of the matrix and vector, without regard to their order or position. | ||||||||||||||||||
Parameter format |
As is custom, parameter zero (in the vector) is the mean, parmeter one is the standard deviation (i.e., the square root of the variance). | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
Predict |
Returns the expected value. The | ||||||||||||||||||
RNG |
An apophenia wrapper for the GSL's Normal RNG. This one asks explicitly for a mean, and the GSL assumes zero and lets you add the mean yourself. | ||||||||||||||||||
Settings |
None. |
apop_model apop_ols |
Ordinary least squares. Weighted least squares is also handled by this model. You can also use it for a lot of not-entirely linear models based on the form .
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
See Data prep rules. | ||||||||||||||||||
Prep_routine |
Focuses on the data shunting. | ||||||||||||||||||
Parameter format |
A vector of OLS coefficients. coeff. zero refers to the constant column, if any. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Linear models are typically only partially defined probability models. For OLS, we know that The apop_lm_settings group includes an apop_model* element named The default is that But you can't draw from an improper uniform. So if you draw from a linear model with a default Alternatively, you may know something about the distribution of the input data. At the least, you could generate a PMF from the actual data: Now, random draws are taken from the input data, and the dependent variable value calculated via | ||||||||||||||||||
Examples |
First, you will need a file named The program: If you saved this code to and then run it with Feeling lazy? The program above was good form and demonstrated useful features, but the code below will do the same thing in two lines: |
apop_model apop_pmf |
A probability mass function is commonly known as a histogram, or still more commonly, a bar chart. It indicates that at a given coordinate, there is a given mass.
Each row of the PMF's data set holds the coordinates, and the weights vector holds the mass at the given point. This is in contrast to the crosstab format, where the location is simply given by the position of the data point in the grid.
For example, here is a typical crosstab:
col 0 | col 1 | col 2 | |
row 0 | 0 | 8.1 | 3.2 |
row 1 | 0 | 0 | 2.2 |
row 2 | 0 | 7.3 | 1.2 |
Here it is as a sparse listing:
value dimension 1 | dimension 2 | |
8.1 | 0 | 1 |
3.2 | 0 | 2 |
2.2 | 1 | 2 |
7.3 | 2 | 1 |
1.2 | 2 | 2 |
The apop_pmf
internally represents data in this manner, with the dimensions in the matrix
, vector
, and text
element of the data set, and the cell values are held in the weights
element (not the vector).
If your data is in a crosstab (with entries in the matrix element for 2-D data or the vector for 1-D data), then use apop_crosstab_to_db to make the conversion. See also this page for another crosstab-to-PMF function as well.
If your data is already in the sparse listing format (which is probably the case for 3- or more dimensional data), then just point the model to your parameter set:
weights
element is NULL
, then I assume that all rows of the data set are equally probable. weights
are present but sum to a not-finite value, the model's error
element is set to 'w'
when the estimation is run, and a warning printed. Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
As above, you can input to the | ||||||||||||||||||
Parameter format |
None. The list of observations and their weights are in the | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Return the data in a random row of the PMF's data set. If there is a weights vector, i will use that to make draws; else all rows are equiprobable.
then I will return the row number of the draw, not the data in that row.
| ||||||||||||||||||
CDF |
Assuming the data is sorted in a meaningful manner, find the total mass up to a given data point. That is, a CDF only makes sense if the data space is totally ordered. The sorting you define using apop_data_sort defines that ordering.
| ||||||||||||||||||
Settings |
apop_model apop_poisson |
The Poisson distribution.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same. | ||||||||||||||||||
Parameter format |
One parameter, the zeroth element of the vector. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Just a wrapper for |
apop_model apop_probit |
The Probit model.
Apophenia makes no distinction between the bivariate probit and the multinomial probit. This one does both.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
The first column of the data matrix this model expects is zeros, ones, ..., enumerating the factors; to get there, try apop_data_to_factors; if you forget to run it, I'll run it on the first data column for you. The remaining columns are values of the independent variables. Thus, the model will return [(data columns)-1] | ||||||||||||||||||
Prep_routine |
You will probably want to convert some column of your data into factors, via apop_data_to_factors. If you do, then that adds a page of factors to your data set (and of course adjusts the data itself). If I find a factor page, I will use that info; if not, then I will run apop_data_to_factors on the first column (the vector if there is one, else the first column of the matrix.) Also, if there is no vector, then I will move the first column of the matrix, and replace that matrix column with a constant column of ones, just like with OLS. | ||||||||||||||||||
Parameter format |
As above | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
See apop_ols; this one is similar but produces a category number instead of OLS's continuous draw. |
apop_model apop_stack |
A stack of models.
For the case when you need to bundle two uncorrelated models into one larger model. For example, the prior for a multivariate normal (whose parameters are a vector of means and a covariance matrix) is a Multivariate Normal-Wishart pair.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
There are two means of handling the input format. If the settings group attached to the data set has a non- If | ||||||||||||||||||
Parameter format |
currently | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
Settings |
apop_model apop_t_distribution |
The t distribution, primarily for descriptive purposes.
If you want to test a hypothesis, you probably don't need this, and should instead use apop_test. See notes in t-, chi-squared, F-, Wishart distributions.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Unordered list of scalars in the matrix and/or vector. | ||||||||||||||||||
Parameter format |
vector->data[0] = mu | ||||||||||||||||||
Post-estimate |
|
apop_model apop_uniform |
This is the two-parameter version of the Uniform, expressing a uniform distribution over [a, b].
The MLE of this distribution is simply a = min(your data); b = max(your data). Primarily useful for the RNG, such as when you have a Uniform prior model.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
An unordered set of numbers in the data set's vector, matrix, or both. | ||||||||||||||||||
Parameter format |
Zeroth vector element is | ||||||||||||||||||
Post-estimate |
|
apop_model apop_wishart |
The Wishart distribution, which is currently somewhat untested.
Here's the likelihood function. is the dimension of the data and covariance matrix,
is the degrees of freedom,
is the
matrix of Wishart parameters, and
is the
matrix whose likelihood is being evaluated.
is the multivariate gamma function.
See also notes in t-, chi-squared, F-, Wishart distributions.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Each row of the input matrix is a single square matrix, flattened; use apop_data_pack to convert your sequence of matrices into rows. | ||||||||||||||||||
Prep_routine |
Just allocates the parameters based on the size of the input data. | ||||||||||||||||||
Parameter format |
| ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
You can use this to generate random covariance matrices, should you need them. See example below. | ||||||||||||||||||
Examples |
Making some random draws: |
apop_model apop_wls |
The Weighed Least Squares model This is a (deprecated) synonym for apop_ols, qv. If you use the apop_ols model and provide weights in your_input_data->weights
, then I will use them appropriately. That is, the apop_ols model really implements Weighted Least Squares, but in most cases weights==NULL
and the math reduces to the special case of Ordinary Least Squares.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Post-estimate |
|
apop_model apop_yule |
apop_yule.estimate() is an MLE, so feed it appropriate apop_mle_settings.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M. See also apop_data_rank_compress for means of dealing with one more input data format. | ||||||||||||||||||
Parameter format |
One element at the top of the parameter set's vector. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Cribbed from <a href="http://cgm.cs.mcgill.ca/~luc/mbookindex.html>Devroye (1986), p 553. |
apop_model apop_zipf |
Wikipedia has notes on the Zipf distribution.
apop_zipf.estimate() is an MLE, so feed it appropriate apop_mle_settings.
Enumerator | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods are Default or Model-specific |
| ||||||||||||||||||
Name |
| ||||||||||||||||||
Input format |
Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M. See also apop_data_rank_compress for means of dealing with one more input data format. | ||||||||||||||||||
Parameter format |
One item in the parameter set's vector. | ||||||||||||||||||
Post-estimate |
| ||||||||||||||||||
RNG |
Returns a ranking: If the population were Zipf distributed, you're most likely to get the 1st most common item, so this produces a lot of ones, a great deal of twos, and so on. Cribbed from <a href="http://cgm.cs.mcgill.ca/~luc/mbookindex.html>Devroye (1986), Chapter 10, p 551. |
double apop_cdf | ( | apop_data * | d, |
apop_model * | m | ||
) |
Input a data point in canonical form and a model; returns the area of the model's PDF beneath the given point.
By default, I just make random draws from the PDF and return the percentage of those draws beneath or equal to the given point. Many models have closed-form solutions that make no use of random draws.
See also apop_cdf_settings, which is the structure I use to store draws already made (which means the second, third, ... calls to this function will take much less time than the first), the gsl_rng
, and the number of draws to be made. These are handled without your involvement, but if you would like to change the number of draws from the default, add this group before calling apop_cdf :
Here are many examples using common, mostly symmetric distributions.
int apop_draw | ( | double * | out, |
gsl_rng * | r, | ||
apop_model * | m | ||
) |
Draw from a model.
out | An already-allocated array of double s to be filled by the draw method. This probably has size your_model->dsize . |
r | A gsl_rng , probably allocated via apop_rng_alloc. Optional; if NULL , then I will call apop_rng_get_thread for an RNG. |
m | The model from which to make draws. |
draw
method, then use that. out[0]
is probably NAN
on failure. apop_model* apop_estimate | ( | apop_data * | d, |
apop_model * | m | ||
) |
estimate the parameters of a model given data.
This function copies the input model, preps it, and calls m.estimate(d, m)
. If your model has no estimate
method, then I assume apop_maximum_likelihood(d, m)
, with the default MLE params.
I assume that you are using this function rather than directly calling the model's the estimate
method. For example, the estimate
method may assume that apop_prep
has already been called.
d | The data |
m | The model |
parameters
element filled in. double apop_log_likelihood | ( | apop_data * | d, |
apop_model * | m | ||
) |
Find the log likelihood of a data/parametrized model pair.
d | The data |
m | The parametrized model, which must have either a log_likelihood or a p method. |
apop_model* apop_model_clear | ( | apop_data * | data, |
apop_model * | model | ||
) |
Allocate an apop_model.
This sets up the output elements of the apop_model:
the parameters and info.
At close, the input model has parameters of the correct size.
prep
method, then that gets used instead, but most don't (or call apop_model_clear at the end of their prep routine).The above two points mean that you probably don't need to call this function directly.
data | If your params vary with the size of the data set, then the function needs a data set to calibrate against. Otherwise, it's OK to set this to NULL . |
model | The model whose output elements will be modified. |
outmodel->error=='d' | dimension error. |
apop_model* apop_model_copy | ( | apop_model * | in | ) |
Outputs a copy of the apop_model input.
in | The model to be copied |
parameters
(if not NULL
, copied via apop_data_copy).in.more_size > 0
I memcpy
the more
pointer from the original data set.out->error=='a' | Allocation error. In extreme cases, where there aren't even a few hundred bytes available, I will return NULL . |
out->error=='s' | Error copying settings groups. |
out->error=='p' | Error copying parameters or info page; the given apop_data struct may be NULL or may have its own ->error element. |
void apop_model_free | ( | apop_model * | free_me | ) |
Free an apop_model structure.
The parameters
element is freed. These are all the things that are completely copied, by apop_model_copy
, so the parent model is still safe after this is called. data
is not freed, because the odds are you still need it.
If free_me->more_size
is positive, the function runs free(free_me->more)
. But it has no idea what the more
element contains; if it points to other structures (like an apop_data set), you need to free them before calling this function.
free_me
is NULL
, this does nothing.free_me | A pointer to the model to be freed. |
double apop_p | ( | apop_data * | d, |
apop_model * | m | ||
) |
Find the probability of a data/parametrized model pair.
d | The data |
m | The parametrized model, which must have either a log_likelihood or a p method. |
apop_model* apop_parameter_model | ( | apop_data * | d, |
apop_model * | m | ||
) |
Get a model describing the distribution of the given parameter estimates.
For many models, the parameter estimates are well-known, such as the -distribution of the parameters for OLS.
For models where the distribution of is not known, if you give me data, I will return a apop_normal or apop_multivariate_normal model, using the parameter estimates as mean and apop_bootstrap_cov for the variances.
If you don't give me data, then I will assume that this is a stochastic model where re-running the model will produce different parameter estimates each time. In this case, I will run the model 1e4 times and return a apop_pmf model with the resulting parameter distributions.
Before calling this, I expect that you have already run apop_estimate to produce .
The apop_pm_settings structure dictates details of how the model is generated. For example, if you want only the distribution of the third parameter, and you know the distribution will be a PMF generated via random draws, then set settings and call the model via:
index
gives the position of the parameter (in apop_data_pack order) in which you are interested. Thus, if this is zero or more, then you will get a univariate output distribution describing a single parameter. If index == -1
, then I will give you the multivariate distribution across all parameters. The default is zero (i.e. the univariate distribution of the zeroth parameter). rng
If the method requires random draws (as the default bootstrap will), then use this. If you provide NULL
and one is needed, I provide one for you via apop_rng_alloc(apop_opts.rng_seed++)
. draws
If there is no closed-form solution and bootstrap is inappropriate, then the last resort is a large numbr of random draws of the model, summarized into a PMF. Default: 1,000 draws. apop_data* apop_predict | ( | apop_data * | d, |
apop_model * | m | ||
) |
A prediction supplies E(a missing value | original data, already-estimated parameters, and other supplied data elements ).
For a regression, one would first estimate the parameters of the model, then supply a row of predictors X. The value of the dependent variable is unknown, so the system would predict that value. [In some models, this may not be the expected value, but is a best value for the missing item using some other meaning of `best'.]
For a univariate model (i.e. a model in one-dimensional data space), there is only one variable to omit and fill in, so the prediction problem reduces to the expected value: E(a missing value | original data, already-estimated parameters).
In other cases, prediction is the missing data problem: for three-dimensional data, you may supply the input (34, NaN
, 12), and the parameterized model provides the most likely value of the middle parameter.
NULL
data set, I will assume you want all values filled in—the expected value.NaNs
, I will take those as the points to be predicted given the provided data.If the model has no predict
method, the default is to use the apop_ml_impute function to do the work.
NULL
data set, I will return that, with the zeroth column or the NaNs
filled in. If NULL
input, I will allocate an apop_data set and fill it with the expected values.There may be a second page (i.e., a apop_data set attached to the ->more
pointer of the main) listing confidence and standard error information. See your specific model documentation for details.
This segment of the framework is in beta—subject to revision of the details.
void apop_prep | ( | apop_data * | d, |
apop_model * | m | ||
) |
The default prep is to simply call apop_model_clear. If the function has a prep method, then that gets called instead.
void apop_score | ( | apop_data * | d, |
gsl_vector * | out, | ||
apop_model * | m | ||
) |
Find the vector of first derivatives (aka the gradient) of the log likelihood of a data/parametrized model pair.
d | The data |
out | The score to be returned. I expect you to have allocated this already. |
m | The parametrized model, which must have either a log_likelihood or a p method. |
As input to your function, you can expect that the model m
is sufficiently prepped that the log likelihood can be evaluated; see p, log_likelihood for details. On output, the a gsl_vector
input to the function must be filled with the gradients (or NaN
s on errors). If the model parameters have a more complex shape than a simple vector, then the vector must be in apop_data_pack
order; use apop_data_unpack
to reformat to the preferred shape.