Patterns in static

Apophenia

A quick overview

This is a "gentle introduction" to the Apophenia library. It is intended to give you some initial bearings on the typical workflow and the concepts and tricks that the manual pages assume you have met.

This introduction assumes you already have some familiarity with C, how to compile a program, and how to use a debugger. If you want to install Apophenia now so you can try the samples on this page, see the Setting up page.

An outline of this overview:

The opening example

Setting aside the more advanced applications and model-building tasks, let us begin with the workflow of a typical fitting-a-model project using Apophenia's tools:

Here is a concrete example of most of the above steps, which you can compile and run, to be discussed in detail below.

The program:

#include <apop.h>
int main(){
apop_text_to_db(.text_file="data", .tabname="d");
apop_data *data = apop_query_to_data("select * from d");
apop_model_show(est);
}

To run this, you will need a file named data in comma-separated form, so here is a set sufficient for the demonstration. The first column is the dependent variable; the remaining columns are the independent:

Y, X_1, X_2, X_3
2,3,4,5
1,2,9,3
4,7,9,0
2,4,8,16
1,4,2,9
9,8,7,6

If you saved the code to sample.c and don't have a Makefile or other build system, then you can compile it with

gcc sample.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me

or

clang sample.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me

and then run it with ./run_me. This compile line will work on any system with all the requisite tools, but for full-time work with this or any other C library, you will probably want to write a Makefile .

The results are unremarkable—this is just an ordinary regression on too few data points—but it does give us some lines of code to dissect.

The first two lines in main() make use of a database. I'll discuss the value of the database step more at the end of this page, but for now, note that there are several functions, apop_query, and apop_query_to_data being the ones you will most frequently be using, that will allow you to talk to and pull data from either an SQLite or mySQL/mariaDB database.

Designated initializers

Like this line,

apop_text_to_db(.text_file="data", .tabname="d");

many Apophenia functions accept named, optional arguments. To give another example, the apop_data set has the usual row and column numbers, but also row and column names. So you should be able to refer to a cell by any combination of name or number; for the data set you read in above, which has column names, all of the following work:

x = apop_data_get(data, 2, 3); //x == 0
x = apop_data_get(data, .row=2, .colname="X_3"); // x == 0
apop_data_set(data, 2, 3, 18);
apop_data_set(data, .colname="X_3", .row=2, .val= 18);

Default values mean that the apop_data_get, apop_data_set, and apop_data_ptr functions handle matrices, vectors, and scalars sensibly:

//Let v be a hundred-element vector:
double x1 = apop_data_get(v, 1);
apop_data_set(v, 2, .val=x1);
//A 100x1 matrix behaves like a vector
double m1 = apop_data_get(v, 1);
//let s be a scalar stored in a 1x1 apop_data set:
double *scalar = apop_data_ptr(s);

This form may be new to users of less user-friendly C libraries, but it it fully conforms to the C standard (ISO/IEC 9899:2011). See the Designated initializers page for details.

apop_data

A lot of real-world data processing is about quotidian annoyances about text versus numeric data or dealing with missing values, and the the apop_data set and its many support functions are intended to make data processing in C easy. Some users of Apophenia use the library only for its apop_data set and associated functions. See the "data sets" section of the outline page (linked from the header of this page) for extensive notes on using the structure.

The structure includes seven parts:

This is not a generic and abstract ideal, but is really the sort of mess that data sets look like. For example, here is some data for a weighted OLS regression. It includes an outcome variable in the vector, dependent variables in the matrix and text grid, replicate weights, and column names in bold labeling the variables:

RownameVector Matrix TextWeights
"Steven"
"Sandra"
"Joe"
Outcome
1
0
1
Age Weight (kg) Height (cm)
32 65 175
41 61 165
40 73 181
Sex State
Male Alaska
Female Alabama
Male Alabama
1
3.2
2.4

Apophenia will generally assume that one row across all of these elements describes a single observation or data point.

See above for some examples of getting and setting individual elements.

Also, apop_data_get, apop_data_set, and apop_data_ptr consider the vector to be the -1st column, so using the data set in the figure, apop_data_get(sample_set, .row=0, .col=-1) == 1.

Reading in data

As per the example above, use apop_text_to_data or apop_text_to_db and then apop_query_to_data.

Subsets

There are many macros to get subsets of the data. Each generates what is considered to be a disposable view: once the variable goes out of scope (by the usual C rules of scoping), it is no longer valid. However, these structures are all wrappers for pointers to the base data, so all operations on the data view affect the base data.

#include <apop.h>
#ifdef Datadir
#define DATAFILE Datadir "/" "data"
#else
#define DATAFILE "data"
#endif
int main(){
apop_table_exists( DATAFILE , 'd');
apop_data *d = apop_text_to_data( DATAFILE );
//tally row zero of the data set's matrix by viewing it as a vector:
gsl_vector *one_row = Apop_rv(d, 0);
double sigma = apop_vector_sum(one_row);
printf("Sum of the first row: %g\n", sigma);
assert(sigma==14);
//view the first column as a vector; take its mean
double mu = apop_vector_mean(Apop_cv(d, 0));
printf("Mean of the first col: %g\n", mu);
assert(fabs(mu - 19./6)<1e-5);
//get a sub-data set (with names) of two rows beginning at row 3; print to screen
apop_data *six_elmts = Apop_rs(d, 3, 2);
apop_data_print(six_elmts);
}

All of these slicing routines are macros, because they generate several background variables in the current scope (something a function can't do). Traditional custom is to put macro names in all caps, like APOP_DATA_ROWS, which to modern sensibilities looks like yelling. The custom has a logic: there are ways to hang yourself with macros, so it is worth distinguishing them typographically. The documentation always uses a single capital.

Notice that all of the slicing macros return nothing, so there is nothing to do with one but put it on a line by itself. This limits the number of misuses.

Basic manipulations

The outline page (which you can get to via a link next to the snowflake header at the top of every page on this site) lists a number of other manipulations of data sets, such as apop_data_listwise_delete for quick-and-dirty removal of observations with NaNs, apop_data_split / apop_data_stack, or apop_data_sort to sort all elements by a single column.

Apply and map

If you have an operation of the form for each element of my data set, call this function, then you can use apop_map to do it. You could basically do everything you can do with an apply/map function via a for loop, but the apply/map approach is clearer and more fun. Also, if you set OpenMP's omp_set_num_threads(N) for any N greater than 1 (the default on most systems is the number of CPU cores), then the work of mapping will be split across multiple CPU threads. See the outline > data sets > map/apply section for a number of examples.

Text

Text in C is annoying. C already treats strings as pointer-to-characters, so a grid of text data is a pointer-to-pointer-to-pointer-to-character. The text grid in the apop_data structure actually takes this form, but functions are provided to do most or all the pointer work for you. The apop_text_alloc function is really a realloc function: you can use it to resize the text grid as necessary. The apop_text_add function will do the pointer work in copying a single string to the grid. Functions that act on entire data sets, like apop_data_rm_rows, handle the text part as well.

You have your_data->textsize[0] rows and your_data->textsize[1] columns. If you are using only the functions to this point, then empty elements are a blank string (""), not NULL. For reading individual elements, refer to the $(i,j)$th text element via your_data->text[i][j].

Errors

Many functions will set the error element of the apop_data structure being operated on if anything goes wrong. You can use this to halt the program or take corrective action:

apop_data *the_data = apop_query_to_data("select * from d");
if (the_data->error) exit(1);
The whole structure

Here is a diagram of all of Apophenia's structures and how they relate. It is taken from this cheat sheet (2 page PDF), which will be useful to you if only because it lists some of the functions that act on GSL vectors and matrices that are useful (in fact, essential) but out of the scope of the Apophenia documentation.

structs.png

All of the elements of the apop_data structure are laid out at middle-left. You have already met the vector, matrix, and weights, which are all a gsl_vector or gsl_matrix.

The diagram shows the apop_name structure, which has received little mention so far because names basically take care of themselves. Just use apop_data_add_names to add names to your data set and apop_name_stack to copy from one data set to another.

The apop_data structure has a more element, for when your data is best expressed in more than one page of data. Use apop_data_add_page, apop_data_rm_page, and apop_data_get_page. Output routines will sometimes append an extra page of auxiliary information to a data set, such as pages named <Covariance> or <Factors>. The angle-brackets indicate a page that describes the data set but is not a part of it (so an MLE search would ignore that page, for example).

Now let us move up the structure diagram to the apop_model structure.

apop_model

Even restricting ourselves to the most basic operations, there are a lot of things that we want to do with our models: estimating the parameters of a model (like the mean and variance of a Normal distribution) given data, or drawing random numbers, or showing the expected value, or showing the expected value of one part of the data given fixed values for the rest of it. The apop_model is intended to encapsulate most of these desires into one object, so that models can easily be swapped around, modified to create new models, compared, and so on.

From the figure above, you can see that the apop_model structure is pretty big, including a number of informational items, key being the parameters, data, and info elements, a list of settings to be discussed below, and a set of procedures for many operations. Its contents are not (entirely) arbitrary: the theoretical basis for what is and is not included in an apop_model, as well as its overall intent, are described in this research report.

There are helper functions that will allow you to avoid deailing with the model internals. For example, the apop_estimate helper function means you never have to look at the model's estimate method (if it even has one), and you will simply pass the model to a function, as with the above form:

apop_data *a_thousand_normals = apop_model_draws(std_normal, 1000);
apop_data *a_thousand_waits = apop_model_draws(poisson, 1000);
Settings

Every model, and every method one would apply to a model, is prone to have a list of settings: how many bins in the histogram, at what tolerance does the maximum likelihood search end, what are the models being combined in the mixture?

Apophenia organizes settings in settings groups, which are then attached to models. In the following snippet, we specify a Beta distribution prior. If the likelihood function were a Binomial distribution, apop_update knows the closed-form posterior for a Beta-Binomial pair, but in this case it will have to run Markov chain Monte Carlo. Attach a apop_mcmc_settings group to the prior to specify details of how the run should work.

For a likelihood, we generate an empirical distribution—a PMF—from an input data set; you will often use the apop_pmf to turn a data set into an distribution. When we call apop_update on the last line, it already has all of the above info on hand.

Apop_settings_add_group(beta, apop_mcmc, .burnin = 0.2, .periods =1e5);
apop_model *my_pmf = apop_estimate(your_data, apop_pmf);
apop_model *posterior = apop_update(.prior= beta, .likelihood = my_pmf);

You will encounter model settings often when doing nontrivial work with models. All can be set using a form like above. See the settings documentation page for the full list of options. There is a full discussion of using and writing settings groups in the outline page under the Models heading.

Databases and models

Returning to the introductory example, you saw that (1) the library expects you to keep your data in a database, pulling out the data as needed, and (2) that the workflow is built around apop_model structures.

Starting with (2), if a stats package has something called a model, then it is probably of the form Y = [an additive function of X], such as $y = x_1 + \log(x_2) + x_3^2$. Trying new models means trying different functional forms for the right-hand side, such as including $x_1$ in some cases and excluding it in others. Conversely, Apophenia is designed to facilitate trying new models in the broader sense of switching out a linear model for a hierarchical, or a Bayesian model for a simulation. A formula syntax makes little sense over such a broad range of models.

As a result, the right-hand side is not part of the apop_model. Instead, the data is assumed to be correctly formatted, scaled, or logged before being passed to the model. This is where part (1), the database, comes in, because it provides a proxy for the sort of formula specification language above:

apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3, 2) from data");

Generating factors and dummies is also considered data prep, not model internals. See apop_data_to_dummies and apop_data_to_factors.

Now that you have est, an estimated model, you can interrogate it. This is really where Apophenia and its encapsulated model objects shine, because you can do more than just admire the parameter estimates on the screen: you can take your estimated data set and fill in or generate new data, use it as an input to the parent distribution of a hierarchical model, et cetera. Some simple examples:

//If you have a new data set with missing elements (represented by NaN), you can fill in predicted values:
apop_predict(new_data_set, est);
apop_data_show(new_data_set);
//Fill a matrix with random draws.
apop_data *d = apop_model_draws(est, .count=1000);
//How does the AIC_c for this model compare to that of est2?
printf("ΔAIC_c=%g\n", apop_data_get(est->info, .rowname="AIC_c")
- apop_data_get(est2->info, .rowname="AIC_c"));
Testing

Here is the model for all testing within Apophenia:

There are a handful of named tests that produce a known statistic and then compare to a known distribution, like apop_test_kolmogorov or apop_test_fisher_exact. For traditional distributions (Normal, $t$, $\chi^2$), use the apop_test convenience function.

But if your model is not from the textbook, then you have the tools to apply the above three-step process directly. First I'll give an overview of the three steps, then another working example.

Defaults for the parameter models are filled in via bootstrapping or resampling, meaning that if your model's parameters are decidedly off the Normal path, you can still test claims about the parameters.

The introductory example ran a standard OLS regression, whose output includes some standard hypothesis tests; to conclude, let us go the long way and replicate those results via the general apop_parameter_model mechanism. The results here will of course be identical, but the more general mechanism can be used in situations where the standard models don't apply.

The first part of this program is identical to the program above. The second half executes the three steps uses many of the above tricks: one of the inputs to apop_parameter_model (which row of the parameter set to use) is sent by adding a settings group, we pull that row into a separate data set using Apop_r, and we set its vector value by referring to it as the -1st element.

#include <apop.h>
int main(void){
apop_text_to_db(.text_file="data", .tabname="d");
apop_data *data = apop_query_to_data("select * from d");
apop_model_show(est);
Apop_settings_add_group(est, apop_pm, .index =1);
apop_model *first_param_distribution = apop_parameter_model(data, est);
Apop_row(est->parameters, 1, param);
double area_under_p = apop_cdf(param, first_param_distribution);
apop_data_set(param, 0, -1, .val=0);
double area_under_zero = apop_cdf(param, first_param_distribution);
printf("reject the null for x_1 with %g percent confidence.\n",
2*fabs(area_under_p-area_under_zero));
}

Note that the procedure did not assume the model parameters had a certain form. It queried the model for the distribution of parameter x_1, and if the model didn't have a closed-form answer then a distribution via bootstrap would be provided. Then that model was queried for its CDF. [The procedure does assume a symmetric distribution. Fixing this is left as an exercise for the reader.] For a model like OLS, this is entirely overkill, which is why OLS provides the basic hypothesis tests automatically. But for models where the distribution of parameters is unknown or has no closed-form solution, this may be the only recourse.

This introduction has shown you the apop_data set and some of the functions associated, which might be useful even if you aren't formally doing statistical work but do have to deal with data with real-world elements like column names and mixed numeric/text values. You've seen how Apophenia encapsulates as many of a model's characteristics as possible into a single apop_model object, which you can send with data to functions like apop_estimate, apop_predict, or apop_draw. Once you've got your data in the right form, you can use this to simply estimate model parameters, or as an input to later analysis.

Autogenerated by doxygen on Sun Oct 26 2014 (Debian 0.999b+ds3-2).