EMBOSS is richly documented. Depending on your experience and requirements you will want to approach it in different ways:
Application documentation
Library documentation
The source code
Navigate the source code using SRS
Demonstration applications (for each library file)
Programming guides on key topics
AJAX Command Definition (ACD) documentation
C coding standards and guidelines
Quality assurance guidelines
Code and application documentation standards
EMBOSS Software Development Course
You should familiarise yourself with the applications and get to know what has or hasn't been done already. Every EMBOSS application is well documented:
AJAX and NUCLEUS contain hundreds of library calls and this can be daunting at first. Documentation for AJAX and NUCLEUS is available on the EMBOSS website, for the CVS (Developers) Release and major versions of the Stable Release. The documentation is derived from structured comments in the source code itself (see Appendix D, Code Documentation Standards). It is easy to navigate, especially when you have some familiarity with the libraries, enough to guess the library file a function lives in.
AJAX is the core library used by all EMBOSS applications. It covers standard data structures and algorithms:
NUCLEUS provides higher-level functions specific to molecular sequence analysis:
It is easy to navigate the library documentation available from the EMBOSS homepage (http://emboss.open-bio.org/).
From the EMBOSS homepage, click on "AJAX" or "NUCLEUS".
This will bring up a table for the AJAX or NUCLEUS library.
Rows in the AJAX or NUCLEUS library tables correspond to an individual library file, e.g. for Alignments, Array handling, Assert Functions and so on. There are columns in the table for:
Links here bring up the library file documentation (see below) which references all the available objects (C data structures) and functions for that library file.
A short description of the library file.
Links here bring up a detailed programming guide and usage notes for the library file, if available (see Section 6.2, “Programming Guides”).
Links to the C source code for an example application, that illustrates the use of the library, if available (see Section 6.1, “Demonstration Applications”).
Links to the ACD code for an example application (see Section 6.1, “Demonstration Applications”).
Find "String manipulation" in the table and follow the link under "Library documentation".
This will bring up the documentation available for string handling (ajstr.c/h
library files).
The library file documentation includes the following sections:
A short description of the library file.
A longer description of the library file.
Table of names, short description and links to further information for each object (C data structure).
Formal description of each function category in the library file, organised by object type.
Table of names, short description and links to formal description for each function in the library, organised by object type and function category.
Table of names, short description and URL to a formal description for each function in the library, organised alphabetically.
Following a link in the tables of objects or functions brings up information on the objects and functions themselves (see below).
The function documentation includes all the critical information.
The sections in the file are as follows:
This includes the function name, short description and the EMBOSS version number when it was first made available.
The function prototype is given in standard C form.
The function parameters are summarised in a table which organises parameters reflecting their relationship to the function as follows:
INPUT
parameters are read by the function.
OUTPUT
parameters are written by it.
UPDATE
parameters may be read and written.
Description of return value(s).
Full description of function.
C source code of function.
A typical use of the function, generated automatically.
Peripheral documentation such as usage notes.
Cautionary usage advice, known bugs etc.
Exception and other messages the function might generate.
External entities the function is dependent upon, for example, environment variables and files.
Links to functions in the same category.
There may well be several fields which are blank. These will be completed along with progress in documenting the software libraries.
The objects are comprehensively described.
The sections are as follows:
This includes the C data structure name, short description and EMBOSS version number when it was first made available.
Object synopsis (datatypes and variable names).
Definitions of datatypes for the object.
Full description of object.
Description of elements in the data structure.
Functions that operate on the object.
C source code of the data structure.
Typical usage example, generated automatically.
Peripheral documentation such as usage notes.
Cautionary usage advice, known bugs etc.
Links to structures in the same library file.
Again, several fields might be blank and will be completed along with progress in documenting the software libraries.
The source code is a vital reference. A simple method for searching the library or application code is to use the UNIX command grep
to search the C source files for keywords. This is a convenient and direct way to find objects or functions quickly.
If you are unsure how to do a particular task, for example reading in a data file, then you should quickly be able to find a program that does something similar to what you need. Bear in mind there are many ways to solve a problem and the example you find might not necessarily be the best way.
There are two files (the C source code and the ACD file) to look at for each application. They're kept in the directories:
/home/auser/emboss/emboss/emboss/c |
/home/auser/emboss/emboss/emboss/acd/ |
The source code (for the CVS (Developers) Release and the latest Stable Release) may be inspected directly and navigated using SRS. The library source code is indexed in SRS at the EBI SRS Server:
http://srs.ebi.ac.uk/ |
There are separate SRS databases for objects (C data structures) and functions:
http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-newId+-page+LibInfo+-lib+EFUNC
http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-newId+-page+LibInfo+-lib+EDATA
http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-newId+-page+LibInfo+-lib+EFUNCREL
http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-newId+-page+LibInfo+-lib+EDATAREL
From http://www.ebi.ac.uk/srs/:
Click on the Library Page
tab at the of the screen.
Expand the Other databases
section by clicking on the +
to the left of Other databases
.
You will see EDATA, EDATAREL, EFUNC and EFUNCREL listed.
Highlight the check-box next to EMBOSS Data Structures (CVS)
and then click on the Query Form
tab.
Change one of the AllText
options to ID
and type a *
character in its associated box, then click on Search
.
You will see a list of every available object. Here is a more specific search:
Return to the query form and replace the *
by ajpstr
(the AJAX string object).
Click on Search
.
You'll see that two entries are returned, AjPStr
and AjPStrTok
. Click on the link for AjPStr
.
The documentation here is in several sections. The first three give the name, description and "aliases" of the object:
AjSStr
is the name of the string object.
AjPStr
is the datatype for the object pointer.
AjPPStr
is the datatype for a pointer to the object pointer.
AjOStr
is the datatype for the object proper.
AjSStr
, AjOStr
, AjPStr
AjSStr
is the formal name of the string object, AjOStr
is the datatype name for the object whereas AjPStr
is the datatype name for the object pointer. In practice AjOStr
(and all other AjO*
datatypes) are never used in EMBOSS. Instead, memory for an instance of the object in memory is dynamically allocated to the pointer AjPStr
(see Section 5.5, “Programming with Objects”). For this reason, AjPStr
is given after "Name" in SRS and for the sake of brevity, "object" is often used to refer to an AjPStr
(for example) when what is really meant is "object pointer". The use of objects and pointers is covered in depth elsewhere (Section 5.5, “Programming with Objects”).
EDATA and EDATAREL include links to functions that use each object, which is handy if you want to know what you can do with an object. The functions in EFUNC and EFUNCREL are organised into categories of related functionality that correspond to sections in the C source file (see Appendix D, Code Documentation Standards and below).
After the Alias(es)
section you'll see several more blocks which correspond to the function categories Each block contains a list of available functions within that category. The categories you see will depend upon the library file, but might include:
Iterators - iteration, e.g. over individual characters in a string.
Constructors - create new instances of an object (allocate memory).
Destructors - destroy instances of an object (free memory).
Assignments - initialise an object, replace contents if necessary.
Modifiers - change or replace the contents of an object.
Operators - use, but do not change, the contents of an object.
Outputs - write the contents of an object to an external file.
Casts - convert an object into an object or data of another type.
At the bottom of the page you'll see the following section:
Attributes
lists the elements of the C data structure.
Body
gives the C code for the object definition.
The EFUNC database can be searched directly. This is useful if you know the kind of function you want but don't know the name. The function names and names and order of function parameters have been standardised (see Appendix D, Code Documentation Standards) to be intuitive and consistent.
Let's assume you want to search for a function that appends one string to another:
Return to the SRS databases page, uncheck the EDATA database and check the check-box for the EFUNC database.
Select the query form.
It's often best to limit the search to the description field so as to retrieve more specific matches. So:
Change AllText
to Description
Type append & string
into the associated box, then click on Search
.
A list of functions will appear. You can only use those functions that begin with aj
or emb
; public functions in the AJAX and NUCLEUS libraries respectively. The others are hidden functions; accessed by the internals of EMBOSS and not for general use.
From looking at the names, the functions you need are those in the ajStrAppend*
family. You'll see that some of the functions accept other string objects, character strings or just single characters.
This search method is of course limited by the vocabulary used in the function descriptions. For instance, the term "append" is used rather than "catenate". You can see this for yourself by repeating the above search using catenate & string
.
To show the advantage of limiting the search:
Change the Description
field back to AllText
and repeat the string & append
query.
You'll see that there is a significant amount of noise in the results list.
Of course you can use SRS if you know the name of a function and need to examine the source code.
Return to the EFUNC page and change AllText
to ID
.
Now use ajstrappend
as the search term. Perform the search and then click on EFUNC:ajStrAppendS
.
You should see the source code for ajStrAppendS
on screen. Again, the output is in several sections. The name of the function indicates the source library file in which it is to be found; the str
of ajStrAppendS
indicates the ajstr
library. The description field gives the text you search with a Description
search.
The most useful information for a user of the library are the Input
, Returns
and Prototype
fields.
The Input
field shows that this function takes the address of a string object pointer as its first parameter and a string object pointer per se as its second parameter. The Returns
field shows, as expected, the return value of the function (AjBool
, a boolean value). All this information is given at-a-glance in the Prototype
field for the function (the prototypes are included in the library code so you don't need to declare them in your applications). A prototype tells the compiler what a function is expecting and what it will return.
Below the prototype is the body of the function. This patently contains the source code of the function. C language reserved words are highlighted in red. The source code is marked-up with any calls to other EMBOSS functions. Unhighlighted function calls are standard C library calls. You could click on, for example, ajFatal
and see the code for that function.
Clicking on the red arrow on the prototype line will show all the EMBOSS functions that use this particular function. Clicking on the blue arrow will show all the EMBOSS functions that are called by this particular function.
As an EMBOSS application programmer you really don't need to know most of the detailed information above, just the inputs and returns. As a library developer, all the information is useful.
EMBOSS includes, for certain AJAX and NUCLEUS library files, an application which illustrates the correct usage of the common functions. Currently, these "demonstration applications" are kept in the myembossdemo package and have the prefix "demo
". Of course, there is an ACD file for each application. For example the following files illustrates the use of the string library:
/home/auser/emboss/emboss/embassy/myembossdemo/emboss_src/demostring.c |
/home/auser/emboss/emboss/embassy/myembossdemo/emboss_acd/demostring.acd |
For information on compiling and using these applications see Section 3.1, “EMBOSS Programming”.
Programming guides (Section 6.2, “Programming Guides”) are available for most AJAX sub-libraries. These summarise the available C data structures and functions and examples of their use. They are very useful if you want to learn all about a particular area of EMBOSS programming.
Every EMBOSS application has an AJAX Command Definition (ACD) file which contains a complete definition of the command line interface and defines all the information the application needs to run. A single library function call from the application source code parses the ACD file and command line and prompts the user for any values still needed.
ACD files are written in the ACD syntax (Appendix A, ACD Syntax Reference) which defines a set of datatypes available to the applications, attributes for qualifying the datatypes, and much more besides. To develop new applications you will need to master ACD programming (see Chapter 5, C Programming).
To ensure consistency, all code should conform to a basic style and standards. You should familiarise yourself with these C coding standards (Appendix C, C Coding Standards), most of which concern the layout of code.
Various quality assurance (QA) tests are performed on the code and documentation to maintain the quality and integrity of the package. This includes application test runs, compilation and memory leak tests and validation of the structured documentation used for objects and functions.
All code should be thoroughly tested and new library code should be documented to the EMBOSS standard (see below) so that checks can be performed. QA testing is handled by the EMBOSS developers but there are ways to help; if you develop a new application you should also provide test data for it (see Chapter 7, Quality Assurance).
Software without documentation often has little value whereas good documentation can enhance the usefulness of software immensely. All contributed code should be adequately documented. End-user documentation is also required for any new applications. To ensure consistency, the documentation should conform to a basic style and standards that are defined for the code (Appendix D, Code Documentation Standards) and the applications (Section 8.1, “Application Documentation Standards”).
Hands-on courses in "Bioinformatics Software Development using EMBOSS" provide a good introduction to programming in EMBOSS, including all the steps to writing a basic bioinformatics application using the EMBOSS programming libraries. If you would like to attend or host a course then get in touch with the EMBOSS developers (emboss-bug@emboss.open-bio.org).