9.4. Data Analysis

9.4.1. Programme Selection

The left hand side of the main Jemboss window (Section 9.2.9.1, “Main Jemboss Window”) gives access to all programs available through the Jemboss interface.

9.4.2. Program Categories

At the top of the pane, the category menus group together programs with similar analysis characteristics.

  • Click on Alignment and then highlight global from the submenu to see all programs that offer a global alignment of sequences. Highlight and click on stretcher to see the program form appear in the central Jemboss pane.

9.4.3. Favourites

Located on the Jemboss toolbar, the favourites menu offers a selection of commonly used programs. These can be edited (Section 9.7.2, “Programme Selection”) to customise the list and optimise program access.

  • Click on the Favourites menu and select Global Alignments. This will alter the program in the central pane to Needle.

9.4.4. Alphabetical Program List

Further down the left hand pane all the programs are listed alphabetically. The scroll bar to the right allows access to any one of these programs. However, if the name of the required program is known, access may be quicker using the Go To box (Section 9.4.5, “Go To Box”).

9.4.5. Go To Box

Directly above the alphabetical program list is an entry field. Any entry accesses the program list and highlights a program name according to the letters in the entry field. This method can be faster than any other selection method as only a few letters of the program name need be typed in.

Type m in the Go To box to highlight the first program beginning with m.

Add at into the Go To box so the entry now reads mat. This will highlight the first entry beginning with mat, which is the global alignment program matcher. Hit the return button on the computer keyboard to bring up the matcher program form in the central pane.

The same text entry can be used to reselect the same program in the event of mis-entry (see Section 9.7.3, “Input/Output Options”)

9.4.6. Input Section

9.4.6.1. Features

Should the results of the selected program require sequence features in any format, then the Use Feature Information box at the top of the input section should be selected. This option is only available for those programs that retrieve sequences: seqret, seqretsplit, skipseq, splitter and union.

9.4.7. File Input

9.4.7.1. File/Database Entry

This is the default selection and allows entry of either stored files (including listfiles (Section 6.6, “The Uniform Sequence Address (USA)”) via drag and drop from Local (Section 9.3.1, “Local File Management”) and Remote (Section 9.3.14, “Remote File management”) File Managers as well as from the Sequence List (Section 9.7.3.1, “Sequence Input”). If the file to be dragged is a listfile (Section 9.3.5.2, “Re-writing a File with New Data”) then the entire entry must be prefixed with an @ sign to indicate to Jemboss the nature of the data.

USAs (Section 6.6, “The Uniform Sequence Address (USA)”) can be entered directly into the field.

9.4.7.2. Browse Files

9.4.7.3. Local Files

  • Hit the Reset button to clear the entry field. Open the local file manager (Section 9.3.1, “Local File Management”) and drag the bgal_ecoli.fasta file into the entry field. Once there is visual indication that the mouse is over the input field, drop the file by releasing the mouse button. The entire file path will be displayed in the field.

9.4.7.4. Remote Files

  • Hit the Reset button to clear the field once more.

9.4.7.5. Uniform Sequence Address

  • Hit the Reset button to remove the remote entry. Click on the Input Sequence Options button, select the uniprot option from the Databases available drop down menu and hit the OK button. This database is now written in the entry field. Type bgal_ecoli in the entry field after the colon. The bgal_ecoli sequence will now be retrieved from the uniprot database.

The database retrieval option using such a USA might only be possible if the desktop computer is connected to the Internet as the sequence may need to be retrieved from a remote database.

9.4.7.6. Paste

Selection of this option allows a sequence or a list of sequences to be pasted into a larger field. Sequences should be pasted in using the desktop shortcut for paste (<CONTROL> + V for Windows, <Apple> + V for Macintosh, middle mouse button for Unix)

9.4.7.7. List of Files

This option is useful only for those programs requiring a number of input files such as emma, the multiple sequence alignment tool. It consists of 20 File/ Database Entry fields (Section 9.4.7.1, “File/Database Entry”) and accepts files specified in the usual manner.

9.4.8. Input Sequence Options

Very few of these sequence attributes are necessary for a successful analysis run as they can be detected automatically.

  • Hit the Input Sequence Options button to see potential sequence attributes.

9.4.9. Databases Available

Lists all databases available for a particular installation of Jemboss. Full names plus any name derivatives are shown, e.g. both uniprot and uni are often used to specify the uniprot protein database.

9.4.10. Sequence Format

Lists all of the EMBOSS-acceptable formats (Section A.1, “Supported Sequence Formats”). It is not normally necessary to specify the format as the program can generally detect this, however if the sequence format is somewhat obscure (e.g. ig or jackknifer), it may be required.

9.4.11. Begin/End

Used if only a portion of a larger sequence need be analysed. Thus an entire database entry can be retrieved but only the relevant portion will undergo analysis.

  • Enter 300 in the begin field and 600 in the end field.

9.4.12. Reverse Complement

A selection here will ensure that the analysis run also includes a check of the reverse complement sequence. It can be used, for example, for nucleotide sequence translations and finding open reading frames or stem loops.

9.4.13. Nucleotide/ Protein

This specifies the type of sequence file used as input. This is generally obvious to the program, but may be necessary for specific types of sequences, for example a peptide sequence composed of a disproportionate number of alanines, threonines, glycines and cytosines, or a nucleotide sequence containing several ambiguity codes. Only one of these options may be selected.

9.4.14. Upper/Lower Case

Forces the program to return the sequence text in either upper or lower case. The default is upper case. Only one of these options may be selected.

9.4.15. UFO Features

The UFO (Uniform Feature Object) is the standard way of specifying file formats containing feature information (Section 5.3, “Introduction to Feature Formats”). In order to use this option the Use feature information box (Section 9.4.6.1, “Features”) should be selected.

You use the UFO features box to optionally load in a features file in association with any sequence you have specified on the main application form. The UFO command line syntax needs to be used is explained elsewhere (Section 6.7, “The Uniform Feature Object (UFO)”).

9.4.16. Load Sequence Attributes

This is a large bar running halfway across the central pane with text in red capitals. Its action is to load the sequence in advance of the analysis run. This is only relevant in cases where there are parameter dependencies on the form which are based on the sequence. The most obvious of these cases are alignment programs, which select default matrices and penalties based on whether the sequence is nucleotide or protein.

Hit the LOAD SEQUENCE ATTRIBUTES bar to update the default gap penalties. Select No to the confirmation message so the inputted start and end sites are not overwritten.

This bar will load sequence attributes for the entire sequence, and so will offer to override any attributes selected in the Input Sequence Options (Section 9.4.8, “Input Sequence Options”).

  • Enter uni:bgal1_entcl in the second sequence filename entry box and load sequence attributes for that sequence also. Look at the begin and end sequence attribute options to ensure the full 1028 peptides of the sequence have been loaded.

9.4.17. Parameter Selection

Any options (Section 6.1, “Introduction to the EMBOSS Command Line”) needed for analysis of the input file are listed after the input section. These parameters are required for the analysis to complete. All mandatory parameters are subject to a default setting, which may or may not be visible to the user. Consult the documentation (Section 9.9, “Documentation”) for each program to ascertain these settings.

9.4.18. Output Section

Depending on the program, the output section may contain a single option to alter the output sequence format (such as matcher), or it may contain a more comprehensive list of parameters that may be included in the final output (e.g. remap). All output section parameters are subject to a default setting, which may or may not be visible to the user. Consult the documentation (Section 9.9, “Documentation”) for each program to ascertain these settings.

For all programs returning a sequence an Output Sequence Name entry field is available, and will name the appropriate results tab (Section 9.6.8.1, “Saved Results Window”) with whatever name is entered. Only the filename is returned, and any filename extensions will be lost.

  • Select seqret by selecting the Database Sequence Retrieval option from the Favourites menu at the top of the Jemboss window. Type uni:bgal_ecoli into the Sequence Filename field and bgal_ecoli_1 into the Output Sequence Name field in the output section. Hit GO and note the name of the results tab (Section 9.6.8.1, “Saved Results Window”) containing the returned sequence.

Currently the name is not transferred when the results are saved, it is for display purposes only.

  • Close the Results window.

9.4.19. Output Sequence Options

Available for any program which outputs a sequence, the output sequence options allow the user to customise a returned sequence should there be such a requirement.

The Separate file for each entry option can be toggled on and off and allows the data to be returned as separate results tabs and not as a single, multiple sequence file. This may be easier to view, but each tab must be saved separately whereas a single multiple data tab can be saved in one go.

9.4.20. Sequence Format

The default output for any sequence in EMBOSS is fasta , but any one of the formats currently supported can be selected from the drop down menu.

9.4.21. Filename Extension

Adds the specified extension to the filename. Anything entered here, however, is overridden by an entry in the Output Sequence Name box (Section 9.4.18, “Output Section”).

This option is not available if the Separate file for each entry option is selected.

9.4.22. Base File Name

This option is for programs which return more than one data file. The base filename chosen will be applied to all data and ascending numbers appended to the name.

This option is not available if the Separate file for each entry option is selected.

9.4.23. Features Format

The features format only needs to be specified here and no colon (':') is required. In order to use this option, the Use feature Information box (Section 9.4.6.1, “Features”) should be selected.

  • Select the Use Feature Information box at the top of the seqret program form. Enter uni:bgal_ecoli in the Sequence Filename field (resetting any other entries if necessary). Open the Output Sequence Options and delete any entries currently visible. In the Features format entry field type swiss. Hit the GO button.

Two results tabs will be returned. The first will be bgal_ecoli.swiss and contain the features of this protein in swissprot format and the second, bgal_ecoli.fasta, will be the sequence. If the swiss format is not entered then Jemboss will return the features in the default GFF format.

  • Close the Results window.

9.4.24. Features Filename

The features output filename (only) needs to be specified here. In order to use this option, the Use feature Information (Section 9.4.6.1, “Features”) should be selected.

The Use Feature Information option should be selected on the seqret program form. Enter uni:bgal_ecoli in the Sequence Filename field (deleting anything else if necessary). Open the Output Sequence Options and in the Feature Format entry field, type swiss. In the Features Filename entry field type features. Hit OK to close the options menu and hit the GO button. The results will be the same as for the previous example except that the output tab for the features is now called features.

If the Separate file for each entry option is selected then the individual sequences appear in separate tabs, but the feature information will appear consecutively in the same tab.

  • Enter uni:bgal*_e* in the Sequence Filename field. Select the Separate file for each entry option in the output sequence options. Leave everything else as in the practical above and hit GO. Scroll to the end of the features tab and compare to the end of the features tab for the last practical. The bgal1_entcl features, and possibly others, should have been added.

  • Close the Results window

9.4.25. Sequence Format

The default output format for any single sequence returned by EMBOSS is FASTA. The default for alignment programs may differ between programs and the default is displayed in parentheses. These defaults may be altered using the drop down menus.

9.4.26. Graphical Format

There are two options for those programs which offer a graphical output. The default PNG output is a static line drawing of the output image. The alternative is Jemboss Graphics which can be selected from the drop down menu. This offers an interactive graphical display.

9.4.27. PNG Graphics

  • Select dotmatcher by typing do into the Go To field and hitting return. Enter uni:bgal_ecoli into the first Sequence Filename field and uni:bgal1_entcl into the second. Hit the GO button to return results as a static image.

PNG graphics files must be saved with a .png extension to the filename to allow them to be recognised by the software.

  • Close the graphics window.

9.4.28. Jemboss Graphics

Leave the entries in dotmatcher and alter the drop down menu in the Output Section to read Jemboss Graphics. Hit the GO button. Graphics should appear in an interactive graphic.

The font size may be altered using the drop down menu on the graphics toolbar. The view may be altered using the percentage zoom menu, also on the toolbar. Hover the mouse over anywhere on the graphic to see the coordinates of that location.

  • Open the File menu on the graph display and select Display data.

An EMBOSS data file window opens to reveal a text version of the dotmatcher graphic. This information cannot be saved.

9.4.29. Graph Options

  • Hit the Options menu on the graphic toolbar to alter the axes and label information. Any alterations can be selected using the OK button to close the options window. The APPLY button will effect the changes on the graphic without closing the window. These changes will remain even when the CANCEL button is then applied.

9.4.30. Main Title

  • Delete the text in the Main Title field and enter bgal_ecoli vs bgal1_entcl. Hit the APPLY button.

This field will accept unlimited characters, but the title appears only on one line of the graphic, centred in the middle of the graph. Thus if the title is too long, it will disappear off the end of the graphic.

9.4.31. Axis number format

The number format for both X and Y axis can be altered using the drop down menu.

9.4.32. Ticks

The number of ticks displayed on each axis can be entered in the appropriate fields. There is no limit to the number of ticks entered, however too many will result in a thick, black, indistinguishable line under the axis.

9.4.33. Axis Labels

The axes Start and End sites are labelled by default. The Start site is always zero and the End site represents the length of the sequence. This may lead to irregular axis numbering. There is no limit placed on new entries, thus if the End site is longer than the actual sequence the graph will move to the left (on the X axis) and down (on the Y axis).

The title of each axis may be altered by entering the required text. There is no limit to the text that may be entered, but longer text may disappear off the end of the axis.

9.4.34. Graph Formatting

The height and width of the graph may be altered. The plot is created as a disproportional plot, but this can be altered by adjusting the height and/or width of the plot.

Should it be required, the colour of the graph can be altered by clicking once with the left hand mouse button on the Graph Colour square. A new colour may be selected from the resulting palette. The colour affects the data only, and not the axes. The width of the graph line may also be made thicker by adjusting the Graph Line Width. There is currently no limit to the line width which can be selected, but a larger line width may obscure data.

9.4.35. Saving Jemboss Graphics

Graphics are saved as they are viewed on the screen, and can be saved in a variety of different formats.

  • Select the File menu at the top of the dotmatcher graphic. Select the Print option to reveal the Save field and the default PNG format. Alter the format using the right hand Select Format drop down menu to jpeg. Enter dotmatcher.jpeg into the File Name field and save to the same folder as the other files in this section.

9.4.36. Advanced Parameter Selection

Advanced program parameters are hidden on the initial program form as they are not required for the analysis run. They are revealed by clicking on the Advanced Options button and scrolling down the program form.

  • Hit the Advanced Options button to reveal the additional program parameters. Alter the window size to 5 and hit the GO button to display small but almost identical matches. Compare the results with those of the analysis run using the larger window size.

9.4.37. Program run options

The majority of programs do not require a great amount of compute power and the results are ready immediately. These programs are run interactively (Section 9.4.38, “Interactive”). Some analyses, however, take extra memory and time and it makes sense to run them in batch mode (Section 9.4.39, “Batch”). The mode in which any program is run can be altered using the drop down Execution mode menu at the left of the GO button (in older versions of Jemboss it appears at the bottom right).

9.4.38. Interactive

This is the default for the majority of programs. Results immediately appear in the Saved Results windows (Section 9.6.8.1, “Saved Results Window”) on screen as the analysis run finishes. During the run the Jemboss screen is locked and it is impossible to conduct any further analyses whilst the current one is running.

This is the mode in which the dotmatcher example above is run.

Any program can be altered to run in batch mode. This may be advantageous if, for example, the desktop computer is slow or there are a number of analyses which need to be carried out before comparing results.

9.4.39. Batch

Those analyses that require a greater amount of compute power are by default run in batch mode. The entire analysis is done in the background and Jemboss can continue to be used whilst the analysis is running.

  • Alter the drop down menu at the bottom left of the central Jemboss pane to read batch and hit the GO button once again for the dotmatcher analysis. The process is sent to the Job Manager (Section 9.6.3, “Job Manager) and is noted on screen by the message sending batch process now. Results can be retrieved once the run is completed.

Any program set to run in batch by default may be altered to run in interactive mode, however this would freeze the Jemboss window for the duration of the run.