textsearch |
Please help by correcting and extending the Wiki pages.
textsearch searches for words (specified as a regular expression) in the description text of one or more input sequences. It writes an output file with optional contents such as the name, description and accession number of any sequence whose description line from the annotation matches the search term. Optionally, the search is case-sensitive and the results output as an HTML table. textsearch is convenient for small input files but will be slow for larger files and databases; you should use use SRS or Entrez instead.
textsearch searches only the description line, not the full sequence annotation.
Search for 'lactose':
% textsearch "tsw:*" "lactose" Search the textual description of sequence(s) Output file [12s1_arath.textsearch]: ajSeqxrefNewDbS '1-I' 'FT025' |
Go to the output files for this example
Example 2
Search for 'lactose' or 'permease' in E.coli proteins:
% textsearch "tsw:*_ecoli" "lactose | permease" Search the textual description of sequence(s) Output file [bgal_ecoli.textsearch]: |
Go to the input files for this example
Go to the output files for this example
Example 3
Output a search for 'lacz' formatted with HTML to a file:
% textsearch "tembl:*" "lacz" -html -outfile embl.lacz.html Search the textual description of sequence(s) |
Go to the output files for this example
Search the textual description of sequence(s) Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-pattern] string The search pattern is a regular expression. Use a | to indicate OR. For example: human|mouse will find text with either 'human' OR 'mouse' in the text (Any string) [-outfile] outfile [*.textsearch] Output file name Additional (Optional) qualifiers: -casesensitive boolean [N] Do a case-sensitive search -html boolean [N] Format output as an HTML table Advanced (Unprompted) qualifiers: -only boolean [N] This is a way of shortening the command line if you only want a few things to be displayed. Instead of specifying: '-nohead -noname -nousa -noacc -nodesc' to get only the name output, you can specify '-only -name' -heading boolean [@(!$(only))] Display column headings -usa boolean [@(!$(only))] Display the USA of the sequence -accession boolean [@(!$(only))] Display 'accession' column -name boolean [@(!$(only))] Display 'name' column -description boolean [@(!$(only))] Display 'description' column Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit |
Qualifier | Type | Description | Allowed values | Default |
---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||
[-sequence] (Parameter 1) |
seqall | (Gapped) sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required |
[-pattern] (Parameter 2) |
string | The search pattern is a regular expression. Use a | to indicate OR. For example: human|mouse will find text with either 'human' OR 'mouse' in the text | Any string | |
[-outfile] (Parameter 3) |
outfile | Output file name | Output file | <*>.textsearch |
Additional (Optional) qualifiers | ||||
-casesensitive | boolean | Do a case-sensitive search | Boolean value Yes/No | No |
-html | boolean | Format output as an HTML table | Boolean value Yes/No | No |
Advanced (Unprompted) qualifiers | ||||
-only | boolean | This is a way of shortening the command line if you only want a few things to be displayed. Instead of specifying: '-nohead -noname -nousa -noacc -nodesc' to get only the name output, you can specify '-only -name' | Boolean value Yes/No | No |
-heading | boolean | Display column headings | Boolean value Yes/No | @(!$(only)) |
-usa | boolean | Display the USA of the sequence | Boolean value Yes/No | @(!$(only)) |
-accession | boolean | Display 'accession' column | Boolean value Yes/No | @(!$(only)) |
-name | boolean | Display 'name' column | Boolean value Yes/No | @(!$(only)) |
-description | boolean | Display 'description' column | Boolean value Yes/No | @(!$(only)) |
Associated qualifiers | ||||
"-sequence" associated seqall qualifiers | ||||
-sbegin1 -sbegin_sequence |
integer | Start of each sequence to be used | Any integer value | 0 |
-send1 -send_sequence |
integer | End of each sequence to be used | Any integer value | 0 |
-sreverse1 -sreverse_sequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N |
-sask1 -sask_sequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N |
-snucleotide1 -snucleotide_sequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N |
-sprotein1 -sprotein_sequence |
boolean | Sequence is protein | Boolean value Yes/No | N |
-slower1 -slower_sequence |
boolean | Make lower case | Boolean value Yes/No | N |
-supper1 -supper_sequence |
boolean | Make upper case | Boolean value Yes/No | N |
-sformat1 -sformat_sequence |
string | Input sequence format | Any string | |
-sdbname1 -sdbname_sequence |
string | Database name | Any string | |
-sid1 -sid_sequence |
string | Entryname | Any string | |
-ufo1 -ufo_sequence |
string | UFO features | Any string | |
-fformat1 -fformat_sequence |
string | Features format | Any string | |
-fopenfile1 -fopenfile_sequence |
string | Features file name | Any string | |
"-outfile" associated outfile qualifiers | ||||
-odirectory3 -odirectory_outfile |
string | Output directory | Any string | |
General qualifiers | ||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N |
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N |
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N |
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N |
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N |
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y |
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N |
-warning | boolean | Report warnings | Boolean value Yes/No | Y |
-error | boolean | Report errors | Boolean value Yes/No | Y |
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y |
-die | boolean | Report dying program messages | Boolean value Yes/No | Y |
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
The input is a standard EMBOSS sequence query (also known as a 'USA').
Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl
Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.
The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.
See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.
# Search for: lactose tsw-id:LACI_ECOLI LACI_ECOLI P03023 Lactose operon repressor tsw-id:LACY_ECOLI LACY_ECOLI P02920 Lactose permease (Lactose-proton symport) |
# Search for: lactose | permease tsw-id:LACI_ECOLI LACI_ECOLI P03023 Lactose operon repressor tsw-id:LACY_ECOLI LACY_ECOLI P02920 Lactose permease (Lactose-proton symport) |
|
The first column in the name or ID of each sequence. The remaining text is the description line of the sequence.
When the -html qualifier is specified, then the output will be wrapped in HTML tags, ready for inclusion in a Web page. Note that tags such as <HTML>, <BODY>, </BODY> and </HTML> are not output by this program as the table of databases is expected to form only part of the contents of a web page - the rest of the web page must be supplier by the user.
The lines of out information are guaranteed not to have trailing white-space at the end. So if '-nodesc' is used, there will not be any whitespace after the ID name.
Program name | Description |
---|---|
drtext | Get data resource entries complete text |
entret | Retrieves sequence entries from flatfile databases and files |
ontotext | Get ontology term(s) original full text |
textget | Get text data entries |
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.