![]() |
hetparse |
Some research applications require knowledge of the types of heterogen (non-protein groups) that are represented in pdb files. A dictionary of heterogen groups containing various data for all of the heterogens found in pdb is available, but is not in a format that is consistent with flat file formats used for protein structural data in emboss. hetparse parses the dictionary in its raw format and converts it to an embl-like format.
hetparse parse the dictionary of heterogen groups available at http://pdb.rutgers.edu/het_dictionary.txt and writes a file containing the group names, synonyms and 3-letter codes in embl-like format. Optionally, hetparse will search a directory of pdb files and will count the number of files that each heterogen appears in. The path and extension for the pdb files and the names of the input and output files are user- specified.
% hetparse Converts raw dictionary of heterogen groups to a file in EMBL-like format. Name of input file (raw dictionary of heterogen groups): het.txt Search a directory of PDB files with keywords? [N]: Y Directory to search with keywords [./]: Name of EMBL-like format dictionary of heterogen groups. [Ehet.dat]: Ehet.dat |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers (* if not always prompted): [-infile] infile This option specifies the name of input file (raw dictionary of heterogen groups) to parse, which should be of the format specified at http://pdb.rutgers.edu/het_dictionary.txt -dogrep toggle This option specifies whether to search a directory of files (typically PDB files) with keywords. If set, HETPARSE will search the directory and will count the number of files that each heterogen appears in. * -dirlistpath dirlist This option specifies the directory to search with keywords. [-outfile] outfile This option specifies the name of EMBL-like format dictionary of heterogen groups. Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report deaths |
Standard (Mandatory) qualifiers | Allowed values | Default | |
---|---|---|---|
[-infile] (Parameter 1) |
This option specifies the name of input file (raw dictionary of heterogen groups) to parse, which should be of the format specified at http://pdb.rutgers.edu/het_dictionary.txt | Input file | Required |
-dogrep | This option specifies whether to search a directory of files (typically PDB files) with keywords. If set, HETPARSE will search the directory and will count the number of files that each heterogen appears in. | Toggle value Yes/No | No |
-dirlistpath | This option specifies the directory to search with keywords. | Directory with files | ./ |
[-outfile] (Parameter 2) |
This option specifies the name of EMBL-like format dictionary of heterogen groups. | Output file | Ehet.dat |
Additional (Optional) qualifiers | Allowed values | Default | |
(none) | |||
Advanced (Unprompted) qualifiers | Allowed values | Default | |
(none) |
RESIDUE 061 58 CONECT N1 2 N2 C5 CONECT N2 2 N1 N3 CONECT N3 2 N2 N4 CONECT N4 3 N3 C5 HN4 CONECT C5 3 N1 N4 C6 CONECT C6 3 C5 C7 C11 CONECT C7 3 C6 C8 C12 CONECT C8 3 C7 C9 H8 CONECT C9 3 C8 C10 H9 CONECT C10 3 C9 C11 H10 CONECT C11 3 C6 C10 H11 CONECT C12 3 C7 C13 C17 CONECT C13 3 C12 C14 H13 CONECT C14 3 C13 C15 H14 CONECT C15 3 C14 C16 C18 CONECT C16 3 C15 C17 H16 CONECT C17 3 C12 C16 H17 CONECT C18 4 C15 N19 1H18 2H18 CONECT N19 3 C18 C20 C33 CONECT C20 3 N19 C21 N25 CONECT C21 4 C20 C22 1H21 2H21 CONECT C22 4 C21 C23 1H22 2H22 CONECT C23 4 C22 C24 1H23 2H23 CONECT C24 4 C23 1H24 2H24 3H24 CONECT N25 2 C20 C26 CONECT C26 3 N25 C27 C32 CONECT C27 3 C26 C28 H27 CONECT C28 3 C27 C29 H28 CONECT C29 3 C28 O30 C31 CONECT O30 2 C29 HOU CONECT C31 3 C29 C32 H31 CONECT C32 3 C26 C31 C33 CONECT C33 3 N19 C32 O34 CONECT O34 1 C33 CONECT HN4 1 N4 CONECT H8 1 C8 CONECT H9 1 C9 CONECT H10 1 C10 CONECT H11 1 C11 CONECT H13 1 C13 CONECT H14 1 C14 CONECT H16 1 C16 CONECT H17 1 C17 CONECT 1H18 1 C18 CONECT 2H18 1 C18 CONECT 1H21 1 C21 CONECT 2H21 1 C21 CONECT 1H22 1 C22 CONECT 2H22 1 C22 [Part of this file has been deleted for brevity] CONECT 2H6 1 C6 CONECT 1H8 1 C8 CONECT 2H8 1 C8 CONECT 1H9 1 C9 CONECT 2H9 1 C9 END HET 104 28 HETSYN 104 TRIENTINE HETNAM 104 N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE FORMUL 104 C6 H18 N4 RESIDUE 105 32 CONECT B 3 O1 O2 C3 CONECT O1 2 B H1 CONECT O2 2 B H2 CONECT C3 4 B N4 1H3 2H3 CONECT N4 3 C3 C5 H4 CONECT C5 3 N4 O6 C7 CONECT O6 1 C5 CONECT C7 3 C5 C8 C12 CONECT N11 2 O10 C12 CONECT O10 2 N11 C8 CONECT C8 3 C7 O10 C9 CONECT C12 3 C7 N11 C13 CONECT C9 4 C8 1H9 2H9 3H9 CONECT C13 3 C12 C14 C18 CONECT C14 3 C13 C15 CL1 CONECT CL1 1 C14 CONECT C15 3 C14 C16 H15 CONECT C16 3 C15 C17 H16 CONECT C17 3 C16 C18 H17 CONECT C18 3 C13 C17 H18 CONECT H1 1 O1 CONECT H2 1 O2 CONECT 1H3 1 C3 CONECT 2H3 1 C3 CONECT H4 1 N4 CONECT 1H9 1 C9 CONECT 2H9 1 C9 CONECT 3H9 1 C9 CONECT H15 1 C15 CONECT H16 1 C16 CONECT H17 1 C17 CONECT H18 1 C18 END HET 105 32 HETSYN 105 CLOXACILLIN DERIVATIVE HETNAM 105 N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACID HETNAM 2 105 AMIDE] BORONIC ACID FORMUL 105 C12 H12 N2 O4 B1 CL1 |
Excerpt from heterogen dictionary file (input)
RESIDUE 061 58 CONECT N1 2 N2 C5 CONECT N2 2 N1 N3 CONECT N3 2 N2 N4 CONECT N4 3 N3 C5 HN4 CONECT C5 3 N1 N4 C6 CONECT C6 3 C5 C7 C11 < data ommitted for clarity > END HET 061 58 HETSYN 061 L-159,061 HETNAM 061 2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4- HETNAM 2 061 YLMETHYL]-3H-QUINAZOLIN-4-ONE FORMUL 061 C26 H24 N6 O2 ** RESIDUE 072 90 CONECT S1B 2 C1B C2A CONECT C1A 3 C1B O1A N3A CONECT C1B 4 S1B C1A C1C H1B CONECT O1A 1 C1A CONECT C1C 4 C1B C1D 1H1C 2H1C < data ommitted for clarity > END HET 072 90 HETSYN 072 THIAZOLIDINONE; GW0072 HETNAM 072 (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4- HETNAM 2 072 OXO-5-THIAZOLIDINE FORMUL 072 C37 H46 N2 O4 S1 ** RESIDUE 074 58 CONECT C1 4 C2 1H1 2H1 3H1 CONECT C2 4 C1 C3 1H2 2H2 CONECT C3 4 C2 N1 1H3 2H3 CONECT N1 3 C3 C4 1HN1 CONECT C4 3 N1 O1 C5 CONECT O1 1 C4 < data ommitted for clarity > END HET 074 58 HETSYN 074 CA-074; [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2- HETSYN 2 074 CARBONYL)-L-ISOLEUCYL-L-PROLINE] HETNAM 074 [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL- HETNAM 2 074 PROLINE FORMUL 074 C18 H31 N3 O6 < data ommitted for clarity >
ID 105 DE N-[5-METHYL-3-O-TOLYL-ISOXAZOLE-4-CARBOXYLIC ACIDAMIDE] BORONIC ACID SY CLOXACILLIN DERIVATIVE NN 0 // ID 104 DE N,N'-BIS(2-AMINOETHYL)-1,2-ETHANEDIAMINE SY TRIENTINE NN 0 // ID 103 DE 2',5'-DIDEOXY-ADENOSINE 3'-MONOPHOSPHATE SY . NN 0 // ID 102 DE GAMMA-DEOXY-GAMMA-SULFO-GUANOSINE-5'-TRIPHOSPHATE SY . NN 0 // ID 101 DE 2'-DEOXY-ADENOSINE 3'-MONOPHOSPHATE SY . NN 0 // ID 100 DE 1-(5-CHLOROINDOL-3-YL)-3-HYDROXY-3-(2H-TETRAZOL-5-YL)-PROPENONE SY . NN 0 // ID 074 DE [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-PROLINE SY CA-074; SY [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-CARBONYL)-L-ISOLEUCYL-L-PROLINE] NN 0 // ID 072 DE DE (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-OXO-5-THIAZOLIDINE SY THIAZOLIDINONE; GW0072 NN 0 // ID 061 DE DE 2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-YLMETHYL]-3H-QUINAZOLIN-4-ONE SY L-159,061 NN 0 // |
The records used in the output file (below) are as follows:
(1) ID - 3-character abbreviation of heterogen (2) DE - full description (3) SY - synonym (4) NN - no. of files which this heterogen appears in
Example of hetparse output file
ID 061 DE 2-BUTYL-6-HYDROXY-3-[2'-(1H-TETRAZOL-5-YL)-BIPHENYL-4-YLMETHYL]-3H-QUINAZ DE OLIN-4-ONE SY L-159,061 NN 2 // ID 072 DE (+/-)(2S,5S)-3-(4-(4-CARBOXYPHENYL)BUTYL)-2-HEPTYL-4-OXO-5-THIAZOLIDINE SY THIAZOLIDINONE; GW0072 NN 10 // ID 074 DE [PROPYLAMINO-3-HYDROXY-BUTAN-1,4-DIONYL]-ISOLEUCYL-PROLINE SY CA-074; [N-(L-3-TRANS-PROPYLCARBAMOYL-OXIRANE-2-CARBONYL)-L-ISOLEUCYL-L-P SY ROLINE] NN 1 //
Program name | Description |
---|---|
aaindexextract | Extract data from AAINDEX |
allversusall | Does an all-versus-all global alignment for each set of sequences in an input directory and writes files of sequence similarity values |
cathparse | Reads raw CATH classification files and writes DCF file (domain classification file) |
cutgextract | Extract data from CUTG |
domainer | Reads CCF files (clean coordinate files) for proteins and writes CCF files for domains, taken from a DCF file (domain classification file) |
domainnr | Removes redundant domains from a DCF file (domain classification file). The file must contain domain sequence information, which can be added by using DOMAINSEQS |
domainseqs | Adds sequence records to a DCF file (domain classification file) |
domainsse | Adds secondary structure records to a DCF file (domain classification file) |
pdbparse | Parses PDB files and writes CCF files (clean coordinate files) for proteins |
pdbplus | Add residue solvent accessibility and secondary structure data to a CCF file (clean coordinate file) for a protein or domain |
pdbtosp | Convert raw swissprot:PDB equivalence file to EMBL-like format |
printsextract | Extract data from PRINTS |
prosextract | Builds the PROSITE motif database for patmatmotifs to search |
rebaseextract | Extract data from REBASE |
scopparse | Reads raw SCOP classification files and writes a DCF file (domain classification file) |
seqnr | Removes redundancy from DHF files (domain hits files) or other files of sequences |
sites | Reads CCF files (clean coordinate files) and writes CON files (contact files) of residue-ligand contact data for domains in a DCF file (domain classification file) |
ssematch | Searches a DCF file (domain classification file) for secondary structure matches |
tfextract | Extract data from TRANSFAC |
funky uses the hetparse output file as input.
Waqas Awan (wawan © hgmp.mrc.ac.uk)
HGMP-RC, Genome Campus, Hinxton, Cambridge CB10 1SB, UK