Fgenesh_smo2 and genescan_smo2 gene predictions

Aligned the two gene predictions using Multalign

The two sequences are very similar for the most part. The genescan gene model has three stretches of AA that are not present in the fgenesh gene model (AA 502-523, 546-555, and 577-600). There is a small section (AA 524-545 and 556-567) that have low sequence similarity between the two gene prediction models.

NCBI Blastp genescan_smo2

Putative conserved domains have been detected, click on the image below for detailed results.

/

/ / / / / / / / / / / / /
/ / / / / / / / / / / / / / / / / / / / / / / / /
/ / / /

/ / / /

/ / / /

/ / / /

/ / / /

/ / / /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

/ /

Best Blast hits

emb|CAO21886.1| unnamed protein product [Vitisvinifera]/note="SPRY domain. SPRY Domain is named from SPla and the

RYanodine Receptor. Domain of unknown function. Distant

homologues are domains in butyrophilin/marenostrin/pyrin

homologues; cl02614"

Length=389

Score = 322 bits (825), Expect = 7e-86, Method: Compositional matrix adjust.

Identities = 161/271 (59%), Positives = 197/271 (72%), Gaps = 4/271 (1%)

Query 131 NTNVWSKSTTRKSSKKGGSKTQAKALENENSVYLNPVVP-KIEDGPDLPVLLSKFQKAEK 189

+ NVW+KST+RK KK + E+ + P P K +D PD+ + LSK KAEK

Sbjct 104 SNNVWTKSTSRKGKKKAKATANNAPTEDPVLITPVPRFPDKNDDAPDMKICLSKIYKAEK 163

Query 190 VELSADQLSAGSIKGYRMVRATRGVVEGAWYFEITVEHLGKTGHTRLGWCTQKGDVQAPV 249

VELS D++ A S KGYRMVRATRGVVEGAWYFEI V LG+TGHTRLGW T+KGD+QAPV

Sbjct 164 VELSDDRMRAASGKGYRMVRATRGVVEGAWYFEIRVLKLGETGHTRLGWSTEKGDLQAPV 223

Query 250 GYDSHGYGYRDLEGSKVHAALREPY-GEAYVEGDTIGFYINLPNGAALAPKPPEIVSFKG 308

GYD++ +GYRD++G+KVH ALRE Y GE YVEGD IGFYINLP+GA APKPP +V +KG

Sbjct 224 GYDANSFGYRDIDGTKVHKALRETYGGEGYVEGDVIGFYINLPDGAMYAPKPPHLVWYKG 283

Query 309 LPYTA--ETKEEPLKPLPGGEIVFFRNGVYQGCAYKDIYAGRYFPAASMYTLPNEPNCTV 366

Y +TKE+P K +PG EI FF+NGV QG A+KD+ GRY+PAASMYTLPN+P+C V

Sbjct 284 QRYVCATDTKEDPPKVVPGSEISFFKNGVCQGVAFKDLCGGRYYPAASMYTLPNQPHCVV 343

Query 367 RFNFGPDFAFPITDWKDHTPPQPMSAAPFAG 397

+FNFGPDF F + P+PM P+ G

Sbjct 344 KFNFGPDFEFFPEELNGRPVPRPMIEVPYHG 374

ref|NP_175556.1|SPla/RYanodine receptor (SPRY) domain-containing protein [Arabidopsis

thaliana]

gb|AAG52633.1|AC024261_20 unknown protein; 66348-64527 [Arabidopsis thaliana]

gb|AAY56421.1| At1g51450 [Arabidopsis thaliana]

dbj|BAF01046.1| hypothetical protein [Arabidopsis thaliana]

Length=509

GENE ID: 841570 AT1G51450 | SPla/RYanodine receptor (SPRY) domain-containing

protein [Arabidopsis thaliana]

Score = 314 bits (805), Expect = 1e-83, Method: Compositional matrix adjust.

Identities = 157/271 (57%), Positives = 189/271 (69%), Gaps = 9/271 (3%)

Query 134 VWSKSTTRKSSKKGGSKTQAKALENENSVYLNPVVPKI----EDGPDLPVLLSKFQKAEK 189

VW +TRK KK + T A E+ V + PV P+ +D PDL + LSK KAEK

Sbjct 226 VWVTKSTRKGKKKSKANTPNPAA-VEDKVLITPV-PRFPDKGDDTPDLEICLSKVYKAEK 283

Query 190 VELSADQLSAGSIKGYRMVRATRGVVEGAWYFEITVEHLGKTGHTRLGWCTQKGDVQAPV 249

VE+S D+L+AGS KGYRMVRATRGVVEGAWYFEI V LG+TGHTRLGW T KGD+QAPV

Sbjct 284 VEISEDRLTAGSSKGYRMVRATRGVVEGAWYFEIKVLSLGETGHTRLGWSTDKGDLQAPV 343

Query 250 GYDSHGYGYRDLEGSKVHAALREPYG-EAYVEGDTIGFYINLPNGAALAPKPPEIVSFKG 308

GYD + +G+RD++G K+H ALRE Y E Y EGD IGFYINLP+G + APKPP V +KG

Sbjct 344 GYDGNSFGFRDIDGCKIHKALRETYAEEGYKEGDVIGFYINLPDGESFAPKPPHYVFYKG 403

Query 309 LPYTA--ETKEEPLKPLPGGEIVFFRNGVYQGCAYKDIYAGRYFPAASMYTLPNEPNCTV 366

Y + KEEP K +PG EI FF+NGV QG A+ DI GRY+PAASMYTLP++ NC V

Sbjct 404 QRYICAPDAKEEPPKVVPGSEISFFKNGVCQGAAFTDIVGGRYYPAASMYTLPDQSNCLV 463

Query 367 RFNFGPDFAFPITDWKDHTPPQPMSAAPFAG 397

+FNFGP F F D+ P+PM P+ G

Sbjct 464 KFNFGPSFEFFPEDFGGRATPRPMWEVPYHG 494

ref|NP_569028.1| unknown protein [Arabidopsis thaliana]

dbj|BAB10414.1| unnamed protein product [Arabidopsis thaliana]

gb|AAM61578.1| unknown [Arabidopsis thaliana]

Length=210

GENE ID: 836741 AT5G66090 | hypothetical protein [Arabidopsis thaliana]

(10 or fewer PubMed links)

Score = 197 bits (502), Expect = 2e-48, Method: Compositional matrix adjust.

Identities = 88/129 (68%), Positives = 112/129 (86%), Gaps = 2/129 (1%)

Query 538 QHVLLPIIDKNPYLSDSTRQAAATATSLAKKYGAKITVVVIDEEKKEK--DYEQRLQTIR 595

+H+LLP+ID+NPYLS+ TRQAAAT TSLAKKYGA ITVVVIDEEK+E ++E ++ IR

Sbjct 82 KHLLLPVIDRNPYLSEGTRQAAATTTSLAKKYGADITVVVIDEEKRESSSEHETQVSNIR 141

Query 596 WHLEEGGIQDYGMLEKIGEGKKAAVVIGEVADDMGLDLVVLSMECIHSKHIDGNLLAEFV 655

WHL EGG +++ +LE++GEGKKA +IGEVAD++ ++LVV+SME IHSK+ID NLLAEF+

Sbjct 142 WHLSEGGFEEFKLLERLGEGKKATAIIGEVADELKMELVVMSMEAIHSKYIDANLLAEFI 201

Query 656 PCPVLLLPL 664

PCPVLLLPL

Sbjct 202 PCPVLLLPL 210

Blastp using TAIR database

Scores Sequences producing significant alignments: (bits) Value

ref|NP_175556.1| SPla/RYanodine receptor (SPRY) domain-cont... 285 7e-77

ref|NP_569028.1| unknown protein [Arabidopsis thaliana] 187 2e-47

ref|NP_850020.1| zinc finger (C3HC4-type RING finger) famil... 54 4e-07

ref|NP_192672.2| SPla/RYanodine receptor (SPRY) domain-cont... 38 0.021

ref|NP_973963.1| SPla/RYanodine receptor (SPRY) domain-cont... 37 0.055

ref|NP_174777.2| SPla/RYanodine receptor (SPRY) domain-cont... 37 0.058

ref|NP_176888.2| unknown protein [Arabidopsis thaliana] 31 2.3

ref|NP_190612.3| unknown protein [Arabidopsis thaliana] 30 4.7

ref|NP_174172.1| unknown protein [Arabidopsis thaliana] 29 8.4

ref|NP_175556.1|SPla/RYanodine receptor (SPRY) domain-containing protein

[Arabidopsis thaliana]

Length = 509

Score = 285 bits (729), Expect = 7e-77, Method: Composition-based stats.

Identities = 148/245 (60%), Positives = 177/245 (72%), Gaps = 6/245 (2%)

Query: 159 ENSVYLNPVV---PKIEDGPDLPVLLSKFQKAEKVELSADQLSAGSIKGYRMVRATRGVV 215

E+ V + PV K +D PDL + LSK KAEKVE+S D+L+AGS KGYRMVRATRGVV

Sbjct: 250 EDKVLITPVPRFPDKGDDTPDLEICLSKVYKAEKVEISEDRLTAGSSKGYRMVRATRGVV 309

Query: 216 EGAWYFEITVEHLGKTGHTRLGWCTQKGDVQAPVGYDSHGYGYRDLEGSKVHAALREPYG 275

EGAWYFEI V LG+TGHTRLGW T KGD+QAPVGYD + +G+RD++G K+H ALRE Y

Sbjct: 310 EGAWYFEIKVLSLGETGHTRLGWSTDKGDLQAPVGYDGNSFGFRDIDGCKIHKALRETYA 369

Query: 276 -EAYVEGDTIGFYINLPNGAALAPKPPEIVSFKGLPY--TAETKEEPLKPLPGGEIVFFR 332

E Y EGD IGFYINLP+G + APKPP V +KG Y + KEEP K +PG EI FF+

Sbjct: 370 EEGYKEGDVIGFYINLPDGESFAPKPPHYVFYKGQRYICAPDAKEEPPKVVPGSEISFFK 429

Query: 333 NGVYQGCAYKDIYAGRYFPAASMYTLPNEPNCTVRFNFGPDFAFPITDWKDHTPPQPMSA 392

NGV QG A+ DI GRY+PAASMYTLP++ NC V+FNFGP F F D+ P+PM

Sbjct: 430 NGVCQGAAFTDIVGGRYYPAASMYTLPDQSNCLVKFNFGPSFEFFPEDFGGRATPRPMWE 489

Query: 393 APFAG 397

P+ G

Sbjct: 490 VPYHG 494

ref|NP_569028.1| unknown protein [Arabidopsis thaliana]

Length = 210

Score = 187 bits (475), Expect = 2e-47, Method: Composition-based stats.

Identities = 88/129 (68%), Positives = 112/129 (86%), Gaps = 2/129 (1%)

Query: 538 QHVLLPIIDKNPYLSDSTRQAAATATSLAKKYGAKITVVVIDEEKKEK--DYEQRLQTIR 595

+H+LLP+ID+NPYLS+ TRQAAAT TSLAKKYGA ITVVVIDEEK+E ++E ++ IR

Sbjct: 82 KHLLLPVIDRNPYLSEGTRQAAATTTSLAKKYGADITVVVIDEEKRESSSEHETQVSNIR 141

Query: 596 WHLEEGGIQDYGMLEKIGEGKKAAVVIGEVADDMGLDLVVLSMECIHSKHIDGNLLAEFV 655

WHL EGG +++ +LE++GEGKKA +IGEVAD++ ++LVV+SME IHSK+ID NLLAEF+

Sbjct: 142 WHLSEGGFEEFKLLERLGEGKKATAIIGEVADELKMELVVMSMEAIHSKYIDANLLAEFI 201

Query: 656 PCPVLLLPL 664

PCPVLLLPL

Sbjct: 202 PCPVLLLPL 210

Multalign to Arabidopsis SPRY protein

The Blast results show that known regions only match up to about 200 AA to the gene prediction models. Most of the unknown proteins hit to a SPla/RYanodine receptor (SPRY) protein. When aligned, there is good sequence similarity in certain regions and low sequence similarity to the rest of the protein. The first 85 AA and the last 250 AA do not align to the known protein found in Arabidopsis.

Multalign to Vitisvinifera SPRY protein

This is another SPRY protein hit that turned up in Blastp. This alignment looks better than the Arabidopsis alignment. The first 20 AA and last 250 AA do not align to the known protein. There is one section in the genescan gene prediction (AA 61-68) that does not appear in the Vitis model. This could mean that the genescan model has an additional intron or that a particular intron does not stop when it is supposed to or starts before it is supposed to start.

Multalign to Arabidopsis unknown protein

This unknown protein also showed up in the Blast results. It has pretty good sequence similarity to the unknown protein, particularly at the end of the sequence. In the notes section of the unknown protein refsequence it mentions that /note="similar to Os09g0541700 [Oryza sativa (japonica

cultivar-group)] (GB:NP_001063816.1); similar to unnamed

protein product [Vitisvinifera] (GB:CAO23263.1); contains

domain Adenine nucleotide alpha hydrolases-like

(SSF52402)"

My thoughts are that genescan smo_2 gene model is actually comprised of two genes: a SPRY type gene and a protein of unknown function.