BIO 102 Renn_Lab#1 (in-lab handout: answer-key) Name ______

Answers to Scoring in Scrabble (English Word Play)

(1a) All words that contain ‘ghi’

regex = ‘ghi’

(1) coughing

(2) laughing

(3) laughingly

(4) laughingstock

(5) outweighing

(6) sighing

(7) weighing

(8) weighings

(1b) All words with ‘yz’ that are not immediately followed by an ‘e’ or an ‘i’

regex = ‘yz[^ei]’

(1) analyzable

(2) Byzantine

(3) Byzantinize

(4) Byzantinizes

(5) Byzantium

(6) unanalyzable

(1c) All words that contain “fin” or “phin”

regex = ‘fin|phin’ or if you want to be fancy ‘(f|ph)in’

(1) affinities

(2) affinity

(3) autographing

(4) Baffin

(5) beefing

(6) bluffing

(7) briefing

(8) briefings

(9) burglarproofing

(155) sphinx … 165 total

Answers to Scrabble in DNA Land

(2a) All 7-mers that contain GTGAC

regex = ‘GTGAC’

TGTGACG

GTGACGT

GTGACGG

AGTGACC

GTGACCA

GTGACAG … 48 total

(2b) All 7-mers that contain the dimer GC immediately followed by a letter that is not G or a C.

regex = ‘GC[^GC]’

GCTCAAT

TTGCTCA

TGGCACA

AAGGGCT

TGAAGCT … 2512 total

Answers to Direct Repeats (English Word Play)

(3a) Are there any words in which a sequence of two letters is repeated at least three times in the word?

regex = ‘(.)(.).*\1\2.*\1\2’

(1) anticompetitive

(2) antidisestablishmentarianism

(3) confrontation

(4) confrontations

(5) contentment

(6) enlightenment

(7) inclining

(8) indoctrinating

(9) infringing

(10) insinuating …23 total

Answers to Direct Repeats in DNA Land

(4a) Are there any motifs in which a sequence of two nucleotides is repeated at least three

times in the motif?

regex = ‘(.)(.).*\1\2.*\1\2’

AAAAAAA

AAAAAAC

AAAAAAG

AAAAAAT

AAAACAA

AAAAGAA

AAAATAA

AACAAAA

AACACAC

AAGAAAA

AAGAGAG

AATAAAA

AATATAT

ACAACAC

ACACAAC … 232 total

Answers for “Mirror Repeats also called palindromes in English (English Word Play)

(5a) Are there any words in which three consecutive letters anywhere in a word are followed by the reverse of those letters anywhere in the word?

(Hint: you do not need ^ (start of word) and $ (end of word) for this).

regex = ‘(.)(.)(.).*\3\2\1’

(1) addresser

(2) addressers

(3) amalgamate

(4) amalgamated

(5) amalgamates

(6) amalgamating

(7) amalgamation

(8) analyticity

(9) assertiveness

(10) assesses… 117 total

(5b) Are there any words in which a pair of letters is followed by a direct repeat of those

pair of letters followed by two occurrences of the reverse of the pair of letters?

(Each of the pairs can be zero or more letters apart). e.g., “AB...AB...BA...BA”

regex = ‘(.)(.).*\1\2.*\2\1.*\2\1’

senselessness

(5c) Are there any words in which this pattern occurs twice: a pair of letters is followed by its reverse of those pair of letters? (Each of the pairs can be zero or more letters apart).

regex = ‘(.)(.).*\2\1.*\1\2.*\2\1’

(1) kinnikinnick(Look for this native plant at Eagle Fern Park during David Dalton’s field trip)

(2) possessionlessness

(3) Wallawalla

Answers for Mirror Repeats in DNA Land

(6a) Are there any motifs in which three consecutive nucleotides anywhere in a motif are

followed by the reverse of those nucleotides anywhere in the motif?

(Hint: you do not need ^ (start of motif) and $ (end of motif) for this).

regex = ‘(.)(.)(.).*\3\2\1’

CTTTTTT

AAAAAAA

ATGTGTA

GTGAAGT

CAAAAAA

ATAATAA

AAAAAAA

TTTTTTT

ATTTTTT

AAGTGAA … 702 total

(6b) Are there any motifs in which a pair of adjacent nucleotides is followed by a direct

repeat of those pair of nucleotides followed by two occurrences of the reverse of the pair of nucleotides? (Each of the pairs can be zero or more nucleotides apart), e.g.,

“GT...GT...TG...TG”

regex = ‘(.)(.).*\1\2.*\2\1.*\2\1’

There are none because this regex matches a motif of minimum size of eight(8), but all our motifs are 7-mers! If we had had 8-mers on the list, a match could have been:

ACACCACA.

(6c) Are there any motifs in which this pattern occurs twice: a pair of adjacent nucleotides is followed by its reverse of those pair of nucleotides? (Each of the pairs can be zero or more nucleotides apart).

regex = ‘(.)(.).*\2\1.*\1\2.*\2\1’

There are none because this regex matches a motif of minimum size of eight(8), but all our motifs are 7-mers!

Answers for Pattern Matching in Large Texts.

(9a) How many times do you think Darwin used the word “evolution” in his text?

I though a lot but I’m wrong.. keep working

(9b) Write a regex to find the correct answer.

regex = ‘evolution’ 12 times {(15) but 2, 12, 13 are missing because the word occurs 2 times in that paragraph.}

( see 9e it really should be ‘/bevolution’ 10 times)

(9c) Write a regex that will find any term related to evolution.

regex = ‘evol.*’

(9d) How many are there?

26 times (includes 1 in the references) {(35) but several are missing because the word occurs 2 times in that paragraph}

(9e) Look carefully. Did you get the words you weren’t expecting (“revolved” “revolution” )? YES

(9f) Write a regex that will avoid these words? (Hint \s means “white space”)

regex = ‘\bevol.*’ or ‘\sevol.*’

Pattern Matching for Protein Sequences

(10a) Write a regex to find how many proteins in E.coli include the protein code GENE.

regex = ‘GENE’ 9

(10b) Write a regex to find out how many proteins in E.coli include a string at least 10 of amino acids made up of DARWIN in any order.

regex = ‘[DARWIN]{10,}’ 9

(10c) Write a regex to find proteins in E.coli include …. (make up your own)

(10d) Is your name part of E.coli proteome?(If you don’t know what the word

There is no amino acid with the code Z or U so SUZY doesn’t exist but SVSY is there 10 times.

(10e) Write a regex for an amino acid pattern that is likely to form a long (10 amino acid) alpha helix secondary structure.

regex = ‘P[MALEK]{10,}’ 4

Answers for Pattern Matching for Protein Sequences in DNA

(11a) Write a regular expression that will search the genomic sequence for the possible amino acid sequence GENE.

regex = GG[AGCT]GA[AT]AA[CT]GA[AT] 20

(11b) Do you find this sequence the same number of times in the DNA as you do when you search the Protein file for GENE? No found it one extra time

If not, what is the reason?

Searching DNA does not require the sequence to be in the correct “reading frame”

(note: because this is a simple perl program we are really only searching by line, any patter that spans a “return” and is therefore broken up onto two lines will not be found by this script. There are easy ways to fix this)

(11c) Write a regular expression to find a Lysine rich region.

regex = (AA[AG]{3,}

Lysine rich region could also be a Lysine every other amino acid (KNKNKN)

(AA[AG].{3}){4,} the .{3} specifies the intervening codon.

To allow some but not all amino acids to occupy this “lysine rich region” we would have to write the regex for each one allowed and separate them with the OR symbol | called “pipe”.

(11d) Why might a researcher be interested in looking for secondary structure in a given DNA sequence rather than looking directly at an amino acid sequence?

This is one way to find VERY ancient homology of genes. The DNA code may be very different, and even a predicted protein code may be highly diverged but the secondary structure of two gene products may be similar.

(11e) Why might a researcher want to write their own program rather than using a web-based tool?

See the bottom of the Protein matching page

Pg 1 of 5