Dear Dr. Majovski,

Please find our revision notes and amended manuscript enitled "The first Korean genome sequence and analysis:full genomesequencingfor a socio-ethnic group".

We modified the manuscript according to the reviewers’ comments. We accepted all their points.For example, we changed the name of the first Korean genome’s name from KOREF to SJKKSJ (for Kim Seong-Jin Kim).

Regards,

Jong Bhak,

Reviewer 1.

Reviewer 2.

1. The last sentence of paragraph 1 of the revised manuscript is unclear and

potentially misleading. It reads "Comparing the two genome scale variations in

relation to Caucasion and Yaruba[n] genomes has given us insight about how

distinct they are from each other".

It is important to make it clear (at this same point in the text) that the

differences you find are among individuals, and may or may not reflect

differences among ethnicities.

 We modified the sentence to make it clear that our comparisons only reflect individual genome differences not ethnic differences.

….“Comparing the two genome scale variations in relation to other already known individual genomes has given us insight about how distinct they are from each other.” ….

2. Since KOREF is not a true reference yet, and won’t be used for mapping new

Korean individuals genomes, the name KOREF should be removed not just from the

title but from the entire manuscript.

 We changed KOREF to SJK (Seong-Jin Kim) which are the genome donor’s initials.

Reviewer 3.

1. Page 4- grammatical error after NCBI human reference genome.an�

 Done

2. Page 5- mentioned paired end libraries of 100,200 and 300 bases is not consistent with the methods section which states 200,300 and 400 bases. Please clarify

 Thanks. We corrected the notation to be consistent to span size (100, 200, and 300) instead of the gel band size (200, 300, and 400)

We changed the relevant sentence to: … “The PE adaptor ligated products were separated on a 2% agarose gel and excised from the gel at approximate span size range positions (100bp, 200bp, and 300bp).” ….

3. Page 5- Mapping depth vs sequencing depth. I understand sequencing depth but mapping depth is a new term....perhaps define so as not to be confused with "Clone coverage" which is a more common term. Clone Coverage reflects the coverage including all DNA in the library not just the pieces of molecules which are sequenced but the span size between the two reads.

 Thanks. We changed it to “average depth” that was used by YH and NA18507 genome sequencing to avoid using a new term.

4. The 5.7% of read unmapped need to be described in terms of quality.

Are these just poor quality reads or contamination?

Good point. It is quite possible that many unmapped reads are due to either poor quality or contamination. So we modified our manuscript as below.

The discussion continues to hang on this 5.7% of reads not mapping as being ethnically meaningful but you show on page 6 that half of these reads are filtered out with a quality screen of which 2.2% (539,873 reads out of 1.2B reads) actually de novo assemble in 28,696 contigs. I think the emphasis of unmapped reads to drive the ethnic argument is still a bit weak based on how few of these reads assemble. A good portion could just be contamination. Perhaps if there is anything interesting in those 57 refseq proteins it would help to drive the story....ie are there certain gene classes represented?

 We agree. We modified the manuscript, dropping the suggestion that unmapped reads can provide some room for ethnic specificity.

 We changed the relevant sentence to: … “This does not mean that 5.77% (reduced from 5.97%) of the SJK sequencing reads account for ethnical or individual uniqueness, as there could be other factors affecting the mapping such as sequencing error or contamination (Supplemental Table 1).” ….

5. On page 6.

This isn’t the analysis Wheeler performed. Knowing 99% of triallelic SNPs are False Positives, he counted how many of these he saw in the genome as a proxy for his False Positives. I still think the false positives in this study are very poorly understood outside of the areas targeted by Chips.Are you suggesting 100% of triallelic SNPs are real?

This would be a bold claim and its best to qualify how many triallelics you saw and how many you validated and if you are really seeing this many triallelics or if you have collapsed CNVs.

How did you select 21 triallelic SNPs?21 out of how many which exist in your data.

 We are sorry for the confusion. There was a mistake in our explanation using the term triallelic. Our triallelic SNPs are in fact “novel” SJK-specific homozygous allele (that wereallele (that was different from dbSNP’s alleles). These are not triallelic within the SJK genome. So, they are not triallelic in the usual usage.

 We tested these 29 SJK-specific novel homozygous allelesSNPs to see if they were from errors or not. 21 of them were validated by experiments. Experiments for the remaining eight did not work (so were inconclusive). Out of the successful 21 experiments, all were novel SJK-specific alleles.

We changed the relevant sentence to: ….“We found 29 SJK-specific novel homozygous SNP candidates, out of 3,439,107 SNPs, when we compared them with dbSNP entries and experimentally validated them using PCR (Polymerase Chain Reaction) amplification and Sanger dideoxy sequencing to see if they were from errors or not. 21 out of 29 were experimentally validated. Eight of them were inconclusive as the experiments were not successful (Supplemental Table 2).“

1