IGG: A Tool to Integrate Gene chips for Genetic Studies

User Manual

Version 2.0

Miao-Xin Li, Lin Jiang

Song You-Qiang and Sham Pak

Biochemistry Department

GenomeResearchCenter

The University of Hong Kong

Pokfulam, Hong Kong SAR, China

AND

HunanBusinessCollege

Changsha 410205, Hunan, China

CONTENT

1. Introduction

2. Installation

2.1 Installation of Java Runtime Environment (JRE)

2.2 Installation of IGG

3. Preparation

3.1 Input Files

3.2 HapMap Genotype Data

3.3 Annotation Files of Genechips

4. Functions

4.1 Integrate Genotypes across various GeneChips

4.1.1 Load Pedigrees or Subjects

4.1.2 Load Genotype Files

4.1.3 Integration Loaded Genotypes

4.2 Integrate HapMap Genotypes

4.2 Export Integrated Datasets

5. Issues involving large datasets

5.1 Conservative Estimate the Required Maximum Java Heap Size

5.2 Merge Datasets

6. Problems and solutions

1. Introduction

IGG (Integration of Genotypes from Genechips) is a Java-based tool with graphic interface to integrate genotypes across high throughput genotyping platforms of Affymetrix and Illumina and the HapMap Project. It is equipped with a series of functions to control qualities of genotype integration and to flexibly export genotypes for genetic studies as well.

2. Installation

2.1 Installation of Java Runtime Environment (JRE)

The JRE is required to run IGG on any operating systems (OS). It can be downloaded from for free. The version number for IGG is 1.6 or up. Installing the JRE is very easy in Windows OS and Mac. In Linux, user may have to configure the Environmental Variables manually.

2.2 Installation of IGG

IGG has not had an installation wizard by far. After downloaded from our website and decompressed, it can be initiated through command, java -jar –Xms256m –Xmx512m "./IGG.jar”, in command prompt window provided by OS. In the command, -Xms<size> and -Xmx<size> set the initial and maximum Java heap sizes for IGG respectively. A larger initial heap size can speed up the process of integration. A higher setting like –Xmx512m is suggested dealing with huge amount of data. The number, however, should be less than the size of physical memory. For Microsoft Windows Workstations, a bat file, run.bat, is prepared for you (users). A simple double-clicking the file can launch IGG.

3. Preparation

3.1 Input Files

Two kinds of input files are required in IGG. One is the pedigree/subjects file, where pedigree structure and attributes of subjects are included. The other is genotype files exported directly by the GTYPE of Affymetrix or the BeadStudio of Illumina. The following are the specific format of these input files. All input files are text-based.

Format of Pedigree/subjects files:

Pedigree ID / Individual ID / Father ID / Mother ID / Gender / Disease / Individual Label
1 / 100 / 0 / 0 / 1 / 0 / 0
1 / 101 / 0 / 0 / 2 / 2 / 0
1 / 307 / 100 / 101 / 2 / 2 / Tom
1 / 502 / 100 / 101 / 1 / 1 / John
1 / 501 / 100 / 101 / 1 / 1 / Kite
1 / 306 / 0 / 0 / 1 / 1 / Kevin

The first five columns are required, which is a traditional definition for almost all popular genetic analysis tools. The column names in the first line have already clearly indicated the meaning of each column. For subjects without pedigrees in case-control studies, their Father andMother IDs are 0. The sixth column is optional and more columns can be added. These columns describe phenotypes of subjects. The last column is the unique identification label in genotype files. A blank or 0 denotes that an individual has no corresponding genechips. All columns are delimited by the “tab” characters.

Format of Affymetrix genotype files:

Sample Name: Tom
##Call Rate Filter Threshold=90.000000
SNP ID / Call
SNP_A-2188145 / AA
SNP_A-1813205 / BB
SNP_A-1880143 / AB
SNP_A-4215517 / AA
SNP_A-1828242 / AA
SNP_A-2029913 / AA
SNP_A-1929900 / BB
SNP_A-1818663 / AB
SNP_A-2192352 / AA
SNP_A-4218271 / AA
SNP_A-2253696 / AB
SNP_A-2033171 / AA
SNP_A-2300162 / AB

This file is exported by GTYPE. The first line is the identification label/name of a subject. This label corresponds to one of Individual Label in the Pedigree/subjects files. IGG will automatically match the two labels during the process of integration. Inconsistent labels will result in the loss of genotypes in the output of integration. The second line is the Call rate filter threshold. Genotype calls start from the fourth line. In the following lines, the first column is the Probe_Set_ID (SNP ID) of Affymetrix genechips and the second is the genotypes.

Format of Illumina genotype files:

Locus_Name / Individual Label / Allele1 / Allele2 / GC_Score
rs1867749 / Tom / A / B / 0.8935
rs1397354 / Tom / A / B / 0.9440
rs649593 / Tom / B / B / 0.8923
rs1517342 / Tom / A / B / 0.8211
rs1517343 / Tom / A / B / 0.8572
rs1868071 / John / B / B / 0.9222
rs761162 / John / B / B / 0.8210
rs911903 / John / B / B / 0.8966
rs753646 / John / A / A / 0.8689
rs558912 / John / A / B / 0.8825
rs357116 / John / A / B / 0.9335
rs715494 / John / A / B / 0.9085
rs223201 / Kevin / B / B / 0.8804
rs213006 / Kevin / A / A / 0.9480
rs520354 / Kevin / B / B / 0.9258
rs874515 / Kevin / B / B / 0.8661

This file is exported by BeadStudio. The first column is column is the RS ID in the dbSNP database The second column Individual Label corresponds to that in the Pedigree/subjects files. The third and fourth columns are genotypes. The last column is ignored by IGG.

3.2 HapMap Genotype Data

To consolidate HapMap genotype, you have to download hapmap genotype files from the web site of HapMap Project ( Downloaded genotypes files can be imported into IGG easily through the menu item Import Genotype in the Hapmap menu. Before importing, you are not allowed to change name of these files because IGG will selectively retrieve SNPs in these files according to the file names as defined by HapMap. You need not download the pedigree and subject information of HapMap sample since they have already been built in the system.

Beside the genotype files, annotation data of the HapMap SNP are also required before you can integrate the HapMap genotype into your own dataset. In the current version, you can download the annotation data straightforwardly through IGG, Tools->Download Hapmap Annotation.

Figure 1: Dialog box of Download HapMap SNP Annotation

3.3 Annotation Files of Genechips

In order to make IGG easier to use, IGG 2.0 has added a function to automatically detect and download the annotation data from the IGG website ( Since you may be not interested all kinds of genechips of Illumina and Affymetrix, you need fist specify your genechips on the dialog “Download GeneChip Annotations” (Figure 2), Tools->Download Chip Annotation.

Figure 2: Dialog box of Download GeneChip Annotation

Hint: You need not use third-party tools to download the HapMap and Genechip annotation data from the IGG website because IGG has employed multi-task and resuming-broken download technologies to speed up the download.

4. Functions

IGG 2.0 focuses on strengthening its integration function and simplify its usage. It has simply three integration Modules: Integrate Genechips according to genomes and chromosomes; Integrate Genechips according to specific genes and SNPs; Integrate HapMap genotypes. After integration, you can export the integrated data into various formats for statistical genetic analyses. Some basic but important functions in IGG 1.0 have been moved into these modules, such as checking Mendelian inheritance and comparing allele frequencies between the annotation and the observed dataset, and checking consistency one subjects’ genotypes across various genechips, if available.

4.1 Integrate Genotypes across various GeneChips

4.1.1 Load Pedigrees or Subjects

Loading pedigrees or subjects is the first step of any integration and analysis on IGG.

Two alternatives are provided to load pedigrees or subjects files. The first one is through the main menu File->Load Pedigree/Subjects. The other one is to use a popup menu, which is more convenient. Right-clicking the mouse after selecting the leaf Pedigree/Subjects in the Files tree can display the popup menu (See Figure 3). Alternatively, you can click the accelerator on the tool bar. IGG can load and recognize only one pedigree/subject file.

To remove the loaded file in the tree, you can use the right-clicking popup menu and choose the Remove File(s) item, or just press the “Delete” key on the keyboard after selecting these file. Click File-> Close All Files menu will move all loaded files including the Pedigree/Subjects file and genotype files.

Figure 3: A Snapshot of Graphic Interface to Load Pedigree or Subject File

4.1.2 Load Genotype Files

After loading the pedigree or subject information, you need load genotype files exported by Affymetrix GTYPE and Illumina BeadStudio software. Similarly, you have several alternatives to load and remove the genotype files, such as through the main or popup menu. However, there are 2 different points, compared with the function of loading Pedigrees/Subjects file.

You have to specify the right type of genechips on a dialog (Figure 4) before your genotype files. A mismatch between the genotype files and genechip types can result in a neglect of genotypes in the process of integration.

User can add multiple files for each node.

Figure 4: The dialog to specify types of genechips before loading the genotypes

4.1.3 Integration Loaded Genotypes

After loading the pedigree/subject and genotype data, you can integrate them now. IGG 2.0 provide two ways for the integration, according to Genome/Chromosome and according to Gene/SNPs. We will introduce both in the following. Figure 5 shows the dialog for the integration according to Genome/Chromosome. The dialog can be shown by clicking the Integrate -> Genome & Chromosomemenu or the accelerator, on the toolbar panel of the main frame.

Figure 5: Dialog Box of Integrating Loaded Genotypes according to Genome/Chromosome.

In the dialog box, eight parameters are offered to customize the integration and exporting.

Chromosomes of Genome: It is only one required parameter for the integration. You can select the chromosomes for the integration.
Genetic Map Region: You can set the region for the integration in terms of genetic map.
Physical Region: You can set the region for the integration in terms of physical map on the reference genome.
Frequency Populations: Choose a reference population to output allelic frequencies in the final integration result.
Interval of Trimming: Set a length of intervals to trim the dataset. The default length is 0, indicating no trimming. The algorithm of trimming is very simple though very useful for genome-wide linkage scan. IGG first split the selected chromosomes or regions into even segments with the interval length customized. It then selects one SNP with the maximum heterozigosity on each segment for exporting (illustrated in Figure 6).

Figure 6: Algorithms to trim dataset
Remove Bad Mendel: IGG, once chosen, will automatically remove the genotype of a child once the mendelina violation is detected within parents-child trios.
Remove All Missing SNPs: If the genotypes of an SNP are all missing, this SNP will be removed out of the final integrated dataset.
Correct Annotation Frequency (required for Linkage): Once selected, IGG will deal with the following situation. If a SNP has a zero allele frequency in the annotation file but has a non-zero one in the loaded datasets, IGG will retain their genotypes and correct the frequency 0 to be 0.000001 in the final integrated dataset. This correction is necessary for upcoming linkage analysis. Otherwise, the linkage analysis may abort for the tools like Merlin.

If you are not interested in genotype of the whole genome but some specific genes or SNPs, you can selectively integrate them in another way. Figure 7 shows the dialog, which is opened through Integrate -> Gene & SNPmenu or the accelerator, on the toolbar panel of IGG’s main frame. The following are the description of settings on the dialog.

Figure 7: Dialog Box of Integrating Loaded Genotypes according to Genome/Chromosome.

Genes->Type: Ways to identify genes. You can choose either the HGNC official gene symbol ( the Entrez gene ID of NCBI ( Both types are very popular in current biological community.
Genes->From File: You can load the gene identifiers from a text file. The format is that genes are separated by lines in the text file, just as the example genes appear in the above text box.
Genes->Gene Extension: You can set an extension range for each gene. In this way, SNPs outside a gene by close to the gene will also be included for the integration.
Genes->Gene Extension->upstream: Location in the upstream of the transcription starting site.
Genes->Gene Extension->downstream: Location in the downstream of the transcription end site.
SNPs: List your interested SNPs in the text box.
SNPs->From File: Load your interested SNPs from a text file. In the text file, the SNPs are separated by lines, just as the example genes appear in the above text box.
Frequency Populations: The same as Figure 5.
Interval of Trimming: The same as Figure 5.
Remove Bad Mendel: The same as Figure 5
Correct Annotation Frequency (required for Linkage): The same as Figure 5

Clicking the button “Integrate” on the dialog can launch the process of integration. A progress bar at the bottom-right will twinkle to indicate IGG is integrating the data. Some real-time brief and detail running information will be displayed on the bottom-right panels.

The integrated results will be shown in the middle tree of the left panel on the main frame as an integrated dataset. The dataset name is the time the integration started with a prefix “C”. Here “C” stands for integrated geneChip dataset. Within the datasets, the integrated results are organized according to chromosomes.

4.2 Integrate HapMap Genotypes

Compared with previous version, IGG 2.0 has greatly simplified the procedure to integrate HapMap Genotypes. The integration of HapMap genotypes is carried out after the integration of GeneChip data.

Figure 8 shows the dialog to integrate HapMap genotypes. You can open the dialog by clicking the main menu Integrate->HapMap Genotype or accelerator. Explanations of the settings on the dialog are list in the following.

Figure 7: Dialog of Integrating HapMap Genotypes

Integrated GeneChip Datasets: You can choose an integrated GeneChip dataset for the integration. The purpose of this integration is to merge HapMap genotypes of SNPs in the integrated GeneChip dataset into a new dataset.
Sample Population: Choose the HapMap sample(s) population for you to integrate. Genotype Files: The HapMap genotype files you loaded for the integration. These genotype files are downloaded from You have to unzip the downloaded files before you can load them into IGG. As you will note, the file name defined by HapMap contains information of chromosome and sample information. IGG can only recognize the original file name as defined by HapMap. So please do not change the file name of these genotype file before you load them. Successful parse of the file name will be shown in the Filter of File Namespanel box on the dialog once you load the genotype files. In addition, IGG can only recognize one pattern of file names at a time. So please make sure all names of imported files have consistent pattern.

Also the integrated results will be shown in the middle tree of the left panel on the main frame as an integrated dataset. The dataset name is the time the integration started with a prefix “H” rather than “C”. Here “H” stands for integrated HapMap dataset. Within the datasets, the integrated results are also organized according to chromosomes.

4.2 Export Integrated Datasets

After the integration, you can export the integrated dataset with a specific statistical genetic analysis. The dialog “Export Integrated Data”can be opened by a click on the menu Tools->Export For Genetic Tools or the accelerator. Figure 8 and 9 show the dialog to export integrated genechip genotype and HapMap genotypes.

The definitions of settings on the dialog are list in the following.

Figure 8: Dialog of Exporting integrated genechip data

Integrated GeneChip Datasets: You can choose an integrated GeneChip dataset for the integration. The purpose of this integration is to merge HapMap genotypes of SNPs in the integrated GeneChip dataset into a new dataset.
Phenotypes->Type: Choose the type of phenotypes in export output. There are three types of phenotypes, Quantitative Trait, Affection Status and Covariate. The three are usually differently treated by many tools.
Phenotypes->Available: Phenotypes in the loaded pedigree file.
Phenotypes->Selected: Phenotypes selected for the output.
Format->Name of Tool: Output format. Each format corresponds to an analysis tool.
Format->Missing Genotypes: Labels denote the missing genotypes in the output. Specific labels have defined the tools of genetic analyses. Therefore, you have to refer to document of the tools to set the missing genotype labels.
Files->Path: Set you output path of the exported files.
Files->Prefix of FileName: Set the prefix of file names exported.

Figure 9: Dialog of exporting integrated HapMap Genotypes

As for the export of integrated HapMap genotypes, there is only one additional setting, compared with the integrated genechip genotypes. It is the assumed phenotype value of the HapMap subjects. For example, in a linkage analysis of a given disease, we can assume the HapMap subjects to be normal. Therefore, we can put 1 into the text field.