All data files have to be in tab delimited ascii format and missing values should be labeled "NA". Example files are available in the Download and Installation section. In the following we will describe the required data formats for the different data types. The names of the required columns can deviate from that in the description. In this case the user has to change the naming of the required columns in the "Loading Settings" menu.
SEURAT provides no preprocessing tools, gene expression data should be normalized and preprocessed beforehand. It is mandatory that the first column contains the gene IDs. The following columns should be named according to the sample IDs and contains the gene expression values of the respective samples.
geneID sample.1 sample.2 ... sample.n
gene.1 y.11 y.12 ... y.1n
gene.2 y.21 y.22 ... y.2n
...
gene.p y.p1 y.p2 ... y.pn
The first column should contain the sample IDs and these IDs should have the same format as in the gene expression file. Additional columns can contain any type of continous or categorical variable with the name of the variable in the first row.
sampleID variable.1 variable.2 ... variable.z
sample.1 x.11 x.12 ... x.1z
sample.2 x.21 x.22 ... x.2z
...
sample.n x.n1 x.n2 ... x.nz
The first column has to contain the geneIDs and the gene names must be
the same as the gene names in the gene expression file. In case of
pathway informations where each gene can belong to different pathways,
the entries of pathway variables have to be lists of comma delimited
strings, e.g.: "cell_signaling, immunology, metastasis" or for
GO-terms: "molecular function|ATP binding|IEA|GO:0005524|GOA/IPI,
biological process|defense response to pathogen|IEA|GO:0042829|GOA/IPI,
cellular component|nucleus|IEA|GO:0005634" The name of a pathway
variable has to begin with "#" as the first character.
If the data set contains URLs (e.g. linking to cross-references like
PubMed),
the name of the respective column has to begin with a "@" symbol.
In order to associate the gene expression data with the array CGH data
the gene annotation file is needed. In addition the annotation file has
to have a column called "ChromosomeNumber" containing the number of the
chromosome and a column called "NucleotidePosition" containing the
starting position of the ORF of the gene. Additional
columns can contain any type of continuous or categorical variable with
the variable names in the first row.
geneID ChromosomeNumber NucleotidePosition #pathway annotation.1 ... annotation.v
gene.1 1 11223444 path1,path2 a.11 ... a.1v
gene.2 22 NA path1,path3 a.21 ... a.2v
...
gene.p X 1055 path5 a.p1 ... a.pv
The array CGH data should have been normalized and preprocessed. A
segmentation algorithm should be applied to estimate cytogenetic gains
and losses. We used the R package
GLAD to preprocess our example data file.
The first column contains the cloneIDs of the CGH clones. To link the
states of the clones with the samples, a column with the name
"samplename.States" is needed for each sample. These columns contain
the result of the segmentation algorithm and the entries are "-1","0"
or "1", which correspond to a cytogenetic loss, no change or gain.
To link the CGH data with the gene expression data, the following six
columns are needed: "ChromosomeNumber", "ChrStart", "ChrEnd",
"ChromosomeCen", "Mapping". The variables "ChrStart" and
"ChrEnd" contain the start and end nucleotide positions of each CGH
clone. "Mapping" gives the cytoband
location of the CGH clone and "ChromosomeCen" gives the nucleotide
position of the centromere of the chromosome (so it is the same for all
clones on the same chromosome). See the example for details:
CloneID ChromosomeNumber ChrStart ChrEnd Mapping ChromosomeCen sample.1.States ... sample.n.States
clone.1 1 145 605 1q41 124300000 1 0
clone.2 22 11233 12034 22q13 11800000 -1 1
...
clone.w X NA NA Xq22 59500000 0 -1
Like the array CGH data also the SNP array data need to be preprocessed.
To normalize Affymetrix SNP chips (Genome-Wide Human SNP Arrays 6.0), we used the R package aroma.affymetrix.
A detailed R script descriping the preprocessing of Affymerix exon arrays (GeneChip Human Exon 1.0 ST Arrays) and SNP chips is available in the download section.
The first column of the snp file contains the SNPIDs. To link the states of the SNPs with the samples, a column with the name
"samplename.States" is needed for each sample. These columns contain
the result of the segmentation algorithm and the entries are "-1","0"
or "1", which correspond to a cytogenetic loss, no change or gain.
To link the SNP data with the gene expression data, the following
columns are needed: "ChromosomeNumber", "Physical.position",
"ChromosomeCen" and "Mapping".
SNPID ChromosomeNumber Physical.position Mapping ChromosomeCen sample.1.States ... sample.n.States
SNP.1 1 555 1q41 124300000 1 0
SNP.2 22 33030 22q13 11800000 -1 1
...
SNP.w X NA Xq22 59500000 0 -1