Importing Data

Once you have selected a genome you can begin to import your data. You can import data either into an empty project or add new data to an existing project. If you import new data into an existing project you will need to rerun any quantitation you had previously done as it won't carry over to newly imported data.

SeqMonk understands a few defined file formats and if you use one of these then you can load your data directly. The currently supported file formats are:

If your data is not in one of these formats then you can use the Generic Text File Importer to tell SeqMonk about your file format and import your data that way.

Common Options

Some of the import filters may add their own specific options when importing, but there are a couple of options which are present in all import filters.

Remove duplicates - you can choose to not import duplicate reads. This may help to cut down on PCR artefacts, but could potentially misrepresent your data if you have very high coverage over enriched areas. Rather than removing duplicates when loading data you can later choose to ignore duplicates when performing quantitation.
(For Single End Reads) Extend Reads - It is common practice to use single end read data to measure enrichment in ChIP-Seq type experiments. Analysing this data can be tricky since the region which is sequenced is not over the site of interaction, but is one end of a fragment which covers that site. Paired end data is easier to interpret since it can show the whole of the enriched fragment so you get more complete peaks over interaction sites. To make single-end data easier to interpret you can choose to extend each of your reads by a specified length so that they more closely represent the region which was sequenced. In this way you get many of the advantages of paired end sequencing with the lower cost of single end.
(For Paired End Reads) Distance cutoff - Correctly matching paired end reads should map within a few hundred bases of each other on the genome. Reads positioned further apart usually result from a mismapping of one end (although they could arise from indel or recombination events). To avoid cluttering the display with really long reads we set a limit on the allowed distance between the ends of read pairs.

When you import your data there may be (but hopefully won't be) a major problem which will cause the import to be stopped and an error reported. There may more commonly be some warnings produced when importing your data. Warnings won't stop your data from being imported but they will be shown to you after the import.

Import Warnings Dialog

The warnings dialog will show you the total number of warnings, but will only show you the details for the first 500 if you generated more than that. Warnings are not necessarily a problem but you should review them carefully to ensure that the data which is imported is OK.

The most common warnings you will see are:

Reading position 100000 was beyond the end of ChrX - this is normally an indication that you have selected the wrong genome assembly as the basis for your project. You should check which assembly was used for your mapping and ensure that you are using the same one otherwise your reads will appear in the wrong place.
Couldn't find a chromosome called [some name]. This can be caused by sequences being mapped to fragments which aren't part of the main genome assembly. This isn't a problem but means you can't analyse those sequences. It can also be caused by different conventions being used to name chromosomes (MT vs M for mitochondrion or 1 vs I for the first chromosome). You need to ensure that your data uses the same conventions as the reference genome.

When your data is initially imported it will appear in the Data View as a Data Set. Each Data Set will be named after the file from which it came. If you want to change the name of the Data Set you can do this by either Right-clicking on the Data Set in the Data View and selecting "Rename" or by selecting Data > Edit Data Sets and then using the controls there to select and rename each Data Set. This gives you more control, including the ability to do bulk find and replace renaming of a large group of datasets.

Edit Data Sets Dialog

Once your data is imported you should move on to creating Data Groups.