HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium De Novo Assembly

Pipestance Structure

The pipeline output directory, described in Assembly Output, contains all of the data produced by one invocation of a pipeline (a pipestance) as well as rich metadata describing the characteristics of each stage. This directory contains a specific structure that is used by the Martian pipeline framework to track the state of the pipeline as execution proceeds.

Pipeline Structure

Supernova's notion of a pipeline is very flexible in that a pipeline can be composed of stages that run stage code or sub-pipelines that may themselves contain stages or sub-pipelines.

Each stage runs in its own directory bearing its name, and each stage's directory is contained within its parent pipeline's directory.

For example, the supernova demux pipeline has the following process graph:

where

Directory Structure

Every pipestance operates wholly inside of its pipeline output directory. When the pipestance completes, this pipestance output directory contains three outputs: metadata files, the pipestance output file directory, and the top-level pipeline stage directory.

The top-level pipeline stage directory (BCL_PROCESSOR_CS or ASSEMBLER_CS) is a stage directory that contains any number of child stage directories as well as one stage output directory for each fork run by that stage. All of the Long Ranger pipelines only contain single-fork stages, so there will only ever be a fork0 stage output directory within each stage directory. Chunk output directories are a subset of stage output directories that additionally contain runtime information specific to the job or process being run by that chunk (e.g., a process ID or cluster job ID).

For example, the supernova demux pipeline's pipeline output directory contains the following directory structure:

_completeMetadata file
_logMetadata file
outs/Pipestance output file directory
BCL_PROCESSOR_CS/Top-level pipeline stage directory
BCL_PROCESSOR_CS/fork0/Stage output directory
BCL_PROCESSOR_CS/fork0/files/Stage output files
BCL_PROCESSOR_CS/BCL_PROCESSOR/Stage directory
BCL_PROCESSOR_CS/BCL_PROCESSOR/fork0/Stage output directory
BCL_PROCESSOR_CS/BCL_PROCESSOR/fork0/files/Stage output files
BCL_PROCESSOR_CS/BCL_PROCESSOR/ANALYZE_RUN/Stage directory
BCL_PROCESSOR_CS/BCL_PROCESSOR/ANALYZE_RUN/fork0/Stage output directory
BCL_PROCESSOR_CS/BCL_PROCESSOR/ANALYZE_RUN/fork0/chnk0/Chunk output directory

Commonly Generated Metadata

The metadata contained in the pipeline output directory includes

File Name Description
_finalstateMetadata cache that is populated when a pipestance completes to minimize re-aggregation of metadata
_invocationThe MRO call used to invoke this pipestance
_logThe log messages that are reported to your terminal window when running supernova run or supernova demux
_mrosourceThe entire MRO describing the pipeline with all @includes dereferenced
_perfDetailed runtime performance data for every stage in the pipestance
_timestampThe start and finish time for this pipestance
_vdrkillA list of all of the volatile data (temporary files) removed during pipeline execution as well as total number of files and bytes deleted
_versionsVersions of the components used by the pipeline

Stage directories contain stage output directories, stage output files, and the stage directories of any child stages or pipelines.

Stage output directories typically contain:

File Name Contents
files/Directory containing any files created by this stage that were not considered volatile (temporary)
split/A special stage output directory for the step that divided this stage's input into parallel chunks
chnkN/A chunk output directory for the Nth parallel chunk executed
join/A special stage output directory for the step that recombined this stage's parallel output chunks into a single output dataset again
_completeA file that, when present, signifies that this stage has successfully completed
_errorsA file that, when present, signifies that this stage failed. Contains the errors that resulted in stage failure.
_invocationThe MRO call used to execute this stage by the Martian framework
_outsThe output files generated by this stage
_vdrkillA list of all of the volatile data (temporary files) removed during pipeline execution as well as total number of files and bytes deleted

Chunk output directories are a subset of stage output directories that, in addition to the aforementioned stage output, may contain:

File Name Contents
_argsThe arguments passed to the stage's stage code
_jobinfoMetadata describing the stage's execution, including performance metrics, job manager jobid and jobname, and process ID
_stdoutAny stage code output that was printed to the stdout stream
_stderrAny stage code output that was printed to the stderr stream

These metadata files should be treated as read-only, and altering the contents of metadata files is not recommended.

Navigating Pipestances

Pipestance output directories can demonstrate very complicated structures and the standard find command can quickly return high-level information about a pipestance.

For example, to find the stages that resulted in the overall failure of a pipestance whose output directory is sample345/,

$ find sample534/ -name _errors
sample345/ASSEMBLER_CS/_ASSEMBLER_PREP/_FASTQ_TO_FASTBQUALP/fork0/join/_errors

This tells us that the failed stage was _FASTQ_TO_FASTBQUALP which is a part of the _ASSEMBLER_PREP sub-pipeline. ASSEMBLER_CS is the parent pipeline.

It can be helpful to view all _errors files' contents at once by piping to xargs cat:

$ find sample534/ -name _errors | xargs cat
 
Traceback (most recent call last):
  File "/mnt/home/neil/src/supernova-1.0/martian-cs/2.0/adapters/python/main.py", line 20, in 
    martian.run("martian.module.main(args, outs)")
  File "/mnt/home/neil/src/supernova-1.0/martian-cs/2.0/adapters/python/martian.py", line 417, in run
    exec(cmd, __main__.__dict__, __main__.__dict__)
  File "", line 1, in 
  File "/mnt/home/neil/src/supernova-1.0/supernova-cs/1.0/mro/stages/denovo/df/__init__.py", line 47, in main
    df_command = ['DF', 'LR_SELECT_FRAC={:f}'.format(select_fracs), 'LR='+args.reads,
KeyError: 0

In the above case, the error is an unhandled exception whose cause is not obvious; these sorts of failures should be reported to the 10x software support team for assistance with diagnosis.

Stages such as _ASSEMBLER_DF and _ASSEMBLER_CP that run external binaries often generate output to their stdout and stderr streams. These messages are captured in the _stdout and _stderr metadata files within the chunk output directories, and combining find and xargs cat to examine their contents can also assist with troubleshooting.