WDL Workflows

This module contains all the instructions for automated execution of cas services.

Module Structure

  • bq_ops
    • anndata_to_ingest_files

    • ingest_files_to_bq

    • prepare_extract

    • extract

    • test_data_cycle

  • data_embedding

Set Up

Cromwell Execution Manager

To execute these scripts in Cloud Environment you would need a cromwell task execution manager. Broad already has a cromwell server running in the cloud. You can use that server to execute these scripts. To interact with this server you would need to install cromshell (cromwell cli tool) on your local machine. You can do that by this command:

$ pip install cromshell

After installation, you would need to specify a cromwell server URL. You can do that by this command:

$ cromshell config set url ${CROMWELL_SERVER_URL}

Please refer to the Cromshell Github page for more

VPN

To interact with the cromwell server you would need to be on the Broad VPN. Make you sure that you’re using Z-Duo-Broad-NonSplit-VPN VPN mode, otherwise cromwell server won’t be accessible. Please refer to the Broad VPN page for more information.

Usage

To execute a workflow you would need to run this command:

$ cromshell submit ${WDL_FILE_PATH} ${INPUT_JSON_FILE_PATH}

To check the status of your workflow runs you can run this command:

$ cromshell list -u -c

Workflow Inputs Files

More detailed description of what these inputs are, please look at the casp.services.bq_ops directory and its subdirectories for each individual operation.

bq_ops.anndata_to_ingest_files

Parameters

  • CASAnndataToIngestFiles.anndata_to_ingest_files.docker_image - Docker image to use for this operation.

  • CASAnndataToIngestFiles.anndata_to_ingest_files.gcs_stage_dir - GCS directory to stage the output files.

  • CASAnndataToIngestFiles.anndata_to_ingest_files.gcs_input_bucket - Working GCS Bucket name.

  • CASAnndataToIngestFiles.anndata_to_ingest_files.original_feature_id_lookup - A column name in var dataframe from where to get original feature ids. In most of the cases it will be a column with ENSEMBL gene IDs. if index, then the index of the dataframe will be used.

  • CASAnndataToIngestFiles.convert_args - List of dictionaries with the following keys
    df_filename - A path to the anndata file in GCS bucket
    cas_cell_index - A starting index for the cells in the output files
    cas_feature_index - A starting index for the features in the output files

Example

{
   "CASAnndataToIngestFiles.anndata_to_ingest_files.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
   "CASAnndataToIngestFiles.anndata_to_ingest_files.gcs_stage_dir": "cromwell_50m",
   "CASAnndataToIngestFiles.anndata_to_ingest_files.gcs_input_bucket": "cellarium-file-system",
   "CASAnndataToIngestFiles.anndata_to_ingest_files.original_feature_id_lookup": "index",
   "CASAnndataToIngestFiles.convert_args": [
      {
         "df_filename": "census_data/0129dbd9-a7d3-4f6b-96b9-1da155a93748-census-dataset.h5ad",
         "cas_cell_index": 0,
         "cas_feature_index": 0
      },
      {
         "df_filename": "census_data/04a23820-ffa8-4be5-9f65-64db15631d1e-census-dataset.h5ad",
         "cas_cell_index": 1000000,
         "cas_feature_index": 1000000
      }
   ]
}

bq_ops.ingest_file_to_bq

Parameters

  • CASIngestFilesToBQ.ingest_files_to_bq.docker_image - Docker image to use for this operation.

  • CASIngestFilesToBQ.ingest_files_to_bq.gcs_stage_dir - GCS directory to stage the output files.

  • CASIngestFilesToBQ.ingest_files_to_bq.gcs_bucket_name - Working GCS Bucket name.

  • CASIngestFilesToBQ.ingest_files_to_bq.dataset - BigQuery dataset name where to ingest the data. If dataset doesn’t exist, it will be created.

Example

{
  "CASIngestFilesToBQ.ingest_files_to_bq.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
  "CASIngestFilesToBQ.ingest_files_to_bq.gcs_stage_dir": "cromwell_test_10k",
  "CASIngestFilesToBQ.ingest_files_to_bq.gcs_bucket_name": "cellarium-file-system",
  "CASIngestFilesToBQ.ingest_files_to_bq.dataset": "cas_test_dataset"
}

bq_ops.precalculate_fields

Parameters

  • CASPrecalculateFields.precalculate_fields.docker_image - Docker image to use for this operation

  • CASPrecalculateFields.precalculate_fields.dataset - BigQuery dataset name where to ingest the data.

  • CASPrecalculateFields.precalculate_fields.fields - A comma separated list of fields to precalculate. Currently only total_mrna_umis is supported.

Example

{
   "CASPrecalculateFields.precalculate_fields.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
   "CASPrecalculateFields.precalculate_fields.dataset": "cas_test_dataset",
   "CASPrecalculateFields.precalculate_fields.fields": "total_mrna_umis"
}

bq_ops.prepare_extract

Current input file could have a filed CASPrepareExtractBQ.prepare_extract.filters_json_path. This is a gs json path. Please find an example of this filter file attached as well

Parameters

  • CASPrepareExtractBQ.prepare_extract.docker_image - Docker image to use for this operation.

  • CASPrepareExtractBQ.prepare_extract.bq_dataset - BigQuery dataset name where to ingest the data.

  • CASPrepareExtractBQ.prepare_extract.extract_table_prefix - Prefix for the extract table name. The final extract tables name will be named like ${extract_table_prefix}_cas_cell_info

  • CASPrepareExtractBQ.prepare_extract.extract_bin_size - Size of the bin for the extract table, usually we put 10000

  • CASPrepareExtractBQ.prepare_extract.bucket_name - Working GCS Bucket name

  • CASPrepareExtractBQ.prepare_extract.obs_columns_to_include - A comma separated list of columns to include in the extract table.

  • CASPrepareExtractBQ.prepare_extract.fq_allowed_original_feature_ids - A fully qualified table name of the reference data table with the feature schema.

  • CASPrepareExtractBQ.prepare_extract.extract_bucket_path - A GCS path to the extract table. This path is used for creating metadata files for the extract script.

  • CASPrepareExtractBQ.prepare_extract.filters_json_path - A GCS path to a json file with filters. Please find an example of this filter file attached as well

Example

{
  "CASPrepareExtractBQ.prepare_extract.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
  "CASPrepareExtractBQ.prepare_extract.bq_dataset": "cas_test_dataset",
  "CASPrepareExtractBQ.prepare_extract.extract_table_prefix": "fg_extract",
  "CASPrepareExtractBQ.prepare_extract.extract_bin_size": 10000,
  "CASPrepareExtractBQ.prepare_extract.bucket_name": "cellarium-file-system",
  "CASPrepareExtractBQ.prepare_extract.extract_bucket_path": "curriculum/fg_extract",
  "CASPrepareExtractBQ.prepare_extract.filters_json_path": "gs://cellarium-file-system/curriculum/extract_filters/filters_mus_mus_brain.json",
  "CASPrepareExtractBQ.prepare_extract.fq_allowed_original_feature_ids": "dsp-cell-annotation-service.cas_reference_data.refdata-gex-GRCh38-2020-A",
  "CASPrepareExtractBQ.prepare_extract.obs_columns_to_include": "cell_type,total_mrna_umis,donor_id,assay,development_stage,disease,organism,sex,tissue"
}

An example of JSON object for CASPrepareExtractBQ.prepare_extract.filters_json_path (you’d need to put this file in a GCS bucket and provide a path to the workflow input file):

{
  "organism__eq": "Mus musculus",
  "cell_type__in": ["L6b glutamatergic cortical neuron", "interneuron", "inhibitory interneuron", "cerebellar Golgi cell"],
  "is_primary_data__eq": true
}

Note

Constructing filters

Filters is a dictionary containing filter criteria, structured as {"column_name__filter_type": "value"}.

Supported filter_types
"eq" - Used for an ‘equals’ comparison.
Example: {"organism__eq": "Homo sapiens"} results in organism='Homo sapiens'.

"in" - Used for an ‘in’ comparison with a set of values.
Example: {"cell_type__in": ["T cell", "neuron"]} results in cell_type in ('T cell', 'neuron').

bq_ops.extract

Parameters

  • CASExtractBQ.extract.docker_image - Docker image to use for this operation.

  • CASExtractBQ.extract.bq_dataset - BigQuery dataset name where to ingest the data.

  • CASExtractBQ.extract.extract_table_prefix - Prefix for the extract table name. This value should be the same as the one used in CASPrepareExtractBQ.prepare_extract.extract_table_prefix

  • CASExtractBQ.bin_borders - A list of lists with bin borders. Each list should contain two numbers, the first one is the start of the bin, the second one is the end of the bin.

For example, [[0, 9], [10, 19], [20, 29]] will create 30 bins: 0-9, 10-19, 20-29. Each bin will be a separate .h5ad extract file. Each bin group will be executed in parallel on a separate VM. Number of bins per group should correspond to a number of CPU cores in each of the machine. It is not recommended to have more than 50 groups at the same time because cromwell wouldn’t be able to manage all the machines and will lose some of the groups. - CASExtractBQ.extract.output_bucket_name - Working GCS Bucket name - CASExtractBQ.extract.extract_bucket_path - A GCS path to the extract table. Should be the same as the one used in CASPrepareExtractBQ.prepare_extract.extract_bucket_path to use metadata file produced by extract script

Example

{
  "CASExtractBQ.extract.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
  "CASExtractBQ.extract.bq_dataset": "cas_test_dataset",
  "CASExtractBQ.extract.extract_table_prefix": "fg_extract",
  "CASExtractBQ.bin_borders": [[0, 9], [10, 19], [20, 29], [30, 39], [40, 49], [50, 59]],
  "CASExtractBQ.extract.output_bucket_name": "cellarium-file-system",
  "CASExtractBQ.extract.extract_bucket_path": "curriculum/fg_extract",
  "CASExtractBQ.extract.obs_columns_to_include": "cell_type,total_mrna_umis,donor_id,assay,development_stage,disease,organism,sex,tissue"
}

bq_ops.test_data_cycle

Only requires a docker image

{
   "CASTestDataCycle.test_data_cycle.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1"
}

model_training.train_incremental_pca (Deprecated)

{
  "CASTrainIncrementalPCA.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
  "CASTrainIncrementalPCA.bucket_name": "dsp-cell-annotation-service",
  "CASTrainIncrementalPCA.data_storage_path": "cas_50m_homo_sapiens_extract_4m",
  "CASTrainIncrementalPCA.checkpoint_save_path": "pca_incremental_4m_june",
  "CASTrainIncrementalPCA.n_components": 512,
  "CASTrainIncrementalPCA.batch_size": 10000,
  "CASTrainIncrementalPCA.use_gpu": true
}

data_embedding (Deprecated)

{
  "CASPEmbedData.docker_image": "us-central1-docker.pkg.dev/dsp-cell-annotation-service/cas-services-cicd/cas-pytorch:1.0a1",
  "CASPEmbedData.bucket_name": "dsp-cell-annotation-service",
  "CASPEmbedData.data_storage_path": "cas_50m_homo_sapiens_extract_4m",
  "CASPEmbedData.dm_storage_path": "models/incremental_pca_003.pickle",
  "CASPEmbedData.output_storage_path": "embeddings_incremental_pca_003",
  "CASPEmbedData.running_script": "casp/services/data_embedding/main.py"
}