BigQuery Data Operations

Running Locally

casp.bq_scripts.run.bq_ops_create_ingest_files(gcs_bucket_name: str, gcs_file_path: str, uns_meta_keys: str, cas_cell_index_start: int, cas_feature_index_start: int, load_uns_data: bool, original_feature_id_lookup: str, dataset_id: str, dataset_version_id: str, gcs_stage_dir: str)[source]

Create ingest files and upload them in a stage GCS bucket directory. High level entry point, reads the input AnnData file and generates Avro files for ingest, cells, features, and raw / core data.

Parameters:
  • gcs_bucket_name – GCS Bucket name

  • gcs_file_path – GCS Bucket input file path to process

  • load_uns_data – Whether to load uns (unstructured) metadata

  • cas_cell_index_start – Starting number for cell index. If None, increment of maximum index would be used. None requires dataset to be set as it will use dataset to get the maximum index in the cell_info table

  • cas_feature_index_start – Starting number for feature index. If None, increment of maximum index would be used. None requires dataset to be set as it will use dataset to get the maximum index in the cell_info table

  • uns_meta_keys – Comma separated list with a set of keys that need to be dumped in ingest. If None, dump all.

  • original_feature_id_lookup – A column name in var dataframe from where to get original feature ids. In most of the cases it will be a column with ENSEMBL gene IDs. Default is index which means that an index column of var dataframe would be used.

  • dataset_id – CZI Dataset ID

  • dataset_version_id – CZI Dataset version ID

  • gcs_stage_dir – Stage directory in GCS Bucket where to upload the files

Example usage

To create ingest files and upload them to a GCS bucket, use the following command:

python casp/bq_scripts/run.py bq-ops-create-ingest-files \
    --gcs-bucket-name my-bucket \
    --gcs-file-path path/to/input/file \
    --uns-meta-keys key1,key2,key3 \
    --cas-cell-index-start 1000 \
    --cas-feature-index-start 2000 \
    --load-uns-data True \
    --original-feature-id-lookup gene_id \
    --dataset-id 12345 \
    --dataset-version-id 67890 \
    --gcs-stage-dir path/to/stage/dir
casp.bq_scripts.run.bq_ops_ingest_data_to_bq(project_id: str, gcs_bucket_name: str, dataset: str, gcs_stage_dir: str, gcs_error_file_path: str, max_retry_attempts: int = 5)[source]

Ingest files prepared by bq_scripts.anndata_to_avro script. If error happens during ingest, the script would retry max_retry_attempts times.

Parameters:
  • project_id – The ID of the Google Cloud project where the BigQuery dataset is hosted.

  • gcs_bucket_name – GCS Bucket name

  • dataset – BigQuery dataset name

  • gcs_stage_dir – GCS directory where ingest files are stored

  • gcs_error_file_path – GCS file path to save a csv table with error per input file occurred. If None, csv file doesn’t get saved.
    Default: None

  • max_retry_attempts – Maximum number for retries per ingest
    Default: 5

Example usage

To ingest data into BigQuery with retries, use the following command:

python casp/bq_scripts/run.py bq-ops-ingest-data-to-bq \
    --project-id my-gcp-project \
    --gcs-bucket-name my_bucket \
    --dataset my_dataset \
    --gcs-stage-dir path/to/ingest/files \
    --gcs-error-file-path path/to/error/file.csv \
    --max-retry-attempts 5
casp.bq_scripts.run.bq_ops_precalculate_fields(project_id: str, dataset: str, fields: str) None[source]

Precalculate fields in BigQuery. You can find more details on which particular fields can be precalculated in casp/bq_scripts/precalculate_fields.py.

Parameters:
  • project_id – The ID of the Google Cloud project where the BigQuery dataset is hosted.

  • dataset – The ID of the dataset in BigQuery where the data is hosted.

  • fields – Comma separated list of fields in a single string.

Example usage

To precalculate a field named “total_mrna_umis” in a BigQuery dataset, use the following command:

python casp/bq_scripts/run.py bq-ops-precalculate-fields \
    --project-id my-gcp-project \
    --dataset my_dataset \
    --fields total_mrna_umis
casp.bq_scripts.run.bq_ops_prepare_extract(project_id: str, dataset: str, extract_table_prefix: str, fq_allowed_original_feature_ids: str, filters_json_path: str, obs_columns_to_include: str, bucket_name: str, extract_bucket_path: str, extract_bin_size: int = 10000, ci_random_seed_offset: int = 0, ci_partition_bin_count: int = 40000, ci_partition_size: int = 10) None[source]

Prepare CAS BigQuery tables used for data extraction.

Parameters:
  • project_id – The ID of the Google Cloud project where the BigQuery dataset is hosted.

  • dataset – The ID of the dataset in BigQuery where the data is hosted.

  • extract_table_prefix – A prefix string for naming tables or for similar purposes.

  • fq_allowed_original_feature_ids – BigQuery table with feature schema needed for the extract

  • bucket_name – GCS Bucket name where to store the metadata files.

  • extract_bucket_path – GCS Bucket path where the extract will be executed. Used to save metadata files there. It is required to use the same bucket path during extract

  • extract_bin_size – The size for the bins where cells are allocated.
    Default: 10000

  • ci_random_seed_offset – Optional offset for the farm_fingerprint for deterministic randomization Used in cas_cell_info table.
    Default: 0

  • ci_partition_bin_count – The count of bins for partitioning cas_cell_info table.
    Default: 40000

  • ci_partition_size – The size for the partitions in cas_cell_info table.
    Default: 10

  • filters_json_path

    A path to a JSON file containing filters that should be included in a SQL query. The JSON should represent a dictionary containing filter criteria, structured as {column_name__filter_type: value}. Supported filter types:

    • "eq": Used for an ‘equals’ comparison.

      Example: {"organism__eq": "Homo sapiens"} results in organism='Homo sapiens'.

    • "in": Used for an ‘in’ comparison with a set of values.

      Example: {"cell_type__in": ["T cell", "neuron"]} results in cell_type IN ('T cell', 'neuron').

    • "not_eq": Used for a ‘not equals’ comparison, meaning that the query would exclude rows with the specified value.

      Example: {"assay__not_eq": "Drop-seq"} results in assay!='Drop-seq'.

    • "not_in": Used for a ‘not in’ comparison with a set of values to exclude.

      Example: {"assay__not_in": ["Drop-seq", "microwell-seq", "BD Rhapsody Targeted mRNA"]} results in assay NOT IN ('Drop-seq', 'microwell-seq', 'BD Rhapsody Targeted mRNA').

    • "gt": Used for a ‘greater than’ comparison.

      Example: {"total_mrna_umis__gt": 13000} results in total_mrna_umis > 13000.

    • "gte": Used for a ‘greater than or equal’ comparison.

      Example: {"total_mrna_umis__gte": 13000} results in total_mrna_umis >= 13000.

    • "lt": Used for a ‘less than’ comparison.

      Example: {"total_mrna_umis__lt": 13000} results in total_mrna_umis < 13000.

    • "lte": Used for a ‘less than or equal’ comparison.

      Example: {"total_mrna_umis__lte": 13000} results in total_mrna_umis <= 13000.

  • obs_columns_to_include – Optional list of columns from cas_cell_info table to include in adata.obs. If not provided, no specific columns would be added to adata.obs apart from cas_cell_index. Note: It is required to provide the column names along with the aliases for the tables to which they belong. However, the output extract table would contain only the column names, without any aliases. Example: ["c.cell_type", "c.donor_id", "c.sex", "i.dataset_id"]

Example filters JSON

The following is an example JSON dictionary for the filters_json_path parameter, demonstrating a few filter types:

{
    "organism__eq": "Homo sapiens",
    "cell_type__in": ["T cell", "neuron"],
    "assay__not_eq": "Drop-seq",
    "total_mrna_umis__gt": 13000
}
Example usage

To prepare CAS BigQuery tables for data extraction, use the following command:

python casp/bq_scripts/run.py bq-ops-prepare-extract \
    --project-id my-gcp-project \
    --dataset my_dataset \
    --extract-table-prefix my_prefix \
    --fq-allowed-original-feature-ids my_feature_ids_table \
    --extract-bin-size 10000 \
    --filters-json-path path/to/filters.json \
    --obs-columns-to-include "c.cell_type,c.donor_id,c.sex,i.dataset_id" \
    --bucket-name my_bucket \
    --extract-bucket-path path/to/extract \
    --ci-random-seed-offset 0 \
    --ci-partition-bin-count 40000 \
    --ci-partition-size 10
casp.bq_scripts.run.bq_ops_extract_bins_in_parallel_workers(project_id: str, dataset: str, extract_table_prefix: str, start_bin: int, end_bin: int, output_bucket_name: str, extract_bucket_path: str, obs_columns_to_include: str) None[source]

Extract bins in parallel workers with using python concurrent.futures.ThreadPoolExecutor.

Parameters:
  • project_id – Google Cloud Project id

  • dataset – BigQuery Dataset

  • extract_table_prefix – Prefix of extract tables

  • start_bin – Start bin to extract

  • end_bin – End bin to extract

  • output_bucket_name – Name of GCS bucket

  • extract_bucket_path – Path where the extract files and subdirectories should be located. Should correspond to the directory provided to prepare_extract script as current script uses shared_meta files.

  • obs_columns_to_include – Comma separated columns from cas_cell_info table to include in adata.obs. Note: It is required to provide the column names along with the aliases for the tables to which they belong. E.g. "c" for cas_cell_info, "i" for cas_ingest_info. However, the output extract table would contain only the column names, without any aliases. Example: ``[“c.cell_type”, “c.donor_id”, “c.sex”, “i.dataset_id”]`

Example usage

To extract bins in parallel workers, use the following command:

python casp/bq_scripts/run.py bq-ops-extract-bins-in-parallel-workers \
    --project-id my-gcp-project \
    --dataset my_dataset \
    --extract-table-prefix my_prefix \
    --start-bin 0 \
    --end-bin 100 \
    --output-bucket-name my_output_bucket \
    --extract-bucket-path path/to/extract \
    --obs-columns-to-include "c.cell_type,c.donor_id,c.sex,i.dataset_id"