BigQuery Data Operations

Running Locally

casp.bq_scripts.run.bq_ops_create_ingest_files(gcs_bucket_name: str, gcs_file_path: str, uns_meta_keys: str, cas_cell_index_start: int, cas_feature_index_start: int, load_uns_data: bool, original_feature_id_lookup: str, dataset_id: str, dataset_version_id: str, gcs_stage_dir: str)[source]

Create ingest files and upload them in a stage GCS bucket directory. High level entry point, reads the input AnnData file and generates Avro files for ingest, cells, features, and raw / core data.

Parameters:

gcs_bucket_name – GCS Bucket name
gcs_file_path – GCS Bucket input file path to process
load_uns_data – Whether to load uns (unstructured) metadata
cas_cell_index_start – Starting number for cell index. If None, increment of maximum index would be used. None requires dataset to be set as it will use dataset to get the maximum index in the cell_info table
cas_feature_index_start – Starting number for feature index. If None, increment of maximum index would be used. None requires dataset to be set as it will use dataset to get the maximum index in the cell_info table
uns_meta_keys – Comma separated list with a set of keys that need to be dumped in ingest. If None, dump all.
original_feature_id_lookup – A column name in var dataframe from where to get original feature ids. In most of the cases it will be a column with ENSEMBL gene IDs. Default is index which means that an index column of var dataframe would be used.
dataset_id – CZI Dataset ID
dataset_version_id – CZI Dataset version ID
gcs_stage_dir – Stage directory in GCS Bucket where to upload the files

Example usage

To create ingest files and upload them to a GCS bucket, use the following command:

python casp/bq_scripts/run.py bq-ops-create-ingest-files \
    --gcs-bucket-name my-bucket \
    --gcs-file-path path/to/input/file \
    --uns-meta-keys key1,key2,key3 \
    --cas-cell-index-start 1000 \
    --cas-feature-index-start 2000 \
    --load-uns-data True \
    --original-feature-id-lookup gene_id \
    --dataset-id 12345 \
    --dataset-version-id 67890 \
    --gcs-stage-dir path/to/stage/dir

casp.bq_scripts.run.bq_ops_ingest_data_to_bq(project_id: str, gcs_bucket_name: str, dataset: str, gcs_stage_dir: str, gcs_error_file_path: str, max_retry_attempts: int = 5)[source]

Ingest files prepared by bq_scripts.anndata_to_avro script. If error happens during ingest, the script would retry max_retry_attempts times.

Parameters:

project_id – The ID of the Google Cloud project where the BigQuery dataset is hosted.
gcs_bucket_name – GCS Bucket name
dataset – BigQuery dataset name
gcs_stage_dir – GCS directory where ingest files are stored
gcs_error_file_path – GCS file path to save a csv table with error per input file occurred. If None, csv file doesn’t get saved.
Default: None
max_retry_attempts – Maximum number for retries per ingest
Default: 5

Example usage

To ingest data into BigQuery with retries, use the following command:

python casp/bq_scripts/run.py bq-ops-ingest-data-to-bq \
    --project-id my-gcp-project \
    --gcs-bucket-name my_bucket \
    --dataset my_dataset \
    --gcs-stage-dir path/to/ingest/files \
    --gcs-error-file-path path/to/error/file.csv \
    --max-retry-attempts 5

casp.bq_scripts.run.bq_ops_precalculate_fields(project_id: str, dataset: str, fields: str) → None[source]

Precalculate fields in BigQuery. You can find more details on which particular fields can be precalculated in casp/bq_scripts/precalculate_fields.py.

Parameters:

project_id – The ID of the Google Cloud project where the BigQuery dataset is hosted.
dataset – The ID of the dataset in BigQuery where the data is hosted.
fields – Comma separated list of fields in a single string.

Example usage

To precalculate a field named “total_mrna_umis” in a BigQuery dataset, use the following command:

python casp/bq_scripts/run.py bq-ops-precalculate-fields \
    --project-id my-gcp-project \
    --dataset my_dataset \
    --fields total_mrna_umis

casp.bq_scripts.run.bq_ops_prepare_extract(project_id: str, dataset: str, extract_table_prefix: str, fq_allowed_original_feature_ids: str, filters_json_path: str, obs_columns_to_include: str, bucket_name: str, extract_bucket_path: str, extract_bin_size: int = 10000, ci_random_seed_offset: int = 0, ci_partition_bin_count: int = 40000, ci_partition_size: int = 10) → None[source]

Prepare CAS BigQuery tables used for data extraction.

Parameters:

project_id – The ID of the Google Cloud project where the BigQuery dataset is hosted.
dataset – The ID of the dataset in BigQuery where the data is hosted.
extract_table_prefix – A prefix string for naming tables or for similar purposes.
fq_allowed_original_feature_ids – BigQuery table with feature schema needed for the extract
bucket_name – GCS Bucket name where to store the metadata files.
extract_bucket_path – GCS Bucket path where the extract will be executed. Used to save metadata files there. It is required to use the same bucket path during extract
extract_bin_size – The size for the bins where cells are allocated.
Default: 10000
ci_random_seed_offset – Optional offset for the farm_fingerprint for deterministic randomization Used in cas_cell_info table.
Default: 0
ci_partition_bin_count – The count of bins for partitioning cas_cell_info table.
Default: 40000
ci_partition_size – The size for the partitions in cas_cell_info table.
Default: 10
filters_json_path –
A path to a JSON file containing filters that should be included in a SQL query. The JSON should represent a dictionary containing filter criteria, structured as {column_name__filter_type: value}. Supported filter types:
- "eq": Used for an ‘equals’ comparison.
  
  Example: {"organism__eq": "Homo sapiens"} results in organism='Homo sapiens'.
- "in": Used for an ‘in’ comparison with a set of values.
  
  Example: {"cell_type__in": ["T cell", "neuron"]} results in cell_type IN ('T cell', 'neuron').
- "not_eq": Used for a ‘not equals’ comparison, meaning that the query would exclude rows with the specified value.
  
  Example: {"assay__not_eq": "Drop-seq"} results in assay!='Drop-seq'.
- "not_in": Used for a ‘not in’ comparison with a set of values to exclude.
  
  Example: {"assay__not_in": ["Drop-seq", "microwell-seq", "BD Rhapsody Targeted mRNA"]} results in assay NOT IN ('Drop-seq', 'microwell-seq', 'BD Rhapsody Targeted mRNA').
- "gt": Used for a ‘greater than’ comparison.
  
  Example: {"total_mrna_umis__gt": 13000} results in total_mrna_umis > 13000.
- "gte": Used for a ‘greater than or equal’ comparison.
  
  Example: {"total_mrna_umis__gte": 13000} results in total_mrna_umis >= 13000.
- "lt": Used for a ‘less than’ comparison.
  
  Example: {"total_mrna_umis__lt": 13000} results in total_mrna_umis < 13000.
- "lte": Used for a ‘less than or equal’ comparison.
  
  Example: {"total_mrna_umis__lte": 13000} results in total_mrna_umis <= 13000.
obs_columns_to_include – Optional list of columns from cas_cell_info table to include in adata.obs. If not provided, no specific columns would be added to adata.obs apart from cas_cell_index. Note: It is required to provide the column names along with the aliases for the tables to which they belong. However, the output extract table would contain only the column names, without any aliases. Example: ["c.cell_type", "c.donor_id", "c.sex", "i.dataset_id"]

Example filters JSON

The following is an example JSON dictionary for the filters_json_path parameter, demonstrating a few filter types:

{
    "organism__eq": "Homo sapiens",
    "cell_type__in": ["T cell", "neuron"],
    "assay__not_eq": "Drop-seq",
    "total_mrna_umis__gt": 13000
}

Example usage

To prepare CAS BigQuery tables for data extraction, use the following command:

python casp/bq_scripts/run.py bq-ops-prepare-extract \
    --project-id my-gcp-project \
    --dataset my_dataset \
    --extract-table-prefix my_prefix \
    --fq-allowed-original-feature-ids my_feature_ids_table \
    --extract-bin-size 10000 \
    --filters-json-path path/to/filters.json \
    --obs-columns-to-include "c.cell_type,c.donor_id,c.sex,i.dataset_id" \
    --bucket-name my_bucket \
    --extract-bucket-path path/to/extract \
    --ci-random-seed-offset 0 \
    --ci-partition-bin-count 40000 \
    --ci-partition-size 10

casp.bq_scripts.run.bq_ops_extract_bins_in_parallel_workers(project_id: str, dataset: str, extract_table_prefix: str, start_bin: int, end_bin: int, output_bucket_name: str, extract_bucket_path: str, obs_columns_to_include: str) → None[source]

Extract bins in parallel workers with using python concurrent.futures.ThreadPoolExecutor.

Parameters:

project_id – Google Cloud Project id
dataset – BigQuery Dataset
extract_table_prefix – Prefix of extract tables
start_bin – Start bin to extract
end_bin – End bin to extract
output_bucket_name – Name of GCS bucket
extract_bucket_path – Path where the extract files and subdirectories should be located. Should correspond to the directory provided to prepare_extract script as current script uses shared_meta files.
obs_columns_to_include – Comma separated columns from cas_cell_info table to include in adata.obs. Note: It is required to provide the column names along with the aliases for the tables to which they belong. E.g. "c" for cas_cell_info, "i" for cas_ingest_info. However, the output extract table would contain only the column names, without any aliases. Example: ``[“c.cell_type”, “c.donor_id”, “c.sex”, “i.dataset_id”]`

Example usage

To extract bins in parallel workers, use the following command:

python casp/bq_scripts/run.py bq-ops-extract-bins-in-parallel-workers \
    --project-id my-gcp-project \
    --dataset my_dataset \
    --extract-table-prefix my_prefix \
    --start-bin 0 \
    --end-bin 100 \
    --output-bucket-name my_output_bucket \
    --extract-bucket-path path/to/extract \
    --obs-columns-to-include "c.cell_type,c.donor_id,c.sex,i.dataset_id"