Data Manager
This module is designed for managing single-cell data in Cellarium DataStore. Currently, single-cell data are stored in BigQuery, which uses its own SQL dialect to handle data in the warehouse. This module should not be limited to only BigQuery, as there is potential to switch to other data warehouse products.
For more information about BigQuery SQL dialect, please visit BigQuery SQL dialect.
SQL Query Management
The template generator is used for managing SQL queries due to the following reasons:
cellarium-cloudrelies on a large number of queries with dynamic content (e.g., optional joins, different filters inwhereclauses).To maintain the DRY (Don’t Repeat Yourself) principle in SQL queries.
To separate SQL code from Python files, which makes our codebase cleaner, more modular, and thus more easily maintainable.
Template Generator
Mako template generator is chosen for this purpose. Both Mako and Jinja2, originally designed for HTML file templating, were considered suitable for managing SQL queries. More information can be found in these references:
Wrapper libraries for both Jinja and Mako specifically used for SQL management are also available:
However, it appears that these libraries are not well-maintained and/or have few stars on GitHub.
Why Mako?
The Mako template generator is preferred for its better syntax. It is found to be easier to use and has a more Pythonic approach. It also allows for the seamless integration of Python helper functions within templates.
Using the data_access_manager module
Create a SQL template. Consider the following points: * Keep each query in a separate template file. * Use
.sql.makoextension as it explicitly indicates that the templates are not executable queries yet. * Utilizedata_access_manager.sql.mako_helpers, feel free to extend helpers if needed.Initialize
data_access_manager.sql.TemplateDatainstance with all the variables needed for the template.Call
data_access_manager.sql.renderwith a SQL template path anddata_access_manager.sql.TemplateDatainstance.Use a query to execute.
Example
SQL Template example
your_path_to_templates/templated_query.sql.mako<%! from casp.data_access_manager.sql import mako_helpers %> select ${mako_helpers.parse_column_names(select_columns)} from `${project}.${dataset}.cell_info` ci % if "ii.dataset_id" in select_columns: join `${project}.${dataset}.ingest_info` ii on (ci.ingest_id = ii.id) $ endif limit ${limit_by}Create
data_access_manager.sql.TemplateDatafrom casp.data_manager import sql select_columns = ["id", "cell_type"] # This could be used by a `mako_helpers.parse_column_names` additional_kwargs = {"limit_by": 100} # Variables to be parsed in a Template template_data = sql.TemplateData( project="test-project", dataset="my-test-dataset", select=select_columns, **additional_kwargs )
Render SQL Query
rendered_sql_query = sql.render( template_path="your_path_to_templates/templated_query.sql.mako", template_data=template_data, ) # Execute your query...
Module Structure
The Data Manager data_manager module consists of the following structure:
├── base_data_manager.py # Base class for all data managers in the project
└── sql
├── __init__.py
├── constants.py
├── mako_helpers.py # Helper functions used in SQL templates
├── query.py # Query management / validation directory
├── template_data.py # Template Data Parsing class
└── templates # All SQL templates go here
└── ...
└── validation
└── ...
Security Considerations
SQL Injection Awareness
When working with SQL Template building, it’s important to be aware of SQL injection vulnerabilities. SQL Template building relies on string interpolation, which can be susceptible to SQL injection attacks. Current implementation of SQL query construction doesn’t leverage SQL injection checks.
Scope: All SQL queries executed within bq_scripts are designed exclusively for internal use within the cellarium-cloud infrastructure. bq_scripts is a secure module and has never been exposed to user clients or APIs. Therefore, it is safe to use template building techniques for managing our internal SQL queries.
Best Practices
To ensure the security of your application, please follow these best practices:
Avoid External Exposure: Do not use SQL template building tools for constructing queries that will be exposed to end-users or external entities.
Parameterized Queries: If there is a need to expose SQL queries to end-users, always use Parameterized Queries. Parameterized queries help prevent SQL injection by separating data from SQL code.
Consider extending `mako_helpers`: To simplify the implementation of parameterized queries, consider expanding mako_helpers to work seamlessly with Parameterized Queries.
By following these practices, you can ensure the safe and secure use of this module within the project.
To Consider
Currently, there is no SQL formatting in the casp.data_access_manager.sql module. There’s no SQL formatter tool that supports the Mako template manager. However, there is one that supports Jinja: sqlfmt. Jinja was not chosen solely for the reason that a formatting tool supports it. Hopefully, a SQL formatting tool suitable for SQL Mako templates will be found. At the moment, it is advised to keep all sql.mako templates in a similar format manually. Please use lower-case style for SQL query consistency.