晚饭吃什么减肥| 香醋是什么醋| 有因必有果什么意思| dyj什么意思| 洗钱是什么意思| 一个令一个瓦念什么| 煮海带放什么容易烂| 什么是电子烟| 打感情牌是什么意思| 女性分泌物发黄是什么原因| 咬肌疼是什么原因| 腋下淋巴结肿大挂什么科| 跳楼是什么感觉| 背部爱出汗是什么原因| 北极熊为什么不怕冷| 血管紧张素是什么意思| 京东自营店什么意思| 红红的苹果像什么句子| 指甲长出来是白色的什么原因| 杭州灵隐寺求什么最灵| 日本樱花什么时候开| 多囊卵巢综合症有什么症状| 勃起是什么| 一什么孩子| 火龙果是什么季节的水果| 口水多是什么原因引起的| 转氨酶和转移酶有什么区别| 什么蜘蛛有毒| 画饼什么意思| 梦见找鞋子是什么意思| 韧带钙化是什么意思| 心里空落落的是什么意思| 为什么感觉不到饿| 尹什么意思| 煞气是什么意思| 大便想拉又拉不出来是什么原因| 鑫字属于五行属什么| 限购是什么意思| 致什么意思| 有什么奇怪| 憋不住尿是什么原因| 吴亦凡什么学历| 舌头边上有锯齿状是什么原因| 孕妇有狐臭擦什么最好| 手抖是因为什么| 颈椎退变是什么意思| 起诉离婚需要什么材料| 女性尿液发黄是什么原因| 苏慧伦为什么不老| 什么是回南天| 乳腺低回声是什么意思| 颈椎疼挂什么科| 气血亏虚吃什么中成药| 禾加农是什么字| 灭活疫苗是什么意思| 胀气吃什么药| 肠胃不好吃什么药好| 夜咳嗽是什么原因| 足底麻木是什么原因| st股票是什么意思| 淋巴结反应性增生是什么意思| 十月十一是什么星座| 创伤性湿肺是什么意思| 肾积水有什么危害| 8月23日是什么星座| ccr是什么意思| 胸变大是什么原因| 广东有什么烟| 番石榴是什么| 七月18日是什么星座| 什么克土| 结婚下雨有什么说法| 胃反酸吃什么食物好| 梦见找对象是什么意思| 养狗的人容易得什么病| 我在你心里是什么颜色| 什么是黄体期| 什么是三观不合| 什么的少年| 什么是膜性肾病| 咖啡色配什么颜色好看| 其实不然是什么意思| 玉化是什么意思| 什么的跳舞| 芹菜炒什么| 堆堆袜是什么| 胚芽是什么| 平板电脑与笔记本电脑有什么区别| 早上眼屎多是什么原因| 夜明砂是什么| 女人肝火旺吃什么好| 三个牛读什么字| 建字五行属什么| 四面楚歌是什么生肖| 测血糖挂什么科| 为什么会有眼屎| 小腹左边疼是什么原因| 戴菊是什么| 2月16日是什么星座| 康宽杀虫剂能杀什么虫| 汗斑用什么药膏好| 黄瓜敷脸有什么效果| 超敏c反应蛋白偏高说明什么| 甲醛中毒有什么症状| 舌苔厚白吃什么食物好| 外痔是什么样子的| 手抖是什么症状| 谷草谷丙高是什么原因| 惊喜的英文是什么| 大姨妈不来是什么原因| 吃什么对肝脏有好处能养肝| 按摩有什么好处和坏处| 女人为什么会来月经| 包皮过长有什么影响| 前列腺液是什么东西| 带状疱疹挂什么科室| 2002是什么年| 什么菜不能吃| 眼睛散光和近视有什么区别| 130是什么意思| 脂浊是什么意思| 喉咙痒是什么原因| 醉酒第二天吃什么才能缓解难受| 1年是什么朝代| 滇红是什么茶| 种马是什么意思| 幽门螺杆菌抗体阳性什么意思| 三什么道中| 白细胞酯酶是什么意思| 海螺不能和什么一起吃| 大便出血是什么原因| 舌苔发白厚吃什么药| 2型糖尿病吃什么药降糖效果好| 来月经前胸胀痛什么原因| 29周岁属什么生肖| 奥运会五环颜色分别代表什么| 湿疹什么症状| 豌豆的什么不能吃| 月经老提前是什么原因| 为什么会无缘无故长痣| 日照有什么特产| 什么什么之年| 什么是过敏性咳嗽| 知了什么时候叫| 去澳门需要什么证件| 12月22日什么星座| 吃海鲜忌什么| 血脂稠吃什么| 什么样的轮子只转不走| 玛瑙五行属什么| 乳房变大是什么原因| 维生素b6主治什么病| 青蛙趴有什么好处| 来月经可以吃什么| 尿道灼热感吃什么药| 这次是我真的决定离开是什么歌| 什么叫ins风格| 46岁属什么| hcg值高说明什么| 什么是五毒| 幽门螺旋杆菌有什么症状| 头发大把大把的掉是什么原因| 长期吃避孕药有什么副作用| 早射吃什么药| 年薪12万什么水平| 经期适合吃什么| 孟子姓什么| 失眠多梦用什么药| 心脏大是什么原因| 查心梗应该做什么检查| 两个叉念什么| 低烧吃什么| 销魂什么意思| 包含是什么意思| 盆腔ct能检查出什么病| 膀胱ca是什么意思| 打点是什么意思| 什么东西可以去口臭呀| emoji是什么意思| 最机灵的动物是什么生肖| 那是什么| 手掌有痣代表什么| 男人少精弱精吃什么补最好| 十二生肖排第七是什么生肖| 奇葩是什么意思| 检查幽门螺旋杆菌挂什么科| 吃青椒有什么好处| 新生儿血糖低是什么原因| 神昏谵语是什么意思| 不安腿综合征吃什么药| 验孕棒什么时候测准确| 芸字五行属什么| 孝敬是什么意思| 脾大有什么危害| 红薯是什么季节的| 结婚一年是什么婚| 栅栏是什么意思| 异位妊娠是什么意思| 血凝是什么意思| 扬州有什么好玩的| 孕妇吃什么坚果比较好| 空调用什么插座| 迁坟需要准备什么东西| 李字五行属什么| 梦见租房子住是什么意思| 口里有异味是什么原因| 红细胞分布宽度偏低是什么原因| 什么细节能感动摩羯男| 血小板数目偏高是什么意思| 脂肪肝吃什么好| 泡泡棉是什么面料| ts和cd有什么区别| 什么破壁机好用| 薄荷长什么样| 40岁适合什么工作| 3什么意思| 活碱是什么| 王八和乌龟有什么区别| 来姨妈下面疼是为什么| 二级烫伤是什么程度| 白色加红色等于什么颜色| 发高烧是什么原因引起的| 肾阴虚吃什么药| 胃气上逆有什么好的办法治疗| 孩子注意力不集中缺什么微量元素| 01年是什么年| 竹笋炒什么好吃| 肝不好挂什么科| 治疗阳痿早泄用什么药| 墨西哥说什么语言| 感情洁癖什么意思| 个人送保是什么意思| eee是什么牌子的鞋| 小肠炖什么好吃又营养| bl和bg是什么意思| 什么东西护肝养肝| 20年属什么生肖| ganni是什么牌子| 胃酸是什么颜色的| 拉钩为什么要上吊| 女性口臭都是什么原因| 双鱼座的上升星座是什么| 拔牙后吃什么| 边缘心电图是什么意思| 什么是牙周炎| 曹仁和曹操什么关系| 人面桃花相映红是什么意思| 黑蝴蝶代表什么| 颈椎属于什么科室| 郑板桥是什么生肖| 木是什么生肖| 凉拖鞋什么材质的好| 天麻有什么作用| gr是什么| 狗吃什么蔬菜好| 失眠为什么开奥氮平片| 鲜为人知是什么意思| 脸上长痘痘用什么药膏效果好| 嘴角疱疹用什么药膏| 611是什么意思| 送奶奶什么礼物好| 控评是什么意思| 吃什么可以降火祛痘| 10月25日什么星座| 百度
Skip to content

GoogleCloudPlatform/professional-services-data-validator

Data Validation Tool

The Data Validation Tool is an open sourced Python CLI tool based on the Ibis framework that compares heterogeneous data source tables with multi-leveled validation functions.

Data validation is a critical step in a data warehouse, database, or data lake migration project where data from both the source and the target tables are compared to ensure they are matched and correct after each migration step (e.g. data and schema migration, SQL script translation, ETL migration, etc.). The Data Validation Tool (DVT) provides an automated and repeatable solution to perform this task.

DVT supports the following validations:

  • Column validation (count, sum, avg, min, max, stddev, group by)
  • Row validation (Not supported for FileSystem connections)
  • Schema validation
  • Custom Query validation
  • Ad hoc SQL exploration

DVT supports the following connection types:

The Connections page provides details about how to create and list connections for the validation tool.

Disclaimer

This is not an officially supported Google product. Please be aware that bugs may lurk, and that we reserve the right to make small backwards-incompatible changes. Feel free to open bugs or feature requests, or contribute directly (see CONTRIBUTING.md for details).

Installation

The Installation page describes the prerequisites and setup steps needed to install and use the Data Validation Tool.

Usage

Before using this tool, you will need to create connections to the source and target tables. Once the connections are created, you can run validations on those tables. Validation results can be printed to stdout (default) or outputted to BigQuery (recommended). DVT also allows you to save and edit validation configurations in a YAML or JSON file. This is useful for running common validations or updating the configuration.

Managing Connections

Before running validations, DVT requires setting up a source and target connection. These connections can be stored locally or in a GCS directory. To create connections, please review the Connections page.

Running Validations

The CLI is the main interface to use this tool and it has several different commands which can be used to create and run validations. DVT is designed to run in an environment connected to GCP services, specifically, BigQuery, GCS and Secret manager. If DVT is being run on-premises or in an environment with restricted access to GCP services, see running DVT at on-prem. Below are the command syntax and options for running validations.

Alternatives to running DVT in the CLI include deploying DVT to Cloud Run, Cloud Functions, or Airflow (Examples Here). See the Validation Logic section to learn more about how DVT uses the CLI to generate SQL queries.

Note that we do not support nested or complex columns for column or row validations.

Column Validations

Below is the command syntax for column validations. To run a grouped column validation, simply specify the --grouped-columns flag.

You can specify a list of string columns for aggregations in order to calculate an aggregation over the length(string_col). Similarly, you can specify timestamp/date columns for aggregation over the unix_seconds(timestamp_col). Running an aggregation over all columns ('*') will only run over numeric columns, unless the --wildcard-include-string-len or --wildcard-include-timestamp flags are present.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  validate column
  --source-conn or -sc SOURCE_CONN
                        Source connection details
                        See: *Data Source Configurations* section for each data source
  --target-conn or -tc TARGET_CONN
                        Target connection details
                        See: *Connections* section for each data source
  --tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
                        Comma separated list of tables in the form schema.table=target_schema.target_table. Or shorthand schema.* for all tables.
                        Target schema name and table name are optional.
                        i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
  [--grouped-columns or -gc GROUPED_COLUMNS]
                        Comma separated list of columns for Group By i.e col_a,col_b
  [--count COLUMNS]     Comma separated list of columns for count or * for all columns
  [--sum COLUMNS]       Comma separated list of columns for sum or * for all numeric
  [--min COLUMNS]       Comma separated list of columns for min or * for all numeric
  [--max COLUMNS]       Comma separated list of columns for max or * for all numeric
  [--avg COLUMNS]       Comma separated list of columns for avg or * for all numeric
  [--std COLUMNS]       Comma separated list of columns for stddev_samp or * for all numeric.
                        Please note that not all supported SQL engines give results from STDDV_SAMP (or engine specific equivalent) that
                        are comparable across all other supported SQL engines. This option may produce unreliable results.
  [--exclude-columns or -ec]
                        Flag to indicate the list of columns provided should be excluded and not included.
  [--result-handler or -rh CONNECTION_NAME.SCHEMA.TABLE or BQ_PROJECT_ID.DATASET.TABLE]
                        Specify a BigQuery or PostgreSQL connection name as destination for validation results.
                        Also supports legacy BigQuery format BQ_PROJECT_ID.DATASET.TABLE.
                        See: *Validation Reports* section
  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE or CONNECTION_NAME.DATASET.TABLE]
                        This option has been deprecated and will be removed in a future release.
  [--service-account or -sa PATH_TO_SA_KEY]
                        Service account to use for BigQuery result handler output.
  [--wildcard-include-string-len or -wis]
                        If flag is present, include string columns in aggregation as len(string_col)
  [--wildcard-include-timestamp or -wit]
                        If flag is present, include timestamp/date columns in aggregation as unix_seconds(ts_col)
  [--cast-to-bigint or -ctb]
                        If flag is present, cast all int32 columns to int64 before aggregation
  [--filters SOURCE_FILTER:TARGET_FILTER]
                        Colon separated string values of source and target filters.
                        If target filter is not provided, the source filter will run on source and target tables.
                        See: *Filters* section
  [--config-file or -c CONFIG_FILE]
                        YAML Config File Path to be used for storing validations and other features. Supports GCS and local paths.
                        See: *Running DVT with YAML Configuration Files* section
  [--config-file-json or -cj CONFIG_FILE_JSON]
                        JSON Config File Path to be used for storing validations only for application purposes.
  [--threshold or -th THRESHOLD]
                        Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
  [--labels or -l KEY1=VALUE1,KEY2=VALUE2]
                        Comma-separated key value pair labels for the run.
  [--format or -fmt FORMAT]
                        Format for stdout output. Supported formats are (text, csv, json, table). Defaults to table.
  [--filter-status or -fs STATUSES_LIST]
                        Comma separated list of statuses to filter the validation results. Supported statuses are (success, fail). If no list is provided, all statuses are returned.

The default aggregation type is a 'COUNT *', which will run in addition to the validations you specify. To remove this default, use YAML configs.

The Examples page provides many examples of how a tool can be used to run powerful validations without writing any queries.

Row Validations

(Note: Row hash validation not supported for FileSystem connections. In addition, please note that SHA256 is not a supported function on Teradata systems. If you wish to perform this comparison on Teradata you will need to deploy a UDF to perform the conversion.)

Below is the command syntax for row validations. In order to run row level validations we require unique columns to join row sets, which are either inferred from the source/target table or provided via the --primary-keys flag, and either the --hash, --concat or --comparison-fields flags. See Primary Keys section.

The --comparison-fields flag specifies the values (e.g. columns) whose raw values will be compared based on the primary key join. The --hash flag will run a checksum across specified columns in the table. This will include casting to string, sanitizing the data (ifnull, rtrim, upper), concatenating, and finally hashing the row.

Under the hood, row validation uses Calculated Fields to apply functions such as IFNULL() or RTRIM(). These can be edited in the YAML or JSON config file to customize your row validation.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  validate row
  --source-conn or -sc SOURCE_CONN
                        Source connection details
                        See: *Data Source Configurations* section for each data source
  --target-conn or -tc TARGET_CONN
                        Target connection details
                        See: *Connections* section for each data source
  --tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
                        Comma separated list of tables in the form schema.table=target_schema.target_table
                        Target schema name and table name are optional.
                        i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
  --comparison-fields or -comp-fields FIELDS
                        Comma separated list of columns to compare. Can either be a physical column or an alias
                        See: *Calculated Fields* section for details
  --hash COLUMNS        Comma separated list of columns to hash or * for all columns
  --concat COLUMNS      Comma separated list of columns to concatenate or * for all columns (use if a common hash function is not available between databases)
  --max-concat-columns INT, -mcc INT
                        Maximum number of columns used in one --hash or --concat validation. When there are more columns in the validation, the validation will be split into multiple validations. There are engine specific defaults, so most users do not need to use this option unless they encounter errors.
  [--primary-keys PRIMARY_KEYS, -pk PRIMARY_KEYS]
                        Comma separated list of primary key columns, when not specified the value will be inferred
                        from the source or target table if available.  See *Primary Keys* section
  [--exclude-columns or -ec]
                        Flag to indicate the list of columns provided should be excluded from hash or concat instead of included.
  [--result-handler or -rh CONNECTION_NAME.SCHEMA.TABLE or BQ_PROJECT_ID.DATASET.TABLE]
                        Specify a BigQuery or PostgreSQL connection name as destination for validation results.
                        Also supports legacy BigQuery format BQ_PROJECT_ID.DATASET.TABLE.
                        See: *Validation Reports* section
  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE or CONNECTION_NAME.DATASET.TABLE]
                        This option has been deprecated and will be removed in a future release.
  [--service-account or -sa PATH_TO_SA_KEY]
                        Service account to use for BigQuery result handler output.
  [--filters SOURCE_FILTER:TARGET_FILTER]
                        Colon separated string values of source and target filters.
                        If target filter is not provided, the source filter will run on source and target tables.
                        See: *Filters* section
  [--config-file or -c CONFIG_FILE]
                        YAML Config File Path to be used for storing validations and other features. Supports GCS and local paths.
                        See: *Running DVT with YAML Configuration Files* section
  [--config-file-json or -cj CONFIG_FILE_JSON]
                        JSON Config File Path to be used for storing validations only for application purposes.
  [--labels or -l KEY1=VALUE1,KEY2=VALUE2]
                        Comma-separated key value pair labels for the run.
  [--format or -fmt FORMAT]
                        Format for stdout output. Supported formats are (text, csv, json, table). Defaults to table.
  [--use-random-row or -rr]
                        Finds a set of random rows of the first primary key supplied.
  [--random-row-batch-size or -rbs]
                        Row batch size used for random row filters (default 10,000).
  [--filter-status or -fs STATUSES_LIST]
                        Comma separated list of statuses to filter the validation results. Supported statuses are (success, fail). If no list is provided, all statuses are returned.
  [--case-insensitive-match, -cim]
                        Performs a case insensitive match by adding an UPPER() before comparison.

Generate Partitions for Large Row Validations

When performing row validations, Data Validation Tool brings each row into memory and can run into MemoryError. Below is the command syntax for generating partitions in order to perform row validations on large dataset (table or custom-query) to alleviate MemoryError. Each partition contains a range of primary key(s) and the ranges of keys across partitions are distinct. The partitions have nearly equal number of rows. See Primary Keys section

The command generates and stores multiple YAML validations each representing a chunk of the large dataset using filters (WHERE primary_key(s) >= X AND primary_key(s) < Y) in YAML files. The parameter parts-per-file, specifies the number of validations in one YAML file. Each yaml file will have parts-per-file validations in it - except the last one which will contain the remaining partitions (i.e. parts-per-file may not divide partition-num evenly). You can then run the validations in the directory serially (or in parallel in multiple containers, VMs) with the data-validation configs run --config-dir PATH command as described here.

The command takes the same parameters as required for Row Validation plus a few parameters to support partitioning. Single and multiple primary keys are supported and keys can be of any indexable type, except for date and timestamp type. You can specify tables that are being validated or the source and target custom query. A parameter used in earlier versions, partition-key is no longer supported.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  generate-table-partitions
  --source-conn or -sc SOURCE_CONN
                        Source connection details
                        See: *Data Source Configurations* section for each data source
  --target-conn or -tc TARGET_CONN
                        Target connection details
                        See: *Connections* section for each data source
  --tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
                        Comma separated list of tables in the form schema.table=target_schema.target_table
                        Target schema name and table name are optional.
                        i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
                        Either --tables-list or --source-query (or file) and --target-query (or file) must be provided
  --source-query SOURCE_QUERY, -sq SOURCE_QUERY
                        Source sql query
                        Either --tables-list or --source-query (or file) and --target-query (or file) must be provided
  --source-query-file  SOURCE_QUERY_FILE, -sqf SOURCE_QUERY_FILE
                        File containing the source sql command. Supports GCS and local paths.
  --target-query TARGET_QUERY, -tq TARGET_QUERY
                        Target sql query
                        Either --tables-list or --source-query (or file) and --target-query (or file) must be provided
  --target-query-file TARGET_QUERY_FILE, -tqf TARGET_QUERY_FILE
                        File containing the target sql command. Supports GCS and local paths.
  --comparison-fields or -comp-fields FIELDS
                        Comma separated list of columns to compare. Can either be a physical column or an alias
                        See: *Calculated Fields* section for details
  --hash COLUMNS        Comma separated list of columns to hash or * for all columns
  --concat COLUMNS      Comma separated list of columns to concatenate or * for all columns (use if a common hash function is not available between databases)
  --config-dir CONFIG_DIR, -cdir CONFIG_DIR
                        Directory Path to store YAML Config Files
                        GCS: Provide a full gs:// path of the target directory. Eg: `gs://<BUCKET>/partitions_dir`
                        Local: Provide a relative path of the target directory. Eg: `partitions_dir`
                        If invoked with -tbls parameter, the validations are stored in a directory named <schema>.<table>, otherwise the directory is named `custom.<random_string>`
  --partition-num INT, -pn INT
                        Number of partitions into which the table should be split, e.g. 1000 or 10000
                        In case this value exceeds the row count of the source/target table, it will be decreased to max(source_row_count, target_row_count)
  [--primary-keys PRIMARY_KEYS, -pk PRIMARY_KEYS]
                        Comma separated list of primary key columns, when not specified the value will be inferred
                        from the source or target table if available.  See *Primary Keys* section
  [--result-handler or -rh CONNECTION_NAME.SCHEMA.TABLE or BQ_PROJECT_ID.DATASET.TABLE]
                        Specify a BigQuery or PostgreSQL connection name as destination for validation results.
                        Also supports legacy BigQuery format BQ_PROJECT_ID.DATASET.TABLE.
                        See: *Validation Reports* section
  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE or CONNECTION_NAME.DATASET.TABLE]
                        This option has been deprecated and will be removed in a future release.
  [--service-account or -sa PATH_TO_SA_KEY]
                        Service account to use for BigQuery result handler output.
  [--parts-per-file INT], [-ppf INT]
                        Number of partitions in a yaml file, default value 1.
  [--filters SOURCE_FILTER:TARGET_FILTER]
                        Colon separated string values of source and target filters.
                        If target filter is not provided, the source filter will run on source and target tables.
                        See: *Filters* section
  [--labels or -l KEY1=VALUE1,KEY2=VALUE2]
                        Comma-separated key value pair labels for the run.
  [--format or -fmt FORMAT]
                        Format for stdout output. Supported formats are (text, csv, json, table). Defaults to table.
  [--filter-status or -fs STATUSES_LIST]
                        Comma separated list of statuses to filter the validation results. Supported statuses are (success, fail). If no list is provided, all statuses are returned.
  [--case-insensitive-match, -cim]
                        Performs a case insensitive match by adding an UPPER() before comparison.

Schema Validations

Below is the syntax for schema validations. These can be used to compare case insensitive column names and types between source and target.

Note: An exclamation point before a data type (!string) signifies the column is non-nullable or required.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  validate schema
  --source-conn or -sc SOURCE_CONN
                        Source connection details
                        See: *Data Source Configurations* section for each data source
  --target-conn or -tc TARGET_CONN
                        Target connection details
                        See: *Connections* section for each data source
  --tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
                        Comma separated list of tables in the form schema.table=target_schema.target_table. Or shorthand schema.* for all tables.
                        Target schema name and table name are optional.
                        e.g.: 'bigquery-public-data.new_york_citibike.citibike_trips'
  [--result-handler or -rh CONNECTION_NAME.SCHEMA.TABLE or BQ_PROJECT_ID.DATASET.TABLE]
                        Specify a BigQuery or PostgreSQL connection name as destination for validation results.
                        Also supports legacy BigQuery format BQ_PROJECT_ID.DATASET.TABLE.
                        See: *Validation Reports* section
  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE or CONNECTION_NAME.DATASET.TABLE]
                        This option has been deprecated and will be removed in a future release.
  [--service-account or -sa PATH_TO_SA_KEY]
                        Service account to use for BigQuery result handler output.
  [--config-file or -c CONFIG_FILE]
                        YAML Config File Path to be used for storing validations and other features. Supports GCS and local paths.
                        See: *Running DVT with YAML Configuration Files* section
  [--config-file-json or -cj CONFIG_FILE_JSON]
                        JSON Config File Path to be used for storing validations only for application purposes.
  [--format or -fmt]    Format for stdout output. Supported formats are (text, csv, json, table).
                        Defaults  to table.
  [--filter-status or -fs STATUSES_LIST]
                        Comma separated list of statuses to filter the validation results. Supported statuses are (success, fail).
                        If no list is provided, all statuses are returned.
  [--exclusion-columns or -ec EXCLUSION_COLUMNS]
                        Comma separated list of columns to be excluded from the schema validation, e.g.: col_a,col_b.
  [--allow-list or -al ALLOW_LIST]
                        Comma separated list of data-type mappings of source and destination data sources which will be validated in case of missing data types in destination data source. e.g: "decimal(4,2):decimal(5,4),!string:string"
  [--allow-list-file ALLOW_LIST_FILE, -alf ALLOW_LIST_FILE]
                        YAML file containing default --allow-list mappings. Can be used in conjunction with --allow-list.
                        e.g.: samples/allow_list/oracle_to_bigquery.yaml or gs://dvt-allow-list-files/oracle_to_bigquery.yaml
                        See example files in samples/allow_list/.

Custom Query Column Validations

Below is the command syntax for custom query column validations.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  validate custom-query column
  --source-conn or -sc SOURCE_CONN
                        Source connection details
                        See: *Data Source Configurations* section for each data source
  --target-conn or -tc TARGET_CONN
                        Target connection details
                        See: *Connections* section for each data source
  --source-query SOURCE_QUERY, -sq SOURCE_QUERY
                        Source sql query
                        Either --source-query or --source-query-file must be provided
  --source-query-file  SOURCE_QUERY_FILE, -sqf SOURCE_QUERY_FILE
                        File containing the source sql command. Supports GCS and local paths.
  --target-query TARGET_QUERY, -tq TARGET_QUERY
                        Target sql query
                        Either --target-query or --target-query-file must be provided
  --target-query-file TARGET_QUERY_FILE, -tqf TARGET_QUERY_FILE
                        File containing the target sql command. Supports GCS and local paths.
  [--count COLUMNS]     Comma separated list of columns for count or * for all columns
  [--sum COLUMNS]       Comma separated list of columns for sum or * for all numeric
  [--min COLUMNS]       Comma separated list of columns for min or * for all numeric
  [--max COLUMNS]       Comma separated list of columns for max or * for all numeric
  [--avg COLUMNS]       Comma separated list of columns for avg or * for all numeric
  [--std COLUMNS]       Comma separated list of columns for stddev_samp or * for all numeric.
                        Please note that not all supported SQL engines give results from STDDV_SAMP (or engine specific equivalent) that
                        are comparable across all other supported SQL engines. This option may produce unreliable results.
  [--exclude-columns or -ec]
                        Flag to indicate the list of columns provided should be excluded and not included.
  [--result-handler or -rh CONNECTION_NAME.SCHEMA.TABLE or BQ_PROJECT_ID.DATASET.TABLE]
                        Specify a BigQuery or PostgreSQL connection name as destination for validation results.
                        Also supports legacy BigQuery format BQ_PROJECT_ID.DATASET.TABLE.
                        See: *Validation Reports* section
  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE or CONNECTION_NAME.DATASET.TABLE]
                        This option has been deprecated and will be removed in a future release.
  [--service-account or -sa PATH_TO_SA_KEY]
                        Service account to use for BigQuery result handler output.
  [--config-file or -c CONFIG_FILE]
                        YAML Config File Path to be used for storing validations and other features. Supports GCS and local paths.
                        See: *Running DVT with YAML Configuration Files* section
  [--config-file-json or -cj CONFIG_FILE_JSON]
                        JSON Config File Path to be used for storing validations only for application purposes.
  [--labels or -l KEY1=VALUE1,KEY2=VALUE2]
                        Comma-separated key value pair labels for the run.
  [--format or -fmt FORMAT]
                        Format for stdout output. Supported formats are (text, csv, json, table). Defaults to table.
  [--filter-status or -fs STATUSES_LIST]
                        Comma separated list of statuses to filter the validation results. Supported statuses are (success, fail). If no list is provided, all statuses are returned.

The default aggregation type is a 'COUNT *'. If no aggregation flag (i.e count, sum , min, etc.) is provided, the default aggregation will run.

The Examples page provides few examples of how this tool can be used to run custom query validations.

Custom Query Row Validations

(Note: Custom query row validation is not supported for FileSystem connections. Struct and array data types are not currently supported.)

Below is the command syntax for row validations. In order to run row level validations you need to pass --hash flag which will specify the fields of the custom query result that will be concatenated and hashed. The primary key should be included in the SELECT statement of both source_query.sql and target_query.sql. See Primary Keys section

Below is the command syntax for custom query row validations.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  validate custom-query row
  --source-conn or -sc SOURCE_CONN
                        Source connection details
                        See: *Data Source Configurations* section for each data source
  --target-conn or -tc TARGET_CONN
                        Target connection details
                        See: *Connections* section for each data source
  --source-query SOURCE_QUERY, -sq SOURCE_QUERY
                        Source sql query
                        Either --source-query or --source-query-file must be provided
  --source-query-file SOURCE_QUERY_FILE, -sqf SOURCE_QUERY_FILE
                        File containing the source sql command. Supports GCS and local paths.
  --target-query TARGET_QUERY, -tq TARGET_QUERY
                        Target sql query
                        Either --target-query or --target-query-file must be provided
  --target-query-file TARGET_QUERY_FILE, -tqf TARGET_QUERY_FILE
                        File containing the target sql command. Supports GCS and local paths.
  --comparison-fields or -comp-fields FIELDS
                        Comma separated list of columns to compare. Can either be a physical column or an alias
                        See: *Calculated Fields* section for details
  --hash '*'            '*' to hash all columns.
  --concat COLUMNS      Comma separated list of columns to concatenate or * for all columns
                        (use if a common hash function is not available between databases)
  --max-concat-columns INT, -mcc INT
                        Maximum number of columns used in one --hash or --concat validation. When there are more columns in the validation, the validation will be split into multiple validations. There are engine specific defaults, so most users do not need to use this option unless they encounter errors.
  [--primary-keys PRIMARY_KEYS, -pk PRIMARY_KEYS]
                       Common column between source and target queries for join
  [--exclude-columns or -ec]
                        Flag to indicate the list of columns provided should be excluded from hash or concat instead of included.
  [--result-handler or -rh CONNECTION_NAME.SCHEMA.TABLE or BQ_PROJECT_ID.DATASET.TABLE]
                        Specify a BigQuery or PostgreSQL connection name as destination for validation results.
                        Also supports legacy BigQuery format BQ_PROJECT_ID.DATASET.TABLE.
                        See: *Validation Reports* section
  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE or CONNECTION_NAME.DATASET.TABLE]
                        This option has been deprecated and will be removed in a future release.
  [--service-account or -sa PATH_TO_SA_KEY]
                        Service account to use for BigQuery result handler output.
  [--config-file or -c CONFIG_FILE]
                        YAML Config File Path to be used for storing validations and other features. Supports GCS and local paths.
                        See: *Running DVT with YAML Configuration Files* section
  [--config-file-json or -cj CONFIG_FILE_JSON]
                        JSON Config File Path to be used for storing validations only for application purposes.
  [--labels or -l KEY1=VALUE1,KEY2=VALUE2]
                        Comma-separated key value pair labels for the run.
  [--format or -fmt FORMAT]
                        Format for stdout output. Supported formats are (text, csv, json, table). Defaults to table.
  [--filter-status or -fs STATUSES_LIST]
                        Comma separated list of statuses to filter the validation results. Supported statuses are (success, fail). If no list is provided, all statuses are returned.
  [--case-insensitive-match, -cim]
                        Performs a case insensitive match by adding an UPPER() before comparison.

The Examples page provides few examples of how this tool can be used to run custom query row validations.

Dry Run Validation

The validate command takes a --dry-run command line flag that prints source and target SQL to stdout as JSON in lieu of performing a validation:

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  validate
  [--dry-run or -dr]    Prints source and target SQL to stdout in lieu of performing a validation.

For example, this flag can be used as follows:

> data-validation validate --dry-run row \
  -sc my_bq_conn \
  -tc my_bq_conn \
  -tbls bigquery-public-data.new_york_citibike.citibike_stations \
  --primary-keys station_id \
  --hash '*'
{
    "source_query": "SELECT `hash__all`, `station_id`\nFROM ...",
    "target_query": "SELECT `hash__all`, `station_id`\nFROM ..."
}

Running DVT at on-prem

On-premises environments can have limited access to GCP services. DVT supports using BigQuery for storing validation results, GCS for storage and the Secret Manager for secrets. You may also use BigQuery and Spanner as a source or target for validation. Service APIs (i.e. bigquery.googleapis.com) may not always be accessible due to firewall restrictions. Work with your network adminstrator to identify the way to access these services. They may set up a Private Service Connect Endpoint. DVT supports accessing source and target tables in Spanner and BigQuery via endpoints set up in your network. Connection Parameters for Spanner and BigQuery outline regarding how to specify endpoints.

Running DVT with YAML Configuration Files

Running DVT with YAML configuration files is the recommended approach if:

  • you want to customize the configuration for any given validation OR
  • you want to run DVT at scale (i.e. run multiple validations sequentially or in parallel)

We recommend generating YAML configs with the --config-file <file-name> flag when running a validation command, which supports GCS and local paths.

You can use the data-validation configs command to run and view YAMLs.

data-validation
  [--verbose or -v ]
                        Verbose logging
  [--log-level or -ll]
                        Log Level to be assigned. Supported levels are (DEBUG,INFO,WARNING,ERROR,CRITICAL). Defaults to INFO.
  configs run
  [--config-file or -c CONFIG_FILE]
                        Path to YAML config file to run. Supports local and GCS paths.
  [--config-dir or -cdir CONFIG_DIR]
                        Directory path containing YAML configs to be run sequentially. Supports local and GCS paths.
  [--dry-run or -dr]    If this flag is present, prints the source and target SQL generated in lieu of running the validation.
  [--kube-completions or -kc]
                        Flag to indicate usage in Kubernetes index completion mode.
                        See *Scaling DVT* section
data-validation configs list
  [--config-dir or -cdir CONFIG_DIR]
                        GCS or local directory from which to list validation YAML configs. Defaults to current local directory.
data-validation configs get
  [--config-file or -c CONFIG_FILE] GCS or local path of validation YAML to print.

View the complete YAML file for a Grouped Column validation on the Examples page.

Scaling DVT

You can scale DVT for large validations by running the tool in a distributed manner. To optimize the validation speed for large tables, you can use GKE Jobs (Google Kubernetes Jobs) or Cloud Run Jobs. If you are not familiar with Kubernetes or Cloud Run Jobs, see Scaling DVT with Distributed Jobs for a detailed overview.

We recommend first generating partitions with the generate-table-partitions command for your large datasets (tables or queries). Then, use Cloud Run or GKE to distribute the validation of each chunk in parallel. See the Cloud Run Jobs Quickstart sample to get started.

When running DVT in a distributed fashion, both the --kube-completions and --config-dir flags are required. The --kube-completions flag specifies that the validation is being run in indexed completion mode in Kubernetes or as multiple independent tasks in Cloud Run. If the -kc option is used and you are not running in indexed mode, you will receive a warning and the container will process all the validations sequentially. If the -kc option is used and a config directory is not provided (i.e. a --config-file is provided instead), a warning is issued.

The --config-dir flag will specify the directory with the YAML files to be executed in parallel. If you used generate-table-partitions to generate the YAMLs, this would be the directory where the partition files numbered 0000.yaml to <partition_num - 1>.yaml are stored i.e (gs://my_config_dir/source_schema.source_table/). When creating your Cloud Run Job, set the number of tasks equal to the number of table partitions so the task index matches the YAML file to be validated. When executed, each Cloud Run task will validate a partition in parallel.

Validation Reports

The result handlers tell DVT where to store the results of each validation. The tool can write the results of a validation run to Google BigQuery, PostgreSQL or print to stdout (default). View the schema of the results table here.

To output to BigQuery or PostgreSQL, simply include the -rh flag during a validation run including the schema and table name for the results.

BigQuery example by connection name:

data-validation validate column \
  -sc bq_conn \
  -tc bq_conn \
  -tbls bigquery-public-data.new_york_citibike.citibike_trips \
  -rh bq_conn.dataset.results_table \
  -sa 'service-acct@project.iam.gserviceaccount.com'

BigQuery example by project name:

data-validation validate column \
  -sc bq_conn \
  -tc bq_conn \
  -tbls bigquery-public-data.new_york_citibike.citibike_trips \
  -rh bq-project-id.dataset.results_table \
  -sa 'service-acct@project.iam.gserviceaccount.com'

PostgreSQL example:

data-validation validate column \
  -sc ora_conn \
  -tc pg_conn1 \
  -tbls my_schema.some_table \
  -rh pg_conn2.dvt_schema.results_table

Ad Hoc SQL Exploration

There are many occasions where you need to explore a data source while running validations. To avoid the need to open and install a new client, the CLI allows you to run ad hoc queries.

data-validation query
  --conn or -c CONN
          The connection name to be queried
  --query or -q QUERY
          The raw query to run against the supplied connection
  [--format or -f {minimal,python}]
          Format for query output (default: python)

Building Matched Table Lists

Creating the list of matched tables can be a hassle. We have added a feature which may help you to match all of the tables together between source and target. The find-tables command:

  • Pulls all tables in the source (applying a supplied allowed-schemas filter)
  • Pulls all tables from the target
  • Uses Jaro Similarity algorithm to match tables
  • Finally, it prints a JSON list of tables which can be a reference for the validation run config.

Note that our default value for the score-cutoff parameter is 1 and it seeks for identical matches. If no matches occur, reduce this value as deemed necessary. By using smaller numbers such as 0.7, 0.65 etc you can get more matches. For reference, we make use of this jaro_similarity method for the string comparison.

data-validation find-tables --source-conn source --target-conn target \
    --allowed-schemas pso_data_validator \
    --score-cutoff 1

Using Beta CLI Features

There may be occasions we want to release a new CLI feature under a Beta flag. Any features under Beta may or may not make their way to production. However, if there is a Beta feature you wish to use than it can be accessed using the following.

data-validation beta --help

[Beta] Deploy Data Validation as a Local Service

If you wish to use Data Validation as a Flask service, the following command will help. This same logic is also expected to be used for Cloud Run, Cloud Functions, and other deployment services.

data-validation beta deploy

Validation Logic

Aggregated Fields

Aggregate fields contain the SQL fields that you want to produce an aggregate for. Currently the functions COUNT(), AVG(), SUM(), MIN(), MAX(), and STDDEV_SAMP() are supported.

Here is a sample aggregate config:

validations:
- aggregates:
    - field_alias: count
    source_column: null
    target_column: null
    type: count
    - field_alias: count__tripduration
    source_column: tripduration
    target_column: tripduration
    type: count
    - field_alias: sum__tripduration
    source_column: tripduration
    target_column: tripduration
    type: sum

If you are aggregating columns with large values, you can CAST() before aggregation with calculated fields as shown in this example.

Filters

Filters let you apply a WHERE statement to your validation query (ie. SELECT * FROM table WHERE created_at > 30 days ago AND region_id = 71;). The filter is written in the syntax of the given source and must reference columns in the underlying table, not projected DVT expressions.

Note that you are writing the query to execute, which does not have to match between source and target as long as the results can be expected to align. If the target filter is omitted, the source filter will run on both the source and target tables.

Primary Keys

In many cases, validations (e.g. count, min, max etc) produce one row per table. The comparison between the source and target table is to compare the value for each column in the source with the value of the column in the target. grouped-columns validation and validate row produce multiple rows per table. Data Validation Tool needs one or more columns to uniquely identify each row so the source and target can be compared. Data Validation Tool refers to these columns as primary keys. These do not need to be primary keys in the table. The only requirement is that the keys uniquely identify the row in the results.

These columns are inferred, where possible, from the source/target table or can be provided via the --primary-keys flag.

Grouped Columns

Grouped Columns contain the fields you want your aggregations to be broken out by, e.g. SELECT last_updated::DATE, COUNT(*) FROM my.table will produce a resultset that breaks down the count of rows per calendar date.

Hash, Concat, and Comparison Fields

Row level validations can involve either a hash/checksum, concat, or comparison fields. A hash validation (--hash '*') will first sanitize the data with the following operations on all or selected columns: CAST to string, IFNULL replace with a default replacement string and RSTRIP. Then, it will CONCAT() the results and run a SHA256() hash and compare the source and target results.

When there are data type mismatches for columns, for example dates compared to timestamps and booleans compared with numeric columns, you may see other expressions in SQL statements which ensure that consistent values are used to build comparison values.

Since each row will be returned in the result set if is recommended recommended to validate a subset of the table. The --filters and --use-random-row options can be used for this purpose.

Please note that SHA256 is not a supported function on Teradata systems. If you wish to perform this comparison on Teradata you will need to deploy a UDF to perform the conversion.

The concat validation (--concat '*') will do everything up until the hash. It will sanitize and concatenate the specified columns, and then value compare the results.

Comparison field validations (--comp-fields column) involve an value comparison of the column values. These values will be compared via a JOIN on their corresponding primary key and will be evaluated for an exact match.

See hash and comparison field validations in the Examples page.

Calculated Fields

Sometimes direct comparisons are not feasible between databases due to differences in how particular data types may be handled. These differences can be resolved by applying functions to columns in the query itself. Examples might include trimming whitespace from a string, converting strings to a single case to compare case insensitivity, or rounding numeric types to a significant figure.

Once a calculated field is defined, it can be referenced by other calculated fields at any "depth" or higher. Depth controls how many subqueries are executed in the resulting query. For example, with the following YAML config:

- calculated_fields:
    - field_alias: rtrim_col_a
      source_calculated_columns: ['col_a']
      target_calculated_columns: ['col_a']
      type: rtrim
      depth: 0 # generated off of a native column
    - field_alias: ltrim_col_a
      source_calculated_columns: ['col_b']
      target_calculated_columns: ['col_b']
      type: ltrim
      depth: 0 # generated off of a native column
    - field_alias: concat_col_a_col_b
      source_calculated_columns: ['rtrim_col_a', 'ltrim_col_b']
      target_calculated_columns: ['rtrim_col_a', 'ltrim_col_b']
      type: concat
      depth: 1 # calculated one query above

is equivalent to the following SQL query:

SELECT
  CONCAT(rtrim_col_a, rtrim_col_b) AS concat_col_a_col_b
FROM (
  SELECT
      RTRIM(col_a) AS rtrim_col_a
    , LTRIM(col_b) AS ltrim_col_b
  FROM my.table
  ) as table_0

If you generate the config file for a row validation, you can see that it uses calculated fields to generate the query. You can also use calculated fields in column level validations to generate the length of a string, or cast a INT field to BIGINT for aggregations.

See the Examples page for a sample cast to NUMERIC.

Custom Calculated Fields

DVT supports certain functions required for row hash validation natively (i.e. CAST() and CONCAT()), which are listed in the CalculatedField() class methods in the QueryBuilder.

You can also specify custom functions (i.e. replace() or truncate()) from the Ibis expression API reference. Keep in mind these will run on both source and target systems. You will need to specify the Ibis API expression and the parameters required, if any, with the 'params' block as shown below:

- calculated_fields:
  - depth: 0
    field_alias: format_start_time
    source_calculated_columns:
    - start_time
    target_calculated_columns:
    - start_time
    type: custom
    ibis_expr: ibis.expr.types.TemporalValue.strftime
    params:
    - format_str: '%m%d%Y'

The above block references the TemporalValue.strftime Ibis API expression. See the Examples page for a sample YAML with a custom calculated field.

Contributing

Contributions are welcome. See the Contributing guide for details.

About

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

南极和北极有什么区别 骨质疏松是什么意思 转移什么意思 二氧化硅是什么东西 乳铁蛋白是什么
慢性鼻窦炎吃什么药 什么日什么里 衣服发黄是什么原因 芈月是秦始皇的什么人 扁桃体发炎挂什么科
流脑是什么 畸胎瘤是什么病严重吗 四月二十五是什么星座 孕中期宫缩是什么感觉 吃什么降火
怀疑心梗做什么检查 右侧卵巢囊性结构是什么意思 尿分叉吃什么药好得快 左心增大是什么意思 性格开朗是什么意思
人生若只如初见是什么意思aiwuzhiyu.com 孕妇吃什么容易滑胎hcv8jop9ns3r.cn 睡莲什么时候开花hcv8jop2ns9r.cn 郡主是什么意思hcv7jop6ns3r.cn 熹字五行属什么hcv9jop6ns1r.cn
mlf操作是什么意思hcv7jop6ns8r.cn 肚脐下面是什么部位creativexi.com 孕妇快生了有什么症状hcv9jop2ns0r.cn 什么是18k金hcv8jop0ns5r.cn 今年是什么生肖hcv8jop1ns5r.cn
phoenix是什么牌子hcv8jop4ns8r.cn 一月十二号是什么星座bjhyzcsm.com 许莫氏结节是什么意思jinxinzhichuang.com 手一直脱皮是什么原因ff14chat.com 心肌缺血是什么原因造成的hcv9jop1ns0r.cn
郡肝是什么部位youbangsi.com 梦见烧纸钱是什么意思hcv8jop9ns2r.cn 鱼用什么游泳hcv8jop5ns5r.cn 脸上起疙瘩是什么原因hcv8jop8ns6r.cn 睡觉张嘴是什么原因dajiketang.com
百度