Column Specifications
The columns are specified for each sync, and determine how the data will be interpreted and transformed at the comparison stage. A full example of all general options per column is shown below:
comparisonColumns:
- name: id
type: Uint8
unique: True
- name: name
type: Utf8
unique: False
- name: another_unique_id
type: Utf8
unique: False
Alternate column declaration
If you do not need to set any special settings for the columns, you can use the alternate syntax for column declaration to speed things up. This will cause the first field in the list to be set as unique:
comparisionColumns:
uniqueField: Int32
fieldA: Utf8
fieldB: Int32
Comparison columns
The comparison columns are a representation of how the data should look when doing the comparison, meaning AFTER any source or destination transformers have been applied. You can retrieve any number of columns not listed in the columns-section, but they will not be used for comparison. You can use them for insertion and transformations, though.
Field types
There are a number of field types that can be used for the columns:
- Integer types: pl.Int8, pl.Int16, pl.Int32, pl.Int64
- Unsigned Integer types: pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64
- Float types: pl.Float32, pl.Float64
- Boolean: pl.Boolean
- Utf8: pl.Utf8
- Object: pl.Object
There are a number of more field types in the PolaRS library, but not all of them have been validated to work with Beetl yet. In particular, the Binary column is not going to work.
Unique Fields
The unique fields determine how the comparison is made between source and destination, and don't have to correspond exactly with all unique columns in both datasets.
In the above ID column, the unique field is set to True
. This will mean that the comparison is done like this:
- Match all Source IDs to Destination IDs to find those that are missing (Create)
- Match all Destination IDs to Source IDs to find those that are to be deleted (Delete)
- Match all rows that exist in both datasets and compare the data to find those that are different (Update)
You can add multiple ID columns for a dataset, but there should never be no unique columns.
TIP
If you transfer to/from SQL Server, make sure to transform the UUID (uniqueidentifier)s on the other end to uppercase since this is done by default in SQL Server. If you don't do this, you will likely have all records updated every time.
Multi-dimensional Data
If you use sources with multi-dimensional data (e.g. json, MongoDB, etc.), you can use the pl.Object
type to store the entire data structure in a single column. You also have the option to use transformers to flatten the data structure into multiple columns
Objects
allRows:
- name: hello
attrs:
height: 100
length: 200
width: 300
- name: world
attrs:
height: 240
length: 120
width: 360
This data structure can be flattened into the following columns:
name | attrs.height | attrs.length | attrs.width |
---|---|---|---|
hello | 100 | 200 | 300 |
world | 240 | 120 | 360 |
Arrays
Arrays will generally be flattened to a pl.List type, and can then be modified and/or extracted with for example the strings.join or structs.jsonpath transformers
Lists cannot yet be compared against each other.
- name: hello
attrs:
- 100
- 200
- 300
deep:
- nested: list
deep_attr:
- 1
...
# This will turn the above field into a comma-separated string
- transformer: strings.join
config:
inField: attrs
separator: ", "
# This will extract the first element of the array under attrs
- transformer: structs.jsonpath
config:
inField: attrs
path: "0"
# Jsonpath can also be used for deeper data structures
- transformer: structs.jsonpath
config:
inField: deep
path: "0.deep_attr.0"