Skip to Main Content

page_parts

JSON format

The format of the JSON, is pretty simple. There are currently 2 sections. "columns" which is an array of column objects and "dataflow" object, which describes any dataflows within the dataset. Only "columns" array with at least one column is required.

A very simple configuration would be a single column with no flow object and would look like this:


{
    "columns": [
        {
            "column_name": "simple_example_column"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "core_random.r_natural"
        }
    ]
}

Column object

The column object is at the core of the custom datasets. These are the columns that describe your data and how it will look when Binubuo generates the Synthetic data using those column definitions.

There are 6 types of column types (data generator category) that we can define for a column. Depending on which one we choose, there are a different number of fields required to define the data outcome. The 5 types of column types are:

  • Generated - This is the standard data generator. This takes advantage of the more than 100+ domain specific generators available in Binubuo.
  • Fixed - This is for when you want to set the value as the same thing in your dataset.
  • Builtin - These are special function columns, such as incrementals, additions or more.
  • Reference List - This is to define a fixed list of possible values.
  • Nested Set - This when you want to create nested datasets, where one datasets is created within your main dataset.
  • Reference Field - This is when you want to build parent/child relationships between your synthetic data outputs.

Generated Column Format

Generated columns is the standard way of generating data for your datasets. For a list of available Generators look at the Generator Documentation list. The format for a generated column is:


{
    "columns": [
        {
            "column_name": "[name_of_your_column_spaces_replaced_with_underscore]"
            , "column_datatype": "[string|text|number|int|date|time]"
            , "column_type": "generated"
            , "generator": "[generator_call_name]"
            , "arguments": "[any arguments as a comma separated list if generator takes args.]"
        }
    ]
}

arguments is not required. You can see on the Generator Documentation list if a generator supports any arguments. If an argument is a string, it should be enclosed in single quotation marks.

Let us imagine that we wanted to create a column with a synthetic generated name and we want the column name to be "customer_name". First we find the generator name: person_random.r_name. The column object would look like this:


{
    "columns": [
        {
            "column_name": "customer_name"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "person_random.r_name"
        }
    ]
}

Fixed Column Format

Fixed columns are the most simple column types for your dataset. It is simply a fixed value, for a column for all rows. The format for a fixed column is:


{
    "columns": [
        {
            "column_name": "[name_of_your_column_spaces_replaced_with_underscore]"
            , "column_datatype": "[string|text|number|int|date|time]"
            , "column_type": "fixed"
            , "fixed_value": "[your fixed data value]"
        }
    ]
}

So if we wanted to create a column called "hello" which always had the value "world" our column object would look like this:


{
    "columns": [
        {
            "column_name": "hello"
            , "column_datatype": "string"
            , "column_type": "fixed"
            , "generator": "world"
        }
    ]
}

Builtin Column Format

Builtin columns are special columns that can be used to create special incremental data rows. It is used to create either incremental numbers or incremental dates. Incremental numbers can be used to create primary keys, and incremental dates and times can be used to create coherence and flow within data. The format for a builtin column is:


{
    "columns": [
        {
            "column_name": "[name_of_your_column_spaces_replaced_with_underscore]"
            , "column_datatype": "[number|int|date|time]"
            , "column_type": "builtin"
            , "builtin_type": "[numiterate|datiterate|epochiterate]"
            , "builtin_startfrom": "[sysdate|DD-MON-YYYY HH24:MI:SS|0-n]"
            , "builtin_increment_min": "[0-n]"
            , "builtin_increment_max": "[0-n]"
            , "builtin_increment_component": "[seconds|minutes|hours|days|months|years]" 
        }
    ]
}

The builtin_increment_component field is only for when you are using datiterate or epochiterate. Only column_name , column_datatype, column_type and builtin_type are required fields. The rest of the fields have default values, builtin_startfrom starts from 0 for numiterate and today for dateiterate and epochiterate. builtin_increment_min and builtin_increment_max has a value of 1 and 5 respectively, and for dateiterate the increment component will be minutes by default.

So let us say that we wanted to create a field log_time that starts from now and it should increment with 2-9 seconds for every row of data we generate, it would look like this:


{
    "columns": [
        {
            "column_name": "log_time"
            , "column_datatype": "date"
            , "column_type": "builtin"
            , "builtin_type": "datiterate"
            , "builtin_increment_min": "2"
            , "builtin_increment_max": "9"
            , "builtin_increment_component": "seconds"
        }
    ]
}

Reference List Column Format

Reference list columns is for a short known list of possible values. For instance if you want to select from a list of known status values or maybe a list of static reference values. The format for a reference list column is:


{
    "columns": [
        {
            "column_name": "[name_of_your_column_spaces_replaced_with_underscore]"
            , "column_datatype": "[number|int|date|time]"
            , "column_type": "referencelist"
            , "reference_static_list": "[comma,separated,list,of,values]"
            
        }
    ]
}

So let us say that we wanted to create a field called rag_status (red/amber/green) and we want a random value in the field, it would look like this:


{
    "columns": [
        {
            "column_name": "rag_status"
            , "column_datatype": "string"
            , "column_type": "referencelist"
            , "reference_static_list": "red,amber,green"
        }
    ]
}

Nested Set Column Format

Whenever you need to have nested (recursive) data in your datasets, you need to use the nested set column type. The nested datasets must be created already before you can use them. Once you have created the recursive definitions as independent datasets, you can create the nested field. The format for a nested set column looks like this:


{
    "columns": [
        {
            "column_name": "[name_of_your_column_spaces_replaced_with_underscore]"
            , "column_datatype": "[text]"
            , "column_type": "nestedset"
            , "nested_source": "[name_of_your_dataset]"
            , "nested_min_count": "[0-n]"
            , "nested_max_count": "[0-n]"
            
        }
    ]
}

Min and Max count are the count boundaries of how many number of rows you want from the nested dataset. If we wanted to create a field named order_details, with the values from a different dataset called order_line_items with between 2 and 20 items, it would look like this:


{
    "columns": [
        {
            "column_name": "order_details"
            , "column_datatype": "text"
            , "column_type": "nestedset"
            , "nested_source": "order_line_items"
            , "nested_min_count": "2"
            , "nested_max_count": "20"
        }
    ]
}

Reference Field Column Format

Reference fields is the Binubuo way of building foreign keys. This way you can reference data keys between 2 synthetic datasets. This way with proper tagging, you can always get the same random rows, even with the same references, for your data extracts. The format for a nested set column looks like this:


{
    "columns": [
        {
            "column_name": "[name_of_your_column_spaces_replaced_with_underscore]"
            , "column_datatype": "[string|text|number|int|date|time]"
            , "column_type": "reference field"
            , "reference_table": "[name_of_your_reference_dataset]"
            , "reference_column": "[name_of_the_parent_key_column_in_reference_dataset]"
            , "reference_distribution_type": "[simple|range|weighted]"
            , "distribution_simple_val": "[simple_value]"
            , "distribution_range_start": "[start_value_of_range]"
            , "distribution_range_end": "[end_value_of_range]"
        }
    ]
}

So if we wanted to create a column called customer_ref in an orders dataset where the order references an id column from a dataset called customers, within a different dataset, and we wanted between 2 and 10 orders for each customer, our column would look like this:


{
        "columns": [
            {
                "column_name": "customer_ref"
                , "column_datatype": "number"
                , "column_type": "reference field"
                , "reference_table": "customers"
                , "reference_column": "customer_id"
                , "reference_distribution_type": "range"
                , "distribution_range_start": "2"
                , "distribution_range_end": "10"
            }
        ]
}

Example 1: Simple dataset with 3 columns

This is a simple example to create a synthetic dataset with a list of users, emails and address.


{
    "columns": [
        {
            "column_name": "user_name"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "people_random.r_name"
        }, {
            "column_name": "email"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "computer_random.r_email"
        }, {
            "column_name": "address"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "location_random.r_address"
        }
    ]
}

Example 2: Adding arguments to generator calls

This example, we are adding an argument to the address generator, to set the country for CN (China). That way the address will be a Chinese address.


{
    "columns": [
        {
            "column_name": "user_name"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "people_random.r_name"
        }, {
            "column_name": "email"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "computer_random.r_email"
        }, {
            "column_name": "address"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "location_random.r_address"
            , "arguments": "CN"
        }
    ]
}

Example 3: Using built-ins

Now we are adding an id column that we can use as a primary key. For that we use the builtin field format. We want the first ID to start from 42 and we want the next ID numbers to increment with a number between 3 and 7.


{
    "columns": [
        {
            "column_name": "user_id"
            , "column_datatype": "number"
            , "column_type": "builtin"
            , "builtin_type": "numiterate"
            , "builtin_startfrom": "42"
            , "builtin_increment_min": "3"
            , "builtin_increment_max": "7"
        },{
            "column_name": "user_name"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "people_random.r_name"
        }, {
            "column_name": "email"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "computer_random.r_email"
        }, {
            "column_name": "address"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "location_random.r_address"
            , "arguments": "CN"
        }
    ]
}

Anchor Points

On this page

  • Column object
  • Generated Column Format
  • Fixed Column Format
  • Builtin Column Format
  • Reference List Column Format
  • Nested Set Column Format
  • Reference Field Column Format
  • Example 1: Simple dataset with 3 columns
  • Example 2: Adding arguments to generator calls
  • Example 3: Using built-ins