Skip to Main Content
 

Blog title

Can we build it? - Apache Common Log Format

INFO_LIST

  • Author Head Binubuo
  • Category Binubuo API
  • Created Fri, Feb 10 2023
 

blog sections

Welcome to a new series of blog posts where we will explore the capability of Binubuo by showing how we can build different real-life datasets as Synthetic Datasets in Binubuo.

Every post will include the individual use of generators and their description as well as the complete JSON format to create the dataset.

So for this blog post we will try and see how we can build a dataset that mimics the Apache Common Log Format https://en.wikipedia.org/wiki/Common_Log_Format

The Common Log Format, also known as the NCSA Common log format, (after NCSA HTTPd) is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed by a variety of web analysis programs.

Each entry in the log file is a single line and has the following syntax:

host ident authuser date request status bytes

So let us take a look at each of the fields and map them to a Binubuo generator.

Field name Description of data Sample Value Binubuo Generator Generator Details Example
host This is the IP address of the client (remote host) which made the request to the server. 127.0.0.1 ipv4 In the computer category we have a generator that gives exactly this. If we wanted to display ipv6 there is a generator for that as well.
ident Client identity. The "hyphen" in the output indicates that the requested piece of information is not available. This information is highly unreliable and should almost never be used except on tightly controlled internal networks. - repeater The simple repeater generator can be used for this one, or you can specify a fixed string column in the JSON format.
authuser This is the userid of the person requesting the document as determined by HTTP authentication. If the document is not password protected, this part will be "-" just like the previous one. - or USERA repeater/username If we choose to have data from a public facing webserver we can use repeater or fixed string column again, or if we are creating data from a protected website we can choose username
date
The time that the request was received. The format is: [day/month/year:hour:minute:second zone]
day = 2*digit
month = 3*letter
year = 4*digit
hour = 2*digit
minute = 2*digit
second = 2*digit
zone = (`+' | `-') 4*digit
[10/Oct/2000:13:55:36 -0700] Builtin incremental timestamp This needs to be an increasing timestamp. Depending on the "popularity" of the website, we have increments in milliseconds. Incremental column format documentation: https://binubuo.com/ords/r/binubuo_ui/binubuo/binubuo-documentation-page?p23_page_name=JSON%20format&p23_section_id=230&p23_section_name=Custom%20datasets
request The request line from the client is given in double quotes. The request line contains a great deal of useful information. First, the method used by the client is GET. Second, the client requested the resource /apache_pb.gif, and third, the client used the protocol HTTP/1.0. "GET /my/path/resource.html HTTP/1.0" http_request_method, url_path, url_query, file_name, repeater
To get this data we need to split it into a few columns where we will produce the output individually.
The request method we will add a weight to, so we have a majority of GET requests like in real life.
The path we will add a random number of levels using a random number input.
For the resource name, we use a file name from the web category.
For the query part, we will set the null percentage to make sure that it is random if requests have a query parameter or not.
We will create the protocol version with a repeater.
status This is the status code that the server sends back to the client. successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). 200 http_status_code For the status code we will give a weight for the value 200 which is the most common return status code. https://binubuo.com/api/generator/computer/http_status_code?success_weight=95
bytes The last part indicates the size of the object returned to the client, not including the response headers. 3508 bytes_size We can use the bytes_size generator which returns realistic byte sizes for the document type. https://binubuo.com/api/generator/computer/bytes_size

So if we follow the format in the custom dataset json documentation we can create a JSON template for the dataset, create the dataset, create the data and display the data in the correct format with just a few lines of code:

from binubuo import binubuo
from datetime import datetime
b = binubuo('YOUR_API_KEY_HERE')
dataset_schema = """{
    "columns": [
        {
            "column_name": "host"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "ipv4"
        }, {
            "column_name": "ident"
            , "column_datatype": "string"
            , "column_type": "fixed"
            , "fixed_value": "-"
        }, {
            "column_name": "authuser"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "username"
        }, {
            "column_name": "date"
            , "column_datatype": "timestamp"
            , "column_type": "builtin"
            , "builtin_type": "timiterate"
            , "builtin_increment_min": "150"
            , "builtin_increment_max": "350"
            , "builtin_increment_component": "milliseconds"
        }, {
            "column_name": "request_method"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "http_request_method"
            , "arguments": "90"
        }, {
            "column_name": "request_path"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "url_path"
            , "arguments": "core_random.r_natural(1,3)"
        }, {
            "column_name": "request_resource"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "file_name"
            , "arguments": "r_extension_cat => 'Web'"
        }, {
            "column_name": "request_url_query"
            , "column_datatype": "string"
            , "column_type": "generated"
            , "generator": "url_query"
            , "arguments": "core_random.r_natural(1,3)"
            , "nullable": 60
        }, {
            "column_name": "request_protocol"
            , "column_datatype": "string"
            , "column_type": "fixed"
            , "fixed_value": "HTTP/1.0"
        }, {
            "column_name": "status"
            , "column_datatype": "number"
            , "column_type": "generated"
            , "generator": "http_status_code"
            , "arguments": "95"
        }, {
            "column_name": "bytes"
            , "column_datatype": "number"
            , "column_type": "generated"
            , "generator": "bytes_size"
        }
    ]
}"""
# Create the dataset
b.create_dataset('common_example', dataset_schema)
# Fetch the data
data = b.dataset('common_example')
# Loop through and output in common log format
for x in data:
  # Convert time to actual datetime object
  l_date = datetime.fromisoformat(x[3])
  '{} {} {} [{}] \"{} {}{} {}\" {} {}'.format(x[0], x[1], x[2], l_date.strftime("%d/%b/%Y:%H:%M:%S"), x[4], x[5], x[6], x[8], x[9], x[10])

And output from the script looks like this:

'213.157.63.188 - 8798987565 [10/Feb/2023:08:18:32] "GET /hetebas/hokwe/cacha.md HTTP/1.0" 200 3363'
'44.59.153.53 - U872844 [10/Feb/2023:08:18:32] "GET /rifjitri/pohe/dev.json HTTP/1.0" 200 12564'
'3.11.235.184 - U577294 [10/Feb/2023:08:18:32] "GET /suhehe/mecok/beb.pl HTTP/1.0" 200 10765'
'151.132.175.175 - zynu6263 [10/Feb/2023:08:18:32] "GET /jiti/lan/vupu.aspx HTTP/1.0" 200 11232'
'130.43.42.77 - 7525166474 [10/Feb/2023:08:18:32] "GET /lod/go.md HTTP/1.0" 200 7864'
'172.153.226.80 - U923932 [10/Feb/2023:08:18:33] "GET /loflej/tapzolo/kat.asp HTTP/1.0" 200 7207'
'16.48.157.173 - phep3712 [10/Feb/2023:08:18:33] "GET /jif/tar/necodo.json HTTP/1.0" 200 3809'
'40.212.237.122 - ozcw5839 [10/Feb/2023:08:18:33] "GET /je/purim/lo.html HTTP/1.0" 200 12760'
'124.118.1.155 - U246926 [10/Feb/2023:08:18:33] "GET /pege/si/valefe.json HTTP/1.0" 200 11409'
'189.17.69.10 - tkhw2115 [10/Feb/2023:08:18:34] "GET /kemli/sov.asp HTTP/1.0" 200 2860'

So if we do not count the dataset json schema itself, in just 8 lines of code we created a script that can simulate a httpd common log file. Changing the output is just a matter of adding one line before you fetch the data. So if you want 1000 rows instead of the default 10, add the below line to the script.

b.drows(1000)

All datasets that we create in this series will be added to the standard datasets. So you can find this one in the computer category.

and you can go ahead and try the output here: https://binubuo.com/api/data/standard/computer/common_log_format?rows=3

Don't have an account yet?

If you don't have an account on Binubuo yet, you can create one real quick. Just click "Prices and Registration" in the top right corner, and you are on your way to create all the synthetic data you could dream about.

Want to see how to get started: Get Started Guide

Get more guides and help from the blog

Youtube Channel

Follow Binubuo on Twitter:


If you already have an account on RapidAPI, you can use your account to access Binubuo

Connect on RapidAPI