Handling other formats
Earlier examples of loading JSON data assume the use of JSONEachRow
(ndjson). We provide examples of loading JSON in other common formats below.
Array of JSON objects
One of the most popular forms of JSON data is having a list of JSON objects in a JSON array, like in this example:
> cat list.json
[
{
"path": "Akiba_Hebrew_Academy",
"month": "2017-08-01",
"hits": 241
},
{
"path": "Aegithina_tiphia",
"month": "2018-02-01",
"hits": 34
},
...
]
Let’s create a table for this kind of data:
CREATE TABLE sometable
(
`path` String,
`month` Date,
`hits` UInt32
)
ENGINE = MergeTree
ORDER BY tuple(month, path)
To import a list of JSON objects, we can use a JSONEachRow
format (inserting data from list.json file):
INSERT INTO sometable
FROM INFILE 'list.json'
FORMAT JSONEachRow
We have used a FROM INFILE clause to load data from the local file, and we can see the import was successful:
SELECT *
FROM sometable
┌─path──────────────────────┬──────month─┬─hits─┐
│ 1971-72_Utah_Stars_season │ 2016-10-01 │ 1 │
│ Akiba_Hebrew_Academy │ 2017-08-01 │ 241 │
│ Aegithina_tiphia │ 2018-02-01 │ 34 │
└───────────────────────────┴────────────┴──────┘
Handling NDJSON (line delimited JSON)
Many apps can log data in JSON format so that each log line is an individual JSON object, like in this file:
cat object-per-line.json
{"path":"1-krona","month":"2017-01-01","hits":4}
{"path":"Ahmadabad-e_Kalij-e_Sofla","month":"2017-01-01","hits":3}
{"path":"Bob_Dolman","month":"2016-11-01","hits":245}
The same JSONEachRow
format is capable of working with such files:
INSERT INTO sometable FROM INFILE 'object-per-line.json' FORMAT JSONEachRow;
SELECT * FROM sometable;
┌─path──────────────────────┬──────month─┬─hits─┐
│ Bob_Dolman │ 2016-11-01 │ 245 │
│ 1-krona │ 2017-01-01 │ 4 │
│ Ahmadabad-e_Kalij-e_Sofla │ 2017-01-01 │ 3 │
└───────── ──────────────────┴────────────┴──────┘
JSON object keys
In some cases, the list of JSON objects can be encoded as object properties instead of array elements (see objects.json for example):
cat objects.json
{
"a": {
"path":"April_25,_2017",
"month":"2018-01-01",
"hits":2
},
"b": {
"path":"Akahori_Station",
"month":"2016-06-01",
"hits":11
},
...
}
ClickHouse can load data from this kind of data using the JSONObjectEachRow
format:
INSERT INTO sometable FROM INFILE 'objects.json' FORMAT JSONObjectEachRow;
SELECT * FROM sometable;
┌─path────────────┬──────month─┬─hits─┐
│ Abducens_palsy │ 2016-05-01 │ 28 │
│ Akahori_Station │ 2016-06-01 │ 11 │
│ April_25,_2017 │ 2018-01-01 │ 2 │
└─────────────────┴────────────┴──────┘
Specifying parent object key values
Let’s say we also want to save values in parent object keys to the table. In this case, we can use the following option to define the name of the column we want key values to be saved to:
SET format_json_object_each_row_column_for_object_name = 'id'
Now, we can check which data is going to be loaded from the original JSON file using file()
function:
SELECT * FROM file('objects.json', JSONObjectEachRow)
┌─id─┬─path────────────┬──────month─┬─hits─┐
│ a │ April_25,_2017 │ 2018-01-01 │ 2 │
│ b │ Akahori_Station │ 2016-06-01 │ 11 │
│ c │ Abducens_palsy │ 2016-05-01 │ 28 │
└────┴─────────────────┴────────────┴──────┘
Note how the id
column has been populated by key values correctly.
JSON Arrays
Sometimes, for the sake of saving space, JSON files are encoded in arrays instead of objects. In this case, we deal with a list of JSON arrays:
cat arrays.json
["Akiba_Hebrew_Academy", "2017-08-01", 241],
["Aegithina_tiphia", "2018-02-01", 34],
["1971-72_Utah_Stars_season", "2016-10-01", 1]
In this case, ClickHouse will load this data and attribute each value to the corresponding column based on its order in the array. We use JSONCompactEachRow
format for this:
SELECT * FROM sometable
┌─c1────────────────────────┬─────────c2─┬──c3─┐
│ Akiba_Hebrew_Academy │ 2017-08-01 │ 241 │
│ Aegithina_tiphia │ 2018-02-01 │ 34 │
│ 1971-72_Utah_Stars_season │ 2016-10-01 │ 1 │
└───────────────────────────┴────────────┴─────┘
Importing individual columns from JSON arrays
In some cases, data can be encoded column-wise instead of row-wise. In this case, a parent JSON object contains columns with values. Take a look at the following file:
cat columns.json
{
"path": ["2007_Copa_America", "Car_dealerships_in_the_USA", "Dihydromyricetin_reductase"],
"month": ["2016-07-01", "2015-07-01", "2015-07-01"],
"hits": [178, 11, 1]
}
ClickHouse uses the JSONColumns
format to parse data formatted like that:
SELECT * FROM file('columns.json', JSONColumns)
┌─path───────────────────────┬──────month─┬─hits─┐
│ 2007_Copa_America │ 2016-07-01 │ 178 │
│ Car_dealerships_in_the_USA │ 2015-07-01 │ 11 │
│ Dihydromyricetin_reductase │ 2015-07-01 │ 1 │
└────────────────────────────┴────────────┴──────┘
A more compact format is also supported when dealing with an array of columns instead of an object using JSONCompactColumns
format:
SELECT * FROM file('columns-array.json', JSONCompactColumns)
┌─c1──────────────┬─────────c2─┬─c3─┐
│ Heidenrod │ 2017-01-01 │ 10 │
│ Arthur_Henrique │ 2016-11-01 │ 12 │
│ Alan_Ebnother │ 2015-11-01 │ 66 │
└─────────────────┴────────────┴────┘
Saving JSON objects instead of parsing
There are cases you might want to save JSON objects to a single String
(or JSON) column instead of parsing it. This can be useful when dealing with a list of JSON objects of different structures. Let's take this file where we have multiple different JSON objects inside a parent list:
cat custom.json
[
{"name": "Joe", "age": 99, "type": "person"},
{"url": "/my.post.MD", "hits": 1263, "type": "post"},
{"message": "Warning on disk usage", "type": "log"}
]
We want to save original JSON objects into the following table:
CREATE TABLE events
(
`data` String
)
ENGINE = MergeTree
ORDER BY ()
Now we can load data from the file into this table using JSONAsString
format to keep JSON objects instead of parsing them:
INSERT INTO events (data)
FROM INFILE 'custom.json'
FORMAT JSONAsString
And we can use JSON functions to query saved objects:
SELECT
JSONExtractString(data, 'type') AS type,
data
FROM events
┌─type───┬─data─────────────────────────────────────────────────┐
│ person │ {"name": "Joe", "age": 99, "type": "person"} │
│ post │ {"url": "/my.post.MD", "hits": 1263, "type": "post"} │
│ log │ {"message": "Warning on disk usage", "type": "log"} │
└────────┴──────────────────────────────────────────────────────┘
Note that JSONAsString
works perfectly fine in cases we have JSON object-per-line formatted files (usually used with JSONEachRow
format).
Schema for nested objects
In cases when we're dealing with nested JSON objects, we can additionally define schema and use complex types (Array
, Object Data Type
or Tuple
) to load data:
SELECT *
FROM file('list-nested.json', JSONEachRow, 'page Tuple(path String, title String, owner_id UInt16), month Date, hits UInt32')
LIMIT 1
┌─page───────────────────────────────────────────────┬──────month─┬─hits─┐
│ ('Akiba_Hebrew_Academy','Akiba Hebrew Academy',12) │ 2017-08-01 │ 241 │
└────────────────────────────────────────────────────┴────────────┴──────┘
Accessing nested JSON objects
We can refer to nested JSON keys by enabling the following settings option:
SET input_format_import_nested_json = 1
This allows us to refer to nested JSON object keys using dot notation (remember to wrap those with backtick symbols to work):
SELECT *
FROM file('list-nested.json', JSONEachRow, '`page.owner_id` UInt32, `page.title` String, month Date, hits UInt32')
LIMIT 1
┌─page.owner_id─┬─page.title───────────┬──────month─┬─hits─┐
│ 12 │ Akiba Hebrew Academy │ 2017-08-01 │ 241 │
└───────────────┴──────────────────────┴────────────┴──────┘
This way we can flatten nested JSON objects or use some nested values to save them as separate columns.
Skipping unknown columns
By default, ClickHouse will ignore unknown columns when importing JSON data. Let’s try to import the original file into the table without the month
column:
CREATE TABLE shorttable
(
`path` String,
`hits` UInt32
)
ENGINE = MergeTree
ORDER BY path
We can still insert the original JSON data with 3 columns into this table:
INSERT INTO shorttable FROM INFILE 'list.json' FORMAT JSONEachRow;
SELECT * FROM shorttable
┌─path──────────────────────┬─hits─┐
│ 1971-72_Utah_Stars_season │ 1 │
│ Aegithina_tiphia │ 34 │
│ Akiba_Hebrew_Academy │ 241 │
└───────────────────────────┴──────┘
ClickHouse will ignore unknown columns while importing. This can be disabled with the input_format_skip_unknown_fields settings option:
SET input_format_skip_unknown_fields = 0;
INSERT INTO shorttable FROM INFILE 'list.json' FORMAT JSONEachRow;
Ok.
Exception on client:
Code: 117. DB::Exception: Unknown field found while parsing JSONEachRow format: month: (in file/uri /data/clickhouse/user_files/list.json): (at row 1)
ClickHouse will throw exceptions in cases of inconsistent JSON and table columns structure.
BSON
ClickHouse allows exporting to and importing data from BSON encoded files. This format is used by some DBMSs, e.g. MongoDB database.
To import BSON data, we use the BSONEachRow format. Let’s import data from this BSON file:
SELECT * FROM file('data.bson', BSONEachRow)
┌─path──────────────────────┬─month─┬─hits─┐
│ Bob_Dolman │ 17106 │ 245 │
│ 1-krona │ 17167 │ 4 │
│ Ahmadabad-e_Kalij-e_Sofla │ 17167 │ 3 │
└───────────────────────────┴───────┴──────┘
We can also export to BSON files using the same format:
SELECT *
FROM sometable
INTO OUTFILE 'out.bson'
FORMAT BSONEachRow
After that, we’ll have our data exported to the out.bson
file.