Examples - Bullet Docs

Query Examples

Rather than sourcing the examples from the Quick Start, these examples are real-world ones sourced from Bullet running on raw, user events generated by instrumentation on Yahoo sites by Yahoo employees (not all Yahoo users).

Disclaimer

The actual data shown here has been edited and is not how actual Yahoo user events look.

Simplest Query

The simplest query you could write would be:

SELECT * FROM STREAM();

This query would get any records that pass through for a default duration of 20000 ms up to a max default of 500 records.

SELECT * FROM STREAM(10000, TIME) LIMIT 1;

If you wanted to write a smaller or shorter query to, for example, quickly test your connection to Bullet, you could adjust the query duration and size like shown above. This query would only last for a duration of 10000 ms and return at most 1 record.

WINDOW?

There is only one unified data stream in Bullet, so for clarity the FROM clause is given a STREAM function to denote the look-forward time window for the Bullet query.

Simple Filtering

SELECT *
FROM STREAM(30000, TIME)
WHERE id = 'btsg8l9b234ha'
LIMIT 1;

Because of the default constraints, this query would find at most 1 record with the id matching the value provided. The record would have all its fields.

A sample response could be (it has been edited to remove PII and other Yahoo data). The response contains a single matching record, and the associated meta information.

{
   "records":[
       {
           "server_name":"EDITED",
           "page_uri":"/",
           "is_page_view":true,
           "device":"tablet",
           "debug_codes":{
               "http_status_code":"200"
           },
           "referrer_domain":"www.yahoo.com",
           "is_logged_in":true,
           "timestamp":1446842189000,
           "event_family":"view",
           "id":"btsg8l9b234ha",
           "os_name":"mac os",
           "demographics":{
               "age" : "25",
               "gender" : "m",
            }
       }
    ],
    "meta":{
        "query_id":1167304238598842449,
        "query_body":"{}",
        "query_finish_time":1480723799550,
        "query_receive_time":1480723799540
    }
}

Relational Filters and Projections

SELECT timestamp AS ts, device_timestamp AS device_ts,
       event AS event, page_domain AS domain, page_id AS id
FROM STREAM(20000, TIME)
WHERE id = 'btsg8l9b234ha' AND page_id IS NOT NULL
LIMIT 10;

The above query finds all events where id is set to 'btsg8l9b234ha' and page_id is not null, projects the fields selected above with their aliases (timestamp as ts, etc.) and limits the results to at most 10 records. The query would wait at most 20 seconds for records to show up.

The resulting response could look like (only 3 events were generated that matched the criteria):

{
    "records": [
        {
            "domain": "http://some.url.com",
            "device_ts": 1481152233788,
            "id": 2273844742998,
            "event": "page",
            "ts": null
        },
        {
            "domain": "www.yahoo.com",
            "device_ts": 1481152233788,
            "id": 227384472956,
            "event": "click",
            "ts": 1481152233888
        },
        {
            "domain": "https://news.yahoo.com",
            "device_ts": null,
            "id": 2273844742556,
            "event": "page",
            "ts": null
        }
    ],
    "meta": {
        "query_id": -3239746252817510000,
        "query_body": "<EDITED OUT>",
        "query_finish_time": 1481152233799,
        "query_receive_time": 1481152233796
    }
}

Relational Filters using the extended value notation for static values

For the following examples, we will simply show and explain the queries. They also use the extended syntax for specify values in a filter using the kind field.

SIZEOF Filter

This query checks to see if the size of the data_map is equal to 4 and returns all records that satisfy this.

SELECT *
FROM STREAM(30000, TIME)
WHERE SIZEOF(data_map) = 4
LIMIT 1;

CONTAINSKEY Filter

This query checks to see if the data_map contains the key id and returns all records for which this is true.

SELECT *
FROM STREAM(30000, TIME)
WHERE CONTAINSKEY(data_map, 'id')
LIMIT 1;

CONTAINSVALUE Filter

This query checks to see if the data_map does not contain the value btsg8l9b234ha and returns all records for which this is true. If this was applied on a list field or list of maps field, the inner maps would be checked instead.

SELECT *
FROM STREAM(30000, TIME)
WHERE NOT CONTAINSVALUE(data_map, 'btsg8l9b234ha')
LIMIT 1;

Filtering with NOW Keyword

SELECT *
FROM STREAM(30000, TIME)
WHERE event_timestamp >= NOW
LIMIT 10;

BETWEEN Filter

This query checks to see if the field heart_rate is in-between 70 and 100 inclusive and returns all records for which this is true. The BETWEEN operator can be written in two ways as shown below.

SELECT *
FROM STREAM(30000, TIME)
WHERE heart_rate BETWEEN (70, 100)
LIMIT 10;

SELECT *
FROM STREAM(30000, TIME)
WHERE BETWEEN(heart_rate, 70, 100)
LIMIT 10;

IN Filter

This query checks to see if the field color is in the given list and returns all records for which is true.

SELECT *
FROM STREAM(30000, TIME)
WHERE color IN ('red', 'green', 'blue')
LIMIT 10;

Relational Filter comparing to other fields

Instead of comparing to static, constant values, you may use the extended values notation and set kind to FIELD to compare to other fields within the same record. The following query returns the first record for which the id field is set to the uid field.

SELECT *
FROM STREAM(30000, TIME)
WHERE id = uid
LIMIT 1;

A sample result could look like:

{
    "records": [
        {
            "uid":"0qcgofdbfqs9s",
            "experience":"web",
            "lid":"978500434",
            "id":"0qcgofdbfqs9s",
            "other fields": "<EDITED OUT>"
        }
    ],
    "meta": {
        "query_id": "c4a336e0-3bb5-452a-8503-40d8751b92d9",
        "query_body": "<EDITED OUT>",
        "query_finish_time": 1536192342505,
        "query_receive_time": 1536192342507
    }
}

Logical Filters and Projections

SELECT id AS id, experience AS experience, page_id AS pid,
       link_id AS lid, tags AS tags, demographics.age AS age
FROM STREAM(60000, TIME)
WHERE (id = 'c14plm1begla7' AND ((experience = 'web' AND page_id IN ['18025', '47729'])
                                  OR link_id RLIKE '2.*'))
      OR (tags.player = 'true' AND demographics.age > '65')
LIMIT 1;

Typing

If demographics.age was of type Long, then Bullet will convert 85 to be an Long, but in this example, we are pretending that it is String. So, no conversion is made. Similarly for link_id, id, experience and page_id. tags is a Map of String to Boolean so Bullet converts "true" to the Boolean true. See below on how to rewrite it with casting.

Logical Filters and Projections with Casting

```SQL
SELECT id AS id, experience AS experience, page_id AS pid,
       link_id AS lid, tags AS tags, CAST(demographics.age AS LONG) AS age
FROM STREAM(60000, TIME)
WHERE (id = 'c14plm1begla7' AND ((experience = 'web' AND page_id IN ['18025', '47729'])
                                  OR link_id RLIKE '2.*'))
      OR (tags.player = 'true' AND CAST(demographics.age AS LONG) > 65L)
LIMIT 1;
```

This query is looking for a single event with a specific id and either the page_id is in two specific pages on the "web" experience or with a link_id that starts with 2, or a player event where the age is greater than "65". In other words, it is looking for senior citizens who generate video player events or the events of a particular person (based on id) events on two specific pages or a group of pages that have link that have ids that start with 2. It then projects out only these fields with different names.

A sample result could look like (it matched because of tags.player was true and demographics.age was > 65):

{
    "records": [
        {
            "pid":"158",
            "id":"0qcgofdbfqs9s",
            "experience":"web",
            "lid":"978500434",
            "age":66,
            "tags":{"player":true}
        }
    ],
    "meta": {
        "query_id": 3239746252812284004,
        "query_body": "<EDITED OUT>",
        "query_finish_time": 1481152233805,
        "query_receive_time": 1481152233881
    }
}

GROUP ALL COUNT Aggregation

An example of a query performing a COUNT all records aggregation would look like:

SELECT COUNT(*) AS numSeniors
FROM STREAM(20000, TIME)
WHERE demographics.age > 65

This query will count the number events for which demographics.age > 65. The aggregation type GROUP indicates that it is a group aggregation. To group by a key, the fields key needs to be set in the aggregation part of the query. If fields is empty or is omitted (as it is in the query above) and the type is GROUP, it is as if all the records are collapsed into a single group - a GROUP ALL. Adding a COUNT in the operations part of the attributes indicates that the number of records in this group will be counted, and the "newName" key denotes the name the resulting column "numSeniors" in the result. Setting the duration to 20000 counts matching records for this duration.

A sample result would look like:

{
    "records": [
        {
            "numSeniors": 363201
        }
    ],
    "meta": {}
}

This result indicates that 363,201 records were counted with demographics.age > 65 during the 20s the query was running.

GROUP ALL Multiple Aggregations

COUNT is the only GROUP operation for which you can omit a "field".

SELECT COUNT(*) AS numCalifornians, AVG(demographics.age) AS avgAge,
       MIN(demographics.age) AS minAge, MAX(demographics.age) AS maxAge
FROM STREAM(20000, TIME)
WHERE demographics.state = 'california'

Note that the GROUP BY () is optional.

A sample result would look like:

{
    "records": [
        {
            "maxAge": 94.0,
            "numCalifornians": 188451,
            "minAge": 6.0,
            "avgAge": 33.71828
        }
    ],
    "meta": {
        "query_id": 8051040987827161000,
        "query_body": "<EDITED OUT>",
        "query_finish_time": 1482371927435,
        "query_receive_time": 1482371916625
    }
}

This result indicates that, among the records observed during the 20s this query ran, there were 188,451 users with demographics.state equal to "california". Among these users the average age was 33.71828, the max age observed was 94, and the minimum age observed was 6.

Exact COUNT DISTINCT Aggregation

SELECT COUNT(DISTINCT browser_name, browser_version) AS "COUNT DISTINCT"
FROM STREAM(10000, TIME);

This gets the count of the unique browser names and versions in the next 30s (default duration). Note that we do not specify values for the keys in fields. This is because they are not relevant

{
    "records": [
        {
            "COUNT DISTINCT": 158.0
        }
    ],
    "meta": {
        "query_id": 4451146261377394443,
        "sketches": {
            "standard_deviations": {
                "1": {
                    "upperBound": 158.0,
                    "lowerBound": 158.0
                },
                "2": {
                    "upperBound": 158.0,
                    "lowerBound": 158.0
                },
                "3": {
                    "upperBound": 158.0,
                    "lowerBound": 158.0
                }
            },
            "was_estimated": false,
            "family": "THETA",
            "theta": 1.0,
            "size": 1280
        },
        "query_body": "<EDITED OUT>",
        "query_finish_time": 1484084869073,
        "query_receive_time": 1484084832684
    }
}

There were 158 unique combinations on browser names and versions in our dataset for those 30 s. Note the new sketches object in the meta. It has various metadata about the result. In particular, the was_estimated key denotes where the result was estimated or not. The standard_deviations key denotes the confidence at various sigmas: 1 (1 sigma = ~68% confidence, 2 sigma = ~95% confidence, 3 sigma = ~99% confidence). Since this result was not estimated, the result is the same as the upper and lower bounds for the result.

Approximate COUNT DISTINCT

SELECT COUNT(DISTINCT ip_address) AS uniqueIPs
FROM STREAM(10000, TIME);

This query gets us the unique IP addresses in the next 10 s. It renames the result column from "COUNT DISTINCT" to "uniqueIPs".

{
   "records":[
      {
         "uniqueIPs":130551.07952805843
      }
   ],
   "meta":{
      "query_id":5377782455857451480,
      "sketches":{
         "standard_deviations":{
            "1":{
               "upperBound":131512.85413760383,
               "lowerBound":129596.30223107953
            },
            "2":{
               "upperBound":132477.15103015225,
               "lowerBound":128652.93906100772
            },
            "3":{
               "upperBound":133448.49248615955,
               "lowerBound":127716.46773622213
            }
         },
         "was_estimated":true,
         "family":"THETA",
         "theta":0.12549877074343688,
         "size":131096
      },
      "query_body":"<EDITED OUT>",
      "query_finish_time":1484090240812,
      "query_receive_time":1484090223351
   }
}

The number of unique IPs in our dataset was 130551 in those 10s (approximately) with the true value between (129596, 131512) at 68% confidence, (128652, 132477) at 95% confidence and (127716, 133448) at 99% confidence. In the worst case at 3 sigma (99% confidence), our error is 2.17%. The final result was computed with 131096 bytes or ~128 KiB as denoted by size. This happens to be maximum size the the COUNT DISTINCT sketch will take up at the default nominal entries, so even if we had billions of unique IPs, the size will be the same and the error may be higher (depends on the distribution). For example, the error when the same query was run for 30s was 2.28% at 99% confidence (actual unique IPs: 559428, upper bound: 572514). In fact, the worst the error can get at this Sketch size is 2.34% as defined here, regardless of the number of unique entries added to the Sketch!.

DISTINCT Aggregation

SELECT browser_name AS browser
FROM STREAM(30000, TIME)
GROUP BY browser_name
LIMIT 10;

SELECT DISTINCT browser_name AS browser
FROM STREAM(30000, TIME)
LIMIT 10;

This query gets the distinct values for the browser_name field and limit the results to 10. It runs for 30 s.

{
   "records":[
      {
         "browser":"opera"
      },
      {
         "browser":"flock"
      },
      {
         "browser":"links"
      },
      {
         "browser":"mozilla firefox"
      },
      {
         "browser":"dolfin"
      },
      {
         "browser":"lynx"
      },
      {
         "browser":"chrome"
      },
      {
         "browser":"microsoft internet explorer"
      },
      {
         "browser":"aol browser"
      },
      {
         "browser":"edge"
      }
   ],
   "meta":{
      "query_id":-4872093887360741287,
      "sketches":{
         "standard_deviations":{
            "1":{
               "upperBound":28.0,
               "lowerBound":28.0
            },
            "2":{
               "upperBound":28.0,
               "lowerBound":28.0
            },
            "3":{
               "upperBound":28.0,
               "lowerBound":28.0
            }
         },
         "was_estimated":false,
         "family":"TUPLE",
         "uniques_estimate":28.0,
         "theta":1.0
      },
      "query_body":"<EDITED OUT>",
      "query_finish_time":1485469087971,
      "query_receive_time":1485469054070
   }
}

There were 28 unique results but we asked for 10, so the query returned a uniform sample across the 28 distinct values.

DISTINCT is just an alias for GROUP. A GROUP by with no operations is exactly a DISTINCT.

GROUP by Aggregation

SELECT demographics.country AS country, device AS device,
       COUNT(*) AS count, AVG(demographics.age) AS averageAge,
       AVG(timespent) AS averageTimespent
FROM STREAM(20000, TIME)
WHERE demographics IS NOT NULL
GROUP BY demographics.country, device
LIMIT 50;

This query groups by the country and the device and for each unique group gets the count, average age and time spent by the users for the next 20 seconds. It renames demographics.country to country and does not rename device. It limits the groups to 50. If there were more than 50 groups, the results would be a uniform sampling of the groups (but each group in the result would have the correct result). These parameters can all be tweaked in the configuration.

{
   "records":[
      {
         "country":"uk",
         "device":"desktop",
         "count":203034,
         "averageAge":32.42523,
         "averageTimespent":1.342
      },
      {
         "country":"us",
         "device":"desktop",
         "count":1934030,
         "averageAge":29.42523,
         "averageTimespent":3.234520
      },
      "...EDITED 41 other such records out for readability...",
   ],
   "meta":{
      "query_id":1705911449584057747,
      "sketches":{
         "standard_deviations":{
            "1":{
               "upperBound":43.0,
               "lowerBound":43.0
            },
            "2":{
               "upperBound":43.0,
               "lowerBound":43.0
            },
            "3":{
               "upperBound":43.0,
               "lowerBound":43.0
            }
         },
         "was_estimated":false,
         "family":"TUPLE",
         "uniques_estimate":43.0,
         "theta":1.0
      },
      "query_body":"<EDITED OUT>",
      "query_finish_time":1485217172780,
      "query_receive_time":1485217148840
   }
}

We received 43 rows for this result. The maximum groups that was allowed for the instance of Bullet was 512. If there were more groups than the maximum specified by your configuration, a uniform sample across them would be chosen for the result. However, for each group, the values computed (average, count) would be exact. The standard deviations, whether the result was estimated and the number of approximate uniques in the metadata would reflect the change.

If you asked for 50 rows in the aggregation (as the query did above) but there were more than 50 in the result (but < 512), the metadata would reflect the fact that the result was not estimated. You would still get a uniform sample but by increasing your aggregation size higher, you could get the rest.

For readability, if you were just trying to get the unique values for a field or a set of fields, you could leave out the attributes section and specify your fields section. You could also call the type DISTINCT instead of GROUP to make that explicit. DISTINCT is just an alias for GROUP. See the DISTINCT example.

QUANTILE DISTRIBUTION

SELECT QUANTILE(duration, LINEAR, 11)
FROM STREAM(5000, TIME);

This query creates 11 points from 0 to 1 (both inclusive) and finds the percentile values of the duration field (which contains an amount of time in ms) at 0, 0.1, 0.2 ... 1.0 or the 0th, 10th, 20th and 100th percentiles. It runs for 5 seconds and returns at most 11 points. As long as the size is set to higher than the number of points you generate, DISTRIBUTION queries will return all your values.

The SQL is not really the same since it will produce one row instead of 11.

{
   "records":[
      {
         "Value":1,
         "Quantile":0
      },
      {
         "Value":1352,
         "Quantile":0.1
      },
      {
         "Value":3045,
         "Quantile":0.2
      },
      {
         "Value":6501,
         "Quantile":0.30000000000000004
      },
      {
         "Value":10700,
         "Quantile":0.4
      },
      {
         "Value":17488,
         "Quantile":0.5
      },
      {
         "Value":28659,
         "Quantile":0.6
      },
      {
         "Value":47929,
         "Quantile":0.7
      },
      {
         "Value":83447,
         "Quantile":0.7999999999999999
      },
      {
         "Value":177548,
         "Quantile":0.8999999999999999
      },
      {
         "Value":83525609,
         "Quantile":1
      }
   ],
   "meta":{
      "query_finish_time":1493748546533,
      "query_body": "<EDITED OUT>",
      "query_id":2981902209347343400,
      "sketches":{
         "normalized_rank_error":0.002389303789572841,
         "size":16416,
         "minimum_value":1,
         "items_seen":1414,
         "maximum_value":83525609,
         "family":"QUANTILES",
         "was_estimated":false
      },
      "query_receive_time":1493748538259
   }
}

The result shows the values at the 0th, 10th percentiles etc. The was_estimated key indicates that the result was not approximated. Note the the minimum_value and the maximum_value correspond to the 0th and 100th percentiles. There is also a normalized_rank_error that describes the error (see below for a detailed explanation) This is constant for all DISTRIBUTION queries and does not depend on the data or the query.

Normalized Rank Error

Unlike GROUP and COUNT DISTINCT, the order in which the data arrives to Bullet can affect the results of a DISTRIBUTION query. The error when the result is estimated is not a Gaussian error function and is not described in terms of the values of your field. In other words, if the 50th percentile was estimated to some value, you could not bound the true median by using the the estimated value +/- constant (see below for a good approximation). The error is expressed in terms of the normalized rank. If one were to sort the true data stream, you would obtain a rank for each item from 0 to the stream length (items_seen in the metadata above). If you then divided each rank by length, you would get ranks from 0 to 1. In this domain, a normalized rank error of 0.002, for example, means that a value returned for the 0.50 or 50th percentile could actually lie between 0.498 and 0.502 in the normalized ranks with 99% confidence.

Distribution Accuracy lists the normalized rank error as a percentage for the maximum size of the Sketch used. If you obtain successive quantile values at granularities lower than this rank error, the results may not be accurate. While the sketch speaks of the normalized rank error, you can still obtain reasonable bounds for values. For example, if the normalized rank error was 1% and you obtained quantile values at 0.48, 0.50 and 0.52, you could use the values at 0.52 and 0.48 as very reasonable upper and lower bounds on your true median (you might even be able to use 0.49 and 0.51 if the error was 1%).

PMF DISTRIBUTION Aggregation

SELECT FREQ(duration, REGION, 2000, 20000, 500)
FROM STREAM(5000, TIME);

This query creates 37 points from 2000 to 20000 in 500 increments to bucketize the duration field using these points as split locations and finds the count of duration values that fall into these intervals. It runs for 5s and returns at most 100 records (this means it will return the 38 records).

The SQL does not include the (-∞ to 2000) and the [20000 to +∞) intervals and does not produce a probability.

{
   "records":[
      {
         "Probability":0.1518054532056006,
         "Count":206,
         "Range":"(-∞ to 2000.0)"
      },
      {
         "Probability":0.0397936624907885,
         "Count":53.99999999999999,
         "Range":"[2000.0 to 2500.0)"
      },
      "...EDITED 34 other such records out for readability...",
      {
         "Probability":0.0058953574060427415,
         "Count":8,
         "Range":"[19500.0 to 20000.0)"
      },
      {
         "Probability":0.45689019896831246,
         "Count":620,
         "Range":"[20000.0 to +∞)"
      }
   ],
   "meta":{
      "query_finish_time":1493750074795,
      "query_body": "<EDITED OUT>",
      "query_id":-2590566941995678000,
      "sketches":{
         "normalized_rank_error":0.002389303789572841,
         "size":16416,
         "minimum_value":1,
         "items_seen":1357,
         "maximum_value":78240570,
         "family":"QUANTILES",
         "was_estimated":false
      },
      "query_receive_time":1493750066022
   }
}

The result consists of 38 records, each denoting an interval in the domain we asked for. The result was not estimated. Note that the interval is denoted by the Range key and the count by the Count key. There is also a probability that estimates how likely a value for duration is likely to fall into that range.

CDF DISTRIBUTION Aggregation

SELECT CUMFREQ(duration, MANUAL, 20000, 2000, 15000, 45000)
FROM STREAM(5000, TIME);

This query specifies a list of points manually using points property in attributes. It runs for 5s and finds the cumulative frequency distribution using the specified points as break points. It returns at most 100 records (which means we will get all of the intervals).

There is no easy SQL equivalent because the points are free-form. It does not produce a probability field like Bullet does.

{
   "records":[
      {
         "Probability":0.14382632293080055,
         "Count":212.00000000000003,
         "Range":"(-∞ to 2000.0)"
      },
      {
         "Probability":0.5210312075983717,
         "Count":767.9999999999999,
         "Range":"(-∞ to 15000.0)"
      },
      {
         "Probability":0.5603799185888738,
         "Count":826,
         "Range":"(-∞ to 20000.0)"
      },
      {
         "Probability":0.6994572591587517,
         "Count":1031,
         "Range":"(-∞ to 45000.0)"
      },
      {
         "Probability":1,
         "Count":1474,
         "Range":"(-∞ to +∞)"
      }
   ],
   "meta":{
      "query_finish_time":1493755151660,
      "query_body": "<EDITED OUT>",
      "query_id":-8460702488693518000,
      "sketches":{
         "normalized_rank_error":0.002389303789572841,
         "size":16416,
         "minimum_value":2,
         "items_seen":1474,
         "maximum_value":10851113,
         "family":"QUANTILES",
         "was_estimated":false
      },
      "query_receive_time":1493755143626
   }
}

The result contains the 5 intervals produced by the split points. It was not estimated so these counts are exact. Note that the start of each interval is -∞ because it is the cumulative frequency distribution.

Exact TOP K Aggregation

There are two methods for executing a TOP K aggregation in BQL:

SELECT TOP(500, 100, demographics.country, browser_name) AS numEvents
FROM STREAM(10000, TIME)
WHERE demographics.country IS NOT NULL AND browser_name IS NOT NULL;

OR:

SELECT demographics.country, browser_name, COUNT(*) AS numEvents
FROM STREAM(10000, TIME)
WHERE demographics.country IS NOT NULL AND browser_name IS NOT NULL
GROUP BY demographics.country, browser_name
HAVING COUNT(*) >= 100
ORDER BY COUNT(*) DESC
LIMIT 500;

This query gets the top 500 country, browser combinations where the count of records for each combination is at least 100. It runs for 10s.

{
   "records":[
      {
         "country":"us",
         "browser":"google chrome",
         "numEvents":2729
      },
      {
         "country":"us",
         "browser":"mozilla firefox",
         "numEvents":1072
      },
      {
         "country":"uk",
         "browser":"google chrome",
         "numEvents":703
      },
      {
         "country":"fr",
         "browser":"google chrome",
         "numEvents":383
      },
      {
         "country":"fr",
         "browser":"mozilla firefox",
         "numEvents":278
      },
      {
         "country":"es",
         "browser":"google chrome",
         "numEvents":234
      },
      "...EDITED 10 other such records here for readability",
      {
         "country":"es",
         "browser":"mozilla firefox",
         "numEvents":102
      },
      {
         "country":"fr",
         "browser":"apple safari",
         "numEvents":101
      }
   ],
   "meta":{
      "query_finish_time":1493760034414,
      "query_body": "<EDITED OUT>",
      "query_id":7515243052399540000,
      "sketches":{
         "maximum_count_error":0,
         "active_items":431,
         "items_seen":10784,
         "family":"FREQUENCY",
         "was_estimated":false
      },
      "query_receive_time":1493760020807
   }
}

The results gave us the top 18 country, browser combinations that had counts over a 100. Note the maximum_count_error key in the metadata. This represents how off the count is. It is 0 because these counts are exact. In our data stream, we only had 18 unique combinations of countries and browser names at the time the query was run.

Approximate TOP K Aggregation

There are two methods for executing a TOP K aggregation in BQL:

SELECT TOP(10, 100, browser_name, browser_version, os_name, os_version, demographics.country, demographics.state) AS numEvents
FROM STREAM(10000, TIME)
WHERE os_name IS NOT NULL AND browser_name IS NOT NULL;

OR:

SELECT browser_name, browser_version, os_name, os_version, demographics.country, demographics.state, COUNT(*) AS numEvents
FROM STREAM(10000, TIME)
WHERE os_name IS NOT NULL AND browser_name IS NOT NULL
GROUP BY browser_name, browser_version, os_name, os_version, demographics.country, demographics.state
HAVING COUNT(*) >= 100
ORDER BY COUNT(*) DESC
LIMIT 10;

In order to make the result approximate, this query adds more dimensions to the Exact TOP K query. It runs for 30s and looks for the top 10 combinations for these events.

{
   "records":[
      {
         "country":"null",
         "os":"mac os x",
         "browser":"google chrome",
         "numEvents":120823,
         "state":"null",
         "bversion":"56",
         "oversion":"10.12"
      },
      {
         "country":"null",
         "os":"mac os x",
         "browser":"google chrome",
         "numEvents":4539,
         "state":"null",
         "bversion":"35",
         "oversion":"10.9"
      },
      {
         "country":"us",
         "os":"mac os x",
         "browser":"google chrome",
         "numEvents":3827,
         "state":"null",
         "bversion":"57",
         "oversion":"10.12"
      },
      {
         "country":"null",
         "os":"ios",
         "browser":"apple safari",
         "numEvents":3426,
         "state":"null",
         "bversion":"9.0",
         "oversion":"9.1"
      },
      {
         "country":"null",
         "os":"windows nt",
         "browser":"microsoft internet explorer",
         "numEvents":2264,
         "state":"null",
         "bversion":"6.0",
         "oversion":"5.1"
      },
      {
         "country":"us",
         "os":"mac os x",
         "browser":"google chrome",
         "numEvents":1995,
         "state":"null",
         "bversion":"58",
         "oversion":"10.12"
      },
      {
         "country":"null",
         "os":"windows nt",
         "browser":"google chrome",
         "numEvents":1416,
         "state":"null",
         "bversion":"57",
         "oversion":"10.0"
      },
      {
         "country":"null",
         "os":"windows nt",
         "browser":"google chrome",
         "numEvents":1327,
         "state":"null",
         "bversion":"58",
         "oversion":"10.0"
      },
      {
         "country":"null",
         "os":"mac os x",
         "browser":"google chrome",
         "numEvents":1187,
         "state":"null",
         "bversion":"57",
         "oversion":"10.12"
      },
      {
         "country":"null",
         "os":"ios",
         "browser":"apple safari",
         "numEvents":1119,
         "state":"null",
         "bversion":"4.0",
         "oversion":"3.0"
      }
   ],
   "meta":{
      "query_finish_time":1493761419611,
      "query_body": "<EDITED OUT>",
      "query_id":-8797534873217479000,
      "sketches":{
         "maximum_count_error":24,
         "active_items":746,
         "items_seen":187075,
         "family":"FREQUENCY",
         "was_estimated":true
      },
      "query_receive_time":1493761386294
   }
}

Like DISTRIBUTION, the distribution of the data matters for TOP K. Depending on the distribution, your results could produce different counts and errors bounds if approximate.

Since we only filtered for nulls in a couple of fields, the top results end up being fields with null values. Note that the maximum_count_error is now 24 and the was_estimated property is set to true. 24 means that the upper bound - the lower bound for the Count field for each combination could be off by at most 24. Since Bullet gives you the upper bound, this means that if you subtract 24 from it, you get the lower bound of the true count.

Note that this also means the order of the items could be off. If two items had Count within 24 of each other, it is possible that the higher one may actually have had a true count lower than the second one and possibly be ranked higher. There is no such situation in this result set.

Lateral View Explode

SELECT student, score
FROM STREAM(30000, TIME)
LATERAL VIEW EXPLODE(test_scores) AS (student, score)
WHERE score >= 80
LIMIT 10;

This query explodes the map test_scores to the fields student and score. This effectively generates a record with a key field and value field for each entry in the exploded map. The lateral view means the generated records are appended to the original record, though in this query, only the exploded fields have been selected.

{
   "records":[
      {
         "student": "Roger",
         "score": 90
      },
      {
         "student": "Albert",
         "score": 92
      },
      {
         "student": "Emily",
         "score": 90
      },
      {
         "student": "Winston",
         "score": 81
      },
      {
         "student": "Jeff",
         "score": 95
      },
      {
         "student": "Kristen",
         "score": 97
      },
      {
         "student": "Percy",
         "score": 85
      },
      {
         "student": "Tyson",
         "score": 80
      },
      {
         "student": "Jackie",
         "score": 89
      },
      {
         "student": "Alice",
         "score": 100
      }
   ],
   "meta": "<EDITED OUT>"
}

Multiple Lateral View Explodes

Multiple lateral view explodes can also be chained in the same query. For instance, using the above example, instead of the map test_scores, there is the list of maps tests. This list could be exploded into a field test_scores which could the be exploded into the fields student and score as before.

SELECT student, score
FROM STREAM(30000, TIME)
LATERAL VIEW EXPLODE(tests) AS test_scores
LATERAL VIEW EXPLODE(test_scores) AS (student, score)
WHERE score >= 80
LIMIT 10;

Outer Query

SELECT COUNT(*)
FROM (
    SELECT browser_name, COUNT(*)
    FROM STREAM(30000, TIME)
    GROUP BY browser_name
    HAVING COUNT(*) > 10
)

This query has an inner query wrapped by an outer query. Note that the inner query selects from STREAM and is thus the main query while the outer query selects from the inner query. Note also that the inner/main query can have a window while the outer query cannot.

The query above counts the number of browser names that appear more than 10 times in 30 seconds.

{
   "records":[
      {
         "COUNT(*)": 6
      }
   ],
   "meta": "<EDITED OUT>"
}

Window - Tumbling Group-By

SELECT demographics.country AS country, COUNT(*) AS count, AVG(demographics.age) AS averageAge,
       AVG(timespent) AS averageTimespent
FROM STREAM(20000, TIME)
WHERE demographics IS NOT NULL
GROUP BY demographics.country
WINDOWING TUMBLING(5000, TIME)
LIMIT 50;

This query specifies a tumbling window that will emit every 5 seconds and contain 5 seconds of data per window. Results will come back to the user every 5 seconds, and since the duration of the query is 20 seconds, the user will receive a total of 4 results. Since the aggregation size is set to 5, each returned window will contain only 5 groups (which will be chosen randomly). The result might look like this:

"records":[
    {
        "country":"Germany",
        "count":1,
        "averageAge":25.0
    },
    {
        "country":"Canada",
        "count":106,
        "averageAge":22.58490566037736
    },
    {
        "country":"USA",
        "count":1,
        "averageAge":28.0
    },
    {
        "country":"England",
        "count":8,
        "averageAge":34.25
    },
    {
        "country":"Peru",
        "count":9,
        "averageAge":30.0
    }
],
"meta":{
    "Window":{
        "Number":1,
        "Emit Time":1529458403038,
        "Expected Emit Time":1529458403023,
        "Name":"Tumbling"
        },
    "Query":{
        "ID":"448d228a-1eed-471f-8777-c800cc866535",
        "Receive Time":1529458398023,
        "Body":"...(query body)...}",
        "Sketch":{
            "Was Estimated":false,
            "Uniques Estimate":100.0,
            "Family":"TUPLE",
            "Theta":1.0,
            "Standard Deviations":{
                "1":{
                    "upperBound":100.0,
                    "lowerBound":100.0
                },
                "2":{
                    "upperBound":100.0,
                    "lowerBound":100.0
                },
                "3":{
                    "upperBound":100.0,
                    "lowerBound":100.0
                }
            }
        }
    }
}

"records":[  
   {  
      "country":"Canada",
      "count":101,
      "averageAge":32.742574257425744
   },
   {  
      "country":"ht",
      "count":2,
      "averageAge":32.0
   },
   {  
      "country":"England",
      "count":16,
      "averageAge":27.0625
   },
   {  
      "country":"Peru",
      "count":8,
      "averageAge":23.625
   },
   {  
      "country":"Bangladesh",
      "count":3,
      "averageAge":27.66666666666667
   }
],
"meta":{  
   "Window":{  
      "Number":2,
      "Emit Time":1529458408036,
      "Expected Emit Time":1529458408023,
      "Name":"Tumbling"
   },
   "Query":{  
      "ID":"448d228a-1eed-471f-8777-c800cc866535",
      "Receive Time":1529458398023,
      "Body":"...(query body)..."
   },
   "Sketch":{  
      "Was Estimated":false,
      "Uniques Estimate":98.0,
      "Family":"TUPLE",
      "Theta":1.0,
      "Standard Deviations":{  
         "1":{  
            "upperBound":98.0,
            "lowerBound":98.0
         },
         "2":{  
            "upperBound":98.0,
            "lowerBound":98.0
         },
         "3":{  
            "upperBound":98.0,
            "lowerBound":98.0
         }
      }
   }
}

"records":[  
   {  
      "country":"Canada",
      "count":121,
      "averageAge":27.97520661157025
   },
   {  
      "country":"Haiti",
      "count":3,
      "averageAge":39.0
   },
   {  
      "country":"Cabuyao laguna",
      "count":2,
      "averageAge":28.0
   },
   {  
      "country":"USA",
      "count":1,
      "averageAge":20.0
   },
   {  
      "country":"England",
      "count":23,
      "averageAge":40.869565217391305
   }
],
"meta":{  
   "Window":{  
      "Number":3,
      "Emit Time":1529458413031,
      "Expected Emit Time":1529458413023,
      "Name":"Tumbling"
   },
   "Query":{  
      "ID":"448d228a-1eed-471f-8777-c800cc866535",
      "Receive Time":1529458398023,
      "Body":"...(query body)..."
   },
   "Sketch":{  
      "Was Estimated":false,
      "Uniques Estimate":104.0,
      "Family":"TUPLE",
      "Theta":1.0,
      "Standard Deviations":{  
         "1":{  
            "upperBound":104.0,
            "lowerBound":104.0
         },
         "2":{  
            "upperBound":104.0,
            "lowerBound":104.0
         },
         "3":{  
            "upperBound":104.0,
            "lowerBound":104.0
         }
      }
   }
}

"records":[  
   {  
      "country":"Canada",
      "count":117,
      "averageAge":21.82051282051282
   },
   {  
      "country":"Azerbaijan",
      "count":1,
      "averageAge":30.0
   },
   {  
      "country":"England",
      "count":13,
      "averageAge":30.923076923076923
   },
   {  
      "country":"Congo",
      "count":1,
      "averageAge":32.0
   },
   {  
      "country":"Bangladesh",
      "count":3,
      "averageAge":24.333333333333336
   }
],
"meta":{  
   "Window":{  
      "Number":4,
      "Emit Time":1529458418030,
      "Expected Emit Time":1529458418023,
      "Name":"Tumbling"
   },
   "Query":{  
      "Finish Time":1529458418030,
      "ID":"448d228a-1eed-471f-8777-c800cc866535",
      "Receive Time":1529458398023,
      "Body":"...(query body)..."
   },
   "Sketch":{  
      "Was Estimated":false,
      "Uniques Estimate":108.0,
      "Family":"TUPLE",
      "Theta":1.0,
      "Standard Deviations":{  
         "1":{  
            "upperBound":108.0,
            "lowerBound":108.0
         },
         "2":{  
            "upperBound":108.0,
            "lowerBound":108.0
         },
         "3":{  
            "upperBound":108.0,
            "lowerBound":108.0
         }
      }
   }
}

Window - Additive Tumbling

SELECT COUNT(*) AS count, AVG(demographics.age) AS averageAge,
       AVG(timespent) AS averageTimespent
FROM STREAM(20000, TIME)
WHERE demographics IS NOT NULL
WINDOWING EVERY(5000, TIME, ALL)
LIMIT 50;

The above query will run for 20 seconds and emit a result every 5 seconds. The result will contain the average age and the count of the records seen since the very beginning of the query. Results might look like this:

"records":[  
   {  
      "count":8493,
      "averageAge":28.8828796983622
   }
],
"meta":{  
   "Window":{  
      "Number":1,
      "Emit Time":1529522392188,
      "Expected Emit Time":1529522392089,
      "Name":"Tumbling"
   },
   "Query":{  
      "ID":"12e48fbd-a20f-4f5e-8135-0f012d9ba3ef",
      "Receive Time":1529522387089,
      "Body":"...(query body)..."
   }
}

"records":[  
   {  
      "count":17580,
      "averageAge":29.842629482071715
   }
],
"meta":{  
   "Window":{  
      "Number":2,
      "Emit Time":1529522397191,
      "Expected Emit Time":1529522397089,
      "Name":"Tumbling"
   },
   "Query":{  
      "ID":"12e48fbd-a20f-4f5e-8135-0f012d9ba3ef",
      "Receive Time":1529522387089,
      "Body":"...(query body)..."
   }
}

"records":[  
   {  
      "count":26317,
      "averageAge":29.86675792835957
   }
],
"meta":{  
   "Window":{  
      "Number":3,
      "Emit Time":1529522402185,
      "Expected Emit Time":1529522402089,
      "Name":"Tumbling"
   },
   "Query":{  
      "ID":"12e48fbd-a20f-4f5e-8135-0f012d9ba3ef",
      "Receive Time":1529522387089,
      "Body":"...(query body)..."
   }
}

"records":[  
   {  
      "count":35259,
      "averageAge":29.8303102557552
   }
],
"meta":{  
   "Window":{  
      "Number":4,
      "Emit Time":1529522407182,
      "Expected Emit Time":1529522407089,
      "Name":"Tumbling"
   },
   "Query":{  
      "Finish Time":1529522407182,
      "ID":"12e48fbd-a20f-4f5e-8135-0f012d9ba3ef",
      "Receive Time":1529522387089,
      "Body":"...(query body)..."
   }
}

Sliding Window of Size 1 with Max Duration

SELECT *
FROM STREAM(MAX, TIME)
WHERE "browser-id" = '2siknmdd6kaqm'
WINDOWING EVERY(1, RECORD, FIRST, 1, RECORD)

This is a query that will capture raw data, and has a sliding window of size 1. This query will return window results immediately whenever a single record that matches the filters flows through the system. The filters in this example will only match records from a particular browser.

This query will run for the maximum amount of time that the backend is configured to allow.

Results might look like this:

"records":[  
   {  
      "country":"USA",
      "event":"page",
      "browser-id":"2siknmdd6kaqm"
   }
],
"meta":{  
   "Window":{  
      "Number":1,
      "Size":1,
      "Emit Time":1529521479235,
      "Name":"Sliding"
   },
   "Query":{  
      "ID":"31d65a12-ed56-4cc8-81ec-6a8bfe9301ba",
      "Receive Time":1529521475015,
      "Body":"...(query body)... "
   }
}

"records":[  
   {  
      "country":"USA",
      "event":"click",
      "browser-id":"2siknmdd6kaqm"
   }
],
"meta":{  
   "Window":{  
      "Number":6,
      "Size":1,
      "Emit Time":1529521764875,
      "Name":"Sliding"
   },
   "Query":{  
      "ID":"e9595eb4-ea95-418b-8cff-d00736bf216f",
      "Receive Time":1529521757459,
      "Body":"...(query body)..."
   }
}

... (one result returned for each record found for as long as the backend is configured to allow) ...