API Best Practices
Read these guidelines before building production integrations. They are especially important for reporting queries, bulk updates, and scripts that run on a schedule.
How our analytics database works
We currently use SingleStore as our analytics database provider.
This database stores recently accessed data in a hot cache, which is very fast to query. Over time data gets moved to cheaper cold storage for archiving -- this is what allows us to offer extremely competitive data retention costs.
When data is accessed that is not hot cached, the system needs to move it back to hot cache -- a process which takes time and is dependent on network speeds.
As such, if you query old data, the first query will take some time, and subsequent queries will be much faster as the data has moved between caches.
Additionally, our analytics DB is columnar in nature, which means each column (attribute/metric) of data is independently stored. Caching of that data is also independent, and the more columns requested, the slower a query will be. Once hot cached you can expect very little additional overhead from the extra columns, other than then networking latencies in sending the response.
Technical limitations
Reporting queries use a resource pool to prevent excessive queries. There is a maximum parallel concurrency and a queue to prevent API workloads from overloading the reporting service.
If a lot of queries are happening in parallel, queries may be moved to a queue where they have to wait before processing. This does not apply to small, lightweight queries that are expected to process quickly (a distinction made by the DB server's workload management).
Our current query timeout is 30 seconds. Our reporting service will timeout the connection after 30 seconds and the query will also be killed on the DB server. Note this does not apply to cost and conversion updates, which are queued and processed by a separate service.
General programmatic guidelines
-
Avoid excessive parallel queries. Initial queries may need to move cold data into hot cache. Subsequent queries can then become faster. Excessive parallel queries compete with each other and may repeat the same slow work.
-
Add delays when looping through days, assets, accounts, or other repeated workloads.
-
Avoid instant retries. If a query hits the 30-second timeout, wait before retrying. Use exponential backoff where possible.
-
Avoid splitting a dataset into a large number of small queries. Larger grouped queries are usually more efficient.
-
Avoid repeatedly pulling large historical ranges. Old data is unlikely to change.
Querying reporting data
-
Our database data is segmented by customer and date/hour. Additional parameters such as funnel or traffic source will not reduce the amount of data the DB needs to access. It is usually more effective to group by funnel ID and traffic source ID in a single query than to run multiple queries for individual funnel/traffic-source pairs.
-
Use
whitelistFilterswhere possible to scope your queries if you want to reduce data payloads -
Data has to move from cold storage to hot on first query. So initial queries will always be slower, with subsequent ones becoming faster. Use this to your advantage -- avoid parallel queries and instead let initial queries take as long as they need before having a delay (e.g. a few seconds) before pulling the next query that covers the same date/time range.
-
Additionally, as a columnar database, data for columns is stored separately. Queries will be faster when fewer columns are requested. Use
restrictToMetricsfor this. -
Unique visitor counts require distinct counting, which is expensive. If you do not need unique visitor counts, do not request them. Queries with uniques can be 3-10x slower. To get uniques, pass the boolean flag and request
VisitorsinrestrictToMetrics.
Raw event querying
- Our raw event API now requires
restrictToMetricsvalues that mirror our DB columns - The API is limited to 15 concurrent columns
- Maximum returned row count is 10,000
- The raw hits API is not intended to download or mirror your raw FunnelFlux data. Programmatic usage that constantly pulls raw data is not permitted and may result in suspended access.
- The API is intended for debugging, pulling conversion logs, and similar targeted workflows. It is not provided as a mechanism to automate exporting all account data