FAQ: When to Use a Seeq Data Lab Project vs. External Platforms

Overview

This article addresses common questions about the boundaries of Seeq Data Lab Projects, especially for large-scale modeling, analytics, and heavy compute workloads.

Seeq Data Lab Projects are designed for advanced scripting and analytics on top of Seeq Workbench data. However, it is not intended to replace high-performance ML platforms or distributed compute frameworks. The following Q&A clarifies common concerns around compute, versioning, large-scale modeling, and best practices.

1. Compute / Performance / Distribution

Q1a. What are the practical limits (CPU, memory, parallelism) for a Data Lab Project before performance degrades significantly?
A: Each user’s instance of a Data Lab Project have pre-allocated CPU and memory resources.

Performance considerations: Performance degrades when notebooks attempt to load tens of millions of samples of raw time-series data into memory. This can occur due to any combination of large time spans, small grid sizes, or many signals being pulled simultaneously.
Parallelism: Limited to Python’s standard concurrency mechanisms (threads or processes) within a single host — there is no distributed or multi-node compute capability.

Q1b. Do Data Lab Projects support auto-scaling, multi-node compute, or distributed execution of large notebooks?
A: No. Data Lab Projects execute on their own, single, container; it does not auto-scale or distribute across multiple nodes.

Q1c. If not, what is the recommended architectural pattern (e.g. run heavy jobs externally and only push results back to Seeq)?
A: The recommended pattern is to perform heavy compute externally (e.g., Databricks, Azure ML, SageMaker, Spark clusters) and use the Seeq Python Library (SPy) to pull data from Seeq, process it externally, and then push summarized results or model outputs back into Seeq.

2. Versioning & Model / Dataset Management

Q2a. Do Data Lab Projects provide built-in version control for datasets, models, and code (beyond standard Git integration)?
A: No. Data Lab Projects do not include dataset/model versioning beyond what you set up yourself. Git integration is available for code versioning, but datasets and model artifacts should be managed externally (e.g., object storage, MLflow, Git-LFS). SPy is typically used to fetch input signals on demand and write outputs/results back into Seeq, rather than persisting large artifacts inside a Data Lab Project.

Q2b. Are there storage or metadata limitations (e.g. number of versions, retention, size) for datasets or model artifacts in Data Lab?
A: Yes. Data Lab projects have finite storage, and all files within a project count against that allocation. It is perfectly fine to store files in Data Lab — for example, notebooks, scripts, logs, and small datasets — but performance and stability can degrade when projects contain large files or a large number of smaller files. Both can quickly consume available disk and slow project operations.

Best practice is to keep only the files required for your current analysis in Data Lab and store large or long-term datasets externally.

Q2c. In high-usage environments, how do you recommend handling branching / staging / rollback of analytical pipelines?
A: These practices are best managed entirely in your 3rd party ML or data engineering platform, not inside Seeq. Use tools such as Git, MLflow, or DVC (Data Version Control, an open-source tool for dataset/model versioning) for branching, staging, and rollback of code, datasets, and models. The Seeq Python Module (SPy) should be used on the external platform to pull input signals from Seeq when needed, and to push results or model outputs back once pipelines are executed.

3. Modeling Many Features Over Long Time Horizons

Q3a. For use cases such as ~1,000 derived features (calculations) over 10–20 years of data, are Data Lab Projects still viable? Where do you expect bottlenecks (memory, I/O, compute, caching)?
A: This is beyond the intended scale of a Data Lab Project. Bottlenecks will appear when pulling very wide/long datasets into memory — especially I/O from Seeq Workbench to Data Lab Project, and memory pressure when processing years of second-level data. Expect degraded responsiveness in the notebook environment. For this scale, run feature engineering externally and use SPy to manage data transfer.

Q3b. Are there known limits (or sweet spots) on the number of features, number of signals, or time span for which Data Lab is still responsive?
A: There’s no hard-coded limit, but performance depends on the total number of samples being processed — not just the number of signals. Typical sweet spots are:

Tens to a few hundred signals spanning months to a couple of years of data.
Performance begins to degrade when notebooks attempt to load tens of millions of samples into memory, which can result from any combination of large time spans, small grid sizes, or many signals.

Q3c. Do you recommend streaming / incremental / chunked processing to avoid memory blow-ups, and if so, are there helper utilities or patterns in SPy or Data Lab to support that?
A: Yes. Chunking data pulls and incremental processing is recommended. SPy provides windowed/interval queries, which you can loop over to build a larger result set without exhausting memory. For modeling workflows, process data in slices, aggregate, and store results back to Seeq rather than holding everything in RAM.

4. Best Practices / Workarounds

Q4a. What are additional best practices for heavy Data Lab Project usage?
A: Additional best practices include:

Minimize pulling raw second-by-second data; pre-aggregate in Workbench or use Capsules.
Use smaller queries in loops with SPy instead of one massive pull.
Write intermediate results back to Seeq and reload as needed instead of holding large objects in memory.
Offload large training jobs to external compute, using SPy as a way to pull and push the data.

Q4b. Are there architectural patterns (e.g. hybrid — Data Lab plus external compute) that Seeq recommends?
A: Yes. The hybrid pattern is the most common:

Use Data Lab for exploratory scripting, lightweight feature engineering, and integration with Seeq Workbench.
Use 3rd Party ML/analytics platforms for heavy compute, long-running training, or massive datasets.
Use SPy as the pipeline to pull input signals into the 3rd party platform and to push model results or engineered features back into Seeq.

5. General Guidance: When to Use a Data Lab Project vs. External Platforms

Use Seeq Data Lab Projects when you need:
- Exploratory scripting and analysis
- Lightweight modeling and data prep
- Tight integration with Seeq Workbench (capsules, conditions, calculations)
Use an 3rd Party ML/data platform with SPy when you need:
- Large-scale training (hundreds+ features, decades of data)
- Distributed compute or auto-scaling
- Dataset/model versioning at enterprise scale
- Long-running jobs that exceed a Data Lab Project’s resource envelope
- A pipeline to move signals in/out of Seeq for advanced ML workflows