Amazon S3 is the default landing zone for data in the cloud. Log files, Parquet datasets, CSV exports, JSON events, IoT telemetry, and raw data dumps — it all ends up in S3 buckets. But S3 is storage, not an analytics engine. You can’t run SQL against S3 natively. To query it, you need Amazon Athena (per-TB pricing), AWS Glue ETL jobs (cluster management), or a data warehouse that imports the data. All add cost, complexity, and latency.
Dremio Cloud connects directly to S3 and lets you query files in place using standard SQL. Dremio reads Parquet, CSV, JSON, Delta Lake, and Apache Iceberg table formats. It pushes projection and filter operations into its vectorized query engine and caches frequently accessed data on local NVMe drives (Columnar Cloud Cache, or C3) for near-instantaneous repeat queries.
For organizations with hundreds or thousands of S3 buckets accumulated over years, data lake sprawl is a major challenge. Data lands in S3 from application logs, CDC pipelines, third-party integrations, and manual uploads — often without consistent naming conventions, schemas, or documentation. Dremio provides the organizational layer: connect S3 buckets, create views that standardize column names and types, build a semantic layer with wiki descriptions, and expose clean datasets to analysts and AI tools. This turns an unstructured “data swamp” into a governed, queryable data lake.
Why S3 Users Need Dremio
SQL on Your Data Lake Without Athena Costs
Athena charges per terabyte of data scanned. For large datasets queried frequently — dashboards refreshing every 15 minutes, analysts exploring data, scheduled reports — costs grow unpredictably. Dremio’s Reflections pre-compute results so repeated queries don’t re-scan S3. C3 caching further reduces S3 GET requests. You pay for Dremio compute time, not per-TB scanned.
Format Flexibility
Dremio reads Parquet, CSV, JSON, Avro, Delta Lake, and Apache Iceberg from S3. You don’t need to convert everything to one format before querying. Mixed-format data lakes work out of the box.
Federation with Databases and Warehouses
Your event data is in S3, but your customer data is in PostgreSQL, your financial data is in Snowflake, and your marketing data is in BigQuery. Dremio joins across all of them in a single SQL query without copying data between systems.
Apache Iceberg Table Management
Create Iceberg tables in Dremio’s Open Catalog (backed by S3 or Dremio-managed storage) with full DML support. Dremio automatically handles compaction (merging small files), manifest rewriting, clustering, and vacuuming — no manual OPTIMIZE jobs needed.
AI on S3 Data
Dremio’s AI Agent, MCP Server, and AI SQL Functions make your raw S3 files queryable by business users and external AI tools — no data engineering required.
Prerequisites
- AWS Account with S3 access
- IAM Role or Access Key/Secret Key with
s3:GetObject,s3:ListBucket, ands3:GetBucketLocationpermissions - Bucket names or specific paths you want to query
- Dremio Cloud account — sign up free for 30 days with $400 in compute credits
Step-by-Step: Connect S3 to Dremio Cloud
1. Add the Source
Click ”+” and select Amazon S3.
2. Configure Connection
- Name: Descriptive identifier (e.g.,
s3-datalakeorevent-logs). - Authentication: IAM Role ARN (recommended) or Access Key/Secret Key.
3. Configure Advanced Options
| Setting | Purpose | When to Use |
|---|---|---|
| Root Path | Starting path in the bucket | Restrict to subfolder: /data/analytics/ |
| Allowlisted Buckets | Limit which buckets appear | Multi-bucket accounts |
| Enable partition column inference | Extract partition keys from folders | Hive-style partitioned data |
| Default CTAS Format | CREATE TABLE format | Iceberg recommended |
| Encrypt Connection | Enable SSL | Always recommended |
| Requester Pays | For requester-pays buckets | Cross-account access |
| Enable compatibility mode | S3-compatible storage | MinIO, R2, etc. |
| Connection Properties | Custom settings | fs.s3a.endpoint for non-AWS |
4. Set Reflection and Metadata Refresh, then Save
Query S3 Data
-- Query Parquet files
SELECT event_type, user_id, event_timestamp, page_url
FROM "s3-datalake".events."user_events.parquet"
WHERE event_type = 'purchase' AND event_timestamp > '2024-01-01';
-- Query partitioned data (e.g., year=2024/month=01/)
SELECT region, product_category, SUM(revenue) AS total_revenue
FROM "s3-datalake".sales.transactions
GROUP BY region, product_category
ORDER BY total_revenue DESC;
Federate S3 with Other Sources
SELECT
c.customer_name,
c.segment,
COUNT(e.event_id) AS s3_events,
SUM(CASE WHEN e.event_type = 'purchase' THEN e.revenue ELSE 0 END) AS s3_revenue,
pg.lifetime_value AS crm_lifetime_value
FROM "postgres-crm".public.customers c
LEFT JOIN "s3-datalake".events.user_events e ON c.customer_id = e.user_id
LEFT JOIN "postgres-crm".public.customer_metrics pg ON c.customer_id = pg.customer_id
GROUP BY c.customer_name, c.segment, pg.lifetime_value
ORDER BY s3_revenue DESC;
Create Iceberg Tables from S3 Data
Promote raw S3 files into managed Iceberg tables:
CREATE TABLE analytics.bronze.clean_events AS
SELECT event_type, user_id, CAST(event_timestamp AS TIMESTAMP) AS event_time, page_url, revenue
FROM "s3-datalake".events."user_events.parquet"
WHERE event_type IS NOT NULL;
The Iceberg table benefits from automatic compaction, time travel, results caching, and Autonomous Reflections.
S3-Compatible Storage
Dremio’s S3 connector works with S3-compatible storage:
- MinIO: Enable compatibility mode, set
fs.s3a.endpointto your MinIO endpoint. - Cloudflare R2: Same pattern, with R2’s S3-compatible endpoint.
- DigitalOcean Spaces: Compatibility mode + custom endpoint.
- Amazon FSx for NetApp ONTAP: Set the S3 Access Point alias as the root path, ensure IAM permissions include FSx-specific actions.
Build a Semantic Layer
CREATE VIEW analytics.gold.event_metrics AS
SELECT
DATE_TRUNC('day', CAST(event_timestamp AS TIMESTAMP)) AS event_date,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users,
SUM(revenue) AS daily_revenue,
CASE
WHEN COUNT(*) > 10000 THEN 'High Activity'
WHEN COUNT(*) > 1000 THEN 'Normal Activity'
ELSE 'Low Activity'
END AS activity_level
FROM "s3-datalake".events.user_events
GROUP BY 1, 2;
In the Catalog, click Edit → Generate Wiki and Generate Tags.
AI-Powered Analytics on S3 Data
Dremio AI Agent
Ask “What’s our daily purchase revenue trend this month?” and the AI Agent generates SQL from your semantic layer. The wiki descriptions guide the Agent’s understanding of event types and metrics.
Dremio MCP Server
Connect Claude or ChatGPT to your S3 data:
- Create a Native OAuth app in Dremio Cloud
- Configure redirect URLs
- Connect via
mcp.dremio.cloud/mcp/{project_id}
A product analyst asks Claude “Analyze user engagement patterns from S3 event data this week” and gets governed results.
AI SQL Functions
-- Classify events with AI
SELECT
event_type,
event_count,
AI_CLASSIFY(
'Based on this event pattern, classify the business impact',
'Event: ' || event_type || ', Count: ' || CAST(event_count AS VARCHAR) || ', Revenue: $' || CAST(daily_revenue AS VARCHAR),
ARRAY['Revenue Driver', 'Engagement Signal', 'Support Indicator', 'Churn Signal']
) AS business_impact
FROM analytics.gold.event_metrics
WHERE event_date = CURRENT_DATE - INTERVAL '1' DAY;
-- Process unstructured data from S3
SELECT
file['path'] AS file_path,
AI_GENERATE(
'Extract key information from this document',
('Summarize the main topics in this file', file)
WITH SCHEMA ROW(summary VARCHAR, category VARCHAR)
) AS extracted_info
FROM TABLE(LIST_FILES('@"s3-datalake"/documents/'))
WHERE file['path'] LIKE '%.pdf';
AI_GENERATE with file references can process unstructured documents (PDFs, images) stored in S3 directly in SQL queries.
Reflections and C3 Caching
For frequently queried S3 data, Dremio provides two layers of acceleration:
Reflections pre-compute query results. Create them on your semantic layer views:
- In the Catalog, navigate to the view
- Click the Reflections tab
- Choose Raw or Aggregation Reflections
- Select columns and set the refresh interval
- Click Save
C3 (Columnar Cloud Cache) automatically caches frequently accessed file data on local NVMe drives. C3 works transparently — no configuration needed. When Dremio reads S3 files, it caches the columnar data locally. Subsequent reads of the same files come from NVMe instead of S3, eliminating S3 GET request costs and latency.
Together, Reflections and C3 mean that frequently executed queries against S3 data run in milliseconds, not seconds.
Governance on S3 Data
S3 has bucket-level IAM policies, but no column-level masking or row-level filtering. Dremio’s Fine-Grained Access Control (FGAC) adds these capabilities:
- Column masking: Mask PII fields (email, IP, user ID) from specific roles. Data engineers see everything; marketing analysts see aggregated metrics only.
- Row-level filtering: Restrict data by user role. Regional analysts see only their region’s events.
- Unified policies: Same governance applies across S3, PostgreSQL, Snowflake, and all other sources.
These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.
Connect BI Tools via Arrow Flight
Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:
- Tableau: Dremio connector for direct Arrow Flight access
- Power BI: Dremio ODBC driver or native connector
- Python/Pandas:
pyarrow.flightfor programmatic access - Looker: Connect via JDBC
- dbt:
dbt-dremioadapter for transformation workflows
All queries benefit from Reflections, governance, and the semantic layer.
VS Code Copilot Integration
Dremio’s VS Code extension with Copilot lets developers query S3 data from their IDE. Ask Copilot “Show me purchase event trends from S3 data this week” and get SQL generated using your semantic layer.
S3 Data Organization Best Practices
How you organize data in S3 directly impacts Dremio’s query performance:
Partition Strategy
Hive-style partitions (year=2024/month=01/day=15/) enable Dremio to skip irrelevant partitions during query planning. The right partition key depends on your query patterns:
- Time-based queries: Partition by
year/month/dayoryear/month. Dremio reads only the partitions matching yourWHEREclause. - Regional queries: Partition by
region/datefor multi-region datasets. - Mixed access: Partition by the most common filter column first (e.g.,
region/year/month).
Avoid over-partitioning (too many small files per partition) or under-partitioning (too few partitions with huge files). Aim for partition sizes between 128 MB and 1 GB.
File Format Recommendations
| Format | Best For | Dremio Support |
|---|---|---|
| Parquet | Structured analytics data | Full support, columnar optimization |
| Apache Iceberg | ACID transactions, time travel | Full read/write support |
| Delta Lake | Databricks ecosystem compatibility | Read support |
| JSON | Semi-structured event data | Full support, schema inference |
| CSV | Legacy data imports | Full support, limited performance |
| Avro | Schema-evolved event streams | Read support |
For analytical workloads, convert CSV and JSON files to Parquet or Iceberg for 10-50x better query performance. Dremio can perform this conversion:
CREATE TABLE analytics.bronze.events_optimized AS
SELECT * FROM "s3-datalake".raw."events.csv";
This creates an Iceberg table from CSV data, giving you columnar storage, automatic compaction, and time travel.
Data Lake Layers
Organize your S3 bucket with a medallion architecture:
raw/— Landing zone for incoming data (CSV, JSON, Parquet from external sources)bronze/— Cleaned, typed versions of raw data (Iceberg tables)silver/— Joined, deduplicated, enriched datasetsgold/— Business-ready views and aggregations for the semantic layer
Dremio’s SQL engine handles the transformations between layers using CREATE TABLE AS SELECT and MERGE statements — no external ETL tools needed.
When to Use S3 vs. Other Storage
Use S3 when: Your data originates in AWS, you need cost-effective long-term storage, you want to use Apache Iceberg tables, your data is in file formats (Parquet, JSON, CSV).
Use managed databases when: Your data requires real-time OLTP operations, your applications need row-level transactions, your data model is heavily relational.
Dremio federates across both — S3 for your data lake and databases for operational data, in a single query.
Get Started
Amazon S3 is the most common data lake storage layer. Dremio Cloud turns it into a queryable, federated, AI-ready analytics platform without Athena costs or data warehouse ETL. Whether your S3 data is in Parquet, CSV, JSON, or Iceberg format, Dremio reads it directly and makes it available for SQL queries, cross-source joins, and AI-powered analytics.
Start by connecting your primary S3 bucket to Dremio Cloud. Create views that standardize your data into business-friendly structures, add wiki descriptions for the AI Agent, and build Reflections on frequently accessed datasets. Within hours, your S3 data lake transforms from raw file storage into a governed, AI-ready analytical platform. No infrastructure to manage and no data to move.
Try Dremio Cloud free for 30 days and connect your S3 buckets.