1. Objective
Design an efficient, scalable, and cost-effective logging architecture using Google Cloud Storage (GCS) for storing login history or event logs, with potential integration into analytics platforms like BigQuery or Dataflow.
2. Choosing the Right Format: JSON vs. Parquet
Format | Pros | Cons | Ideal Use Case |
---|---|---|---|
JSON | – Human-readable- Easy to generate- Flexible structure | – Large file size- Slower for analytical queries | Realtime logging, debugging, streaming |
Parquet | – Compact and compressed- Columnar (fast queries)- Schema-defined | – Not human-readable- Requires batch processing and libraries | Batch analytics, BigQuery integration |
3. File Organization Strategy
Organize logs in GCS using a partitioned folder structure for efficient retrieval and lifecycle management:
gs://your-bucket-name/logs/{service}/{year}/{month}/{day}/file.parquet
Example:
gs://login-logs/auth-service/2025/03/25/logins.parquet
- Helps with lifecycle policies and cost control
- Enables selective loading into BigQuery using partition filters
4. Compression
- JSON: Use GZIP compression to reduce file size (
.json.gz
) - Parquet: Compression (e.g., Snappy) is natively supported and highly efficient
5. Write Patterns
Pattern | Recommendation |
Small, frequent logs | Accumulate logs in memory or buffer and write periodically (e.g., every 5 min) |
Batch processing | Combine multiple entries into one file to reduce the number of small writes |
Streaming use cases | Prefer newline-delimited JSON (NDJSON) for compatibility and simplicity |
6. Integration with BigQuery
Format | Method |
Parquet | Use native support for external tables or scheduled ingestion |
NDJSON | Define schema in BigQuery and use for direct loading |
Automation options:
- Cloud Functions + Pub/Sub: Trigger on file upload for streaming pipelines
- Cloud Scheduler + Dataflow / Cloud Run: Scheduled batch ingestion
- BigQuery Data Transfer Service (for periodic ingestion)
7. Security and Access Control
- Apply fine-grained IAM: Grant only required roles to each identity
- Use Uniform Bucket-Level Access (UBLA) for centralized control
- Enable Object Versioning: Prevent accidental overwrites or deletions
- Consider Customer-Managed Encryption Keys (CMEK) for compliance-sensitive data
8. Lifecycle Management
Reduce storage cost by configuring GCS lifecycle policies. Example JSON configuration:
{
"rule": [
{
"action": { "type": "Delete" },
"condition": { "age": 180 }
}
]
}
This rule deletes objects older than 180 days (6 months). You can also configure transitions to Nearline, Coldline, or Archive storage classes based on data access patterns.
9. Monitoring & Alerting
- Use Cloud Monitoring to observe storage usage and object creation
- Set budget alerts with Cloud Billing to detect cost spikes
- Enable Object Change Notifications via Pub/Sub for pipeline triggers
- Integrate with Cloud Logging for end-to-end observability
10. Summary
Requirement | Recommended Format / Method |
Realtime log ingestion | JSON (NDJSON preferred, gzip compressed) |
Efficient analytics | Parquet |
Debugging / readability | JSON |
BigQuery integration | Parquet (preferred), or NDJSON |
Storage cost optimization | Parquet + Lifecycle rules |
Compliance / encryption | Use CMEK + IAM |
By following these best practices, you can build a robust logging system on GCS that is optimized for cost, performance, security, and future analytical needs.