BigQuery on GCP is easy to switch on and surprisingly easy to misconfigure. The first problem rarely comes from SQL. It shows up when analysts can see a dataset but cannot run jobs, when notebooks fail because the wrong identity is in play, or when someone exports a service account key and sends it around just to keep work moving. At that point, a task that looked like basic setup has already created delivery friction, security debt, and poor auditability.
This step-by-step BigQuery on GCP guide covers the full setup path: project and billing decisions, dataset creation, BigQuery IAM design, and secure credentials for both people and workloads, including the JSON key path when it is truly required. The goal is not simply to make the first query work. It is to create an access model that survives real use, including notebooks, pipelines, offboarding, cost review, and audit.
BigQuery setup decisions happen before the first dataset
The visible task is usually "create a dataset." The real setup starts earlier. In production, cost control, compliance, and access hygiene are shaped by decisions that happen before any table exists.
Start by choosing a dedicated GCP project, or confirm that the existing one already has clear ownership. BigQuery placed inside a shared demo or experimentation project becomes difficult to govern very quickly. A dedicated project gives you cleaner IAM boundaries, clearer labels, and a cost surface that finance and platform teams can actually interpret.
Attach billing if the environment is meant for team use. BigQuery Sandbox is useful for isolated experimentation, but it is rarely the right base for shared datasets, automation, or sustained analysis. If the data team is going to depend on this environment, it is usually better to attach billing once and avoid rebuilding the setup later.
Enable the APIs you actually need. The BigQuery API is mandatory. The BigQuery Storage API is optional, but it is worth planning for if analysts will read data through notebooks, client libraries, or connectors that benefit from higher-throughput reads.
gcloud services enable bigquery.googleapis.com
# Optional when clients or connectors need the Storage API
gcloud services enable bigquerystorage.googleapis.com
Then decide the location before creating the first dataset. Region and multi-region choices affect latency, cross-service architecture, compliance posture, and which datasets can work together cleanly. The hidden cost appears later, when teams discover that data landed in inconsistent locations and every copy or join now has operational consequences.
BigQuery is not operationally ready when the console opens. It is operationally ready when project ownership, billing, and data location are no longer ambiguous.
Create the first dataset as an access and lifecycle boundary
Once the project is ready, create the first dataset from the BigQuery console. This is where many teams move too fast. A dataset is not just a container for tables. In practice, it becomes a boundary for access, retention, and often cost ownership. If that boundary is vague on day one, it usually stays vague when the environment grows.
Use a dataset ID that describes a durable domain or purpose. Names like tmp, test, or dataset1 are harmless for a day and noisy for years. They end up in IAM entries, lineage tooling, notebooks, pipelines, and incident reviews. Names such as analytics_core, sales_reporting, or product_events survive much better because they communicate intent.
Keep the dataset in the location you selected during project planning. Location is not a cosmetic attribute. It affects where data can live, how it interacts with other systems, and what constraints you inherit later. Teams rarely regret being deliberate here, but they often regret improvising.
If the dataset will hold staging data, temporary tables, or exploratory outputs, set default table or partition expiration from the start. This is one of the simplest cost-control mechanisms in BigQuery. Without it, transient data tends to become permanent storage spend because nobody owns cleanup.
Apply labels that match the rest of your GCP operating model. Owner, environment, and cost-attribution labels help far beyond reporting. They make automation, governance, and platform review materially easier once the environment has multiple teams and datasets.
Finally, confirm the encryption posture before access opens up. For many teams, Google's default encryption is entirely sufficient. In regulated environments or under stricter internal controls, the question is whether customer-managed encryption keys are required. That is a policy decision, not something to improvise after the data team is already active.
A simple test is useful here: if this dataset had twenty tables, five users, and two pipelines writing into it three months from now, would the current name, location, retention policy, and ownership still look reasonable? If not, fix it before data starts to accumulate.
BigQuery IAM should separate job execution from data access from day one
In production, BigQuery access usually fails at the identity layer, not at the query layer. Query jobs run at project level and are billed there. Table and view access is usually controlled at dataset level. That is why a user with roles/bigquery.dataViewer on a dataset can still fail to query it if roles/bigquery.jobUser is missing on the project. When teams blur those concerns, the fastest workaround becomes BigQuery Admin, which solves the immediate symptom by widening the blast radius.
A stable access model separates three things: the right to run jobs in a project, the right to read or write a specific dataset, and any additional permissions required for special access patterns such as the Storage API or service account impersonation.
For human users, this baseline works well in many teams:
| Profile | Project-level role | Dataset-level role | Why it matters |
|---|---|---|---|
| Read-only analyst | roles/bigquery.jobUser | roles/bigquery.dataViewer | Can submit query jobs and read tables or views in the target dataset |
| Analytics engineer or data engineer with write access | roles/bigquery.jobUser | roles/bigquery.dataEditor | Can run jobs and create or modify tables in the dataset |
| User or tool reading through the Storage API | roles/bigquery.readSessionUser | Keep the dataset role above | Required for some higher-throughput client read patterns |
The Storage API role is additive. It does not replace dataset access.
The better operating pattern is to grant these roles to groups, not to individuals. A group such as data-team@yourcompany.com becomes the control point. Onboarding and offboarding then happen in Google Workspace or Cloud Identity instead of being scattered across IAM screens and dataset permission panels.
A practical rollout usually looks like this:
- Create a group for the data team.
- Add the relevant users to that group.
- Grant
BigQuery Job Userto the group at project level in IAM & Admin > IAM. - Grant
Data ViewerorData Editoron the required dataset in BigQuery, depending on whether the team only reads data or also writes it. - Add
BigQuery Read Session Useronly if the tools in use actually rely on the Storage API.
The important point is not memorizing role names. It is preserving the separation between project-level execution and dataset-level data access. If one dataset contains mixed sensitivity, fix the boundary by splitting datasets or adding row or column-level controls. Do not solve a weak data model by broadening roles.
Human access should use corporate identity, not shared JSON keys
When teams ask for "credentials for BigQuery," they are usually combining two different questions: how people should authenticate and how software should authenticate. Those answers should not be the same.
For people, the default should be corporate identity backed by group-based IAM. A shared service account key is not a shortcut. It removes attribution, complicates offboarding, increases secret sprawl, and makes incident review unnecessarily hard.
For local work with client libraries, dbt, or notebooks, user-based Application Default Credentials are usually the cleanest pattern:
gcloud auth application-default login
This keeps each person on their own Google identity. Audit logs remain meaningful, permissions map cleanly to IAM grants, and there is no portable secret to pass around just because an analyst needs to run a notebook or script. For Cloud SDK workflows such as bq, the same principle applies: use the individual's account, not a shared service account.
Some organizations prefer not to let local tools use direct user identity. In that case, service account impersonation is usually the next best option, not a long-lived downloaded key. The user or group needs permission to impersonate the target service account, commonly through roles/iam.serviceAccountTokenCreator on that service account.
A common pattern looks like this:
gcloud auth application-default login \
--impersonate-service-account=bq-data-runner@PROJECT_ID.iam.gserviceaccount.com
That approach keeps the runtime identity controlled while avoiding static credential files on laptops. It also preserves a cleaner audit trail than a shared JSON key, because the platform can still distinguish who invoked the impersonation.
Workload identities need dedicated service accounts and keys only as a last resort
Pipelines, schedulers, and external connectors are different from humans. They do need workload identities, but those identities should be purpose-built. One service account per tool or per pipeline may feel slower at the start, yet it is much cheaper to govern once the platform grows. It limits blast radius, makes permissions review readable, and lets you retire access cleanly when a workload is removed.
A sound pattern is straightforward:
- Create a dedicated service account for the exact use case, such as
bq-data-runner. - Grant only the roles it actually needs:
roles/bigquery.jobUserat project levelroles/bigquery.dataViewerorroles/bigquery.dataEditoron the datasets it must accessroles/bigquery.readSessionUseronly if the read path requires the Storage API
- If the workload runs on GCP, attach that identity directly to Cloud Run, GKE, Composer, Compute Engine, or the relevant managed service.
- If the workload runs outside GCP, evaluate Workload Identity Federation before falling back to a static key.
This is the critical distinction: a service account is normal, a downloaded key should be an explicit exception. Keys are portable secrets. Once created, governance shifts from IAM design to secret hygiene and rotation discipline.
If a key is genuinely required, the mechanics are simple:
gcloud iam service-accounts create bq-data-runner \
--display-name="BigQuery Data Runner"
gcloud iam service-accounts keys create ./bq-data-runner.json \
--iam-account=bq-data-runner@PROJECT_ID.iam.gserviceaccount.com
The first command is routine. The second should be justified. If a key exists, store it in a secret manager or vault, define a rotation schedule, and revoke it as soon as the integration is retired. A JSON file copied into shared folders, repositories, or chat threads is not a workable credential strategy.
How to generate a BigQuery JSON key without turning it into permanent debt
Some teams still need a service account JSON key because of a legacy connector, a constrained external tool, or a tightly scoped automation script. If you are looking for how to generate a BigQuery JSON key, the important part is not just the command. It is making sure the key remains narrow in scope, well stored, and easy to revoke.
How to generate the .json key from the GCP console
If you are doing this through the Google Cloud UI, the usual path is:
- Go to IAM & Admin > Service Accounts.
- Create a new service account or open the one you already plan to use, such as
bq-data-runner. - Open the Keys tab.
- Click Add key > Create new key.
- Select JSON and confirm.
- Download the file and move it immediately into secure storage.
That file is only downloadable at creation time. If it is lost, the right response is not to ask for the same file again, but to create a new key and revoke the old one.
How to generate it with the CLI
Create a dedicated service account for that exact integration and grant only the minimum BigQuery permissions it needs. Then generate the key:
gcloud iam service-accounts keys create ./bq-data-runner.json \
--iam-account=bq-data-runner@PROJECT_ID.iam.gserviceaccount.com
Store the file as a secret, not as a convenience artifact. Many client libraries will pick it up through GOOGLE_APPLICATION_CREDENTIALS:
export GOOGLE_APPLICATION_CREDENTIALS="$PWD/bq-data-runner.json"
Then validate the behavior with the same interface the integration will actually use. For example:
bq query --use_legacy_sql=false 'SELECT 1 AS ok'
from google.cloud import bigquery
client = bigquery.Client()
query_job = client.query("SELECT 1 AS ok")
print(list(query_job.result()))
The important test is not just whether the key works. It is whether the key remains contained. It should be rotated intentionally, monitored, and removed when the integration disappears. It should never become the default access model for a human team. If the whole data team depends on one .json file to do everyday work, the BigQuery setup is incomplete.
It is also worth documenting three things alongside the key: which integration uses it, where it is stored, and who owns its rotation or retirement. Without that context, the secret often outlives the workload that justified it.
Before the team starts using BigQuery, this validation gate should already be green
Before announcing that BigQuery is ready, validate the controls below. If this table cannot be passed cleanly, the environment is not ready for team use.
| Control | What must be true | Failure signal |
|---|---|---|
| Project | BigQuery lives in a project with clear ownership, billing, and labeling | Shared project with mixed use and no accountable owner |
| Location | Region or multi-region is consistent with workload, latency, and compliance needs | Datasets spread across locations without a rule |
| Dataset | Naming, retention, and labels follow an operating convention | Generic dataset with no lifecycle controls |
| Human IAM | Access is granted through groups with least privilege | Broad roles assigned manually to individual users |
| Human credentials | People authenticate with corporate identity or controlled impersonation | Analysts sharing one service account key |
| Service accounts | Each workload has a dedicated identity with minimal scope | One service account used across unrelated tools or pipelines |
| Secrets | Any keys that exist are stored securely, rotated, and revocable | JSON files living in chat, drives, or repositories |
If these controls are not clean, the setup is not finished. It is only temporarily unblocked.
FAQ: the objections that show up in week one
Is BigQuery Sandbox enough?
For one person exploring BigQuery, sometimes yes. For a real data team, usually not for long. As soon as you need shared datasets, automation, or governed access, the constraints become more expensive than simply attaching billing and treating the environment as production-capable from the start.
Should I just give the team BigQuery Admin so nobody gets blocked?
It can be a short-lived emergency measure, but it is a poor steady-state model. The operational cost appears later in audit scope, accidental privilege creep, and weak separation between users who need to run jobs and users who need to administer the platform. Separate execution, data access, and special permissions by actual use case.
Can I share one service account across the whole data team?
You can, but it is the wrong pattern for humans. You lose traceability, make offboarding messy, and normalize static secrets where they are not needed. For people, prefer corporate identity and groups. For workloads, use dedicated service accounts tied to specific systems or pipelines.
If one person can run a query today but nobody can explain who has access, how keys are rotated, or which dataset boundary controls retention, the setup is still incomplete.
Primary sources and official documentation
- BigQuery quickstart using the Google Cloud console
- Introduction to BigQuery access control
- BigQuery IAM roles and permissions
- Authenticate to BigQuery
- Create and delete service account keys
- Best practices for using service account keys
Related reading that sharpens the decision
- Data platform for analytics: operational trust and decision speed
- AWS vs GCP vs Azure for AI/ML in 2026: how to choose without operational debt
- MLOps in production: from prototype to a governable system
- Airflow 3 in production: how to migrate without breaking your pipelines
- Data engineering consulting: platforms and pipelines








