Skip to content

feat(bigframes): implement ai.similarity#16771

Merged
sycai merged 3 commits into
mainfrom
sycai_ai_similarity
Apr 27, 2026
Merged

feat(bigframes): implement ai.similarity#16771
sycai merged 3 commits into
mainfrom
sycai_ai_similarity

Conversation

@sycai
Copy link
Copy Markdown
Contributor

@sycai sycai commented Apr 22, 2026

Fixes b/497837587

@sycai sycai requested review from a team as code owners April 22, 2026 23:55
@sycai sycai requested review from mpovoa and removed request for a team April 22, 2026 23:55
@sycai sycai marked this pull request as draft April 22, 2026 23:55
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the similarity function to the BigQuery AI module, enabling cosine similarity calculations between strings or series using Vertex AI models. The implementation includes the necessary operation definitions, compiler registrations for Ibis and SQLGlot, and comprehensive system and unit tests. Review feedback identified a consistent typo in the model name 'embedding-gemma-300m' within the documentation and test cases.

Comment thread packages/bigframes/bigframes/bigquery/_operations/ai.py
Comment thread packages/bigframes/tests/system/small/bigquery/test_ai.py
@sycai sycai marked this pull request as ready for review April 23, 2026 00:10
@sycai sycai requested review from TrevorBergeron and chelsea-lin and removed request for mpovoa April 23, 2026 00:10
@sycai sycai added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 27, 2026
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 27, 2026
@sycai sycai merged commit d4afa2c into main Apr 27, 2026
39 of 40 checks passed
@sycai sycai deleted the sycai_ai_similarity branch April 27, 2026 20:23
shuoweil added a commit that referenced this pull request May 13, 2026
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v0.13.0
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/python-librarian-generator@sha256:234b9d1f2ddb057ed7ac6a38db0bf8163d839c65c6cf88ade52530cddebce59e
<details><summary>bigframes: v2.40.0</summary>

##
[v2.40.0](bigframes-v2.39.0...bigframes-v2.40.0)
(2026-05-13)

### Features

* Add `bigframes.execution_history` API to track BigQuery jobs (#16588)
([fa20a74](fa20a740))
  ```python
  import bigframes.pandas as bpd
  bpd.options.compute.enable_execution_history = True
  df = bpd.read_gbq("my_table")
  # ... perform operations ...
  history = bpd.execution_history
  print(history.jobs) # Access BigQuery job details for executed queries
  ```

* Implement `ai.similarity` and `ai.embed` for text embeddings and
semantic similarity (#16771, #16759)
([d4afa2c](d4afa2c8),
[fcb4579](fcb4579b))
  ```python
  import bigframes.pandas as bpd
  # Generate embeddings
  df["embeddings"] = bpd.bigquery.ai.embed(df["text_col"])
  # Compute similarity
df["similarity"] = bpd.bigquery.ai.similarity(df["embeddings_a"],
df["embeddings_b"])
  ```

* Support `hparam_range` and `hparam_candidates` parameters for
hyperparameter tuning in model creation (#16640)
([ca47835](ca47835c))
* Update `ai.score`, `ai.classify` and `ai.if_` parameters to match
their SQL equivalents (#16919, #16990, #16857)
([9f42fe1](9f42fe14),
[e9c52b1](e9c52b12),
[f3cb4ad](f3cb4ad0))
* Support unstable sorting in `sort_values` and `sort_index` (#16665)
([bbdeb70](bbdeb70f))
* Support loading Avro and ORC data formats (#16555)
([6d46cba](6d46cba3))
* Add NumPy ufunc support directly on column expressions (#16554)
([2f792ab](2f792abd))

### Bug Fixes

* Fix bugs compiling ambiguous ids and in subqueries (#16617)
([479e44d](479e44dd))

* BigFrames respects bq default region (#16933)
([ef9945a](ef9945a5))

* avoid views when querying BigLake tables from SQL cells (#16562)
([fdd3e0d](fdd3e0de))

* avoid `copy` argument warning in `to_pandas` (#16917)
([fe5245b](fe5245b8))

### Performance Improvements

* Improve write api upload throughput (#16641)
([ef856b0](ef856b04))

### Documentation

* Add docs to the to_csv methods of dataframe and series (#16570)
([a8fccef](a8fccefd))

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants