Skip to content

Wikibase support#4

Open
philippesaade-wmde wants to merge 9 commits into
mainfrom
wb_support
Open

Wikibase support#4
philippesaade-wmde wants to merge 9 commits into
mainfrom
wb_support

Conversation

@philippesaade-wmde
Copy link
Copy Markdown
Contributor

Including support for textifying Wikibase entities:

  • Added an API parameter for specifying the Wikibase Action API URL.
  • Removed TTL retrieval and normalization because it is not standardized across Wikibase instances.
  • Updated the label cache database to include the Wikibase Action API URL as a primary key, and added a migration function.

Copy link
Copy Markdown

@exowanderer exowanderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a lot of questions and suggestions. Nothing looks problematic so far, but I have not completed my review -- only 4 out of 13 files

Comment thread src/Textifier/WikidataTextifier.py Outdated
"text": self.text,
"lang": self.lang
}
return {"text": self.text, "lang": self.lang}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewability and maintainability standards prefer the previous version:

return {
    "text": self.text,
    "lang": self.lang
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed by Ruff Lint's formatting

Comment thread src/Textifier/WikidataTextifier.py Outdated
self.latitude is not None
and self.longitude is not None
)
return self.latitude is not None and self.longitude is not None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go either way, but I prefer the maintainabilitiy of the previous version

return (
    self.latitude is not None 
    and self.longitude is not None
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed by Ruff Lint's formatting

Comment thread src/Textifier/WikidataTextifier.py Outdated
and self.label is not None
and str(self.label) != ""
)
return bool(self.id) and self.label is not None and str(self.label) != ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a temp variable here for clarity

    id_and_label_check = bool(self.id) and self.label is not None and str(self.label)

    return id_and_label_check != ""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/Textifier/WikidataTextifier.py Outdated
bool(self.property)
and any(bool(v) for v in self.values)
)
return bool(self.property) and any(bool(v) for v in self.values)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For maintainability and readability, I prefer the previous version

return (
    bool(self.property) and any(bool(v)
    for v in self.values)
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed by Ruff Lint's formatting

Comment thread src/Textifier/WikidataTextifier.py Outdated
@@ -302,8 +289,8 @@ class WikidataClaimValue:
value: Optional[
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little difficult to read. Maybe this could be better:

Optional[
        Union[
            WikidataEntity, 
            WikidataQuantity,
            WikidataTime,
            WikidataCoordinates,
            WikidataText,
            WikidataMonolingualText
        ]
    ] = None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also changed by Ruff Lint's formatting

Comment thread src/WikidataLabel.py Outdated
@@ -36,10 +37,11 @@


class WikidataLabel(Base):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be generalised to WikibaseLabel(Base)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change and renamed everything with "Wikidata" in it to "Wikibase".
I additionally added aliases to keep support for Wikidata classes and files, otherwise external repositories importing Wikidata would break (such as the vector database pipeline)

Comment thread src/WikidataLabel.py Outdated
"""Create tables if they do not already exist."""
try:
Base.metadata.create_all(engine)
WikidataLabel._migrate_labels_table_for_wikibase()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to generalise the class name, then we would need to change this to WikibaseLabel as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change and renamed everything with "Wikidata" in it to "Wikibase".
I additionally added aliases to keep support for Wikidata classes and files, otherwise external repositories importing Wikidata would break (such as the vector database pipeline)

Comment thread src/WikidataLabel.py Outdated

@staticmethod
def add_bulk_labels(data):
def _migrate_labels_table_for_wikibase():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this completely new?

If so, what problem does it solve or feature does it enable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database structure has been changed to include the wikibase action api as a new column and primary key. This function runs at start and migrates the old database to the new one by filling this column with Wikidata's action API.

Comment thread src/WikidataLabel.py Outdated
"""
SELECT COUNT(*)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = :schema_name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an F-String?

f"""
SELECT COUNT(*)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = {DB_NAME} # <---- HERE
AND TABLE_NAME = 'labels'
AND COLUMN_NAME = 'wikibase_url'
"""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bound parameters are safer than f-string because they handle SQL injection and syntax breakage.

Comment thread src/WikidataLabel.py Outdated
SELECT COLUMN_NAME
FROM information_schema.KEY_COLUMN_USAGE
WHERE TABLE_SCHEMA = :schema_name
AND TABLE_NAME = 'labels'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"""
SELECT COLUMN_NAME
FROM information_schema.KEY_COLUMN_USAGE
WHERE TABLE_SCHEMA = {DB_NAME} <--- HERE
AND TABLE_NAME = 'labels'
AND CONSTRAINT_NAME = 'PRIMARY'
ORDER BY ORDINAL_POSITION
"""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bound parameters are safer than f-string because they handle SQL injection and syntax breakage.

Copy link
Copy Markdown

@exowanderer exowanderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed review of another 5 files -- the smaller ones

Comment thread Dockerfile Outdated
"-k", "uvicorn.workers.UvicornWorker", \
"-w", "1", \
"-b", "0.0.0.0:5000"] No newline at end of file
CMD ["uv", "run", "gunicorn", "main:app", "-k", "uvicorn.workers.UvicornWorker", "-w", "2", "--bind", "0.0.0.0:5000", "--timeout", "30", "--graceful-timeout", "15", "--max-requests", "1000", "--max-requests-jitter", "100", "--keep-alive", "5", "--access-logfile", "-", "--error-logfile", "-"] No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a trailing space after the las line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread README.md

## Future Plan

- Replace Action API with GraphQL once Wikibase GraphQL is available for Wikibases:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may never "replace" the ActionAPI, but we will supplement some functionality with GraphQL in addition.

Comment thread src/WikidataLabel.py Outdated
text(
"""
SELECT COLUMN_NAME
FROM information_schema.KEY_COLUMN_USAGE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the DB schema stored in this repo as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DB schema can be found under the WikidataLabel class

Comment thread src/WikidataLabel.py Outdated
]

if primary_key_cols == ["id"]:
connection.execute(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this only happens when a new DB is first instantiated.

I think we should log that on the server and possibly flag it during deployment

Comment thread src/WikidataLabel.py Outdated
missing_ids = set(ids) - set(labels.keys())
if missing_ids:
missing_labels = WikidataLabel._get_labels_wdapi(missing_ids)
missing_labels = WikidataLabel._get_labels_wdapi(missing_ids, wb_url=normalized_wb_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
missing_labels = WikidataLabel._get_labels_wdapi(missing_ids, wb_url=normalized_wb_url)
missing_labels = WikidataLabel._get_labels_wdapi(
missing_ids,
wb_url=normalized_wb_url
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff Lint's formatter will revert this back to one line

Comment thread src/WikidataLabel.py Outdated
# Cache labels
WikidataLabel.add_bulk_labels(
[{"id": entity_id, "labels": entity_labels} for entity_id, entity_labels in missing_labels.items()]
[{"id": entity_id, "labels": entity_labels} for entity_id, entity_labels in missing_labels.items()],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[{"id": entity_id, "labels": entity_labels} for entity_id, entity_labels in missing_labels.items()],
[
{"id": entity_id, "labels": entity_labels}
for entity_id, entity_labels in missing_labels.items()
],

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff Lint's formatter will revert this back to one line

Comment thread src/WikidataLabel.py Outdated
dict[str, dict]: Mapping of each ID to compressed labels.
"""
entities_data = get_wikidata_json_by_ids(ids, props="labels")
entities_data = get_wikidata_json_by_ids(ids, action_api_url=wb_url, props="labels")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
entities_data = get_wikidata_json_by_ids(ids, action_api_url=wb_url, props="labels")
entities_data = get_wikidata_json_by_ids(
ids,
action_api_url=wb_url,
props="labels"
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff Lint's formatter will revert this back to one line

Comment thread tests/unit/test_wikidatalabel.py Outdated
calls.append((list(ids), wb_url))
return {"Q42": {"en": "Douglas Adams"}}

monkeypatch.setattr(WikidataLabel, "get_bulk_labels", staticmethod(fake_get_bulk_labels))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
monkeypatch.setattr(WikidataLabel, "get_bulk_labels", staticmethod(fake_get_bulk_labels))
monkeypatch.setattr(
WikidataLabel,
"get_bulk_labels",
staticmethod(fake_get_bulk_labels)
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff Lint's formatter will revert this back to one line


monkeypatch.setattr(WikidataLabel, "get_bulk_labels", staticmethod(fake_get_bulk_labels))

factory = LazyLabelFactory(lang="en", fallback_lang="en", wb_url="https://example.wikibase.local/")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
factory = LazyLabelFactory(lang="en", fallback_lang="en", wb_url="https://example.wikibase.local/")
factory = LazyLabelFactory(
lang="en",
fallback_lang="en",
wb_url="https://example.wikibase.local/"
)

Comment thread main.py

qids = [q.strip() for q in id.split(",")]
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang)
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang, wb_url=action_api_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang, wb_url=action_api_url)
label_factory = LazyLabelFactory(
lang=lang,
fallback_lang=fallback_lang,
wb_url=action_api_url
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff Lint's formatter will revert this back to one line

Comment thread main.py Outdated
entity_id=qids[0],
ttl_text=entity_data,
try:
entity_data = utils.get_wikidata_json_by_ids(qids, action_api_url=action_api_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
entity_data = utils.get_wikidata_json_by_ids(qids, action_api_url=action_api_url)
entity_data = utils.get_wikidata_json_by_ids(
qids,
action_api_url=action_api_url
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff Lint's formatter will revert this back to one line

Comment thread main.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of changes that should have been single commits.

Could you please comment on each deletion section and addition section, "what changed and why"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment thread main.py
from fastapi.middleware.cors import CORSMiddleware

from src import utils
from src.Normalizer import JSONNormalizer, TTLNormalizer
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTLNormalizer has been removed to support Wikibases.
TTLNormalizer is only capable of parsing Wikidata entities while JSONNormalizer supports all Wikibases with an Action API endpoint.

Comment thread main.py

entity_data = TTLNormalizer(
entity_id=qids[0],
ttl_text=entity_data,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, TTLNormalizer was used to parse requents of one Wikidata entity. This has been removed and replaced by the JSONNormalizer to support Wikibase entities as well.

Comment thread main.py
if entity
else None
for qid, entity in entity_data.items()
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the removed lines here are also based on the decision to remove TTLNormalizer and only use the JSONNormalizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants