Wikidata talk:Lexicographical data

Lexicographical data

Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.

Translate this header box!

Start a new discussion

On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2026/01.

Multiple representations

Latest comment: 19 days ago9 comments2 people in discussion

Hi,

In the data model of Lexemes, it is possible to add multiple representations to one form. But (AFAIK) it has never been said when and how to use this feature.

Right now, there is ~15 million forms and only ~154 000 with more than one representation (so ~1%) and it's mostly used on a few languages (in descending order: Japanese (Q5287), Hebrew (Q9288), Punjabi (Q58635), Sumerian (Q36790), Hindustani (Q11051), making 2/3 of the total). The highest being ਅੜਾਉਣੀ/اڑاوݨی (L744679) with 8 representations on one form, followed by òmo/oˊmo/oʼmo/ohmo (L1417242).

Also, on a more technical side, is there an identifier (or anything) that allows to point to a specific representation? (like LX-FY for the form Y of lexeme X or even the weird ID for statements like Lexeme:L1#L1$e30d2889-462b-19fa-1f77-1dd57ea722cb). The only way I know if to get all representations of a form and then filter by language (but it's more complicated and it requires to know the language in advance).

Cheers, VIGNERON (talk) 08:21, 21 September 2025 (UTC)Reply

3 months later, no answer... @عُثمان: what do you think? Cheers, VIGNERON (talk) 08:49, 13 December 2025 (UTC)Reply

@VIGNERON I intend to go back and clean up the Punjabi representations as they don’t need so many. For the Karina lexemes, that language is used in multiple countries each with a different spelling system. Ideally we would have codes for Guyana, Brazil, Venezuela, Suriname, and Trinidad & Tobago. عُثمان (talk) 11:25, 13 December 2025 (UTC)Reply

...and French Guiana عُثمان (talk) 11:25, 13 December 2025 (UTC)Reply

@عُثمان: my question is not so much "should we have these representations" but more "should these representation be on the same form or on different forms?" (I've seen both done, there is pros and cons for both model ; but ideally we need to choose, it's bad to have 2 different ways to do the same thing).

About codes, there is always the possibility to use private code so technically we already have them, or do you mean something else?

Cdlt, VIGNERON (talk) 11:54, 13 December 2025 (UTC)Reply

Oh — in cases where the different spelling represents the same pronunciation (within the phonological framework of the language) and grammatical features, they should definitely be on the same form. Especially since languages with a lot of spelling variation also tend to be spoken more than written. عُثمان (talk) 11:56, 13 December 2025 (UTC)Reply

@عُثمان: that make sense but is it documented anywhere? and it also causes problems. The main one I see right now is that it's not possible to easily point to a specific representation in a form. If I say "The center" and "The centre" how do I state that it's representation one or two of L:L1473#F1? (is it even possible?) Meanwhile, I can easily point to L:L2127#F1, L:L2127#F2, etc. (which technically could all be in the same form). Plus, how to deal with data concerning only one representation of the form? Also, a form can have a lot of representation (especially in spoken and/or less documented languages, which is most of the languages, but also in written/well documented languages, I have thousands of examples in French without even having to search). And finally, there is the code problem you mentioned... Cheers, VIGNERON (talk) 15:51, 13 December 2025 (UTC)Reply

@VIGNERON Do we not have shorthand like en@en-gb ? I agree these cases should be better documented عُثمان (talk) 11:25, 14 December 2025 (UTC)Reply

@عُثمان: yes, we have codes like en-gb, but only have a few (166 according to https://w.wiki/GdTg) where we need millions of them (just with private code, there is 1336 already in use : https://w.wiki/GdUJ). And even if we magically had all the codes we needs, it doesn't solve all the other problems. We need to document but we also need to have a clearer view on the best practices and model I guess. Cheers, VIGNERON (talk) 12:04, 14 December 2025 (UTC)Reply

How to express usage examples on smaller or dead languages?

Latest comment: 2 months ago1 comment1 person in discussion

I'm trying to put usage example (P5831) on the verb ikó (L1520208) from the Tupi (Q56944) language. The problem is that the property uses monolingual text, which is limited to languages offered by the MediaWiki software. Is there a alternative for this so I could register some examples in this language? Luk3 (talk) 04:19, 15 October 2025 (UTC)Reply

How to deal with forms whose written representation is unknown but not their pronunciation?

Latest comment: 19 days ago2 comments1 person in discussion

Hi. I am working on Lorrain (Q671198) and I'm importing Q106167910. I stumbled on an issue: different dialects use different words, and this dictionary usually accounts for this by stating all the known pronunciations on a "main" entry and having all the variants redirect to that entry. However, it sometimes "forgets" to provide the word and only specifies the pronunciation.

As an example, see Āchpac (L1508548) and the corresponding excerpt from the dictionary s:fr:Page:Zéliqzon - Dictionnaire des patois romans de la Moselle, œuvre complète, 1924.djvu/44:

Āchpac [ǟs̆pak.. S, ās̆pǫk V], n. p. — Aspach, vill. de l’arr. de Sarrebourg.

ǟs̆pak is the pronunciation for 'Āchpac'. ās̆pǫk is the pronunciation for '?'. Nowhere in the dictionary there is the written form of ās̆pǫk.

Having worked on this dictionary for years already, I can assume the latter is going to look like 'Āchpoc' when written, but I feel uncomfortable "guessing" without having a way to properly state that this form was "recreated" on a "best guess basis". Or if it is even a good practice to "recreate" these missing forms in the first place.

As such, on Āchpac (L1508548) and any lexemes where this occurred, I went ahead and created a '?' form with all the proper data. @Lepticed7 suggested instead to add a 'pronunciation' statement at the Lexeme level to state that no known form is associated to this pronunciation but it still exists nonetheless.

Is there a better way? Poslovitch (talk) 09:54, 19 October 2025 (UTC)Reply

Would putting L1508548-F2has characteristic (P1552)‘written form deduced from pronunciation’ (Qxxx) be a good way to model this? Poslovitch (talk) 09:45, 14 December 2025 (UTC)Reply

Unicode to use Wikidata as a lexicon datasource ?

Latest comment: 20 days ago3 comments2 people in discussion

See this article on the unicode blog : https://blog.unicode.org/2025/11/introducing-unicode-inflection-library.html for Unicode Inflexion Library (Q136796507).

It's a library published by the unicode consortium ought to compute flexions for words, just as Wikifunctions is made for at a low level for language generation. They cite Wikidata as their source for the lexicon and the flexions. I don't know if it's well known here, hence this message. I'll add a word in the next Wikidata Weekly (if it has not been done in an earlier edition…) author TomT0m / talk page 12:31, 15 November 2025 (UTC)Reply

Turns out it's been done : Wikidata:Status_updates/2025_11_10. I added a mention of this also on the wikipage of this talk page. author TomT0m / talk page 12:39, 15 November 2025 (UTC)Reply

Very interesting, and I missed it too in the Status update to thanks @TomT0m:. I'm not sure to fully understand the codes on Github but I see they mention *a lot* Wikidata and even retook some of our code (see https://github.com/unicode-org/inflection/blob/main/data/tools/wikidata_upload.py @Mahir256:). It's great to see that a group like Unicode acknowledge and care about us. Cheers, VIGNERON (talk) 08:38, 13 December 2025 (UTC)Reply

Ordia Games

Latest comment: 16 days ago7 comments4 people in discussion

Hi,

I'm don't think it has been announced here, Ordia now has games which reuse lexemes ! Thanks a lot Fnielsen !

It's great and fun, you should give it a try. It's good both way: for people outside lexemes to discover and use them but also for people working on lexems to find things to improve or correct.

Cheers, VIGNERON (talk) 15:59, 13 December 2025 (UTC)Reply

yay! a bit sad though that it doesn't support mis languages. Would have loved to play some games with Lorrain (Q671198)

Poslovitch (talk) 09:31, 14 December 2025 (UTC)Reply

Tried the "gender" game with french, I had wrong with diacre (L1373847), the expected answer was "feminine", I entered "masculine". The lexeme have both. Not sure how to best handle that case, both in the lexeme and the game ? author TomT0m / talk page 11:46, 14 December 2025 (UTC)Reply

@TomT0m I have made a fix so double-gendered words no longer appear in the game. Let me hear if you still experience problems. — Finn Årup Nielsen (fnielsen) (talk) 22:29, 16 December 2025 (UTC)Reply

Yes, I have noted the problem here: https://github.com/fnielsen/ordia/issues/291 It may take a while before I fix it. — Finn Årup Nielsen (fnielsen) (talk) 22:29, 16 December 2025 (UTC)Reply

@Poslovitch I managed to implement it. Lorrain is available on https://ordia.toolforge.org/guess-the-gender/ - at the buttom. Finn Årup Nielsen (fnielsen) (talk) 22:54, 16 December 2025 (UTC)Reply

@Fnielsen I can't express how grateful I am. Thank you so so much! Poslovitch (talk) 23:57, 16 December 2025 (UTC)Reply

Attach Lexicographical data chat to WikiProject Languages ?

Latest comment: 2 days ago3 comments3 people in discussion

It seems the topic is very close to WikiProject Language, language and their words data.

Should we attach the discussion on this talk page to this project ?

I'm trying to work on Wikidata:WikiProject Languages/Writing systems and writing characters right now, I noticed there is not very many participants to this project, so we might share the discussions to benefit from each other ? author TomT0m / talk page 12:40, 17 December 2025 (UTC)Reply

It's close (lexicography obviously rely on languages) but different (for instance, the project may see different languages where lexeme editors decided to group them together). Depends on what you mean by "attach". I guess the link to WD:LD at the bottom is enough, no?

Notified participants of WikiProject Languages to have more point of view.

Cheers, VIGNERON (talk) 13:12, 31 December 2025 (UTC)Reply

Request for comment on Notability policy reform

Latest comment: 2 days ago1 comment1 person in discussion

Hi,

For the record, there is a currently a request for comment: Wikidata:Requests for comment/Notability policy reform. One question is about adding explicitly Wikidata:Lexicographical data/Notability on Wikidata:Notability.

Cheers, VIGNERON (talk) 13:22, 31 December 2025 (UTC)Reply