Querying a NifVector graph - Dutch#

Introduction#

import os, sys, logging
logging.basicConfig(stream=sys.stdout, 
                    format='%(asctime)s %(message)s',
                    level=logging.INFO)

Querying the NifVector graph based on DBpedia#

These are results of a NifVector graph created with 100.000 DBpedia pages. We defined a context of a word in it simplest form: the tuple of the previous multiwords and the next multiwords (no preprocessing, no changes to the text, i.e. no deletion of stopwords and punctuation). The maximum phrase length is five words, the maximum left and right context length is also five words.

from rdflib import URIRef

database_url = 'http://localhost:3030/dbpedia_nl'
identifier = URIRef("https://mangosaurus.eu/dbpedia")

from rdflib.plugins.stores.sparqlstore import SPARQLUpdateStore
from nifigator import NifVectorGraph

# Connect to triplestore
store = SPARQLUpdateStore(
    query_endpoint = database_url+'/sparql',
    update_endpoint = database_url+'/update'
)
# Create NifVectorGraph with this store
g = NifVectorGraph(
    store=store, 
    identifier=identifier
)

Most frequent contexts of a phrase#

The eight most frequent contexts in which the word ‘has’ occurs with their number of occurrences are the following:

# most frequent contexts of the word "schrijver"
g.phrase_contexts("is gemaakt", topn=10)

This results in

Counter({('gebruik', 'van'): 15,
         ('Het', 'door'): 12,
         ('SENTSTART Het', 'door'): 11,
         ('beeld', 'door'): 7,
         ('tekst', 'door'): 6,
         ('De tekst', 'door'): 5,
         ('SENTSTART De tekst', 'door'): 5,
         ('ei', 'van'): 5,
         ('muziek', 'door'): 5,
         ('en', 'door'): 3})

SENTSTART and SENTEND are tokens to indicate the start and end of a sentence.

Phrase and context frequencies#

The contexts in which a word occurs represent to some extent the properties and the meaning of a word. If you derive the phrases that share the most frequent contexts of the word ‘has’ then you get the following table (the columns contains the contexts, the rows the phrases that have the most contexts in common):

import pandas as pd
pd.DataFrame().from_dict(
    g.dict_phrases_contexts("is gemaakt", topcontexts=8), orient='tight'
)

This results in:

                gebruik  Het     SENTSTART Het  beeld 	tekst 	De tekst    SENTSTART De tekst 	ei
                van      door    door 	        door 	door 	door 	    door                van
is gemaakt      15       12      11             7       6       5           5                   5
is geschreven   0        79      78             0       22      12          11                  0
werd geschreven 0        19      19             0       14      10          10                  0
is              0        47      47             2       2       0           0                   0
werd gemaakt    53       2       2              2       0       0           0                   0
was             2    	 10      9              0       0       0           0                   0
werd            0        83      82             2       0       0           0                   0

Phrase similarities#

Based on the approach above we can derive top phrase similarities.

# top phrase similarities of the word "has"
g.most_similar("is gemaakt", topn=10, topcontexts=15)

This results in

{'is gemaakt': (15, 15),
 'is geschreven': (9, 15),
 'werd': (9, 15),
 'is': (8, 15),
 'werd geschreven': (7, 15),
 'wordt': (7, 15),
 'was': (6, 15),
 'werd gemaakt': (6, 15),
 'wordt gemaakt': (5, 15),
 'gemaakt': (4, 15)}

Now take a look at similar words of ‘groter’.

# top phrase similarities of the word "larger"
g.most_similar("groter", topn=10, topcontexts=15)

Resulting in:

{'groter': (15, 15),
 'kleiner': (14, 15),
 'breder': (11, 15),
 'hoger': (11, 15),
 'lager': (11, 15),
 'beter': (10, 15),
 'langer': (10, 15),
 'meer': (10, 15),
 'sneller': (10, 15),
 'minder': (9, 15)}

# top phrase similarities of the word "King"
g.most_similar("koning", topn=10, topcontexts=25)

This results in

{'koning': (25, 25),
 'hertog': (22, 25),
 'keizer': (22, 25),
 'vorst': (21, 25),
 'prins': (20, 25),
 'graaf': (19, 25),
 'koningin': (18, 25),
 'Koning': (17, 25),
 'bisschop': (17, 25),
 'groothertog': (17, 25)}

Instead of single words we can also find the similarities of multiwords

# top phrase similarities of Willem Alexander (King of the Netherlands)
g.most_similar("Willem Alexander", topn=10, topcontexts=15)

{'Willem Alexander': (15, 15),
 'Filip': (8, 15),
 'Boudewijn': (7, 15),
 'Willem I': (7, 15),
 'Willem III': (7, 15),
 'Albert II': (6, 15),
 'George III': (6, 15),
 'Maximiliaan I Jozef van Beieren': (6, 15),
 'Willem II': (6, 15),
 'Christiaan IX van Denemarken': (5, 15)}

Most frequent phrases of a context#

Here are some examples of the most frequent phrases of a context.

context = ("koning", "van Engeland")
for r in g.context_phrases(context, topn=10).items():
    print(r)

('Eduard III', 21)
('Karel II', 21)
('Karel I', 17)
('Eduard I', 15)
('Hendrik VIII', 13)
('Jacobus II', 13)
('Eduard IV', 11)
('Hendrik VII', 9)
('Jacobus I', 9)
('Hendrik II', 8)

context = ("de", "stad")
for r in g.context_phrases(context, topn=10).items():
    print(r)

('grootste', 493)
('oude', 228)
('gelijknamige', 178)
('tweede', 120)
('Duitse', 116)
('Nederlandse', 116)
('belangrijkste', 105)
('huidige', 99)
('hele', 97)
('nieuwe', 91)

Phrase similarities given a specific context#

Some phrases have multiple meanings. Take a look at the contexts of the word ‘middel’:

g.phrase_contexts("middel", topn=10)

This results in:

Counter({('door', 'van'): 4956,
         ('door', 'van een'): 1516,
         ('door', 'van de'): 406,
         ('SENTSTART Door', 'van'): 398,
         ('Door', 'van'): 390,
         ('door', 'van het'): 264,
         ('worden door', 'van'): 142,
         ('een', 'om'): 135,
         ('die door', 'van'): 117,
         ('Door', 'van een'): 102})

It is possible to take into account a specific context when using the most_similar function in the following way:

g.most_similar(phrase="middel", context=("door", "van"), topcontexts=50, topphrases=15, topn=10)

The result is:

{'middel': (50, 50),
 'toedoen': (21, 50),
 'gebruik te maken': (19, 50),
 'toepassing': (15, 50),
 'gebruik': (11, 50),
 'leden': (10, 50),
 'toevoeging': (7, 50),
 'die': (6, 50),
 'doelpunten': (4, 50),
 'samenvoeging': (3, 50)}

g.most_similar(phrase="middel", context=("een", "dat"), topcontexts=50, topphrases=15, topn=10)

In this case the result is:

{'systeem': (5, 5),
 'apparaat': (4, 5),
 'programma': (4, 5),
 'proces': (3, 5),
 'teken': (3, 5),
 'bedrijf': (2, 5),
 'gebied': (2, 5),
 'tijd': (2, 5),
 'woord': (2, 5),
 'boek': (1, 5)}

Phrase similarities given a set of contexts#

If you want to find the phrases that fit a set of contexts then this is also possible.

c1 = [
        c[0] for c in (
            g.phrase_contexts("dacht", topn=None) &
            g.phrase_contexts("vond", topn=None)
         ).most_common(15)
]
c1

This results in:

[('hij', 'dat'),
 ('Hij', 'dat'),
 ('SENTSTART Hij', 'dat'),
 ('omdat hij', 'dat'),
 ('en', 'dat'),
 ('men', 'dat'),
 ('hij', 'dat de'),
 ('die', 'dat'),
 ('hij', 'dat hij'),
 ('dat hij', 'dat'),
 ('omdat men', 'dat'),
 ('hij', 'dat het'),
 ('omdat hij', 'dat de'),
 ('Hij', 'dat het'),
 ('SENTSTART Hij', 'dat het')]

g.most_similar(contexts=c1, topn=10)

Resulting in:

{'dacht': (15, 15),
 'meende': (15, 15),
 'vond': (15, 15),
 'stelde': (12, 15),
 'zei': (12, 15),
 'beweerde': (11, 15),
 'denkt': (11, 15),
 'geloofde': (11, 15),
 'vindt': (11, 15),
 'wist': (11, 15)}