Querying a NifVector graph - English#

Introduction#

A NifVector graph is a network graph of multiword expressions (phrases) and the contexts in which they occur based on a document set (here in NLP Interchange Format). It can be used as if it were a language model.

The main difference of a NifVector graph to traditional word vector embeddings is that no dimensionality reduction is applied; and there is no model created with real-valued vector embeddings. Instead, the NifVectors are derived directly from the document set itself without any transformation. Furthermore, the graph structure also enables to differentiate words depending on the context in which they occur. Because no dimensionality reduction is applied the NifVector graph produces explainable results without any randomness. It also enables explicit links between the phrases and context vector representation and lexical and linguistical annotations.

This notebook shows how to query the NifVector graph and extract phrase and context similarities.

import os, sys, logging
logging.basicConfig(stream=sys.stdout, 
                    format='%(asctime)s %(message)s',
                    level=logging.INFO)

Simple NifVector graph example to introduce the idea#

Let’s setup a NifVector graph with a context with two sentences.

# The NifContext contains a context which uses a URI scheme
from nifigator import NifGraph, NifContext, OffsetBasedString, NifContextCollection

# Make a context by passing uri, uri scheme and string
context = NifContext(
  uri="https://mangosaurus.eu/rdf-data/doc_1",
  URIScheme=OffsetBasedString,
  isString="We went to the small park to walk.\n Yesterday, we went to the city to shop."
)
context.extract_sentences()

# Make a Nif context collection
collection = NifContextCollection(uri="https://mangosaurus.eu/rdf-data")
collection.add_context(context)
nif_graph = NifGraph(collection=collection)

Then we create a NifVectorGraph from this data.

from nifigator import NifVectorGraph

# set up the params of the NifVector graph
params = {
    "min_phrase_count": 1, 
    "min_context_count": 1,
    "min_phrasecontext_count": 1,
    "max_phrase_length": 3,
    "max_context_length": 3,
}

# the NifVector graph can be created from the NifGraph made above
g = NifVectorGraph(
    nif_graph=nif_graph, 
    params=params
)

The contexts of the work ‘small park’ are found in this way.

phrase = "small park"
g.phrase_contexts(phrase)

Resulting in the following contexts:

Counter({('to the', 'to'): 1,
         ('the', 'to walk'): 1,
         ('to the', 'to walk'): 1,
         ('went to the', 'to'): 1,
         ('the', 'to'): 1})

So the context (‘to the’, ‘to’) with the phrase ‘small park’ occurs once in the text.

You see that, contrary to Word2Vec models a NifVector graph also derives contexts of phrases.

The phrases that occurs within this context can be found in the following way:

g.context_phrases(('to the', 'to'))

Now we can find the most similar words of the word ‘city’.

phrase = "city"
g.most_similar(phrase)

This results in:

{'city': (5, 5), 'small park': (3, 5)}

The word ‘small park’ has three out of five similar contexts.

Similarly the most similar phrases to walk are:

phrase = "walk"
g.most_similar(phrase)

This results in:

{'walk': (2, 2), 'shop': (1, 2)}

The most similar phrases are derived from the contexts in which the words occur. You see that, given the two sentences above, the most similar words of ‘park’ are nouns (park and city) and likewise the most similar word of ‘walk’ are verbs (walk and shop).

Querying the NifVector graph based on DBpedia#

These are results of a NifVector graph created with 10.000 DBpedia pages. We defined a context of a word in it simplest form: the tuple of the previous multiwords and the next multiwords (no preprocessing, no changes to the text, i.e. no deletion of stopwords and punctuation). The maximum phrase length is five words, the maximum left and right context length is also five words.

from rdflib import URIRef

database_url = 'http://localhost:3030/dbpedia_en'
identifier = URIRef("https://mangosaurus.eu/dbpedia")

from rdflib.plugins.stores.sparqlstore import SPARQLUpdateStore
from nifigator import NifVectorGraph

# Connect to triplestore
store = SPARQLUpdateStore(
    query_endpoint = database_url+'/sparql',
    update_endpoint = database_url+'/update'
)
# Create NifVectorGraph with this store
g = NifVectorGraph(
    store=store, 
    identifier=identifier
)

Most frequent contexts of a phrase#

The eight most frequent contexts in which the word ‘has’ occurs with their number of occurrences are the following:

# most frequent contexts of the word "has"
g.phrase_contexts("has", topn=10)

This results in

Counter({('It', 'been'): 1031,
         ('SENTSTART It', 'been'): 951,
         ('it', 'been'): 836,
         ('and', 'been'): 642,
         ('which', 'been'): 521,
         ('also', 'a'): 436,
         ('there', 'been'): 420,
         ('and', 'a'): 418,
         ('that', 'been'): 339,
         ('it', 'a'): 270})

This means that the corpus contains … occurrences of ‘It has been’, i.e. occurrences where the word ‘has’ occurred in the context (‘It’, ‘been’).

SENTSTART and SENTEND are tokens to indicate the start and end of a sentence.

Phrase and context frequencies#

The contexts in which a word occurs represent to some extent the properties and the meaning of a word. If you derive the phrases that share the most frequent contexts of the word ‘has’ then you get the following table (the columns contains the contexts, the rows the phrases that have the most contexts in common):

import pandas as pd
pd.DataFrame().from_dict(
    g.dict_phrases_contexts("has", topcontexts=8), orient='tight'
)

This results in:

                  It 	it      SENTSTART It    and     which   also    there   and
                  been 	been 	been 	        been    been    a       been    a
          
has               1031  1021    954             642   521     436     420     418
had               71    402     53              169   886     266     171     336
would have        14 	75   	10              9     53      2       15 	2
may have          26 	82  	26              37    61      0       33 	2
could have        2 	24  	2               2     7       0       4 	0
has also          149 	60  	180             12    5       0       11 	0
has always        2 	3   	2               2     3       0       2 	0          

The contexts that a word has in common with contexts of another word can be used as a measure of similarity. The word ‘had’ (second row) has eight contexts in common with the word ‘has’ so this word is very similar. The phrase ‘would have’ (seventh row) has seven contexts in common, so ‘would have’ is also similar but less similar than the word ‘had’. We used a limited number of contexts to show the idea; normally a higher number of contexts can be used to compare the similarity of words.

The word similarities found can in this case explained as follows. Similar words are forms of the verb ‘have’. This is because the verb is often used in the construction of perfect tenses where the verb ‘have’ is combined with the past participle of another verb, in this case the often occuring ‘been’. Note that the list contains ‘has not’.

Phrase similarities#

Based on the approach above we can derive top phrase similarities.

# top phrase similarities of the word "has"
g.most_similar("has", topn=10, topcontexts=15)

This results in

{'had': (15, 15),
 'has': (15, 15),
 'would have': (12, 15),
 'have': (10, 15),
 'may have': (10, 15),
 'could have': (9, 15),
 'has never': (8, 15),
 'has not': (8, 15),
 'also has': (7, 15),
 'had long': (7, 15)}

Now take a look at similar words of ‘larger’.

# top phrase similarities of the word "larger"
g.most_similar("larger", topn=10, topcontexts=15)

Resulting in:

{'larger': (15, 15),
 'smaller': (14, 15),
 'greater': (13, 15),
 'higher': (12, 15),
 'longer': (11, 15),
 'better': (10, 15),
 'faster': (10, 15),
 'less': (10, 15),
 'lower': (10, 15),
 'shorter': (10, 15)}

Like the word ‘larger’ these are all comparative adjectives. These words are similar because they share the most frequent contexts, in this case contexts like (is, than) and (much, than).

# top phrase similarities of the word "might"
g.most_similar("might", topn=10, topcontexts=25)

{'could': (25, 25),
 'may': (25, 25),
 'might': (25, 25),
 'should': (25, 25),
 'would': (25, 25),
 'must': (24, 25),
 'would not': (23, 25),
 'could not': (22, 25),
 'will': (21, 25),
 'can': (20, 25)}

Most frequent coinciding contexts are in this case (‘it’, ‘be’), (‘he’, ‘have’) and (‘that’, ‘be’).

Contexts can also be used to find ‘semantic’ similarities.

# top phrase similarities of the word "King"
g.most_similar("king", topn=10, topcontexts=25)

This results in

{'King': (15, 15),
 'President': (8, 15),
 'Queen': (8, 15),
 'king': (8, 15),
 'Emperor': (7, 15),
 'Kingdom': (7, 15),
 'Prince': (7, 15),
 'enemies': (7, 15),
 'kings': (7, 15),
 'president': (7, 15)}

Instead of single words we can also find the similarities of multiwords

# top phrase similarities of Barack Obama
g.most_similar("Barack Obama", topn=10, topcontexts=15)

{'Barack Obama': (15, 15),
 'Bill Clinton': (5, 15),
 'Ronald Reagan': (5, 15),
 'Franklin D Roosevelt': (4, 15),
 'George W Bush': (4, 15),
 'Richard Nixon': (4, 15),
 'Bush': (3, 15),
 'Dwight D Eisenhower': (3, 15),
 'George H W Bush': (3, 15)}

Most frequent phrases of a context#

Here are some examples of the most frequent phrases of a context.

context = ("King", "of England")
for r in g.context_phrases(context, topn=10).items():
    print(r)

('Henry VIII', 15)
('Charles II', 12)
('John', 12)
('Henry III', 8)
('James I', 8)
('Edward I', 7)
('Edward III', 6)
('Charles I', 5)
('Henry VII', 5)
('Henry II', 4)

context = ("the", "city")
for r in g.context_phrases(context, topn=10).items():
    print(r)

('capital', 141)
('largest', 140)
('old', 55)
('inner', 52)
('first', 48)
('second largest', 44)
('ancient', 43)
('most populous', 39)
('Greek', 37)
('port', 31)

context = ("he", "that")
for r in g.context_phrases(context, topn=10).items():
    print(r)

Phrase similarities given a specific context#

Some phrases have multiple meanings. Take a look at the contexts of the word ‘deal’:

g.phrase_contexts("deal", topn=10)

This results in:

Counter({('to', 'with'): 487,
         ('great', 'of'): 348,
         ('a great', 'of'): 326,
         ('to', 'with the'): 165,
         ('a', 'with'): 84,
         ('a good', 'of'): 36,
         ('had to', 'with'): 35,
         ('good', 'of'): 28,
         ('SENTSTART The', 'was'): 25,
         ('The', 'was'): 25})

In some of these contexts ‘deal’ is a verb meaning ‘to do business’ and in other contexts ‘deal’ is a noun meaning a ‘contract’ or an ‘agreement’. The specific meaning can be derived from the context in which the phrase is used.

It is possible to take into account a specific context when using the most_similar function in the following way:

g.most_similar(phrase="deal", context=("to", "with"), topcontexts=50, topphrases=15, topn=10)

The result is:

{'deal': (50, 50),
 'work': (20, 50),
 'comply': (9, 50),
 'compete': (7, 50),
 'cope': (7, 50),
 'interact': (7, 50),
 'coincide': (6, 50),
 'communicate': (6, 50),
 'do': (6, 50),
 'help': (6, 50)}

So these are all verbs, similar to the verb ‘deal’.

g.most_similar(phrase="deal", context=("a", "with"), topcontexts=100, topphrases=15, topn=10)

In this case the result is:

{'deal': (50, 50),
 'contract': (12, 50),
 'treaty': (12, 50),
 'meeting': (11, 50),
 'relationship': (11, 50),
 'man': (10, 50),
 'dispute': (8, 50),
 'partnership': (8, 50),
 'person': (8, 50),
 'coalition': (7, 50)}

So, now the results are nouns, and similar to the noun ‘deal’.

Phrase similarities given a set of contexts#

If you want to find the phrases that fit a set of contexts then this is also possible.

c1 = [
        c[0] for c in (
            g.phrase_contexts("considered", topn=None) &
            g.phrase_contexts("believed", topn=None)
         ).most_common(15)
]

This results in:

[('is', 'to'),
 ('is', 'to be'),
 ('are', 'to'),
 ('was', 'to'),
 ('are', 'to be'),
 ('is', 'to have'),
 ('is', 'to be the'),
 ('was', 'to be'),
 ('were', 'to'),
 ('generally', 'to'),
 ('are', 'to have'),
 ('is', 'to be a'),
 ('is', 'by'),
 ('widely', 'to'),
 ('he', 'to')]

g.most_similar(contexts=c1, topn=10)

Resulting in:

{'believed': (15, 15),
 'considered': (15, 15),
 'thought': (14, 15),
 'expected': (13, 15),
 'known': (13, 15),
 'reported': (13, 15),
 'said': (13, 15),
 'assumed': (12, 15),
 'claimed': (12, 15),
 'held': (12, 15)}