Useful concepts from IRE¶
- Term frequency is a function of term and document TF(t, d). Inverse document frequency is a function of term and Dataset IDF (t, D).
Stanford Core NLP¶
When I was running the StanfordCoreNLP server in 10.4.17.63 and passing the data from a different server 10.4.17.64 through nltk.parse.corenlp, I was getting the error “requests.exceptions.HTTPError: 403 Client Error: Forbidden for url :http://10.4.17.63:9000/…”.
Fix: I have set the no_proxy variable in ~/.bashrc file of 10.4.17.64 which reads like export no_proxy=10.4.17.63
The following is a sample code to get the dependency tree for a sample text:
import nltk import nltk.parse.corenlp as corenlp parser = corenlp.CoreNLPParser('http://localhost:9000') text = 'Let me try this.' sample_op = list(parser.parse_text(text)) sample_op[0].draw()
DBPedia Spotlight¶
I have created a DBPedia spotlight server locally and used it to ___annotate and disambiguate___ entities or phrases in my text. The following instructions would help us to do it better. Go to the github link for DBPedia Spotlight and read the instructions (especially in the Run your own server section). Start the server on a screen session using the command:
java -Xmx10g -jar dbpedia-spotlight-1.0.0.jar * en_2+2 http://localhost:2222/rest
Now you can call the server for three purposes spot, disambiguate or annotate (spot + disambiguate). More help available at Wiki.
A general use case - Pass some text and get the DBPedia concepts annotated. The following command does that:
bash curl http://localhost:2222/rest/annotate --data-urlencode "text=Narendra Modi is the prim minister of India." --data "confidence=0.2" --data "support=20"
My specific use case - I had the concepts marked in my text. I just wanted to link it with DBPedia. In that case, I used the following command:
bash curl http://localhost:2222/rest/disambiguate -H "Accept: text/xml" --data 'text=<annotation text="New Delhi is the capital of India"><surfaceForm name="New Delhi" offset="0"></surfaceForm></annotation>'
- Imp Note For the above command to work, one must always escape the single quote, i.e. ‘ in the text.
NLTK¶
NLTK can be used for a lot of NLP related tasks. I have used stanfordCoreNLPServer to extract NLTK parse trees. The following is the code to print such a parse tree:
from nltk.tree import Tree from nltk.draw.tree import TreeView from nltk.draw import TreeWidget from nltk.draw.util import CanvasFrame t = Tree.fromstring('(S (NP this tree) (VP (V is) (AdjP pretty)))') cf = CanvasFrame() tc = TreeWidget(cf.canvas(),t) tc['node_font'] = 'arial 14 bold' tc['leaf_font'] = 'arial 14' tc['node_color'] = '#005990' tc['leaf_color'] = '#3F8F57' tc['line_color'] = '#175252' cf.add_widget(tc,10,10) # (10,10) offsets cf.print_to_file('tree.ps') cf.destroy()