Programming A Driverless Car Chapter 7: Natural Language Processing

What is Natural language Processing?
-The process of computer analysis of input provided in a human language (natural language), and conversion of this input into a useful form of representation.
-The field of NLP is primarily concerned with getting computers to perform useful and interesting tasks with human languages.
-The field of NLP is secondarily concerned with helping us come to a better understanding of human language.
Language could refer to both the language that we speak as well as artificial languages such as Python and Java.
As humans, computers do not have that much understanding between various languages and their grammar. Humans can differentiate between sentences like “I am going to school” and “I goes to school” , that which one is grammatically correct and makes more sense. But same doesn’t go for computers. Computers are not very good at understanding the same. Computer/human interaction allows a lot of real world applications:
-Automatic Text Simulation
-Sentiment analysis
-Topic Extraction
-Name and Entity Cognition
-Stemming
-Relation extraction
-Tagging parts of speech
Generally used for text mining, machine translation and automated question-answering! It is becoming more popular as we move towards AI. Machine translation is highly needed and necessary.
Forms of Natural Language:

  • The input/output of a NLP system can be: Written Text, Speech
  • To process written text, we need: Lexical, syntactic, semantic knowledge about the language and discourse information, real world knowledge
  • To process spoken language, we need everything required to process written text, plus the challenges of speech recognition and speech analysis.

We prefer communicating in text primarily because we read, think and listen in the form of words. So, to process data on the internet or web we need lexical knowledge i.e. understanding the meaning of each word, Syntactic i.e. what roles certain words play in a sentence and semantic: the meaning derived from a sentence. The algorithm that we build, needs to understand the content not only the words. The sentiment of the statement must be understood.
Components of Natural processing:
Why NLP is hard?
-Natural language is extremely rich in form and structure. Different words for different meanings or vice versa, it become difficult to understand.
-One input can mean many different things. Ambiguity can be at different levels.
-Many input can mean the same thing.
-Interaction among components of the input is not clear. For ex. Book me a flight from NYC and other is I was reading a book in the flight from NYC. The meaning is different but the words used is somewhat same.
Knowledge of Language–

  1. Phonology: concerns how words are related to the sounds that realize them.
  2. Morphology: concerns how words are constructed more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language.
  3. Syntax: concerns how can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of other phrases.
  4. Semantics: concerns what words mean and how these meaning combine in sentences to form sentence meaning. The study of context-independent meaning.
  5. Pragmatic: concerns how sentences are used in different situations and how use affects the interpretation of the sentence.
  6. Discourse: concerns how the immediately preceding sentences, affect the interpretation of the next sentence. For ex.,interpreting pronouns and interpreting the temporal aspects of the information.
  7. World Knowledge: includes general knowledge about the context of the sentence.

Natural Language Generation:
Why natural language processing is so hard, one of the reasons is ambiguity. Like:
How many different interpretations does one sentence have?
What are the reasons for the ambiguity?
The categories of knowledge of language can be thought of as ambiguity resolving components.
How can each ambiguous piece be resolved?
Does speech input make the sentence even more ambiguous?
Ambiguity refers to how many different interpretations a sentence can have!
For example: I made her duck!!
Some interpretations for the same are:

  1. I cooked duck for her.
  2. I cooked duck belonging to her.
  3. I created a toy duck which she owns
  4. I caused her to quickly lower her head or body
  5. I used magic and turned her into a duck.

Duck- morphologically and syntactically ambiguous-Verb
Her- syntactically ambiguous:Dative or possessive
Make- semantically ambiguous: cook or create
Make: syntactically ambiguous
Components of NLP:
There are two components of NLP- Natural Language Understanding and Natural Language Generation
Natural language Understanding means:
-Mapping the given input in the natural language into a useful representation. Why is it used and what is the meaning of the sentence.
-Different level of analysis required: morphological analysis, syntactic analysis, Semantic analysis, discourse analysis.
Natural language Generation means:
-Producing output in the natural language from some internal representation.
-Different level of synthesis required:
Deep planning
Syntactic generation
Natural Language Understanding:
Words
Morphological Analysis
Syntactic Analysis
                                 Syntactic Structure
Semantic Analysis
                                                                          Context-Independent meaning representation
Discourse Processing
                                              Final meaning representation
First we feed words using phonology analysis. If the text is red the text is then fed to morphological analysis, in which we analyze the meaning of the same word. For ex. I went to school. Then here the analysis says, main part is go and went is the past tense. So, this will be morphological analysis of the verb. Then syntactic analysis we find out the role of each word if it is a verb, adjective or pronoun. Then semantic analysis is carried on to know the meaning and role of each word and discourse processing digs the whole meaning of the sentence.
-In Morphological analysis we try to find words into their linguistic components.
-Morphemes are the smallest meaningful units of language. Ex.
Cars : car+PLU
Giving: give+PROG
Geliyordum: gel+PROG+PAST+1SG
-Ambiguity: More than one alternatives
Flies Fly(verb)+PROGRESSIVE
        Fly(noun)+PLURAL
Part-of-Speech (POS) Tagging-

  • Each part has a part-of-speech tag to describe its category
  • Part-of-speech tag of a word is one of the major word groups
    Open Classes — Noun, verb, adjective, adverb
    Closed Classes — prepositions, determiners, conjunctions, pronouns, particples
  • POS taggers try to find POS tags for the words
  • Duck is a verb or noun? (morphological analyzer cannot make a decision)
  • A POS tagger may make that decision by looking the surrounding words.
    Duck! (verb)
    Duck is a delicious dinner (noun)

Lexical processing-

  • The purpose of lexical processing is to determine meanings of individual words.
  • Basic methods is to lookup in a database of meanings- lexicon
  • We should also identify non-words such as punctuation marks
  • Word-level ambiguity– words may have several meanings, and the correct one cannot be chosen based solely on the word itself.
  • Resolve the ambiguity on the spot by POS tagging .

Syntactic processing-
Parsing- Converting a flat input sentence into a hierarchical structure that corresponds to the units of meaning  in the sentence.
There are different parsing formalisms and algorithms.
Most formalisms have two main components-

  • Grammar: a declarative representation describing the syntactic structure of the sentences in language. Like how object and object are related to each other.
  • Parser: an algorithm that analyzes the input and outputs its structural representation consistent with the grammar specification.

 
Semantic Analysis-

  • Assigning meanings to the structures created by syntactic analysis.
  • Mapping words and structures to particular domain objects in way consistent with our knowledge of the world.
  • Semantic can play an import role in selecting among competing syntactic analyses and discarding illogical analyses.
    Like : I robbed the bank- Now the question arises what bank? Bank of river or Bank institution?
  • We have to decide the formalisms which will be used in the meaning representation.

Discourse Analysis-
Discourses are collection of coherent sentences
Discourses have also hierarchical structures
Anaphora resolution– to resolve referring expression
Ex. Mary bought a book for Kelly. She didn’t like it.
Where She can be kelly or Mary and ‘it’ can refer to book
Discourse structure may depend on application
-Monologue
-Dialogue
-Human-Computer Interaction
Knowledge Representation for NLP:

  • Which knowledge representation will be used depends on the application — Machine translation, Database Query System.
  • Requires the choice of representational framework, as well as the specific meaning vocabulary (What are concepts and relationships between these concepts –Ontology)
  • Must be computationally effective
  • Common representational formalisms:
    First Order predicate logic
    Conceptual dependency graphs
    Semantic networks
    Frame-based representations

Natural language generation-

  • NLG is the process of constructing natural language outputs from non-linguistic inputs.
  • NLG can be viewed as the reverse process of NL understanding.
  • NLG has two main parts-
    Discourse Planner: what will be generated. Which Sentences.
    Surface Realizer: Realizes a sentence from its internal representation
  • Lexical Selection: Selecting the correct words describing the concepts.

Meaning representation
Utterance Planning
Sentence Planning and lexical choice
Sentence Generation
Morphological Generation
Machine translation-
Converting a text in language A into the corresponding text in language B (or speech)
Different Machine translation architectures:
–Interlingua based systems
–Transfer based systems
How to acquire the required knowledge resources such as mapping rules and bi-lingual dictionary? By hand or acquire them automatically from corpora.
Example based machine Translation acquires the required knowledge from corpora.
Models to represent Linguistic Knowledge

  • We will use certain formalisms to represent the required linguistic knowledge
  • State machines — FSAs, FSTs, HMMs, ATNs, RTNs
  • Formal Rule System– Context Free Grammars, Unification grammars, Probabilistic CFGs.
  • Logic-Based Formalisms– First order predicate logic, some higher order logic.
  • Models of uncertainty– Bayesian probability theory

Applications of Natural Language processing:

  1. Machine Translation
  2. Database Access
  3. Information Retrieval- Selection from a set of documents the ones that are relevant to a query
  4. Text Categorization- Sorting text into fixed topic categories
  5. Extracting data from text- Converting unstructured text into structured data
  6. Spoken language control systems
  7. Spelling and grammar checkers

Lab Work: Using NLTK for extracting data

from nltk.corpus import names
import random ------- To shuffle the names
import nltk

Def genderfeature(word);
return {‘last_letter’: word[-1]}
males=([(name,’male’) for name in names.word(r’male.txt’)])
females=([(name,’female’) for name in names.word(r’female.txt’)])

labellednames=males+females

random.shuffle(labellednames)
#to find d=features
featureset=[(gendefeature(n),gender) for (n,gender) in labellednames]
train, test=featureset[500:],featureset[:500]

#design algorithm
classifier=nltk.NaiveBayesClassifier.train(train)
print classifier.classify(genderfeature(‘adam’))
print classifier.classify(genderfeature(‘trinity’’))

#to check accuracy
nltk.classify.accuracy(classifier,test)

Output– male
              Female
              0.76

Sentiment Analysis Example:
In this example, we will work on Sentiment Analysis. We will be analyzing looking at few sentences and measuring their intensity whether it is positive, negative or neutral.
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
sentences=[‘Manoj is good, smart and funny”, “MANOJ IS GOOD, smart & funny”, “The book was very good”, “The book was somewhat good”, “A really horrible book”, “The food in the hotel was not good”, “The movie was too good”, “The movie was neither funny nor bad”, “The script was not so fascinating but the acting was good”, “He is an excellent man.”,”.”,”:) AND :P”]
sentiment=SentimentIntensityAnalyzer()
for i in sentences:
print i
ss=sentiment.polarity_scores(i)
For k in sorted(ss):

print “{0}:{1},” “.format(k,ss[k])
print “ ”


Output:

This is the output that you receive. With sentiment analysis of each sentence whether it is positive, negative or neutral. Though both the first and second sentences are same, but still the positive sensitivity is more for the second one because the good words have been capitalized i.e. focus has been impressioned on them, increasing the positivity of the sentence.
Similarly, when the sentences is negative, the negative value increases.

Internet Data Analysis:
Here we will do analysis on internet data, we will be using twitter. We will be creating an app and by that we will get all the latest tweets from twitter and then apply sentiment analysis on it using libraries for natural language processing like nltk.
Firstly, install one python package: tweepy. Go to your command prompt:
Start with the program in Python:

>>> import tweepy
>>> import nltk
>>> import textblob
After it shows no error, move ahead and start coding..


#go to twitter on this link: http://apps.twitter.com and create an app.
 
Fill out the form and create your application. Once you are done, then  get the Access token URL.

Then you can go to your account settings and find the consumer key and consumer secret.
 
If you don’t have access token, you can click on CREATE ACCESS TOKEN at the bottom most of the page and get yourself done!
Start with your code:
import tweepy
from textblob import TextBlob

from tweepy import OAuthHandler

consumer_key= “Copy the consumer key number here”

consumer_secret_key= “Copy your consumer secret key number here”
access_token= “Copy your access token number here”
access_secret_token= “Copy your access secret token number here”

#creating an handler of type OAuthhandler with consumerkey, consumersecret, accesstoken & accesstoken & accesssecrettoken
auth=OAuthHandler(consumer_key, consumer_secret_key)
auth.set_token(access_token,access_secret_token)
api=tweepy.API(auth)

#using the api we can fetch the latest tweets from twitter
fetched_tweets=api.search(q= “Barack Obama”, count=1)
print fetched_tweets
Output:
Output will show the recognizable tweets by the person whose name is mentioned.

For retweets:
fetched_tweets=api.search(q= “Barack Obama”, count=1)
For tweet in fetched_tweets:
b=re.sub(“(@[A-Za_z0-9]+)|([^0-9A-Za-z\t])|(\w+:\/\/\s+)”,” “,tweet.text).split()
Print b
For Sentiment analysis-
tweets=[]
def sentimentanalysis(tweet):
data= TextBlob(tweet)
 
if data.sentiment.polarity>0:
return ‘positive’
elif data.sentiment.polarity==0:
return ‘neutral’
else:
return ‘negative’
 
for tweet in fetched_tweets:
 
b=re.sub(“(@[A-Za_z0-9]+)|([^0-9A-Za-z\t])|(\w+:\/\/\s+)”,” “,tweet.text).split()
dictionary={}
dictionary[‘text’]=tweet.text
  dictionary[‘sentiment’]=sentimentanalysis(b)

 
#to avoid retweets
if tweet.retweet_count>0:
 if dictionary not in tweets:
tweets.append(dictionary)
else:
tweets.append(dictionary)
print b
#store all positive tweets in one ptweet
ptweets=[tweet for tweet in tweets if tweet[‘sentiment’]==’positive’]
#store all negative tweets in one ntweet
ntweets=[tweet for tweet in tweets if tweet[‘sentiment’]==’negative’]
neutral=[tweet for tweet in tweets if tweet[‘sentiment’]==’neutral’]
print ‘\n\npositive tweets’
for tweet in ptweet:
print tweet
print ‘\n’
print ‘\n\nnegative tweets’
for tweet in ntweet:
print tweet
print ‘\n’
print ‘\n\neutral tweets’
for tweet in neutral:
print tweet
print ‘\n
#to check the percentage of negative and positive tweets
print ‘positive tweets percentage {}%’.format(100*len(ptweet)/len(tweets))
print ‘positive tweets percentage {}%’.format(100*len(ntweet)/len(tweets))
Lab: Extracting data using TextBlob
First install TextBlob, copra and all packages.
pip install textblob
python -m textblob.download_corpora
 
Recommended Reading: Programming A Driverless Car Chapter 6: Clustering
Next: Chapter 8: Support Vector Machine
To start reading from a topic of your choice, you can go back to the Table of Contents here
This course evolves with your active feedback. Do let us know your feedback in the comments section below.
Looking for jobs in Artificial Intelligence? Check here

GET THE BEST APPS IN YOUR INBOX

Don't worry we don't spam

      LaunchToast
      Logo
      Enable registration in settings - general
      Compare items
      • Total (0)
      Compare
      0