How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.

Here are a couple of lines:

1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles
1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|@STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx  @SuryaRay

Here is the total data set: http://pastebin.com/eJuEb4eB

I need to build a model that classifies "Apple" (Inc). from the rest.

I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).


What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.

Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.

Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")

A very good library (GPL) is Stanford's NER

Here's the demo: http://nlp.stanford.edu:8080/ner/

Some sample text to try:

I was eating an apple over at Apple headquarters and I thought about Apple Martin, the daughter of the Coldplay guy

(the 3class and 4class classifiers get it right)


I would do it as follows:

  1. Split the sentence into words, normalise them, build a dictionary
  2. With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
  3. When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.