Amit Choubey - Portfolio

Introduction to Natural Language Processing

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and human language, particularly how to program computers to process and analyze large amounts of natural language data.

The Evolution of NLP

1. Traditional NLP Approaches

Early NLP systems relied on rule-based methods and statistical models:

Rule-based systems: Hand-crafted linguistic rules
Statistical methods: N-grams, Hidden Markov Models
Machine Learning approaches: SVMs, Random Forests with feature engineering

2. Word Embeddings

Word embeddings revolutionized NLP by representing words as dense vectors in a continuous vector space:

Word2Vec: Learns word associations from text
GloVe: Global Vectors for Word Representation
FastText: Extension of Word2Vec that handles subword information

3. The Transformer Revolution

Transformers, introduced in the paper "Attention is All You Need" (2017), changed the landscape of NLP:

Self-attention mechanism: Allows models to weigh the importance of different words
Parallelization: Enables efficient training on large datasets
Transfer learning: Pre-train on large corpora, fine-tune on specific tasks

4. Modern NLP Models

Today's state-of-the-art NLP is dominated by large pre-trained models:

BERT: Bidirectional Encoder Representations from Transformers
GPT (1-4): Generative Pre-trained Transformers
T5: Text-to-Text Transfer Transformer
RoBERTa: Robustly Optimized BERT Approach
XLNet: Generalized Autoregressive Pretraining

Building Practical NLP Applications

1. Text Classification

Text classification involves assigning categories to text documents. Applications include:

Sentiment analysis
Spam detection
Topic categorization
Intent recognition

Example using Hugging Face's Transformers library:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Analyze sentiment
text = "I really enjoyed this movie. The plot was engaging and the characters were well-developed."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
positive_score = predictions[0][1].item()
print(f"Sentiment score (positive): {positive_score:.4f}")

2. Named Entity Recognition (NER)

NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, etc.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

# Create NER pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Extract entities
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in Cupertino, California."
entities = ner(text)
for entity in entities:
    print(f"{entity['word']} - {entity['entity_group']} ({entity['score']:.4f})")

3. Question Answering

Question answering systems extract answers from text based on questions:

from transformers import pipeline

# Create question answering pipeline
qa_pipeline = pipeline("question-answering")

# Context and question
context = """
The Transformer architecture was introduced in the paper "Attention is All You Need" 
by Ashish Vaswani et al. in 2017. It has become the foundation for many state-of-the-art 
NLP models including BERT and GPT.
"""
question = "When was the Transformer architecture introduced?"

# Get answer
result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")

4. Text Generation

Text generation creates new text based on a prompt:

from transformers import pipeline

# Create text generation pipeline
generator = pipeline("text-generation", model="gpt2")

# Generate text
prompt = "Artificial intelligence has the potential to"
generated_text = generator(prompt, max_length=100, num_return_sequences=1)
print(generated_text[0]['generated_text'])

5. Text Summarization

Summarization condenses text while preserving key information:

from transformers import pipeline

# Create summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Text to summarize
article = """
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to natural intelligence 
displayed by animals including humans. AI research has been defined as the field of study of intelligent 
agents, which refers to any system that perceives its environment and takes actions that maximize its 
chance of achieving its goals. The term "artificial intelligence" had previously been used to describe 
machines that mimic and display "human" cognitive skills that are associated with the human mind, such 
as "learning" and "problem-solving". This definition has since been rejected by major AI researchers who 
now describe AI in terms of rationality and acting rationally, which does not limit how intelligence can 
be articulated.
"""

# Generate summary
summary = summarizer(article, max_length=130, min_length=30, do_sample=False)
print(summary[0]['summary_text'])

Building an End-to-End NLP Application

Let's walk through creating a simple sentiment analysis API using FastAPI and Hugging Face:

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import pipeline
import uvicorn

# Initialize FastAPI app
app = FastAPI(title="Sentiment Analysis API")

# Load sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Define request model
class TextRequest(BaseModel):
    text: str

# Define endpoint
@app.post("/analyze-sentiment")
async def analyze_sentiment(request: TextRequest):
    try:
        # Analyze sentiment
        result = sentiment_analyzer(request.text)[0]
        return {
            "text": request.text,
            "sentiment": result["label"],
            "score": result["score"]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Run the API
if __name__ == "__main__":
    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)

Advanced NLP Techniques

1. Fine-tuning Pre-trained Models

Adapting pre-trained models to specific tasks or domains:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Fine-tune the model
trainer.train()

2. Few-shot Learning

Training models with limited examples:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model_name = "gpt2-large"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Few-shot prompt for sentiment analysis
prompt = """
Review: This movie was fantastic! The acting was superb.
Sentiment: Positive

Review: I was disappointed with the service at the restaurant.
Sentiment: Negative

Review: The hotel room was clean and comfortable.
Sentiment: Positive

Review: I found the book to be boring and predictable.
Sentiment: Negative

Review: The concert exceeded all my expectations.
Sentiment: """

# Generate completion
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(
    input_ids,
    max_length=len(input_ids[0]) + 5,
    temperature=0.7,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Future Directions in NLP

The field of NLP continues to evolve rapidly. Some exciting future directions include:

Multimodal learning: Combining text with images, audio, and video
More efficient models: Reducing computational requirements while maintaining performance
Multilingual models: Better support for low-resource languages
Ethical AI: Addressing bias, fairness, and transparency in NLP systems
Domain-specific applications: Specialized models for healthcare, legal, financial domains

Practical NLP Applications