博客一文详解自然语言处理两大任务与代码实战：NLU与NLG——基础概念

一文详解自然语言处理两大任务与代码实战：NLU与NLG——基础概念

数栈君发表于 2024-01-12 10:50 502 0

词向量

词向量，也被称为词嵌入，是自然语言处理中的关键概念。它通过将词映射到连续的向量空间中，使得机器能够捕捉词之间的相似性和语义关系。接下来我们将详细介绍几种主要的词向量模型。

Word2Vec
Word2Vec是一种流行的词嵌入方法，通过无监督学习从大量文本中学习词向量。Word2Vec包括Skip-Gram和CBOW两种架构。

Skip-Gram
Skip-Gram模型通过当前词来预测周围的上下文词。以下是一个简化的Skip-Gram模型的PyTorch实现：

class SkipGram(nn.Module):
def __init__(self, vocab_size, embed_dim):
super(SkipGram, self).__init__()
self.in_embeddings = nn.Embedding(vocab_size, embed_dim)
self.out_embeddings = nn.Embedding(vocab_size, embed_dim)

def forward(self, target, context):
in_embeds = self.in_embeddings(target)
out_embeds = self.out_embeddings(context)
scores = torch.matmul(in_embeds, out_embeds.t())
return scores
1
2
3
4
5
6
7
8
9
10
11
GloVe
GloVe（Global Vectors for Word Representation）是另一种流行的词嵌入方法，它通过统计共现矩阵并对其进行分解来获取词向量。

以下是一个GloVe模型的简化实现：

class GloVe(nn.Module):
def __init__(self, vocab_size, embed_dim):
super(GloVe, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embed_dim)
self.context_embeddings = nn.Embedding(vocab_size, embed_dim)
self.bias = nn.Embedding(vocab_size, 1)
self.context_bias = nn.Embedding(vocab_size, 1)

def forward(self, target, context):
target_embeds = self.embeddings(target)
context_embeds = self.context_embeddings(context)
target_bias = self.bias(target)
context_bias = self.context_bias(context)

dot_product = (target_embeds * context_embeds).sum(1)
logits = dot_product + target_bias.squeeze() + context_bias.squeeze()
return logits

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
FastText
FastText是由Facebook AI Research (FAIR)团队开发的一种词向量和文本分类模型。与Word2Vec等模型相比，FastText的主要特点是考虑了词内的子词信息。这一特性使其在许多语言和任务上都表现优异。

1. 子词表示
FastText通过将每个词分解为字符n-grams来捕捉词内结构信息。例如，对于单词“apple”，其3-grams包括"app"、“ppl”、"ple"等。这种子词表示有助于捕捉形态学信息，特别是在形态丰富的语言中。

2. 词向量训练
下面的代码使用Gensim库训练FastText模型，并展示如何使用训练后的模型。

from gensim.models import FastText

# 示例句子
sentences = [["natural", "language", "processing"],
["language", "model", "essential"],
["fasttext", "is", "amazing"]]

# 训练模型
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

# 获取单词"language"的向量
vector_language = model.wv["language"]

# 找到最相似的单词
similar_words = model.wv.most_similar("language")

# 输出:
# [('natural', 0.18541546112537384), ('model', 0.15876708467006683), ...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
3. 文本分类
FastText还提供了一种高效的文本分类方法。与许多深度学习模型不同，FastText在文本分类任务上的训练非常快速。

4. 预训练模型
与Word2Vec一样，也有许多针对特定语言和领域的预训练FastText模型。这些模型可用于各种自然语言处理任务。

————————————————
版权声明：本文为CSDN博主「星川皆无恙」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/Myx74270512/article/details/135271501