抱歉,您的浏览器无法访问本站

本页面需要浏览器支持(启用)JavaScript


了解详情 >
scikit-learn
scikit-learn

有关特征的提取,scikit-learn给出了很多方法,具体分成了图片特征提取和文本特征提取。

文本特征提取的接口是sklearn.feature_extraction.text,那么接下来学习里面封装的函数。

CountVectorizer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)

corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
feature_name = vectorizer.get_feature_names()

print feature_name
print X.toarray()

程序的结果为

1
2
3
4
5
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 2 1 0 1]
[1 0 0 0 1 0 1 1 0]
[0 1 1 1 0 0 1 0 1]]

Reference

Keras