Bag of words meets bag of popcorn

Part 2 / Word Vector

word2vec 모델

논문

Efficient Estimation of Word Representations in Vector Space (2013, Mikolov)
- 초기버전
- CBOW ,Skip-gram
Distributed Representations of Words and Phrases and their Compositionality (2013,Mikolov)
- 튜닝기법이 추가

word2vec 관련 참고 자료

각 단어들을 원 핫인코딩 방식 혹은 Bag of words 방식으로 나타낼 경우 size가 매우 크고 벡터가 너무 sparse해서 neural net 성능이 잘 나오지 않는다.
단어 주변이 비슷하면 그 단어들은 의미가 유사하다는 아이디어
단어를 벡터로 바꿔주는 엠베딩(Embedding)과정
Word2Vec은 분산 된 텍스트 표현을 사용하여 개념 간 유사성을 본다. 예를 들어, 파리와 프랑스가 베를린과 독일이 (수도와 나라) 같은 방식으로 관련되어 있음을 이해한다.

5 CBOW와 Skip-gram 기법 사용

CBOW
- CBOW(continuous bag-of-words)
- 전체 텍스트로 하나의 단어를 예측한다.
- 작은 데이터셋일수록 유리
Skip-Gram
- 타겟 단어들로 부터 원본 단어들을 역으로 유추하는 과정
- 큰 규모의 데이터셋일수록 유리

# 참고 : https://gist.github.com/yong27/7869662
# http://www.racketracer.com/2016/07/06/pandas-in-parallel/
from multiprocessing import Pool
import numpy as np

def _apply_df(args):
    df, func, kwargs = args
    return df.apply(func, **kwargs)

def apply_by_multiprocessing(df, func, **kwargs):
    # 키워드 항목 중 workers 파라메터를 꺼냄
    workers = kwargs.pop('workers')
    # 위에서 가져온 workers 수로 프로세스 풀을 정의
    pool = Pool(processes=workers)
    # 실행할 함수와 데이터프레임을 워커의 수 만큼 나눠 작업
    result = pool.map(_apply_df, [(d, func, kwargs)
            for d in np.array_split(df, workers)])
    pool.close()
    # 작업 결과를 합쳐서 반환
    return pd.concat(list(result))

from kaggleBagofWord import kaggleBagofWord
##전처리과정에 사용하는 것들을 class화~

import pandas as pd

train = pd.read_csv('data/labeledTrainData.tsv',
                    header=0, delimiter='\t', quoting=3)
test = pd.read_csv('data/testData.tsv',
                   header=0, delimiter='\t', quoting=3)
unlabeled_train = pd.read_csv('data/unlabeledTrainData.tsv',
                              header=0, delimiter='\t', quoting=3)

print(train.shape)
print(test.shape)
print(unlabeled_train.shape)

print(train['review'].size)
print(test['review'].size)
print(unlabeled_train['review'].size)

kaggleBagofWord.review_to_wordlist(train['review'][0])[:10]

['with', 'all', 'this', 'stuff', 'go', 'down', 'at', 'the', 'moment', 'with']

sentences = []
for review in train["review"]:
    sentences += kaggleBagofWord.review_to_sentences(
        review, remove_stopwords=False)

C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'...'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:282: UserWarning: "http://www.happierabroad.com"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'12.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'music.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)

for review in unlabeled_train["review"]:
    sentences += kaggleBagofWord.review_to_sentences(
        review, remove_stopwords=False)

C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:282: UserWarning: "http://www.archive.org/details/LovefromaStranger"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:282: UserWarning: "http://www.loosechangeguide.com/LooseChangeGuide.html"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'... ...'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'...'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'....'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:282: UserWarning: "http://www.msnbc.msn.com/id/4972055/site/newsweek/"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'12.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'..'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:282: UserWarning: "http://www.youtube.com/watch?v=a0KSqelmgN8"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:219: UserWarning: "b'.. .'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:282: UserWarning: "http://jake-weird.blogspot.com/2007/08/beneath.html"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup

len(sentences)

sentences[0][:10]

['with', 'all', 'this', 'stuff', 'go', 'down', 'at', 'the', 'moment', 'with']

sentences[1][:10]

['mayb', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into']

Word2Vec 모델의 파라메터

아키텍처 : 아키텍처 옵션은 skip-gram (default) 또는 CBOW 모델이다. skip-gram (default)은 느리지 만 더 나은 결과를 낸다.
학습 알고리즘 : Hierarchical softmax (default) 또는 negative 샘플링. 여기에서는 기본값이 잘 동작한다.
빈번하게 등장하는 단어에 대한 다운 샘플링 : Google 문서는 .00001에서 .001 사이의 값을 권장한다. 여기에서는 0.001에 가까운 값이 최종 모델의 정확도를 높이는 것으로 보여진다.
단어 벡터 차원 : 많은 feature를 사용한다고 항상 좋은 것은 아니지만 대체적으로 좀 더 나은 모델이 된다. 합리적인 값은 수십에서 수백 개가 될 수 있고 여기에서는 300으로 지정했다.
컨텍스트 / 창 크기 : 학습 알고리즘이 고려해야하는 컨텍스트의 단어 수는 얼마나 될까? hierarchical softmax 를 위해 좀 더 큰 수가 좋지만 10 정도가 적당하다.
Worker threads : 실행할 병렬 프로세스의 수로 컴퓨터마다 다르지만 대부분의 시스템에서 4에서 6 사이의 값을 사용하다.
최소 단어 수 : 어휘의 크기를 의미있는 단어로 제한하는 데 도움이 된다. 모든 문서에서이 여러 번 발생하지 않는 단어는 무시된다. 10에서 100 사이가 적당하며, 이 경진대회의 데이터는 각 영화가 30개씩의 리뷰가 있기 때문에 개별 영화 제목에 너무 많은 중요성이 붙는 것을 피하기 위해 최소 단어 수를 40으로 설정한다. 그 결과 전체 어휘 크기는 약 15,000 단어가 된다. 높은 값은 제한 된 실행시간에 도움이 된다.

import logging
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s',
    level=logging.INFO)

# 파라메터값 지정
num_features = 300 # 문자 벡터 차원 수
min_word_count = 40 # 최소 단어 수
num_workers = 4 # 병렬 처리 스레드 수
context = 10 # 문자열 창 크기
downsampling = 1e-3 # 문자 빈도 수 Downsample

# 초기화 및 모델 학습
from gensim.models import word2vec

# 모델 학습
model = word2vec.Word2Vec(sentences,
                          workers=num_workers,
                          size=num_features,
                          min_count=min_word_count,
                          window=context,
                          sample=downsampling)
model

C:\ProgramData\Anaconda3\lib\site-packages\gensim-3.4.0-py3.6-win-amd64.egg\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-04-30 12:06:49,740 : INFO : 'pattern' package not found; tag filters are not available for English
2018-04-30 12:06:49,755 : INFO : collecting all words and their counts
2018-04-30 12:06:49,756 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-04-30 12:06:49,831 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 12465 word types
2018-04-30 12:06:49,912 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 17070 word types
2018-04-30 12:06:49,990 : INFO : PROGRESS: at sentence #30000, processed 671315 words, keeping 20370 word types
2018-04-30 12:06:50,077 : INFO : PROGRESS: at sentence #40000, processed 897815 words, keeping 23125 word types
2018-04-30 12:06:50,154 : INFO : PROGRESS: at sentence #50000, processed 1116963 words, keeping 25365 word types
2018-04-30 12:06:50,253 : INFO : PROGRESS: at sentence #60000, processed 1338404 words, keeping 27283 word types
2018-04-30 12:06:50,342 : INFO : PROGRESS: at sentence #70000, processed 1561580 words, keeping 29024 word types
2018-04-30 12:06:50,435 : INFO : PROGRESS: at sentence #80000, processed 1780887 words, keeping 30603 word types
2018-04-30 12:06:50,537 : INFO : PROGRESS: at sentence #90000, processed 2004996 words, keeping 32223 word types
2018-04-30 12:06:50,628 : INFO : PROGRESS: at sentence #100000, processed 2226967 words, keeping 33579 word types
2018-04-30 12:06:50,717 : INFO : PROGRESS: at sentence #110000, processed 2446581 words, keeping 34827 word types
2018-04-30 12:06:50,790 : INFO : PROGRESS: at sentence #120000, processed 2668776 words, keeping 36183 word types
2018-04-30 12:06:50,886 : INFO : PROGRESS: at sentence #130000, processed 2894304 words, keeping 37353 word types
2018-04-30 12:06:50,999 : INFO : PROGRESS: at sentence #140000, processed 3107006 words, keeping 38376 word types
2018-04-30 12:06:51,128 : INFO : PROGRESS: at sentence #150000, processed 3332628 words, keeping 39556 word types
2018-04-30 12:06:51,248 : INFO : PROGRESS: at sentence #160000, processed 3555316 words, keeping 40629 word types
2018-04-30 12:06:51,364 : INFO : PROGRESS: at sentence #170000, processed 3778656 words, keeping 41628 word types
2018-04-30 12:06:51,477 : INFO : PROGRESS: at sentence #180000, processed 3999237 words, keeping 42599 word types
2018-04-30 12:06:51,589 : INFO : PROGRESS: at sentence #190000, processed 4224450 words, keeping 43461 word types
2018-04-30 12:06:51,685 : INFO : PROGRESS: at sentence #200000, processed 4448604 words, keeping 44301 word types
2018-04-30 12:06:51,771 : INFO : PROGRESS: at sentence #210000, processed 4669968 words, keeping 45212 word types
2018-04-30 12:06:51,863 : INFO : PROGRESS: at sentence #220000, processed 4894969 words, keeping 46134 word types
2018-04-30 12:06:52,013 : INFO : PROGRESS: at sentence #230000, processed 5117546 words, keeping 46986 word types
2018-04-30 12:06:52,126 : INFO : PROGRESS: at sentence #240000, processed 5345051 words, keeping 47854 word types
2018-04-30 12:06:52,229 : INFO : PROGRESS: at sentence #250000, processed 5559166 words, keeping 48699 word types
2018-04-30 12:06:52,345 : INFO : PROGRESS: at sentence #260000, processed 5779147 words, keeping 49469 word types
2018-04-30 12:06:52,438 : INFO : PROGRESS: at sentence #270000, processed 6000436 words, keeping 50416 word types
2018-04-30 12:06:52,525 : INFO : PROGRESS: at sentence #280000, processed 6226315 words, keeping 51640 word types
2018-04-30 12:06:52,612 : INFO : PROGRESS: at sentence #290000, processed 6449475 words, keeping 52754 word types
2018-04-30 12:06:52,694 : INFO : PROGRESS: at sentence #300000, processed 6674078 words, keeping 53755 word types
2018-04-30 12:06:52,782 : INFO : PROGRESS: at sentence #310000, processed 6899392 words, keeping 54734 word types
2018-04-30 12:06:52,871 : INFO : PROGRESS: at sentence #320000, processed 7124279 words, keeping 55770 word types
2018-04-30 12:06:52,953 : INFO : PROGRESS: at sentence #330000, processed 7346022 words, keeping 56687 word types
2018-04-30 12:06:53,036 : INFO : PROGRESS: at sentence #340000, processed 7575534 words, keeping 57629 word types
2018-04-30 12:06:53,133 : INFO : PROGRESS: at sentence #350000, processed 7798804 words, keeping 58485 word types
2018-04-30 12:06:53,208 : INFO : PROGRESS: at sentence #360000, processed 8019467 words, keeping 59345 word types
2018-04-30 12:06:53,296 : INFO : PROGRESS: at sentence #370000, processed 8246659 words, keeping 60161 word types
2018-04-30 12:06:53,379 : INFO : PROGRESS: at sentence #380000, processed 8471806 words, keeping 61069 word types
2018-04-30 12:06:53,467 : INFO : PROGRESS: at sentence #390000, processed 8701556 words, keeping 61810 word types
2018-04-30 12:06:53,546 : INFO : PROGRESS: at sentence #400000, processed 8924505 words, keeping 62546 word types
2018-04-30 12:06:53,634 : INFO : PROGRESS: at sentence #410000, processed 9145855 words, keeping 63263 word types
2018-04-30 12:06:53,716 : INFO : PROGRESS: at sentence #420000, processed 9366935 words, keeping 64024 word types
2018-04-30 12:06:53,797 : INFO : PROGRESS: at sentence #430000, processed 9594472 words, keeping 64795 word types
2018-04-30 12:06:53,882 : INFO : PROGRESS: at sentence #440000, processed 9821225 words, keeping 65539 word types
2018-04-30 12:06:53,966 : INFO : PROGRESS: at sentence #450000, processed 10044987 words, keeping 66378 word types
2018-04-30 12:06:54,047 : INFO : PROGRESS: at sentence #460000, processed 10277747 words, keeping 67158 word types
2018-04-30 12:06:54,141 : INFO : PROGRESS: at sentence #470000, processed 10505672 words, keeping 67775 word types
2018-04-30 12:06:54,220 : INFO : PROGRESS: at sentence #480000, processed 10726056 words, keeping 68500 word types
2018-04-30 12:06:54,308 : INFO : PROGRESS: at sentence #490000, processed 10952800 words, keeping 69256 word types
2018-04-30 12:06:54,409 : INFO : PROGRESS: at sentence #500000, processed 11174456 words, keeping 69892 word types
2018-04-30 12:06:54,485 : INFO : PROGRESS: at sentence #510000, processed 11399731 words, keeping 70593 word types
2018-04-30 12:06:54,554 : INFO : PROGRESS: at sentence #520000, processed 11623082 words, keeping 71267 word types
2018-04-30 12:06:54,630 : INFO : PROGRESS: at sentence #530000, processed 11847480 words, keeping 71877 word types
2018-04-30 12:06:54,703 : INFO : PROGRESS: at sentence #540000, processed 12072095 words, keeping 72537 word types
2018-04-30 12:06:54,789 : INFO : PROGRESS: at sentence #550000, processed 12297646 words, keeping 73212 word types
2018-04-30 12:06:54,873 : INFO : PROGRESS: at sentence #560000, processed 12518936 words, keeping 73861 word types
2018-04-30 12:06:54,961 : INFO : PROGRESS: at sentence #570000, processed 12748083 words, keeping 74431 word types
2018-04-30 12:06:55,074 : INFO : PROGRESS: at sentence #580000, processed 12969579 words, keeping 75087 word types
2018-04-30 12:06:55,232 : INFO : PROGRESS: at sentence #590000, processed 13195104 words, keeping 75733 word types
2018-04-30 12:06:55,349 : INFO : PROGRESS: at sentence #600000, processed 13417302 words, keeping 76294 word types
2018-04-30 12:06:55,458 : INFO : PROGRESS: at sentence #610000, processed 13638325 words, keeping 76952 word types
2018-04-30 12:06:55,556 : INFO : PROGRESS: at sentence #620000, processed 13864650 words, keeping 77503 word types
2018-04-30 12:06:55,644 : INFO : PROGRESS: at sentence #630000, processed 14088936 words, keeping 78066 word types
2018-04-30 12:06:55,726 : INFO : PROGRESS: at sentence #640000, processed 14309719 words, keeping 78692 word types
2018-04-30 12:06:55,805 : INFO : PROGRESS: at sentence #650000, processed 14535475 words, keeping 79295 word types
2018-04-30 12:06:55,888 : INFO : PROGRESS: at sentence #660000, processed 14758265 words, keeping 79864 word types
2018-04-30 12:06:55,973 : INFO : PROGRESS: at sentence #670000, processed 14981658 words, keeping 80381 word types
2018-04-30 12:06:56,060 : INFO : PROGRESS: at sentence #680000, processed 15206490 words, keeping 80912 word types
2018-04-30 12:06:56,141 : INFO : PROGRESS: at sentence #690000, processed 15428683 words, keeping 81482 word types
2018-04-30 12:06:56,227 : INFO : PROGRESS: at sentence #700000, processed 15657389 words, keeping 82074 word types
2018-04-30 12:06:56,316 : INFO : PROGRESS: at sentence #710000, processed 15880378 words, keeping 82560 word types
2018-04-30 12:06:56,411 : INFO : PROGRESS: at sentence #720000, processed 16105665 words, keeping 83036 word types
2018-04-30 12:06:56,492 : INFO : PROGRESS: at sentence #730000, processed 16332046 words, keeping 83571 word types
2018-04-30 12:06:56,578 : INFO : PROGRESS: at sentence #740000, processed 16553079 words, keeping 84127 word types
2018-04-30 12:06:56,657 : INFO : PROGRESS: at sentence #750000, processed 16771406 words, keeping 84599 word types
2018-04-30 12:06:56,735 : INFO : PROGRESS: at sentence #760000, processed 16990810 words, keeping 85068 word types
2018-04-30 12:06:56,818 : INFO : PROGRESS: at sentence #770000, processed 17217947 words, keeping 85644 word types
2018-04-30 12:06:56,906 : INFO : PROGRESS: at sentence #780000, processed 17448093 words, keeping 86160 word types
2018-04-30 12:06:56,996 : INFO : PROGRESS: at sentence #790000, processed 17675169 words, keeping 86665 word types
2018-04-30 12:06:57,045 : INFO : collected 86996 word types from a corpus of 17798270 raw words and 795538 sentences
2018-04-30 12:06:57,045 : INFO : Loading a fresh vocabulary
2018-04-30 12:06:57,131 : INFO : min_count=40 retains 11986 unique words (13% of original 86996, drops 75010)
2018-04-30 12:06:57,133 : INFO : min_count=40 leaves 17434033 word corpus (97% of original 17798270, drops 364237)
2018-04-30 12:06:57,209 : INFO : deleting the raw counts dictionary of 86996 items
2018-04-30 12:06:57,216 : INFO : sample=0.001 downsamples 50 most-common words
2018-04-30 12:06:57,217 : INFO : downsampling leaves estimated 12872363 word corpus (73.8% of prior 17434033)
2018-04-30 12:06:57,281 : INFO : estimated required memory for 11986 words and 300 dimensions: 34759400 bytes
2018-04-30 12:06:57,282 : INFO : resetting layer weights
2018-04-30 12:06:57,559 : INFO : training model with 4 workers on 11986 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2018-04-30 12:06:58,622 : INFO : EPOCH 1 - PROGRESS: at 3.42% examples, 427165 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:06:59,625 : INFO : EPOCH 1 - PROGRESS: at 6.68% examples, 422080 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:00,629 : INFO : EPOCH 1 - PROGRESS: at 9.81% examples, 413311 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:01,661 : INFO : EPOCH 1 - PROGRESS: at 13.08% examples, 411567 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:02,692 : INFO : EPOCH 1 - PROGRESS: at 16.72% examples, 418921 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:03,708 : INFO : EPOCH 1 - PROGRESS: at 19.38% examples, 404802 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:04,719 : INFO : EPOCH 1 - PROGRESS: at 21.80% examples, 391006 words/s, in_qsize 8, out_qsize 1
2018-04-30 12:07:05,725 : INFO : EPOCH 1 - PROGRESS: at 24.56% examples, 386193 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:06,724 : INFO : EPOCH 1 - PROGRESS: at 26.97% examples, 377823 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:07,738 : INFO : EPOCH 1 - PROGRESS: at 29.47% examples, 372069 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:07:08,756 : INFO : EPOCH 1 - PROGRESS: at 32.18% examples, 368533 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:09,779 : INFO : EPOCH 1 - PROGRESS: at 34.27% examples, 359451 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:10,806 : INFO : EPOCH 1 - PROGRESS: at 36.11% examples, 349544 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:11,835 : INFO : EPOCH 1 - PROGRESS: at 37.73% examples, 339030 words/s, in_qsize 8, out_qsize 2
2018-04-30 12:07:12,914 : INFO : EPOCH 1 - PROGRESS: at 40.80% examples, 341046 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:13,915 : INFO : EPOCH 1 - PROGRESS: at 42.41% examples, 332956 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:14,935 : INFO : EPOCH 1 - PROGRESS: at 44.89% examples, 331693 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:15,974 : INFO : EPOCH 1 - PROGRESS: at 47.28% examples, 329834 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:16,983 : INFO : EPOCH 1 - PROGRESS: at 49.65% examples, 328681 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:17,999 : INFO : EPOCH 1 - PROGRESS: at 52.43% examples, 329647 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:19,009 : INFO : EPOCH 1 - PROGRESS: at 54.99% examples, 329598 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:20,031 : INFO : EPOCH 1 - PROGRESS: at 57.42% examples, 328726 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:21,038 : INFO : EPOCH 1 - PROGRESS: at 59.85% examples, 328154 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:22,049 : INFO : EPOCH 1 - PROGRESS: at 62.82% examples, 330233 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:23,065 : INFO : EPOCH 1 - PROGRESS: at 65.18% examples, 328977 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:24,103 : INFO : EPOCH 1 - PROGRESS: at 67.12% examples, 325612 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:25,119 : INFO : EPOCH 1 - PROGRESS: at 68.38% examples, 319379 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:26,120 : INFO : EPOCH 1 - PROGRESS: at 70.39% examples, 317302 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:27,124 : INFO : EPOCH 1 - PROGRESS: at 72.49% examples, 315815 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:28,130 : INFO : EPOCH 1 - PROGRESS: at 75.48% examples, 317963 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:29,141 : INFO : EPOCH 1 - PROGRESS: at 77.66% examples, 316696 words/s, in_qsize 7, out_qsize 1
2018-04-30 12:07:30,164 : INFO : EPOCH 1 - PROGRESS: at 79.92% examples, 315602 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:31,172 : INFO : EPOCH 1 - PROGRESS: at 81.98% examples, 314064 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:32,200 : INFO : EPOCH 1 - PROGRESS: at 83.84% examples, 311640 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:33,222 : INFO : EPOCH 1 - PROGRESS: at 85.63% examples, 309199 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:07:34,268 : INFO : EPOCH 1 - PROGRESS: at 87.69% examples, 307650 words/s, in_qsize 7, out_qsize 1
2018-04-30 12:07:35,278 : INFO : EPOCH 1 - PROGRESS: at 90.07% examples, 307646 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:36,278 : INFO : EPOCH 1 - PROGRESS: at 91.92% examples, 305841 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:37,297 : INFO : EPOCH 1 - PROGRESS: at 94.15% examples, 305076 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:38,300 : INFO : EPOCH 1 - PROGRESS: at 96.53% examples, 305000 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:39,314 : INFO : EPOCH 1 - PROGRESS: at 98.72% examples, 304525 words/s, in_qsize 7, out_qsize 1
2018-04-30 12:07:39,910 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-04-30 12:07:39,949 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-30 12:07:39,970 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-30 12:07:39,998 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-30 12:07:40,000 : INFO : EPOCH - 1 : training on 17798270 raw words (12871193 effective words) took 42.4s, 303498 effective words/s
2018-04-30 12:07:41,052 : INFO : EPOCH 2 - PROGRESS: at 2.45% examples, 308316 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:42,073 : INFO : EPOCH 2 - PROGRESS: at 4.59% examples, 288681 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:43,118 : INFO : EPOCH 2 - PROGRESS: at 7.30% examples, 303057 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:44,125 : INFO : EPOCH 2 - PROGRESS: at 9.29% examples, 290343 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:45,150 : INFO : EPOCH 2 - PROGRESS: at 11.20% examples, 280107 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:46,152 : INFO : EPOCH 2 - PROGRESS: at 13.03% examples, 272151 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:47,162 : INFO : EPOCH 2 - PROGRESS: at 14.72% examples, 264022 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:48,234 : INFO : EPOCH 2 - PROGRESS: at 16.29% examples, 254190 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:49,266 : INFO : EPOCH 2 - PROGRESS: at 18.42% examples, 254662 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:50,266 : INFO : EPOCH 2 - PROGRESS: at 20.73% examples, 258713 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:51,269 : INFO : EPOCH 2 - PROGRESS: at 23.04% examples, 261984 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:52,282 : INFO : EPOCH 2 - PROGRESS: at 25.22% examples, 263287 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:53,303 : INFO : EPOCH 2 - PROGRESS: at 26.91% examples, 259337 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:54,304 : INFO : EPOCH 2 - PROGRESS: at 29.98% examples, 268955 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:55,304 : INFO : EPOCH 2 - PROGRESS: at 33.03% examples, 276327 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:56,321 : INFO : EPOCH 2 - PROGRESS: at 35.33% examples, 277230 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:57,359 : INFO : EPOCH 2 - PROGRESS: at 38.01% examples, 280612 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:07:58,371 : INFO : EPOCH 2 - PROGRESS: at 40.46% examples, 282449 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:07:59,381 : INFO : EPOCH 2 - PROGRESS: at 43.36% examples, 287092 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:00,390 : INFO : EPOCH 2 - PROGRESS: at 45.89% examples, 288824 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:01,406 : INFO : EPOCH 2 - PROGRESS: at 48.16% examples, 288917 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:02,451 : INFO : EPOCH 2 - PROGRESS: at 50.68% examples, 289948 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:03,521 : INFO : EPOCH 2 - PROGRESS: at 53.10% examples, 289944 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:04,538 : INFO : EPOCH 2 - PROGRESS: at 54.81% examples, 287055 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:05,539 : INFO : EPOCH 2 - PROGRESS: at 56.72% examples, 285394 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:06,541 : INFO : EPOCH 2 - PROGRESS: at 58.63% examples, 284133 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:07,542 : INFO : EPOCH 2 - PROGRESS: at 60.64% examples, 283253 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:08,562 : INFO : EPOCH 2 - PROGRESS: at 63.10% examples, 284282 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:09,583 : INFO : EPOCH 2 - PROGRESS: at 65.57% examples, 285213 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:10,602 : INFO : EPOCH 2 - PROGRESS: at 67.97% examples, 285854 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:11,661 : INFO : EPOCH 2 - PROGRESS: at 70.60% examples, 287023 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:12,662 : INFO : EPOCH 2 - PROGRESS: at 72.72% examples, 286639 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:13,682 : INFO : EPOCH 2 - PROGRESS: at 74.80% examples, 285883 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:14,713 : INFO : EPOCH 2 - PROGRESS: at 77.10% examples, 285918 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:15,742 : INFO : EPOCH 2 - PROGRESS: at 79.57% examples, 286573 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:16,743 : INFO : EPOCH 2 - PROGRESS: at 82.09% examples, 287607 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:17,759 : INFO : EPOCH 2 - PROGRESS: at 84.67% examples, 288667 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:18,763 : INFO : EPOCH 2 - PROGRESS: at 87.64% examples, 291075 words/s, in_qsize 7, out_qsize 1
2018-04-30 12:08:19,765 : INFO : EPOCH 2 - PROGRESS: at 90.31% examples, 292449 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:20,779 : INFO : EPOCH 2 - PROGRESS: at 93.80% examples, 296159 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:21,802 : INFO : EPOCH 2 - PROGRESS: at 97.35% examples, 299801 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:22,540 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-04-30 12:08:22,558 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-30 12:08:22,574 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-30 12:08:22,578 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-30 12:08:22,579 : INFO : EPOCH - 2 : training on 17798270 raw words (12872113 effective words) took 42.6s, 302461 effective words/s
2018-04-30 12:08:23,608 : INFO : EPOCH 3 - PROGRESS: at 3.36% examples, 426875 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:24,615 : INFO : EPOCH 3 - PROGRESS: at 6.90% examples, 438685 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:25,630 : INFO : EPOCH 3 - PROGRESS: at 10.37% examples, 437207 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:26,642 : INFO : EPOCH 3 - PROGRESS: at 13.43% examples, 424408 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:27,645 : INFO : EPOCH 3 - PROGRESS: at 16.50% examples, 417336 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:28,659 : INFO : EPOCH 3 - PROGRESS: at 19.94% examples, 420156 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:29,661 : INFO : EPOCH 3 - PROGRESS: at 23.38% examples, 422990 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:30,667 : INFO : EPOCH 3 - PROGRESS: at 26.18% examples, 414959 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:31,669 : INFO : EPOCH 3 - PROGRESS: at 28.99% examples, 408924 words/s, in_qsize 8, out_qsize 1
2018-04-30 12:08:32,673 : INFO : EPOCH 3 - PROGRESS: at 31.67% examples, 401902 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:33,696 : INFO : EPOCH 3 - PROGRESS: at 34.76% examples, 400514 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:34,707 : INFO : EPOCH 3 - PROGRESS: at 37.62% examples, 397562 words/s, in_qsize 8, out_qsize 1
2018-04-30 12:08:35,734 : INFO : EPOCH 3 - PROGRESS: at 40.80% examples, 397784 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:36,737 : INFO : EPOCH 3 - PROGRESS: at 43.69% examples, 396176 words/s, in_qsize 8, out_qsize 1
2018-04-30 12:08:37,745 : INFO : EPOCH 3 - PROGRESS: at 46.72% examples, 395548 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:38,764 : INFO : EPOCH 3 - PROGRESS: at 49.94% examples, 396492 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:39,810 : INFO : EPOCH 3 - PROGRESS: at 53.10% examples, 395892 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:40,816 : INFO : EPOCH 3 - PROGRESS: at 56.49% examples, 398171 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:41,831 : INFO : EPOCH 3 - PROGRESS: at 59.84% examples, 400068 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:42,834 : INFO : EPOCH 3 - PROGRESS: at 62.99% examples, 400243 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:43,846 : INFO : EPOCH 3 - PROGRESS: at 66.28% examples, 401222 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:44,884 : INFO : EPOCH 3 - PROGRESS: at 68.72% examples, 396460 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:45,894 : INFO : EPOCH 3 - PROGRESS: at 70.27% examples, 387967 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:46,898 : INFO : EPOCH 3 - PROGRESS: at 72.72% examples, 385020 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:47,916 : INFO : EPOCH 3 - PROGRESS: at 75.82% examples, 385212 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:48,924 : INFO : EPOCH 3 - PROGRESS: at 79.06% examples, 386378 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:49,946 : INFO : EPOCH 3 - PROGRESS: at 82.09% examples, 386199 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:50,950 : INFO : EPOCH 3 - PROGRESS: at 84.17% examples, 381952 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:51,969 : INFO : EPOCH 3 - PROGRESS: at 86.53% examples, 379028 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:52,997 : INFO : EPOCH 3 - PROGRESS: at 88.81% examples, 375949 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:08:54,001 : INFO : EPOCH 3 - PROGRESS: at 90.82% examples, 372184 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:55,048 : INFO : EPOCH 3 - PROGRESS: at 93.01% examples, 368873 words/s, in_qsize 8, out_qsize 1
2018-04-30 12:08:56,057 : INFO : EPOCH 3 - PROGRESS: at 95.59% examples, 367458 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:57,073 : INFO : EPOCH 3 - PROGRESS: at 98.11% examples, 366274 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:08:57,760 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-04-30 12:08:57,811 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-30 12:08:57,816 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-30 12:08:57,822 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-30 12:08:57,825 : INFO : EPOCH - 3 : training on 17798270 raw words (12874360 effective words) took 35.2s, 365412 effective words/s
2018-04-30 12:08:58,843 : INFO : EPOCH 4 - PROGRESS: at 2.29% examples, 294724 words/s, in_qsize 7, out_qsize 1
2018-04-30 12:08:59,845 : INFO : EPOCH 4 - PROGRESS: at 5.03% examples, 323737 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:00,850 : INFO : EPOCH 4 - PROGRESS: at 7.54% examples, 321039 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:01,855 : INFO : EPOCH 4 - PROGRESS: at 9.00% examples, 287446 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:09:02,921 : INFO : EPOCH 4 - PROGRESS: at 10.76% examples, 271258 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:03,924 : INFO : EPOCH 4 - PROGRESS: at 12.68% examples, 266864 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:04,931 : INFO : EPOCH 4 - PROGRESS: at 14.45% examples, 260620 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:05,954 : INFO : EPOCH 4 - PROGRESS: at 15.95% examples, 251781 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:06,980 : INFO : EPOCH 4 - PROGRESS: at 17.52% examples, 244878 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:07,999 : INFO : EPOCH 4 - PROGRESS: at 19.21% examples, 241613 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:09,020 : INFO : EPOCH 4 - PROGRESS: at 21.24% examples, 242812 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:10,037 : INFO : EPOCH 4 - PROGRESS: at 22.98% examples, 240963 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:11,067 : INFO : EPOCH 4 - PROGRESS: at 25.00% examples, 241868 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:12,066 : INFO : EPOCH 4 - PROGRESS: at 26.91% examples, 242089 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:09:13,144 : INFO : EPOCH 4 - PROGRESS: at 29.04% examples, 242988 words/s, in_qsize 8, out_qsize 2
2018-04-30 12:09:14,288 : INFO : EPOCH 4 - PROGRESS: at 31.20% examples, 242746 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:15,319 : INFO : EPOCH 4 - PROGRESS: at 32.97% examples, 241197 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:16,359 : INFO : EPOCH 4 - PROGRESS: at 34.65% examples, 239327 words/s, in_qsize 6, out_qsize 0
2018-04-30 12:09:17,377 : INFO : EPOCH 4 - PROGRESS: at 36.34% examples, 237946 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:18,407 : INFO : EPOCH 4 - PROGRESS: at 38.40% examples, 239011 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:19,442 : INFO : EPOCH 4 - PROGRESS: at 40.13% examples, 237942 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:20,494 : INFO : EPOCH 4 - PROGRESS: at 42.14% examples, 238351 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:21,521 : INFO : EPOCH 4 - PROGRESS: at 43.92% examples, 237784 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:22,538 : INFO : EPOCH 4 - PROGRESS: at 45.71% examples, 237327 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:23,553 : INFO : EPOCH 4 - PROGRESS: at 47.62% examples, 237505 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:24,612 : INFO : EPOCH 4 - PROGRESS: at 49.66% examples, 238085 words/s, in_qsize 6, out_qsize 2
2018-04-30 12:09:25,636 : INFO : EPOCH 4 - PROGRESS: at 51.31% examples, 236856 words/s, in_qsize 8, out_qsize 2
2018-04-30 12:09:26,653 : INFO : EPOCH 4 - PROGRESS: at 53.43% examples, 238013 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:27,726 : INFO : EPOCH 4 - PROGRESS: at 54.81% examples, 235514 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:28,731 : INFO : EPOCH 4 - PROGRESS: at 55.94% examples, 232529 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:29,882 : INFO : EPOCH 4 - PROGRESS: at 57.73% examples, 231598 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:09:30,929 : INFO : EPOCH 4 - PROGRESS: at 59.39% examples, 230825 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:31,968 : INFO : EPOCH 4 - PROGRESS: at 61.02% examples, 229926 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:32,976 : INFO : EPOCH 4 - PROGRESS: at 63.16% examples, 231149 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:33,992 : INFO : EPOCH 4 - PROGRESS: at 64.73% examples, 230245 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:35,003 : INFO : EPOCH 4 - PROGRESS: at 67.29% examples, 232907 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:36,040 : INFO : EPOCH 4 - PROGRESS: at 69.03% examples, 232444 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:37,047 : INFO : EPOCH 4 - PROGRESS: at 71.54% examples, 234763 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:09:38,075 : INFO : EPOCH 4 - PROGRESS: at 74.12% examples, 237022 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:39,090 : INFO : EPOCH 4 - PROGRESS: at 77.21% examples, 240818 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:40,091 : INFO : EPOCH 4 - PROGRESS: at 79.34% examples, 241612 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:41,121 : INFO : EPOCH 4 - PROGRESS: at 82.56% examples, 245368 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:42,124 : INFO : EPOCH 4 - PROGRESS: at 85.34% examples, 247963 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:09:43,142 : INFO : EPOCH 4 - PROGRESS: at 88.52% examples, 251479 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:44,149 : INFO : EPOCH 4 - PROGRESS: at 91.32% examples, 253803 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:45,173 : INFO : EPOCH 4 - PROGRESS: at 94.60% examples, 257163 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:46,197 : INFO : EPOCH 4 - PROGRESS: at 97.62% examples, 259774 words/s, in_qsize 7, out_qsize 2
2018-04-30 12:09:46,892 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-04-30 12:09:46,915 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-30 12:09:46,920 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-30 12:09:46,929 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-30 12:09:46,930 : INFO : EPOCH - 4 : training on 17798270 raw words (12873324 effective words) took 49.1s, 262219 effective words/s
2018-04-30 12:09:47,958 : INFO : EPOCH 5 - PROGRESS: at 3.13% examples, 399629 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:48,990 : INFO : EPOCH 5 - PROGRESS: at 6.40% examples, 402616 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:50,001 : INFO : EPOCH 5 - PROGRESS: at 9.75% examples, 408700 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:51,010 : INFO : EPOCH 5 - PROGRESS: at 12.57% examples, 396184 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:52,024 : INFO : EPOCH 5 - PROGRESS: at 15.84% examples, 399643 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:53,087 : INFO : EPOCH 5 - PROGRESS: at 19.04% examples, 396274 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:54,098 : INFO : EPOCH 5 - PROGRESS: at 22.20% examples, 396805 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:55,102 : INFO : EPOCH 5 - PROGRESS: at 25.40% examples, 398372 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:56,121 : INFO : EPOCH 5 - PROGRESS: at 28.70% examples, 400506 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:57,133 : INFO : EPOCH 5 - PROGRESS: at 32.02% examples, 401787 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:09:58,149 : INFO : EPOCH 5 - PROGRESS: at 35.21% examples, 402060 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:09:59,155 : INFO : EPOCH 5 - PROGRESS: at 38.63% examples, 405073 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:00,174 : INFO : EPOCH 5 - PROGRESS: at 41.86% examples, 405512 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:01,211 : INFO : EPOCH 5 - PROGRESS: at 44.84% examples, 402875 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:10:02,215 : INFO : EPOCH 5 - PROGRESS: at 47.56% examples, 399532 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:03,254 : INFO : EPOCH 5 - PROGRESS: at 50.10% examples, 394475 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:04,259 : INFO : EPOCH 5 - PROGRESS: at 52.83% examples, 391585 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:05,269 : INFO : EPOCH 5 - PROGRESS: at 56.16% examples, 393617 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:06,270 : INFO : EPOCH 5 - PROGRESS: at 59.39% examples, 395269 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:07,284 : INFO : EPOCH 5 - PROGRESS: at 62.43% examples, 394699 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:08,298 : INFO : EPOCH 5 - PROGRESS: at 65.68% examples, 395585 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:09,304 : INFO : EPOCH 5 - PROGRESS: at 69.03% examples, 397157 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:10,319 : INFO : EPOCH 5 - PROGRESS: at 71.76% examples, 395053 words/s, in_qsize 6, out_qsize 1
2018-04-30 12:10:11,325 : INFO : EPOCH 5 - PROGRESS: at 74.68% examples, 394162 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:12,345 : INFO : EPOCH 5 - PROGRESS: at 77.94% examples, 394830 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:13,375 : INFO : EPOCH 5 - PROGRESS: at 81.37% examples, 396089 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:14,393 : INFO : EPOCH 5 - PROGRESS: at 84.51% examples, 396105 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:15,406 : INFO : EPOCH 5 - PROGRESS: at 87.64% examples, 396246 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:16,418 : INFO : EPOCH 5 - PROGRESS: at 90.71% examples, 396094 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:17,418 : INFO : EPOCH 5 - PROGRESS: at 93.74% examples, 395884 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:18,419 : INFO : EPOCH 5 - PROGRESS: at 96.69% examples, 395222 words/s, in_qsize 7, out_qsize 0
2018-04-30 12:10:19,426 : INFO : EPOCH 5 - PROGRESS: at 99.44% examples, 394099 words/s, in_qsize 8, out_qsize 0
2018-04-30 12:10:19,516 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-04-30 12:10:19,536 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-30 12:10:19,547 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-30 12:10:19,552 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-30 12:10:19,553 : INFO : EPOCH - 5 : training on 17798270 raw words (12871972 effective words) took 32.6s, 394752 effective words/s
2018-04-30 12:10:19,555 : INFO : training on a 88991350 raw words (64362962 effective words) took 202.0s, 318637 effective words/s





<gensim.models.word2vec.Word2Vec at 0x23f36bbac18>

# 학습이 완료 되면 필요없는 메모리를 unload 시킨다.
model.init_sims(replace=True)

model_name = '300features_40minwords_10text'
# model_name = '300features_50minwords_20text'
model.save(model_name)

2018-04-30 12:10:19,865 : INFO : precomputing L2-norms of word weight vectors
2018-04-30 12:10:20,082 : INFO : saving Word2Vec object under 300features_40minwords_10text, separately None
2018-04-30 12:10:20,099 : INFO : not storing attribute vectors_norm
2018-04-30 12:10:20,118 : INFO : not storing attribute cum_table
2018-04-30 12:10:21,103 : INFO : saved 300features_40minwords_10text

# 유사도가 없는 단어 추출
model.wv.doesnt_match('man woman child kitchen'.split())

'kitchen'

model.wv.doesnt_match("france england germany berlin".split())

2018-04-30 12:10:21,386 : WARNING : vectors for words {'germany', 'france'} are not present in the model, ignoring these words

'england'

# 가장 유사한 단어를 추출
model.wv.most_similar("man")

[('woman', 0.6355543732643127),
 ('businessman', 0.5106414556503296),
 ('lad', 0.49627137184143066),
 ('millionair', 0.4852792024612427),
 ('ladi', 0.48219048976898193),
 ('policeman', 0.47352561354637146),
 ('widow', 0.4686756134033203),
 ('farmer', 0.4667765200138092),
 ('men', 0.4604969620704651),
 ('boxer', 0.4499785602092743)]

model.wv.most_similar("queen")

[('princess', 0.6181148886680603),
 ('madam', 0.5621399283409119),
 ('latifah', 0.5599690675735474),
 ('countess', 0.557962954044342),
 ('dame', 0.5570350885391235),
 ('stepmoth', 0.554591178894043),
 ('victoria', 0.5522404909133911),
 ('maid', 0.5426138639450073),
 ('maria', 0.533758282661438),
 ('eva', 0.5325278639793396)]

# model.wv.most_similar("happy")
model.wv.most_similar("happi") # stemming 처리 시

[('unhappi', 0.45206087827682495),
 ('sad', 0.43361160159111023),
 ('bitter', 0.40061312913894653),
 ('satisfi', 0.3947785198688507),
 ('lucki', 0.3823172450065613),
 ('joy', 0.37378984689712524),
 ('happier', 0.36996692419052124),
 ('glad', 0.3682306408882141),
 ('sappi', 0.36718422174453735),
 ('afraid', 0.3600339889526367)]

# 참고 https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
import gensim
import gensim.models as g

# 그래프에서 마이너스 폰트 깨지는 문제에 대한 대처
mpl.rcParams['axes.unicode_minus'] = False

model_name = '300features_40minwords_10text'
model = g.Doc2Vec.load(model_name)

vocab = list(model.wv.vocab)
X = model[vocab]

print(len(X))
print(X[0][:10])
tsne = TSNE(n_components=2)

# 100개의 단어에 대해서만 시각화
X_tsne = tsne.fit_transform(X[:100,:])
# X_tsne = tsne.fit_transform(X)

2018-04-30 12:10:26,511 : INFO : loading Doc2Vec object from 300features_40minwords_10text
2018-04-30 12:10:26,982 : INFO : loading wv recursively from 300features_40minwords_10text.wv.* with mmap=None
2018-04-30 12:10:26,983 : INFO : setting ignored attribute vectors_norm to None
2018-04-30 12:10:26,984 : INFO : loading vocabulary recursively from 300features_40minwords_10text.vocabulary.* with mmap=None
2018-04-30 12:10:26,985 : INFO : loading trainables recursively from 300features_40minwords_10text.trainables.* with mmap=None
2018-04-30 12:10:26,986 : INFO : setting ignored attribute cum_table to None
2018-04-30 12:10:26,987 : INFO : loaded 300features_40minwords_10text
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  from ipykernel import kernelapp as app


11986
[-0.04888876  0.02344336 -0.11825124 -0.0108145  -0.02040573  0.09992806
 -0.01389777 -0.02648932 -0.07141103 -0.03799058]

df = pd.DataFrame(X_tsne, index=vocab[:100], columns=['x', 'y'])
df.shape

(100, 2)

df.head(10)

	x	y
with	-12.056787	4.115122
all	-2.497947	5.160569
this	-3.791062	4.612969
stuff	-2.408651	-1.128544
go	-9.874672	0.933419
down	-11.559048	7.447039
at	-10.873740	5.436837
the	-2.758157	6.790027
moment	0.124072	2.817021
mj	3.170858	-2.268975

fig = plt.figure()
fig.set_size_inches(40, 20)
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos, fontsize=30)
plt.show()

output_22_0

import numpy as np

def makeFeatureVec(words, model, num_features):
    """
    주어진 문장에서 단어 벡터의 평균을 구하는 함수
    """
    # 속도를 위해 0으로 채운 배열로 초기화 한다.
    featureVec = np.zeros((num_features,),dtype="float32")

    nwords = 0.
    # Index2word는 모델의 사전에 있는 단어명을 담은 리스트이다.
    # 속도를 위해 set 형태로 초기화 한다.
    index2word_set = set(model.wv.index2word)
    # 루프를 돌며 모델 사전에 포함이 되는 단어라면 피처에 추가한다.
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 결과를 단어수로 나누어 평균을 구한다.
    featureVec = np.divide(featureVec,nwords)
    return featureVec

def getAvgFeatureVecs(reviews, model, num_features):
    # 리뷰 단어 목록의 각각에 대한 평균 feature 벡터를 계산하고
    # 2D numpy 배열을 반환한다.

    # 카운터를 초기화 한다.
    counter = 0.
    # 속도를 위해 2D 넘파이 배열을 미리 할당한다.
    reviewFeatureVecs = np.zeros(
        (len(reviews),num_features),dtype="float32")

    for review in reviews:
       # 매 1000개 리뷰마다 상태를 출력
       if counter%1000. == 0.:
           print("Review %d of %d" % (counter, len(reviews)))
       # 평균 피처 벡터를 만들기 위해 위에서 정의한 함수를 호출한다.
       reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
           num_features)
       # 카운터를 증가시킨다.
       counter = counter + 1.
    return reviewFeatureVecs

# 멀티스레드로 4개의 워커를 사용해 처리한다.
def getCleanReviews(reviews):
    clean_reviews = []
    clean_reviews = kaggleBagofWord.apply_by_multiprocessing(\
        reviews["review"], kaggleBagofWord.review_to_wordlist,\
        workers=4)
    return clean_reviews

%time trainDataVecs = getAvgFeatureVecs(\
    getCleanReviews(train), model, num_features )

Review 0 of 25000


C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:18: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Wall time: 2min 43s

%time testDataVecs = getAvgFeatureVecs(\
        getCleanReviews(test), model, num_features )

Review 0 of 25000


C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:18: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Wall time: 2min 28s

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(
    n_estimators = 100, n_jobs = -1, random_state=2018)

%time forest = forest.fit( trainDataVecs, train["sentiment"] )

Wall time: 21.1 s

from sklearn.model_selection import cross_val_score
%time score = np.mean(cross_val_score(\
    forest, trainDataVecs, \
    train['sentiment'], cv=10, scoring='roc_auc'))

Wall time: 3min 18s

score

0.904642208

result = forest.predict( testDataVecs )

output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv('data/Word2Vec_AverageVectors_{0:.5f}.csv'.format(score),
              index=False, quoting=3 )

output_sentiment = output['sentiment'].value_counts()
print(output_sentiment[0] - output_sentiment[1])
output_sentiment

66

0    12533
1    12467
Name: sentiment, dtype: int64

import seaborn as sns
%matplotlib inline

fig, axes = plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sns.countplot(train['sentiment'], ax=axes[0])
sns.countplot(output['sentiment'], ax=axes[1])

<matplotlib.axes._subplots.AxesSubplot at 0x1681f5faa90>

output_35_1

Score 81.34%

reniew's blog
cs and deep learning

Bag of words meets bag of popcorn

27 Jun 2018 | Kaggle

Bag of words meets bag of popcorn

Part 2 / Word Vector

Comments

reniew's blog cs and deep learning

Bag of words meets bag of popcorn

27 Jun 2018 | Kaggle

Bag of words meets bag of popcorn

Part 2 / Word Vector

Comments

reniew's blog
cs and deep learning