{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Text mining" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We show how to use scikit-network for text mining. We here consider the novel [Les Misérables](https://en.wikipedia.org/wiki/Les_Misérables) by Victor Hugo (Project Gutenberg, translation by Isabel F. Hapgood). By considering the graph between words and paragraphs, we can embed both words and paragraphs in the same vector space and compute cosine similarity between them.\n", "\n", "Each word is considered as in the original text; more advanced [tokenizers](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) can be used instead.\n", "\n", "Other graphs can be considered, like the graph of co-occurrence of words within a window of 5 words, or the graph of chapters and words. These graphs can be combined to get richer information and better embeddings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from re import sub" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.data import from_adjacency_list\n", "from sknetwork.embedding import Spectral" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "filename = 'miserables-en.txt'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "with open(filename, 'r') as f:\n", " text = f.read()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "3254333" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Project Gutenberg EBook of Les Misérables, by Victor Hugo\n", "\n", "This eBook is for the use of anyone anywhere at no cost and with almost\n", "no restrictions whatsoever. You may copy it, give it away or re-use\n", "it under the terms of the Project Gutenberg License included with this\n", "eBook or online at www.gutenberg.org\n", "\n", "\n", "Title: Les Misérables\n", " Complete in Five Volumes\n", "\n", "Author: Victor Hugo\n", "\n", "Translator: Isabel F. Hapgood\n", "\n", "Release Date: June 22, 2008 [EBook #135]\n", "Last Updated: January 18, 2016\n", "\n", "\n" ] } ], "source": [ "print(text[:494])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Pre-processing" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# extract main text\n", "main = text.split('LES MISÉRABLES')[-2].lower()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "3215017" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(main)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# remove ponctuation\n", "main = sub(r\"[,.;:()@#?!&$'_*]\", \" \", main)\n", "main = sub(r'[\"-]', ' ', main)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# extract paragraphs\n", "sep = '|||'\n", "main = sub(r'\\n\\n+', sep, main)\n", "main = sub('\\n', ' ', main)\n", "paragraphs = main.split(sep)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "13499" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(paragraphs)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "'after leaving the asses there was a fresh delight they crossed the seine in a boat and proceeding from passy on foot they reached the barrier of l étoile they had been up since five o clock that morning as the reader will remember but bah there is no such thing as fatigue on sunday said favourite on sunday fatigue does not work '" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paragraphs[1000]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Build graph" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "graph = from_adjacency_list(paragraph_words, bipartite=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "biadjacency = graph.biadjacency\n", "words = graph.names_col" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "<13499x23093 sparse matrix of type ''\n", "\twith 416331 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "biadjacency" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "23093" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(words)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Statistics" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "n_row, n_col = biadjacency.shape" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "paragraph_lengths = biadjacency.dot(np.ones(n_col))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "array([ 6., 23., 127., 379.])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.quantile(paragraph_lengths, [0.1, 0.5, 0.9, 0.99])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "word_counts = biadjacency.T.dot(np.ones(n_row))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "array([ 1. , 2. , 23. , 282.08])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Embedding" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "dimension = 50\n", "spectral = Spectral(dimension, regularization=100)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spectral.fit(biadjacency)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "embedding_paragraph = spectral.embedding_row_\n", "embedding_word = spectral.embedding_col_" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# some word\n", "i = int(np.argwhere(words == 'love'))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "array(['love', 'kiss', 'ye', 'celestial', 'hearts', 'loved', 'tender',\n", " 'roses', 'joys', 'sweet', 'wedded', 'charming', 'angelic', 'adore',\n", " 'aurora', 'pearl', 'voluptuousness', 'chaste', 'innumerable',\n", " 'heart'], dtype='