# Text-classification-and-summarization **Repository Path**: fyq9719/Text-classification-and-summarization ## Basic Information - **Project Name**: Text-classification-and-summarization - **Description**: Built Logistic regression, SVM, Naive Bayes, RandomForest, KNN for text classification on scrapped news data. Built Text rank, LDA and K-means clustering for text summarization. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2020-03-15 - **Last Updated**: 2023-12-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Text-classification-and-summarization ### Text classification: Classifying the news articles into 4 categories namely Health, Business, Entertainment, Technology using the following ML models: 1. Logistic regression 2. Support Vector Machine 3. Naive Bayes 4. Random forest 5. K-NN ### Text summarization: Summarize the news articles using Extractive text summarization ( selecting top sentences from the article) #### Models for extractive summarization: 1. Text rank algorithm (variation of page rank) 2. K-means clustering 3. Latent semantic analysis ### Data: Scrapped news articles from urls provided by UCI Machine Learning repository [link](http://archive.ics.uci.edu/ml/datasets/News+Aggregator) For scrapping the news articles, ```Newspaper3k``` [library](https://newspaper.readthedocs.io/en/latest/) built in Python was used. The library contains ```nlp()``` method using which *keywords* and *summary* of the news article can be extracted. Article's content and summary have been scrapped to create the data for the project. [Code](https://github.com/saiharshithreddy/Text-classification-and-summarization/blob/master/Data%20collection/data%20scrapper.ipynb) ### Installation The following libraries of Python have to be installed: ```pandas```, ```sklearn```, ```ntlk```, ```newspaper3k``` Run the following command to install ```pip install -r requirements.txt``` ### Data preprocessing Raw text has unwanted characters (\n,\t,$ etc) and contains stop words (a, an, the) which has to removed before generating the vector representation. The following text preprocessing techniques have been used: 1. Converting to lower case 2. Removal of stop words 3. Tokenize 4. Removing contractions (does'nt -> does not) 5. Stemming/Lemmatization ### Results #### Text classification | S.no | Model | Accuracy in % (BoW)| Accuracy in % (Tf-idf) | |------|-------|----------|---------------------| |1. | Logistic regression | 95.2|94.7 | |2. | SVM |94.8 | 95.2| |3. | Naive Bayes | 94.69| 94.54| |4. | Random forest |92.2 | 92.05| |5. | K-NN |94.3 | 94.59| #### Text summarization | S.no | Model | Rouge-1 | |------|-------|----------| |1. | Text rank | 59.2 | |2. | K-means clustering|54.7 | |3. | Latent semantic analysis | 52.1|