A short description of the post.
Question: For my second blog post I will be observing the content of the sub-reddit I am scraping, extracting information, and performing initial natural language processing and text analysis on it.
Navigate and describe the characteristics of the data source of your interest, if you’ve specified one (or plan B). The characteristics include (1) its ‘content’ and/or (2) how it can be scraped. The characteristics include
and/or NLP tools in relation to your research project.
I Summon up your knowledge of some useful packages we’ve reviewed and/or NLP tools in relation to your research project. Sorting out adjectives? Extracting major verbs or named entities? …
Explanation: My code below illustrates how I initially got my information from reddit by scraping. In this case I used RedditExtractoR. The author of this packages describes it as a minimalist r wrapper it scrapes a limited number of posts from reddit. The api on reddit itself only allows 60 requests per minute so I will have to extract the next set of data later and pick a period before my first post occured.
#top_guns_urls <- find_thread_urls(subreddit="guns", sort_by="top")
load("/Users/noahmilstein/Desktop/Spring 2022/Textasdata/text_as_data_work/df_guns.RData")
str(top_guns_urls)
top_guns_urls_df=top_guns_urls[,c("title", "date_utc", "comments")]
#guns_contents <- get_thread_content(top_guns_urls_df$url[1:1000])
#str(guns_contents$threads)
Explanation: Below I have processed my inital dataframe from reddit into a corpus.
Explanation: Next I used the factorial_preprocessing() command to both use n-grams processing and use an infrequent term threshold.
preprocessed_documents <- factorial_preprocessing(
top_guns_corpus,
use_ngrams = TRUE,
infrequent_term_threshold = 0.2,
verbose = FALSE)
Preprocessing 998 documents 128 different ways...
names(preprocessed_documents)
[1] "choices" "dfm_list" "labels"
head(preprocessed_documents$choices)
removePunctuation removeNumbers lowercase stem
P-N-L-S-W-I-3 TRUE TRUE TRUE TRUE
N-L-S-W-I-3 FALSE TRUE TRUE TRUE
P-L-S-W-I-3 TRUE FALSE TRUE TRUE
L-S-W-I-3 FALSE FALSE TRUE TRUE
P-N-S-W-I-3 TRUE TRUE FALSE TRUE
N-S-W-I-3 FALSE TRUE FALSE TRUE
removeStopwords infrequent_terms use_ngrams
P-N-L-S-W-I-3 TRUE TRUE TRUE
N-L-S-W-I-3 TRUE TRUE TRUE
P-L-S-W-I-3 TRUE TRUE TRUE
L-S-W-I-3 TRUE TRUE TRUE
P-N-S-W-I-3 TRUE TRUE TRUE
N-S-W-I-3 TRUE TRUE TRUE
Explanation: Next I used preText() to pre-process the documents that I have so far to acquire pre-text scores that can give me a sense of what techniques may be necessary for natural language processing as the project develops.
#preText_results <- preText(
# preprocessed_documents,
# dataset_name = "Gun Pretext Results",
# distance_method = "cosine",
# num_comparisons = 50,
# verbose = TRUE)
Explanation Continued: Next we look at the pre-text scores with 50 comparisons, which was acquired from the code above. Below these are graphed with intercept.
load("/Users/noahmilstein/Desktop/Spring 2022/Textasdata/text_as_data_work/preText_results_gun_50_comps.RData")
preText_score_plot(preText_results)
The lowest score with intercepts according to the graph is N-3 which removes numbers, and uses n-grams. This plot represents the potential risk of using more complex pre-processing at the pre-text score goes up but may remove more information. The highest score belongs to L-S which is lowercased and stemmed which is quite risky to do.
Explanation Continued: Looking at the regression coefficients we see negative scores as usual results and positive coefficients as unusual ones. In this case removing puncuation, stopwords, and n-grams would not lead to a great deal of abnormalilty.
regression_coefficient_plot(preText_results,
remove_intercept = TRUE)
top_guns_tokens <- tokens(top_guns_corpus)
print(top_guns_tokens)
Tokens consisting of 998 documents.
text1 :
[1] "Smith" "and" "Wesson" "Saturday" "anyone" "?"
text2 :
[1] "My" "two" "favorite" "9mm" "s"
text3 :
[1] "My" "Arex" "Zero" "1" "Tactical" "w"
[7] "/" "a" "Trijicon" "RMR" "," "Viridian"
[ ... and 8 more ]
text4 :
[1] "Opinion" "of" "the" "fnx9" "?" "Already"
[7] "have" "an" "xd" "and" "a" "p320"
[ ... and 10 more ]
text5 :
[1] "Howa" "1500" "in" "." "308"
text6 :
[1] "A" "little" "before" "and" "after"
[6] "comparison" "."
[ reached max_ndoc ... 992 more documents ]
top_guns_tokens_no_punct <- tokens(top_guns_corpus,
remove_punct = T)
print(top_guns_tokens_no_punct)
Tokens consisting of 998 documents.
text1 :
[1] "Smith" "and" "Wesson" "Saturday" "anyone"
text2 :
[1] "My" "two" "favorite" "9mm" "s"
text3 :
[1] "My" "Arex" "Zero" "1" "Tactical" "w"
[7] "a" "Trijicon" "RMR" "Viridian" "XL5" "Gen"
[ ... and 5 more ]
text4 :
[1] "Opinion" "of" "the" "fnx9" "Already" "have"
[7] "an" "xd" "and" "a" "p320" "m18"
[ ... and 6 more ]
text5 :
[1] "Howa" "1500" "in" "308"
text6 :
[1] "A" "little" "before" "and" "after"
[6] "comparison"
[ reached max_ndoc ... 992 more documents ]
top_guns_tokens_no_punct_no_upper <- tokens_tolower(top_guns_tokens_no_punct)
print(top_guns_tokens_no_punct_no_upper)
Tokens consisting of 998 documents.
text1 :
[1] "smith" "and" "wesson" "saturday" "anyone"
text2 :
[1] "my" "two" "favorite" "9mm" "s"
text3 :
[1] "my" "arex" "zero" "1" "tactical" "w"
[7] "a" "trijicon" "rmr" "viridian" "xl5" "gen"
[ ... and 5 more ]
text4 :
[1] "opinion" "of" "the" "fnx9" "already" "have"
[7] "an" "xd" "and" "a" "p320" "m18"
[ ... and 6 more ]
text5 :
[1] "howa" "1500" "in" "308"
text6 :
[1] "a" "little" "before" "and" "after"
[6] "comparison"
[ reached max_ndoc ... 992 more documents ]
top_guns_tokens_no_punct_no_upper_no_stop <- tokens_select(top_guns_tokens_no_punct_no_upper, pattern = stopwords("en"), selection = "remove")
length(top_guns_tokens_no_punct_no_upper_no_stop)
[1] 998
print(top_guns_tokens_no_punct_no_upper_no_stop)
Tokens consisting of 998 documents.
text1 :
[1] "smith" "wesson" "saturday" "anyone"
text2 :
[1] "two" "favorite" "9mm" "s"
text3 :
[1] "arex" "zero" "1" "tactical" "w" "trijicon"
[7] "rmr" "viridian" "xl5" "gen" "3" "hogue"
[ ... and 2 more ]
text4 :
[1] "opinion" "fnx9" "already" "xd"
[5] "p320" "m18" "sa" "da"
[9] "interesting" "sure"
text5 :
[1] "howa" "1500" "308"
text6 :
[1] "little" "comparison"
[ reached max_ndoc ... 992 more documents ]
top_guns_corpus_tokens <- tokens(top_guns_corpus)
print(top_guns_corpus_tokens)
Tokens consisting of 998 documents.
text1 :
[1] "Smith" "and" "Wesson" "Saturday" "anyone" "?"
text2 :
[1] "My" "two" "favorite" "9mm" "s"
text3 :
[1] "My" "Arex" "Zero" "1" "Tactical" "w"
[7] "/" "a" "Trijicon" "RMR" "," "Viridian"
[ ... and 8 more ]
text4 :
[1] "Opinion" "of" "the" "fnx9" "?" "Already"
[7] "have" "an" "xd" "and" "a" "p320"
[ ... and 10 more ]
text5 :
[1] "Howa" "1500" "in" "." "308"
text6 :
[1] "A" "little" "before" "and" "after"
[6] "comparison" "."
[ reached max_ndoc ... 992 more documents ]
head(annotated.guns_corpus$token)
# A tibble: 6 × 11
doc_id sid tid token token_with_ws lemma upos xpos feats
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 1 Smith "Smith " Smith PROPN NNP Numb…
2 1 1 2 and "and " and CCONJ CC <NA>
3 1 1 3 Wesson "Wesson " Wesson PROPN NNP Numb…
4 1 1 4 Saturday "Saturday " Saturday PROPN NNP Numb…
5 1 1 5 anyone "anyone" anyone PRON NN Numb…
6 1 1 6 ? "?" ? PUNCT . <NA>
# … with 2 more variables: tid_source <chr>, relation <chr>
head(annotated.guns_corpus$document)
doc_id
1 1
2 2
3 3
4 4
5 5
6 6
doc_id_guns<-annotated.guns_corpus$document
doc_id_guns$date<-top_guns_urls_df$date_utc
annoData <- left_join(doc_id_guns, annotated.guns_corpus$token, by = "doc_id")
annoData$date<-as.Date(annoData$date)
annoData %>%
group_by(date) %>%
summarize(Sentences = max(sid)) %>%
ggplot(aes(date, Sentences)) +
geom_line() +
geom_smooth() +
theme_bw()
readability <- textstat_readability(top_guns_corpus,
measure = c("Flesch.Kincaid", "FOG", "Coleman.Liau.grade"))
# add in a chapter number
readability$post <- c(1:nrow(readability))
# plot results
ggplot(readability, aes(x = post)) +
geom_line(aes(y = Flesch.Kincaid), color = "black") +
geom_line(aes(y = FOG), color = "red") +
geom_line(aes(y = Coleman.Liau.grade), color = "blue") +
theme_bw()
annoData$date<-as.Date(annoData$date)
readability$added_dates<-as.Date(top_guns_urls_df$date_utc)
ggplot(readability, aes(x = added_dates)) +
geom_smooth(aes(y = Flesch.Kincaid), color = "black") +
geom_smooth(aes(y = FOG), color = "red") +
geom_smooth(aes(y = Coleman.Liau.grade), color = "blue") +
theme_minimal()
cor(readability$Flesch.Kincaid, readability$FOG, use = "complete.obs")
[1] 0.8007756
cor(readability$Flesch.Kincaid, readability$Coleman.Liau.grade, use = "complete.obs")
[1] 0.7308223
cor(readability$FOG, readability$Coleman.Liau.grade, use = "complete.obs")
[1] 0.5551364
#sentimetnsdf<-get_sentiments("nrc")
#write.csv(sentimetnsdf, file = "sentimetnsdf.csv")
#save(sentimetnsdf, file="sentimetnsdf_2")
top_guns_urls_df_2<-top_guns_urls_df
top_guns_urls_df_2$text<- seq(1, 998, by=1)
nrc_joy <- sentimetnsdf %>%
filter(sentiment == "joy")
tidy_posts_for_guns <- top_guns_urls_df_2 %>%
unnest_tokens(word, title)
tidy_posts_for_guns %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>% head() %>% kable()
word | n |
---|---|
love | 24 |
finally | 22 |
happy | 18 |
fun | 14 |
favorite | 12 |
found | 10 |
tidy_posts_for_guns_sentiment <- tidy_posts_for_guns %>%
inner_join(sentimetnsdf) %>%
count(text, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
tidy_posts_for_guns_sentiment
# A tibble: 580 × 12
text trust joy positive anger anticipation disgust fear
<dbl> <int> <int> <int> <int> <int> <int> <int>
1 1 1 0 0 0 0 0 0
2 2 1 1 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 9 1 0 2 0 0 0 0
5 11 0 0 1 2 1 1 1
6 12 0 0 1 0 0 0 0
7 14 2 2 3 0 3 0 0
8 15 1 1 3 0 2 0 0
9 16 2 1 1 0 1 0 0
10 22 0 0 0 1 0 0 1
# … with 570 more rows, and 4 more variables: negative <int>,
# sadness <int>, surprise <int>, sentiment <int>
sentimetnsdf
# A tibble: 13,875 × 3
...1 word sentiment
<dbl> <chr> <chr>
1 1 abacus trust
2 2 abandon fear
3 3 abandon negative
4 4 abandon sadness
5 5 abandoned anger
6 6 abandoned fear
7 7 abandoned negative
8 8 abandoned sadness
9 9 abandonment anger
10 10 abandonment fear
# … with 13,865 more rows
nrc_guns_word_counts <- tidy_posts_for_guns %>%
inner_join(sentimetnsdf) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
nrc_guns_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Explanation: Using nrc appears to have had some unintended effects that may require an analysis of the specific words used to describe sentiment. One difficult part of the data being used is that firearms, and the words used to describe them, are percieved
Bing_sentiments<-get_sentiments("bing")
tidy_posts_for_guns <- top_guns_urls_df_2 %>%
unnest_tokens(word, title)
tidy_posts_for_guns %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>% head() %>% kable()
word | n |
---|---|
love | 24 |
finally | 22 |
happy | 18 |
fun | 14 |
favorite | 12 |
found | 10 |
tidy_posts_for_guns_sentiment <- tidy_posts_for_guns %>%
inner_join(Bing_sentiments) %>%
count(text, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
bing_word_counts <- tidy_posts_for_guns %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Document-feature matrix of: 998 documents, 2,661 features (99.65% sparse) and 0 docvars.
features
docs smith and wesson saturday anyone ? my two favorite 9mm
text1 1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1
text3 0 1 0 0 0 0 1 0 0 0
text4 0 1 0 0 0 1 0 0 0 0
text5 0 0 0 0 0 0 0 0 0 0
text6 0 1 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 992 more documents, reached max_nfeat ... 2,651 more features ]
A LDA_VEM topic model with 2 topics.
gun_dfm_lda_topics <- tidy(gun_dfm_lda, matrix = "beta")
gun_dfm_lda_topics
# A tibble: 5,322 × 3
topic term beta
<int> <chr> <dbl>
1 1 smith 0.000540
2 2 smith 0.000679
3 1 and 0.0254
4 2 and 0.00320
5 1 wesson 0.000669
6 2 wesson 0.000956
7 1 saturday 0.000655
8 2 saturday 0.000564
9 1 anyone 0.00138
10 2 anyone 0.00167
# … with 5,312 more rows
gun_top_terms <- gun_dfm_lda_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
gun_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
beta_wide <- gun_dfm_lda_topics %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_wide
# A tibble: 192 × 4
term topic1 topic2 log_ratio
<chr> <dbl> <dbl> <dbl>
1 and 0.0254 0.00320 -2.99
2 anyone 0.00138 0.00167 0.275
3 ? 0.0118 0.0109 -0.112
4 my 0.0380 0.0117 -1.70
5 two 0.00163 0.00203 0.311
6 favorite 0.00134 0.00110 -0.291
7 s 0.00659 0.00356 -0.888
8 1 0.000486 0.00114 1.23
9 tactical 0.00137 0.0000479 -4.84
10 w 0.00193 0.000101 -4.26
# … with 182 more rows
Response: As can be seen above topic modeling may benefit from some data reduction, removing punctuation and stop words would likely be beneficial as can be seen above where a number of the differences between topics are modeled as punctuation and stop words.
gun_tokens_stopwords_and_punct_removed <- tokens_remove(tokens(top_guns_corpus, remove_punct = TRUE), c(stopwords("english")))
gun_corpus_stopwords_and_punct_removed <- corpus(sapply(gun_tokens_stopwords_and_punct_removed, paste, collapse=" ")
)
gun_corpus_stopwords_and_punct_removed
Corpus consisting of 998 documents.
text1 :
"Smith Wesson Saturday anyone"
text2 :
"two favorite 9mm s"
text3 :
"Arex Zero 1 Tactical w Trijicon RMR Viridian XL5 Gen 3 Hogue..."
text4 :
"Opinion fnx9 Already xd p320 m18 SA DA interesting sure"
text5 :
"Howa 1500 308"
text6 :
"little comparison"
[ reached max_ndoc ... 992 more documents ]
library(methods)
too_gun_dfm_no_punct_stopwords<- quanteda::dfm(gun_corpus_stopwords_and_punct_removed, verbose = FALSE)
too_gun_dfm_no_punct_stopwords
Document-feature matrix of: 998 documents, 2,518 features (99.78% sparse) and 0 docvars.
features
docs smith wesson saturday anyone two favorite 9mm s arex zero
text1 1 1 1 1 0 0 0 0 0 0
text2 0 0 0 0 1 1 1 1 0 0
text3 0 0 0 0 0 0 0 0 1 1
text4 0 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 992 more documents, reached max_nfeat ... 2,508 more features ]
library(topicmodels)
gun_dfm_lda_nopunct_stop <- LDA(too_gun_dfm_no_punct_stopwords, k = 2, control = list(seed = 777))
gun_dfm_lda_nopunct_stop
A LDA_VEM topic model with 2 topics.
gun_dfm_lda_topics_nopunct_stop <- tidy(gun_dfm_lda_nopunct_stop, matrix = "beta")
gun_dfm_lda_topics_nopunct_stop
# A tibble: 5,036 × 3
topic term beta
<int> <chr> <dbl>
1 1 smith 0.00135
2 2 smith 0.000782
3 1 wesson 0.00234
4 2 wesson 0.000503
5 1 saturday 0.00154
6 2 saturday 0.000596
7 1 anyone 0.00339
8 2 anyone 0.00194
9 1 two 0.00320
10 2 two 0.00320
# … with 5,026 more rows
gun_top_terms_no_punct_or_stop<- gun_dfm_lda_topics_nopunct_stop %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
gun_top_terms_no_punct_or_stop %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Response: As can be seen from the results above, removing stopwords and punctuation removes a good deal of the unwanted language from the corpus and does a slightly more comprehensible job in displaying the information. However, any kind of stemming or reduction will be difficult with posts about firearms for a number of reasons. Firstly the language surrounding firearms involves numbers for model numbers, ammunition calibers and the capacity of magazines and other devices that hold bullets. This results in difficulty removing both punctuation and numbers from the data as they give a sense of what sort of each of the aforementioned items people are interesting in talking about. As a results removing the punctuation is difficult because it allows for more comprehensible data by reducing the usage of unneeded punctuation like exclamaintion points and questions marks that are common on a forum of this nature but not useful in analyzing the common topics and language.
For attribution, please cite this work as
Milstein (2022, March 3). Noah_Milstein_Blog: Text as Data Blog Post 2. Retrieved from https://nmilsteinuma.github.io/posts/2022-03-03-text-as-data-blog-post-2/
BibTeX citation
@misc{milstein2022text, author = {Milstein, Noah}, title = {Noah_Milstein_Blog: Text as Data Blog Post 2}, url = {https://nmilsteinuma.github.io/posts/2022-03-03-text-as-data-blog-post-2/}, year = {2022} }