Noah_Milstein_Blog: Text as Data Blog Post 2

Noah Milstein

Blog Post 2:

Question: For my second blog post I will be observing the content of the sub-reddit I am scraping, extracting information, and performing initial natural language processing and text analysis on it.

Navigate and describe the characteristics of the data source of your interest, if you’ve specified one (or plan B). The characteristics include (1) its ‘content’ and/or (2) how it can be scraped. The characteristics include

its ‘content’ and/or (2) how it can be scraped. I Summon up your knowledge of some useful packages we’ve reviewed

and/or NLP tools in relation to your research project.

I Summon up your knowledge of some useful packages we’ve reviewed and/or NLP tools in relation to your research project. Sorting out adjectives? Extracting major verbs or named entities? …

Inital Loading and Processing

Explanation: My code below illustrates how I initially got my information from reddit by scraping. In this case I used RedditExtractoR. The author of this packages describes it as a minimalist r wrapper it scrapes a limited number of posts from reddit. The api on reddit itself only allows 60 requests per minute so I will have to extract the next set of data later and pick a period before my first post occured.

#top_guns_urls <- find_thread_urls(subreddit="guns", sort_by="top")

load("/Users/noahmilstein/Desktop/Spring 2022/Textasdata/text_as_data_work/df_guns.RData")

str(top_guns_urls)

top_guns_urls_df=top_guns_urls[,c("title", "date_utc", "comments")]

#guns_contents <- get_thread_content(top_guns_urls_df$url[1:1000])
#str(guns_contents$threads)

Conversion From Data Frame to Corpus

Explanation: Below I have processed my inital dataframe from reddit into a corpus.

top_guns_urls_df=top_guns_urls[,c("title", "date_utc", "comments")]

top_guns_corpus<-corpus(top_guns_urls_df$title)

top_guns_documents<-top_guns_corpus[1:10,]

top_guns_corpus_summary <- summary(top_guns_corpus)

Pre-processing

Explanation: Next I used the factorial_preprocessing() command to both use n-grams processing and use an infrequent term threshold.

preprocessed_documents <- factorial_preprocessing(
    top_guns_corpus,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.2,
    verbose = FALSE)

Preprocessing 998 documents 128 different ways...

names(preprocessed_documents)

[1] "choices"  "dfm_list" "labels"

head(preprocessed_documents$choices)

              removePunctuation removeNumbers lowercase stem
P-N-L-S-W-I-3              TRUE          TRUE      TRUE TRUE
N-L-S-W-I-3               FALSE          TRUE      TRUE TRUE
P-L-S-W-I-3                TRUE         FALSE      TRUE TRUE
L-S-W-I-3                 FALSE         FALSE      TRUE TRUE
P-N-S-W-I-3                TRUE          TRUE     FALSE TRUE
N-S-W-I-3                 FALSE          TRUE     FALSE TRUE
              removeStopwords infrequent_terms use_ngrams
P-N-L-S-W-I-3            TRUE             TRUE       TRUE
N-L-S-W-I-3              TRUE             TRUE       TRUE
P-L-S-W-I-3              TRUE             TRUE       TRUE
L-S-W-I-3                TRUE             TRUE       TRUE
P-N-S-W-I-3              TRUE             TRUE       TRUE
N-S-W-I-3                TRUE             TRUE       TRUE

preText Results

Explanation: Next I used preText() to pre-process the documents that I have so far to acquire pre-text scores that can give me a sense of what techniques may be necessary for natural language processing as the project develops.

#preText_results <- preText(
#    preprocessed_documents,
#    dataset_name = "Gun Pretext Results",
#    distance_method = "cosine",
#    num_comparisons = 50,
#    verbose = TRUE)

Explanation Continued: Next we look at the pre-text scores with 50 comparisons, which was acquired from the code above. Below these are graphed with intercept.

load("/Users/noahmilstein/Desktop/Spring 2022/Textasdata/text_as_data_work/preText_results_gun_50_comps.RData")

preText_score_plot(preText_results)

The lowest score with intercepts according to the graph is N-3 which removes numbers, and uses n-grams. This plot represents the potential risk of using more complex pre-processing at the pre-text score goes up but may remove more information. The highest score belongs to L-S which is lowercased and stemmed which is quite risky to do.

Explanation Continued: Looking at the regression coefficients we see negative scores as usual results and positive coefficients as unusual ones. In this case removing puncuation, stopwords, and n-grams would not lead to a great deal of abnormalilty.

regression_coefficient_plot(preText_results,
                            remove_intercept = TRUE)

Tokens and Corpus Work

top_guns_tokens <- tokens(top_guns_corpus)

print(top_guns_tokens)

Tokens consisting of 998 documents.
text1 :
[1] "Smith"    "and"      "Wesson"   "Saturday" "anyone"   "?"       

text2 :
[1] "My"       "two"      "favorite" "9mm"      "s"       

text3 :
 [1] "My"       "Arex"     "Zero"     "1"        "Tactical" "w"       
 [7] "/"        "a"        "Trijicon" "RMR"      ","        "Viridian"
[ ... and 8 more ]

text4 :
 [1] "Opinion" "of"      "the"     "fnx9"    "?"       "Already"
 [7] "have"    "an"      "xd"      "and"     "a"       "p320"   
[ ... and 10 more ]

text5 :
[1] "Howa" "1500" "in"   "."    "308" 

text6 :
[1] "A"          "little"     "before"     "and"        "after"     
[6] "comparison" "."         

[ reached max_ndoc ... 992 more documents ]

top_guns_tokens_no_punct <- tokens(top_guns_corpus, 
    remove_punct = T)

print(top_guns_tokens_no_punct)

Tokens consisting of 998 documents.
text1 :
[1] "Smith"    "and"      "Wesson"   "Saturday" "anyone"  

text2 :
[1] "My"       "two"      "favorite" "9mm"      "s"       

text3 :
 [1] "My"       "Arex"     "Zero"     "1"        "Tactical" "w"       
 [7] "a"        "Trijicon" "RMR"      "Viridian" "XL5"      "Gen"     
[ ... and 5 more ]

text4 :
 [1] "Opinion" "of"      "the"     "fnx9"    "Already" "have"   
 [7] "an"      "xd"      "and"     "a"       "p320"    "m18"    
[ ... and 6 more ]

text5 :
[1] "Howa" "1500" "in"   "308" 

text6 :
[1] "A"          "little"     "before"     "and"        "after"     
[6] "comparison"

[ reached max_ndoc ... 992 more documents ]

top_guns_tokens_no_punct_no_upper <- tokens_tolower(top_guns_tokens_no_punct)

print(top_guns_tokens_no_punct_no_upper)

Tokens consisting of 998 documents.
text1 :
[1] "smith"    "and"      "wesson"   "saturday" "anyone"  

text2 :
[1] "my"       "two"      "favorite" "9mm"      "s"       

text3 :
 [1] "my"       "arex"     "zero"     "1"        "tactical" "w"       
 [7] "a"        "trijicon" "rmr"      "viridian" "xl5"      "gen"     
[ ... and 5 more ]

text4 :
 [1] "opinion" "of"      "the"     "fnx9"    "already" "have"   
 [7] "an"      "xd"      "and"     "a"       "p320"    "m18"    
[ ... and 6 more ]

text5 :
[1] "howa" "1500" "in"   "308" 

text6 :
[1] "a"          "little"     "before"     "and"        "after"     
[6] "comparison"

[ reached max_ndoc ... 992 more documents ]

top_guns_tokens_no_punct_no_upper_no_stop <-    tokens_select(top_guns_tokens_no_punct_no_upper, pattern = stopwords("en"), selection = "remove")

length(top_guns_tokens_no_punct_no_upper_no_stop)

[1] 998

print(top_guns_tokens_no_punct_no_upper_no_stop)

Tokens consisting of 998 documents.
text1 :
[1] "smith"    "wesson"   "saturday" "anyone"  

text2 :
[1] "two"      "favorite" "9mm"      "s"       

text3 :
 [1] "arex"     "zero"     "1"        "tactical" "w"        "trijicon"
 [7] "rmr"      "viridian" "xl5"      "gen"      "3"        "hogue"   
[ ... and 2 more ]

text4 :
 [1] "opinion"     "fnx9"        "already"     "xd"         
 [5] "p320"        "m18"         "sa"          "da"         
 [9] "interesting" "sure"       

text5 :
[1] "howa" "1500" "308" 

text6 :
[1] "little"     "comparison"

[ reached max_ndoc ... 992 more documents ]

top_guns_corpus_tokens <- tokens(top_guns_corpus)

print(top_guns_corpus_tokens)

Tokens consisting of 998 documents.
text1 :
[1] "Smith"    "and"      "Wesson"   "Saturday" "anyone"   "?"       

text2 :
[1] "My"       "two"      "favorite" "9mm"      "s"       

text3 :
 [1] "My"       "Arex"     "Zero"     "1"        "Tactical" "w"       
 [7] "/"        "a"        "Trijicon" "RMR"      ","        "Viridian"
[ ... and 8 more ]

text4 :
 [1] "Opinion" "of"      "the"     "fnx9"    "?"       "Already"
 [7] "have"    "an"      "xd"      "and"     "a"       "p320"   
[ ... and 10 more ]

text5 :
[1] "Howa" "1500" "in"   "."    "308" 

text6 :
[1] "A"          "little"     "before"     "and"        "after"     
[6] "comparison" "."         

[ reached max_ndoc ... 992 more documents ]

head(annotated.guns_corpus$token)

# A tibble: 6 × 11
  doc_id   sid tid   token    token_with_ws lemma    upos  xpos  feats
   <int> <int> <chr> <chr>    <chr>         <chr>    <chr> <chr> <chr>
1      1     1 1     Smith    "Smith "      Smith    PROPN NNP   Numb…
2      1     1 2     and      "and "        and      CCONJ CC    <NA> 
3      1     1 3     Wesson   "Wesson "     Wesson   PROPN NNP   Numb…
4      1     1 4     Saturday "Saturday "   Saturday PROPN NNP   Numb…
5      1     1 5     anyone   "anyone"      anyone   PRON  NN    Numb…
6      1     1 6     ?        "?"           ?        PUNCT .     <NA> 
# … with 2 more variables: tid_source <chr>, relation <chr>

head(annotated.guns_corpus$document)

doc_id_guns<-annotated.guns_corpus$document

doc_id_guns$date<-top_guns_urls_df$date_utc

annoData <- left_join(doc_id_guns, annotated.guns_corpus$token, by = "doc_id")

annoData$date<-as.Date(annoData$date)

annoData %>% 
  group_by(date) %>% 
  summarize(Sentences = max(sid)) %>%
  ggplot(aes(date, Sentences)) +
    geom_line() +
    geom_smooth() +
    theme_bw()

# calculate readability
readability <- textstat_readability(top_guns_corpus, 
                                    measure = c("Flesch.Kincaid")) 

# add in a chapter number
readability$chapter <- c(1:nrow(readability))

# plot results
ggplot(readability, aes(x = chapter, y = Flesch.Kincaid)) +
  geom_line() + 
  geom_smooth() + 
  theme_bw()

readability <- textstat_readability(top_guns_corpus, 
                                    measure = c("Flesch.Kincaid", "FOG", "Coleman.Liau.grade")) 

# add in a chapter number

readability$post <- c(1:nrow(readability))

# plot results
ggplot(readability, aes(x = post)) +
  geom_line(aes(y = Flesch.Kincaid), color = "black") + 
  geom_line(aes(y = FOG), color = "red") + 
  geom_line(aes(y = Coleman.Liau.grade), color = "blue") + 
  theme_bw()

annoData$date<-as.Date(annoData$date)

readability$added_dates<-as.Date(top_guns_urls_df$date_utc)

ggplot(readability, aes(x = added_dates)) +
  geom_smooth(aes(y = Flesch.Kincaid), color = "black") + 
  geom_smooth(aes(y = FOG), color = "red") + 
  geom_smooth(aes(y = Coleman.Liau.grade), color = "blue") + 
  theme_minimal()

cor(readability$Flesch.Kincaid, readability$FOG, use = "complete.obs")

[1] 0.8007756

cor(readability$Flesch.Kincaid, readability$Coleman.Liau.grade, use = "complete.obs")

[1] 0.7308223

cor(readability$FOG, readability$Coleman.Liau.grade, use = "complete.obs")

[1] 0.5551364

#sentimetnsdf<-get_sentiments("nrc")

#write.csv(sentimetnsdf, file = "sentimetnsdf.csv")

#save(sentimetnsdf, file="sentimetnsdf_2")

library(readr)
sentimetnsdf <- read_csv("/Users/noahmilstein/Desktop/Spring 2022/Textasdata/text_as_data_work/sentimetnsdf.csv")

Sentiment Results Using NRC

top_guns_urls_df_2<-top_guns_urls_df
top_guns_urls_df_2$text<- seq(1, 998, by=1)


nrc_joy <- sentimetnsdf %>% 
  filter(sentiment == "joy")

tidy_posts_for_guns <- top_guns_urls_df_2 %>%
  unnest_tokens(word, title) 

tidy_posts_for_guns %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>% head() %>% kable()

word	n
love	24
finally	22
happy	18
fun	14
favorite	12
found	10

tidy_posts_for_guns_sentiment <- tidy_posts_for_guns %>%
  inner_join(sentimetnsdf) %>%
  count(text, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

tidy_posts_for_guns_sentiment

# A tibble: 580 × 12
    text trust   joy positive anger anticipation disgust  fear
   <dbl> <int> <int>    <int> <int>        <int>   <int> <int>
 1     1     1     0        0     0            0       0     0
 2     2     1     1        1     0            0       0     0
 3     4     0     0        1     0            0       0     0
 4     9     1     0        2     0            0       0     0
 5    11     0     0        1     2            1       1     1
 6    12     0     0        1     0            0       0     0
 7    14     2     2        3     0            3       0     0
 8    15     1     1        3     0            2       0     0
 9    16     2     1        1     0            1       0     0
10    22     0     0        0     1            0       0     1
# … with 570 more rows, and 4 more variables: negative <int>,
#   sadness <int>, surprise <int>, sentiment <int>

sentimetnsdf

# A tibble: 13,875 × 3
    ...1 word        sentiment
   <dbl> <chr>       <chr>    
 1     1 abacus      trust    
 2     2 abandon     fear     
 3     3 abandon     negative 
 4     4 abandon     sadness  
 5     5 abandoned   anger    
 6     6 abandoned   fear     
 7     7 abandoned   negative 
 8     8 abandoned   sadness  
 9     9 abandonment anger    
10    10 abandonment fear     
# … with 13,865 more rows

nrc_guns_word_counts <- tidy_posts_for_guns %>%
  inner_join(sentimetnsdf) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

nrc_guns_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Explanation: Using nrc appears to have had some unintended effects that may require an analysis of the specific words used to describe sentiment. One difficult part of the data being used is that firearms, and the words used to describe them, are percieved

Sentiment Results Using BING

Bing_sentiments<-get_sentiments("bing")

tidy_posts_for_guns <- top_guns_urls_df_2 %>%
  unnest_tokens(word, title) 

tidy_posts_for_guns %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>% head() %>% kable()

word	n
love	24
finally	22
happy	18
fun	14
favorite	12
found	10

tidy_posts_for_guns_sentiment <- tidy_posts_for_guns %>%
  inner_join(Bing_sentiments) %>%
  count(text, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bing_word_counts <- tidy_posts_for_guns %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

library(methods)

too_gun_dfm<- quanteda::dfm(top_guns_corpus, verbose = FALSE)

too_gun_dfm

Document-feature matrix of: 998 documents, 2,661 features (99.65% sparse) and 0 docvars.
       features
docs    smith and wesson saturday anyone ? my two favorite 9mm
  text1     1   1      1        1      1 1  0   0        0   0
  text2     0   0      0        0      0 0  1   1        1   1
  text3     0   1      0        0      0 0  1   0        0   0
  text4     0   1      0        0      0 1  0   0        0   0
  text5     0   0      0        0      0 0  0   0        0   0
  text6     0   1      0        0      0 0  0   0        0   0
[ reached max_ndoc ... 992 more documents, reached max_nfeat ... 2,651 more features ]

library(topicmodels)

gun_dfm_lda <- LDA(too_gun_dfm, k = 2, control = list(seed = 777))
gun_dfm_lda

A LDA_VEM topic model with 2 topics.

gun_dfm_lda_topics <- tidy(gun_dfm_lda, matrix = "beta")
gun_dfm_lda_topics

# A tibble: 5,322 × 3
   topic term         beta
   <int> <chr>       <dbl>
 1     1 smith    0.000540
 2     2 smith    0.000679
 3     1 and      0.0254  
 4     2 and      0.00320 
 5     1 wesson   0.000669
 6     2 wesson   0.000956
 7     1 saturday 0.000655
 8     2 saturday 0.000564
 9     1 anyone   0.00138 
10     2 anyone   0.00167 
# … with 5,312 more rows

gun_top_terms <- gun_dfm_lda_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

gun_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

beta_wide <- gun_dfm_lda_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  pivot_wider(names_from = topic, values_from = beta) %>% 
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_wide

# A tibble: 192 × 4
   term       topic1    topic2 log_ratio
   <chr>       <dbl>     <dbl>     <dbl>
 1 and      0.0254   0.00320      -2.99 
 2 anyone   0.00138  0.00167       0.275
 3 ?        0.0118   0.0109       -0.112
 4 my       0.0380   0.0117       -1.70 
 5 two      0.00163  0.00203       0.311
 6 favorite 0.00134  0.00110      -0.291
 7 s        0.00659  0.00356      -0.888
 8 1        0.000486 0.00114       1.23 
 9 tactical 0.00137  0.0000479    -4.84 
10 w        0.00193  0.000101     -4.26 
# … with 182 more rows

beta_wide %>%
  group_by(direction = log_ratio > 0) %>%
  slice_max(abs(log_ratio), n = 10) %>% 
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(log_ratio, term)) +
  geom_col() +
  labs(x = "Log2 ratio of beta in topic 2 / topic 1", y = NULL)

Topic Modeling analysis

Response: As can be seen above topic modeling may benefit from some data reduction, removing punctuation and stop words would likely be beneficial as can be seen above where a number of the differences between topics are modeled as punctuation and stop words.

gun_tokens_stopwords_and_punct_removed <- tokens_remove(tokens(top_guns_corpus, remove_punct = TRUE), c(stopwords("english")))

gun_corpus_stopwords_and_punct_removed <- corpus(sapply(gun_tokens_stopwords_and_punct_removed, paste, collapse=" ")
)

gun_corpus_stopwords_and_punct_removed

Corpus consisting of 998 documents.
text1 :
"Smith Wesson Saturday anyone"

text2 :
"two favorite 9mm s"

text3 :
"Arex Zero 1 Tactical w Trijicon RMR Viridian XL5 Gen 3 Hogue..."

text4 :
"Opinion fnx9 Already xd p320 m18 SA DA interesting sure"

text5 :
"Howa 1500 308"

text6 :
"little comparison"

[ reached max_ndoc ... 992 more documents ]

library(methods)

too_gun_dfm_no_punct_stopwords<- quanteda::dfm(gun_corpus_stopwords_and_punct_removed, verbose = FALSE)

too_gun_dfm_no_punct_stopwords

Document-feature matrix of: 998 documents, 2,518 features (99.78% sparse) and 0 docvars.
       features
docs    smith wesson saturday anyone two favorite 9mm s arex zero
  text1     1      1        1      1   0        0   0 0    0    0
  text2     0      0        0      0   1        1   1 1    0    0
  text3     0      0        0      0   0        0   0 0    1    1
  text4     0      0        0      0   0        0   0 0    0    0
  text5     0      0        0      0   0        0   0 0    0    0
  text6     0      0        0      0   0        0   0 0    0    0
[ reached max_ndoc ... 992 more documents, reached max_nfeat ... 2,508 more features ]

library(topicmodels)

gun_dfm_lda_nopunct_stop <- LDA(too_gun_dfm_no_punct_stopwords, k = 2, control = list(seed = 777))
gun_dfm_lda_nopunct_stop

A LDA_VEM topic model with 2 topics.

gun_dfm_lda_topics_nopunct_stop <- tidy(gun_dfm_lda_nopunct_stop, matrix = "beta")

gun_dfm_lda_topics_nopunct_stop

# A tibble: 5,036 × 3
   topic term         beta
   <int> <chr>       <dbl>
 1     1 smith    0.00135 
 2     2 smith    0.000782
 3     1 wesson   0.00234 
 4     2 wesson   0.000503
 5     1 saturday 0.00154 
 6     2 saturday 0.000596
 7     1 anyone   0.00339 
 8     2 anyone   0.00194 
 9     1 two      0.00320 
10     2 two      0.00320 
# … with 5,026 more rows

gun_top_terms_no_punct_or_stop<- gun_dfm_lda_topics_nopunct_stop %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

gun_top_terms_no_punct_or_stop %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

beta_wide_no_punct_stop <- gun_dfm_lda_topics_nopunct_stop %>%
  mutate(topic = paste0("topic", topic)) %>%
  pivot_wider(names_from = topic, values_from = beta) %>% 
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_wide_no_punct_stop %>%
  group_by(direction = log_ratio > 0) %>%
  slice_max(abs(log_ratio), n = 10) %>% 
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(log_ratio, term)) +
  geom_col() +
  labs(x = "Log2 ratio of beta in topic 2 / topic 1", y = NULL)

Topic Modeling analysis with stopwords and punctuation removed

Response: As can be seen from the results above, removing stopwords and punctuation removes a good deal of the unwanted language from the corpus and does a slightly more comprehensible job in displaying the information. However, any kind of stemming or reduction will be difficult with posts about firearms for a number of reasons. Firstly the language surrounding firearms involves numbers for model numbers, ammunition calibers and the capacity of magazines and other devices that hold bullets. This results in difficulty removing both punctuation and numbers from the data as they give a sense of what sort of each of the aforementioned items people are interesting in talking about. As a results removing the punctuation is difficult because it allows for more comprehensible data by reducing the usage of unneeded punctuation like exclamaintion points and questions marks that are common on a forum of this nature but not useful in analyzing the common topics and language.

Text as Data Blog Post 2