R 语言文本分析｜词语间的相关性：n-grams 模型与相关性——中文案例

Original RStata RStata 2024-04-07

欢迎各位培训班会员参加明晚 8 点的直播课：「词语间的相关性：n-grams 模型与相关性——中文案例」

该课程是系列课程「R 语言文本分析的最新课时」，之前的课时有：

R 语言和 RStudio 的安装、R Profile 的配置及初识 R 语言文本分析
R 语言预备知识
字符串处理、正则表达式与整洁文本数据
词频统计、中文文本分词与词云图的绘制
词频、逆文档频率指数与 TF-IDF 分析
R语言情感分析与情感词云图绘制
词语间的相关性：n-grams 模型与相关性

上次课程中我们讲解了 n-grams 模型与相关性分析，不过是基于英语文本讲解的。今天我们再一起来看下如何对中文文本进行 n-grams 模型分析。

在本课程中我们将以 2001～2022 年的平安银行（深发展）年报为例进行讲解。附件中的 pdf 文件夹里面存放了这些年报文件。

pdf 文本提取

使用下面的代码即可提取多年的年报文本：

library(tidyverse) 
# pdf 文本提取
library(pdftools)

# 多个 pdf 文档处理
fs::dir_ls("pdf") %>% 
  as.character() %>% 
  as_tibble() %>% 
  mutate(text = map_chr(value, function(x){
    pdftools::pdf_text(x) %>% 
      paste0(collapse = "") %>% 
      str_remove_all("[\\s\\n\\t\\d[a-z].]")
  })) -> textdf 

textdf

#> # A tibble: 22 × 2
#>    value        text 
#>    <chr>        <chr> 
#>  1 pdf/2001.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示及公司简介…
#>  2 pdf/2002.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  3 pdf/2003.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  4 pdf/2004.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  5 pdf/2005.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  6 pdf/2006.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  7 pdf/2007.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  8 pdf/2008.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#>  9 pdf/2009.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#> 10 pdf/2010.PDF "深圳发展银行股份有限公司年年度报告目录第一节重要提示第二节公司…
#> # ℹ 12 more rows

token = "ngrams" 可以直接处理中文文本，不过没有办法设置停用词、用户词典之类的：

library(tidytext)
textdf %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) 
#> # A tibble: 1,494,748 × 2
#>    value        bigram       
#>    <chr>        <chr>        
#>  1 pdf/2001.PDF 深圳 发展    
#>  2 pdf/2001.PDF 发展 银行    
#>  3 pdf/2001.PDF 银行 股份    
#>  4 pdf/2001.PDF 股份 有限公司
#>  5 pdf/2001.PDF 有限公司 年  
#>  6 pdf/2001.PDF 年 年度      
#>  7 pdf/2001.PDF 年度 报告    
#>  8 pdf/2001.PDF 报告 目录    
#>  9 pdf/2001.PDF 目录 第一节  
#> 10 pdf/2001.PDF 第一节 重要  
#> # ℹ 1,494,738 more rows

另外上面的用法等价于：

textdf %>% 
  unnest_ngrams(bigram, text, n = 2) 
#> # A tibble: 1,494,748 × 2
#>    value        bigram       
#>    <chr>        <chr>        
#>  1 pdf/2001.PDF 深圳 发展    
#>  2 pdf/2001.PDF 发展 银行    
#>  3 pdf/2001.PDF 银行 股份    
#>  4 pdf/2001.PDF 股份 有限公司
#>  5 pdf/2001.PDF 有限公司 年  
#>  6 pdf/2001.PDF 年 年度      
#>  7 pdf/2001.PDF 年度 报告    
#>  8 pdf/2001.PDF 报告 目录    
#>  9 pdf/2001.PDF 目录 第一节  
#> 10 pdf/2001.PDF 第一节 重要  
#> # ℹ 1,494,738 more rows

最开始我以为 ngrams 拆分是自动根据空格实现的，所以我想是不是可以先把中文文本处理成这个样子：

library(jiebaR)
# 分词引擎：需要停用词字典和用户字典
engine_s <- worker(stop_word = "stopwords.txt", user = "dictionary.txt")

# 我们可以考虑先对中文文本进行分词，然后再进行 ngrams 分析
segment("深圳发展银行股份有限公司年年度报告", jiebar = engine_s) %>% 
  paste0(collapse = " ")

#> [1] "深圳发展银行股份有限公司 年 年度报告"

# token 辅助函数
segmentlist <- function(x, ...) {
  lapply(x, function(x, ...){
    paste0(segment(x, jiebar = engine_s, ...), collapse = " ")
  })
}

textdf %>% 
  unnest_tokens(output = wordseg, input = text, 
                token = segmentlist) -> textdf2 

textdf2
#> # A tibble: 22 × 2
#>    value        wordseg 
#>    <chr>        <chr> 
#>  1 pdf/2001.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 公司简介 …
#>  2 pdf/2002.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  3 pdf/2003.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  4 pdf/2004.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  5 pdf/2005.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  6 pdf/2006.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  7 pdf/2007.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  8 pdf/2008.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#>  9 pdf/2009.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#> 10 pdf/2010.PDF 深圳发展银行股份有限公司 年 年度报告 目录 第一节 提示 第二节 公…
#> # ℹ 12 more rows

然后再使用 ngrams 分词，会发现结果依旧不是我们想要的：

textdf2 %>% 
  unnest_tokens(bigram, wordseg, token = "ngrams", n = 2) 
#> # A tibble: 1,186,932 × 2
#>    value        bigram       
#>    <chr>        <chr>        
#>  1 pdf/2001.PDF 深圳 发展    
#>  2 pdf/2001.PDF 发展 银行    
#>  3 pdf/2001.PDF 银行 股份    
#>  4 pdf/2001.PDF 股份 有限公司
#>  5 pdf/2001.PDF 有限公司 年  
#>  6 pdf/2001.PDF 年 年度      
#>  7 pdf/2001.PDF 年度 报告    
#>  8 pdf/2001.PDF 报告 目录    
#>  9 pdf/2001.PDF 目录 第一节  
#> 10 pdf/2001.PDF 第一节 提示  
#> # ℹ 1,186,922 more rows

也就是 ngrams 并不是单纯使用空格作为分割符进行处理。深入分析代码可以发现 ngrams 调用的是 tokenizers::tokenize_ngrams() 函数，这个函数的源代码我已经放在附件中了（ngram-tokenizers.R文件）。

ngrams 的结果主要是下面的代码生成的：

# 这段代码不需要运行
words <- tokenize_words(x, lowercase = lowercase)
out <-
  generate_ngrams_batch(
    words,
    ngram_min = n_min,
    ngram_max = n,
    stopwords = stopwords,
    ngram_delim = ngram_delim
  )
if (!is.null(named))
  names(out) <- named
simplify_list(out, simplify)

因此 tokenize_ngrams() 调用的实际上是 tokenize_words() 的结果，因此前面的中文分词什么的并不会有什么用。

# 测试下：
tokenizers::tokenize_words("深圳发展银行股份有限公司年年度报告") %>% 
  tokenizers:::generate_ngrams_batch(ngram_max = 2, ngram_min = 2) %>% 
  unlist()
#> [1] "深圳 发展"     "发展 银行"     "银行 股份"     "股份 有限公司"
#> [5] "有限公司 年"   "年 年度"       "年度 报告"

中文文本的 n-grams

因此对于中文的 ngrams 应该这样：

cn_ngrams <- function(x, n = 2, n_min = n, ...) {
  lapply(x, function(x){
    lapply(x, function(x, ...){
      segment(x, jiebar = engine_s, ...)
    }) %>% 
      tokenizers:::generate_ngrams_batch(ngram_max = n, ngram_min = n_min) %>% 
      unlist()
  })
}

textdf %>% 
  unnest_tokens(bigram, text, token = cn_ngrams, n = 2) -> textdf2 

textdf2 
#> # A tibble: 911,145 × 2
#>    value        bigram                     
#>    <chr>        <chr>                      
#>  1 pdf/2001.PDF 深圳发展银行股份有限公司 年
#>  2 pdf/2001.PDF 年 年度报告                
#>  3 pdf/2001.PDF 年度报告 目录              
#>  4 pdf/2001.PDF 目录 第一节                
#>  5 pdf/2001.PDF 第一节 提示                
#>  6 pdf/2001.PDF 提示 公司简介              
#>  7 pdf/2001.PDF 公司简介 第二节            
#>  8 pdf/2001.PDF 第二节 会计                
#>  9 pdf/2001.PDF 会计 数据                  
#> 10 pdf/2001.PDF 数据 业务                  
#> # ℹ 911,135 more rows

拆分单词：

textdf2 %>% 
  separate(bigram, c("word1", "word2"), sep = " ") -> bigrams_separated

计数：

bigrams_separated %>% 
  count(value, word1, word2, sort = T) -> bigram_counts 
bigram_counts 

#> # A tibble: 449,238 × 4
#>    value        word1  word2      n
#>    <chr>        <chr>  <chr>  <int>
#>  1 pdf/2012.PDF 公允   价值     408
#>  2 pdf/2011.PDF 公允   价值     380
#>  3 pdf/2012.PDF 人民币 千元     369
#>  4 pdf/2009.PDF 公允   价值     356
#>  5 pdf/2008.PDF 公允   价值     355
#>  6 pdf/2011.PDF 人民币 千元     338
#>  7 pdf/2010.PDF 公允   价值     331
#>  8 pdf/2020.PDF 年     月       313
#>  9 pdf/2022.PDF 人民币 百万元   311
#> 10 pdf/2020.PDF 人民币 百万元   309
#> # ℹ 449,228 more rows

可以看到特有名词仍旧是最常出现的组合，例如“公允价值”、“特别注明”、“价值计量”等，所以可以考虑把这些词再加入到用户词典中再重复上述的分析。

使用 n = 3 就可以进行三个词语的共现分析了：

textdf %>%
  unnest_tokens(trigram, text, token = cn_ngrams, n = 3) %>%
  filter(!is.na(trigram)) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% 
  count(value, word1, word2, word3, sort = TRUE) -> textdf3 

textdf3 

#> # A tibble: 613,507 × 5
#>    value        word1                word2  word3                        n
#>    <chr>        <chr>                <chr>  <chr>                    <int>
#>  1 pdf/2022.PDF 单位                 人民币 百万元                     217
#>  2 pdf/2021.PDF 单位                 人民币 百万元                     210
#>  3 pdf/2011.PDF 单位                 人民币 千元                       209
#>  4 pdf/2012.PDF 平安银行股份有限公司 原名   深圳发展银行股份有限公司   206
#>  5 pdf/2012.PDF 单位                 人民币 千元                       202
#>  6 pdf/2020.PDF 单位                 人民币 百万元                     202
#>  7 pdf/2011.PDF 外                   金额   单位                       197
#>  8 pdf/2011.PDF 注明                 外     金额                       197
#>  9 pdf/2011.PDF 特别                 注明   外                         197
#> 10 pdf/2011.PDF 金额                 单位   人民币                     197
#> # ℹ 613,497 more rows

双词组的 TF_IDF 分析

# 例如分析和 互联网 同时出现的词汇：
bigrams_separated %>% 
  filter(word1 == "互联网") %>% 
  count(value, word2, sort = T)

#> # A tibble: 147 × 3
#>    value        word2     n
#>    <chr>        <chr> <int>
#>  1 pdf/2014.PDF 金融     15
#>  2 pdf/2013.PDF 金融     11
#>  3 pdf/2015.PDF 金融     10
#>  4 pdf/2016.PDF 金融      7
#>  5 pdf/2014.PDF 化        4
#>  6 pdf/2018.PDF 支付      4
#>  7 pdf/2020.PDF 平台      4
#>  8 pdf/2015.PDF 平台      3
#>  9 pdf/2017.PDF 技术      3
#> 10 pdf/2019.PDF 支付      3
#> # ℹ 137 more rows

# 双词组也可以作为一个术语，下面计算 tf-idf：
textdf2 %>% 
  count(value, bigram, sort = T) %>% 
  bind_tf_idf(bigram, value, n) %>% 
  arrange(desc(tf_idf)) -> bigram_tf_idf 

bigram_tf_idf %>% 
  mutate(value = str_extract(value, "\\d{4}"),
         value = as.numeric(value)) %>% 
  rename(year = value) %>% 
  filter(str_detect(bigram, "互联网"))
#> # A tibble: 341 × 6
#>     year bigram              n        tf   idf   tf_idf
#>    <dbl> <chr>           <int>     <dbl> <dbl>    <dbl>
#>  1  2014 互联网 金融        15 0.000355   1.48 0.000526
#>  2  2013 互联网 金融        11 0.000279   1.48 0.000414
#>  3  2015 互联网 金融        10 0.000248   1.48 0.000367
#>  4  2016 互联网 金融         7 0.000182   1.48 0.000269
#>  5  2014 互联网 化           4 0.0000946  1.99 0.000189
#>  6  2014 中国 互联网         3 0.0000710  2.40 0.000170
#>  7  2016 互联网 证券         2 0.0000519  3.09 0.000160
#>  8  2016 互联网 账户         2 0.0000519  3.09 0.000160
#>  9  2016 协会 互联网         2 0.0000519  3.09 0.000160
#> 10  2016 行业协会 互联网     2 0.0000519  3.09 0.000160
#> # ℹ 331 more rows

使用 ggraph 绘制 bigram 网络图

这部分的操作也和上次课类似：

bigram_counts %>% 
  mutate(value = str_extract(value, "\\d{4}"),
         value = as.numeric(value)) %>% 
  rename(year = value) %>% 
  filter(year == 2022) %>% 
  filter(str_length(word1) >= 2 & str_length(word2) >= 2) %>% 
  filter(!(str_detect(word1, "元|年|月|人民币|附注|单位") | 
             str_detect(word2, "元|年|月|人民币|附注|单位"))) %>% 
  select(-year) %>% 
  filter(n >= 10) -> bigram_counts_sub 

bigram_counts_sub
#> # A tibble: 438 × 3
#>    word1                word2        n
#>    <chr>                <chr>    <int>
#>  1 公允                 价值       283
#>  2 特别                 注明       171
#>  3 综合                 收益       160
#>  4 平安银行股份有限公司 财务报表   155
#>  5 价值                 计量       133
#>  6 发放贷款             垫款       126
#>  7 集团                 本行       121
#>  8 债权                 投资       113
#>  9 金融                 负债       105
#> 10 信用                 损失       101
#> # ℹ 428 more rows


library(tidygraph)
as_tbl_graph(bigram_counts_sub) %>% 
  mutate(size = centrality_degree(mode = 'out')) -> bigram_graph

bigram_graph

#> # A tbl_graph: 352 nodes and 438 edges
#> #
#> # A directed multigraph with 32 components
#> #
#> # A tibble: 352 × 2
#>   name                  size
#>   <chr>                <dbl>
#> 1 公允                     1
#> 2 特别                     1
#> 3 综合                     2
#> 4 平安银行股份有限公司     5
#> 5 价值                     3
#> 6 发放贷款                 1
#> # ℹ 346 more rows
#> #
#> # A tibble: 438 × 3
#>    from    to     n
#>   <int> <int> <int>
#> 1     1     5   283
#> 2     2   250   171
#> 3     3    81   160
#> # ℹ 435 more rows

library(ggraph) 

bigram_graph %>% arrange(desc(size)) %>% slice(1:10) %>% pull(name)
#>  [1] "资产"                 "持续"                 "本行"                
#>  [4] "风险"                 "负债"                 "变动"                
#>  [7] "金融资产"             "投资"                 "现金"                
#> [10] "平安银行股份有限公司"

# 设置字体
library(showtext)
showtext_auto(enable = TRUE)
font_add("myfont", regular = "Source_Han_Serif_CN_VF_Regular.ttf")

ggraph(bigram_graph, layout = "kk") + 
  geom_edge_link(aes(edge_colour = bigram_counts_sub$word1,
                     edge_width = n), show.legend = F,
                 arrow = grid::arrow(type = "closed", length = unit(0.06, "inches"))) + 
  geom_node_point(aes(color = name,
                      size = size), show.legend = F) + 
  geom_node_text(aes(label = name, size = size), vjust = 1, 
                 hjust = 1, check_overlap = T, 
                 family = "myfont", show.legend = F, color = "white") + 
  scale_edge_width(range = c(0.2, 4)) + 
  scale_size_continuous(range = c(0.5, 8)) + 
  scale_color_manual(values = c("资产" = "#f6cf71", "持续" = "#019868", 
                                "本行" = "#ec0b88", "风险" = "#651eac", 
                                "负债" = "#e18a1e", "变动" = "#9dd292", 
                                "金融资产" = "#2b7de5", "投资" = "#c6c6c6", 
                                rep("gray", 344))) + 
  scale_edge_color_manual(values = c("资产" = "#f6cf71", "持续" = "#019868", 
                                     "本行" = "#ec0b88", "风险" = "#651eac", 
                                     "负债" = "#e18a1e", "变动" = "#9dd292", 
                                     "金融资产" = "#2b7de5", "投资" = "#c6c6c6", 
                                     rep("gray", 430))) + 
  theme_graph(background = 'grey20',
              base_family = "myfont") -> p

直播信息

欢迎各位培训班会员参加明晚 8 点的直播课：「词语间的相关性：n-grams 模型与相关性——中文案例」

直播地址：腾讯会议(需要报名 RStata 培训班参加)
讲义材料：需要购买 RStata 名师讲堂会员，详情可阅读：一起来学习 R 语言和 Stata 啦！学习过程中遇到的问题也可以随时提问！

更多关于 RStata 会员的更多信息可添加微信号 r_stata 咨询：

附件下载（点击文末的阅读原文即可跳转）：
https://rstata.duanshu.com/#/brief/course/bf37cf50eef04d38b43541cc52114c96

继续滑动看下一个

RStata

向上滑动看下一个

一把短刀，怎么就让他连捅18人？！

黄晓菁，这位杭州泰隆银行女员工自爆视频火了，带给我们那些思考？

这一刻，快乐被具象化了

海南省拟任干部人选公

女高管与男下属上班约会开房，男方妻子闹到单位！被开除后她辩称：一直保持0.46-1.22米“个人距离”

R 语言文本分析｜词语间的相关性：n-grams 模型与相关性——中文案例

pdf 文本提取

中文文本的 n-grams

双词组的 TF_IDF 分析

使用 ggraph 绘制 bigram 网络图

直播信息

您可能也对以下帖子感兴趣

一把短刀，怎么就让他连捅18人？！

黄晓菁，这位杭州泰隆银行女员工自爆视频火了，带给我们那些思考？

这一刻，快乐被具象化了

海南省拟任干部人选公

女高管与男下属上班约会开房，男方妻子闹到单位！被开除后她辩称：一直保持0.46-1.22米“个人距离”

生成图片，分享到微信朋友圈

R 语言文本分析｜词语间的相关性：n-grams 模型与相关性——中文案例

pdf 文本提取

中文文本的 n-grams

双词组的 TF_IDF 分析

使用 ggraph 绘制 bigram 网络图

直播信息

您可能也对以下帖子感兴趣