四十年政府工作报告词频变化——Stata 图表绘制

Stata 也可以绘制精美的图表的。我觉得对于任何一个绘图工具,只要你熟练掌握了点线面三种元素的使用,你就能绘制任何你想绘制的图表。今天我们用一个案例展示如何使用 Stata 进行文本处理和复杂图表的绘制。

CRAN上的包都是干什么的?

在之前的推文 R 和 RStudio 的安装 的结尾,我写了段爬取 CRAN 上的所有 R 包的名称、发布日期和标题的代码,但是我只使用了前两个变量,进行绘图,没有提标题的事情,那么标题可以用来干什么呢?标题当然是描述该包的主要功能了,通过简单的词频统计,我们就能绘制一幅词云图观察 CRAN 上的 R 包的关键词是哪些了,首先还是爬取清华镜像源的那个表格:

可视化 Expatistan 网站上的各国生活成本指数数据

本周的小项目作业是“爬取 Expatistan 网站上的各国生活成本数据并绘制一幅世界地图进行展示”。

  1. 数据源:Expatistan

  2. 世界地图的底图数据:tmap
    包内有一个 World 数据,调用方法:

R
1
data("World", package = "tmap")
  1. 爬取数据的 R 包,可以用
    rvest

Tips: 可能需要到的函数:read_html,html_nodes,html_table;

  1. 绘制地图的 R 包,ggplot + sf (本周有教程),用 tmap 也行。

  2. 拓展作业:可以再绘制一些其他的图来展示各国生活成本的排名。


参考结果

爬取数据

这种表格数据用 rvest 包爬取非常容易:

R
1
2
3
4
5
6
7
library(tidyverse)
library(hrbrthemes)
library(rvest)
# 把网址保存成一个名为 url 的变量:
url <- "https://www.expatistan.com/cost-of-living/country/ranking"
# 使用 read_html() 函数读取解析网页文件,保存为名为 html 的变量:
html <- read_html(url)

解析得到的 html 是个 xml_document,这是一种结构性的数据,我们可以使用
html_nodes() 函数从中找寻某个节点,通常找寻的办法有两个:CSS 和
XPath,都可以用,首先我们用 xpath:

R
1
2
3
4
5
html %>% 
html_nodes(xpath = '//*[@id="content"]/div/div[1]/div[1]/table')

## {xml_nodeset (1)}
## [1] <table class="country-ranking centered">\n<thead><tr>\n<th>Ranking</th>\n ...

table 标签对应的就是我们想要爬取数据的这个表格。那么这个 xpath
从哪来的呢?

或者我们可以用 CSS 选择:

R
1
2
3
4
5
html %>% 
html_nodes(css = '#content > div > div.block.first.comparison > div.prices > table')

## {xml_nodeset (1)}
## [1] <table class="country-ranking centered">\n<thead><tr>\n<th>Ranking</th>\n ...

CSS 选择器是这么来的

两种方式的效果是一样的,至于选择哪种就看你的偏好了。

得到了 table 所在的节点之后呢,我们可以使用 html_table()
函数解析表格,解析之后再转化为 tibble 数据框并赋值给 df 变量:

R
1
2
3
4
5
html %>% 
html_nodes(xpath = '//*[@id="content"]/div/div[1]/div[1]/table') %>%
html_table() %>%
.[[1]] %>%
as_tibble() -> df

完整的表格是这样的:

R
1
2
df %>% 
DT::datatable()

我们再把这个数据整理一下,例如 Ranking 变量可以转换成数值型变量,
Price Index * 的名字也改改:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(stringr)
df <- df %>%
`colnames<-`(c("ranking", "country", "price_index")) %>%
mutate(ranking = str_remove_all(ranking, "[st nd rd th]")) %>%
type_convert()

df

## # A tibble: 95 x 3
## ranking country price_index
## <dbl> <chr> <int>
## 1 1 Cayman Islands 279
## 2 2 Hong Kong 230
## 3 3 Switzerland 226
## 4 4 Iceland 223
## 5 5 Bahamas 213
## 6 6 Norway 208
## 7 7 Singapore 197
## 8 8 Ireland 196
## 9 9 Denmark 189
## 10 10 Qatar 189
## # … with 85 more rows

先画个简单的柱状图吧!

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 关于字体和主题的设置,请参考:https://czxa.top/tf/get-started-with-r-and-rstudio.html
enfont = "CascadiaCode-Regular"
library(forcats)
df %>%
slice(1:10) %>%
mutate(
country = fct_reorder(country, price_index)
) %>%
ggplot() +
geom_col(aes(x = country,
y = price_index,
fill = country)) +
awtools::a_dark_theme(enfont) +
theme(legend.position = "none") +
scale_fill_brewer(palette = "Paired") +
coord_flip() +
labs(y = "Price Index",
x = "",
title = "Cost of Living Ranking: Top 10",
subtitle = "Czech Republic = 100",
caption = "Data Source: Expatistan\nhttps://www.expatistan.com/cost-of-living/country/ranking") +
theme(plot.margin = grid::unit(c(1, 0.5, 0.5, 0.2), "cm"))

这是个世界地图的数据,最好的可视化当然是画世界地图了!

我们使用 ggplot2 + sf 绘制世界地图,底图使用 tmap 包中的 World,安装
tmap 包出错的小伙伴,可以从 TidyFriday 的 知识星球下载 “World.rds”
数据:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(ggplot2)
library(sf)
data("World", package = "tmap")
wdf <- World %>%
mutate(name = as.character(name)) %>%
left_join(df, by = c("name" = "country")) %>%
rename(`Price Index` = `price_index`)
ggplot(wdf) +
geom_sf(aes(geometry = geometry,
fill = `Price Index`),
color = "white", size = 0.05) +
theme_modern_rc(base_family = enfont,
plot_title_family = enfont,
subtitle_family = enfont,
caption_family = enfont) +
scale_fill_viridis_c() +
theme(plot.margin = grid::unit(c(1, 0.2, 0.3, 0.2), "cm")) +
labs(y = "",
x = "",
title = "Cost of Living Ranking by Country",
subtitle = "Czech Republic = 100",
caption = "Data Source: Expatistan\nhttps://www.expatistan.com/cost-of-living/country/ranking")

离散变量

价格指数是个连续变量,但是我们可以把它切割成分组的离散变量:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 使用分位数进行切割,例如我们想分成 6 组
nclass = 6

# 计算分位数
quantiles <- wdf %>%
pull(`Price Index`) %>%
quantile(probs = seq(0, 1,
length.out = nclass + 1),
na.rm = TRUE) %>%
as.vector()

labels <- imap_chr(quantiles, function(., idx){
return(paste0(quantiles[idx], " – ",
quantiles[idx + 1]))
})
# 删除最后一个标签,要不然我们就会看到像 "62 - NA" 这样的标签:
labels <- labels[1:length(labels) - 1]
labels

## [1] "64 – 77" "77 – 89"
## [3] "89 – 104.5" "104.5 – 129.333333333333"
## [5] "129.333333333333 – 166" "166 – 226"

wdf <-
wdf %>%
mutate(
`Price Index` = cut(`Price Index`,
breaks = quantiles,
labels = labels,
include.lowest = TRUE)
)

unique(wdf$`Price Index`)

## [1] <NA> 77 – 89 129.333333333333 – 166
## [4] 64 – 77 166 – 226 89 – 104.5
## [7] 104.5 – 129.333333333333
## 6 Levels: 64 – 77 77 – 89 89 – 104.5 ... 166 – 226

绘制地图:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ggplot(wdf) + 
geom_sf(aes(geometry = geometry,
fill = `Price Index`),
color = "white", size = 0.05) +
theme_modern_rc(base_family = enfont,
plot_title_family = enfont,
subtitle_family = enfont,
caption_family = enfont) +
scale_fill_manual(values = awtools::a_palette) +
theme(plot.margin = grid::unit(c(1, 0.2, 0.3, 0.2), "cm")) +
labs(y = "",
x = "",
title = "Cost of Living Ranking by Country",
subtitle = "Czech Republic = 100",
caption = "Data Source: Expatistan\nhttps://www.expatistan.com/cost-of-living/country/ranking")

使用 tmap 包绘制地图

R
1
2
3
4
5
6
7
wdf2 <- World %>% 
mutate(name = as.character(name)) %>%
left_join(df, by = c("name" = "country")) %>%
rename(`Price Index` = `price_index`)
tmap::tmap_style("classic")
tmap::tm_shape(wdf2) +
tmap::tm_polygons("Price Index")

使用 highcharter 包绘制世界地图

似乎由于国家和地区的名字差异的问题,合并的有些问题,尽管我使用了
fuzzyjoin 包进行模糊连接:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(highcharter)
world <- download_map_data("custom/world-robinson-highres")
worlddf <- get_data_from_map(world)
worlddf <- worlddf %>%
fuzzyjoin::stringdist_left_join(df, by = c("name" = "country")) %>%
select(code = `hc-a2`, price_index)

hcmap("custom/world-robinson-highres",
data = worlddf, value = "price_index",
joinBy = c("hc-a2", "code"),
name = "Price Index",
dataLabels = list(
enabled = T,
format = '{point.name}'
),
borderColor = "#FAFAFA",
borderWidth = 0.1,
tooltip = list(
valueDecimal = 2
)) %>%
hc_title(text = "Cost of Living Ranking by Country") %>%
hc_subtitle(text = 'Data Source: <a href="https://www.expatistan.com/cost-of-living/country/ranking">Expatistan</a>', useHTML = TRUE) %>%
hc_add_theme(hc_theme_chalk())

我再试试用 Stata 完成这个图表的绘制。。。

爬取城市生活成本指数的排名

这个表在这里:https://www.expatistan.com/cost-of-living/index

爬取方法类似:

R
1
2
3
4
5
6
7
8
"https://www.expatistan.com/cost-of-living/index" %>% 
read_html() %>%
html_nodes(xpath = '//*[@id="ranking"]/div[1]/table') %>%
html_table() %>%
.[[1]] %>%
`colnames<-`(c("Ranking", "City", "Price Index")) %>%
as_tibble() %>%
DT::datatable()

编写 Shiny 文档

Shiny 文档和 R Markdown 文档不同的地方在于,它是实时运行的,我做了个 Shiny 文档:https://czxa.top/shiny/cost/ ,实时运行意味着每次你打开它的时候里面的代码就会自动运行一遍,所以这个文档上的表格和图表和原始网站上的始终是一致的。

使用 hchinamap 绘制中国地图

hchinamap 包是我写的一个 R 包,已经发表在 CRAN 上了。这个包可以非常方便的绘制交互式的中国地图。

之所以编写这个 R 包,是因为我发现外国人开发的基于 htmlwidgets 的 R 包里面的中国地图都没有九段线和台湾。所以我就想开发一个能够绘制完整的中国地图的 R 包。

在中国地图上填充离散变量

继续昨天的话题,昨天我们介绍了如何使用 ggplot2 + sf 绘制中国各级行政地图。然后在地图上填充了随机生成的数据。需要注意的是,我们生成的随机数据是一个连续变量,所以我们使用的 scale_fill_viridis_c() 方案进行颜色映射。实际工作中我们还有可能会遇到离散变量的情形,或者需要把连续变量分割成离散变量进行绘图的情形。本文就介绍了如何进行这两种操作。

使用 ggplot2 + sf 绘制中国地图

使用 ggplot2 + sf 绘制地图需要 shp 数据,点击这里可以下载我准备好的数据文件: china-shp.zip。解压这个文件夹之后会得到两个文件夹,一个是 chinamap,这个文件夹里面有很多数据,包括:

你最常用哪个表情?

本文介绍了如何在 ggplot 图表上添加 GIF 图。

Cost of Living Ranking by Country & City

Expatistan provides two kinds of data: Cost of Living Ranking by Country and Cost of Living Ranking by City. It’s very easy to get these two data. This article introduced how to crawl these data and visualize theme on map.

Regional Population Distribution of China, Just a Graph.

It’s very hard to get Chinese population at county level. So I just get this data for year 2004.

The shp data: chinamap.zip, theme.R can be found in this article: Create Complete China Maps Using GGPLOT2 and SF, Population data set: 全国分县市人口统计资料2004.xlsx

A New Shiny Application: Show My Other Packages Which Have Conflicts with `bs4Dash`

When I develop the ‘package’ application, I found that the ‘bs4Dash’ package has some conflicts with sankeywheel and hpackedbubble, which results in a problem that these two packages cann’t be used in bs4Dash framework. So I write another shiny application named ‘otherpackages’ to show these packages’ demo.

SHIBOR versus The Repo Rate: the Choice of Benchmark Interest Rates in China's Monetary Market

This article has been published on Zenodo - Research. Shared., if you want to cite it, you can cite it as:

Cheng, Zhenxing. (2018, June 27). SHIBOR versus Repo Rate: the Choice of Benchmark Interest Rates in China’s Monetary Market. Zenodo. http://doi.org/10.5281/zenodo.3517843

Exploring the Wechat Friends Data Again!

About one week ago, I added a new functions to hchinamap package, that is chinamappoint() and provincepoint(). these two functions can map points on chinese map and it province map. The tweets indtroduce a method to display the geographical distribution of your wechat friends on chinese map, and also some other methods to plot wechat friends’s geographical distribution.

Find Semi-max Density Values in Normal Distribution

Today afternoon, a friend asked me a question, I can’t describe this question in English accurately, but I will try my best. Now, she has 30 sequences, each has 10 thousand numeric elements, they seemingly follow a normal distribution. For a normal distribution, the max density value correspond to mean. she want to know at which value, the density value equals half of the max density value.

Geographical Distribution of Students' Hometown in Ji'nan University

This tweet demonstrates a simple application of ‘hchinamap’ package. The data was collect from Ji’nan University’s library websit and sport website. Considering that this is a data set with private information, I can’t make this data public. If you want a copy, you can add my wechat to ask for it.

Welcome to My Shiny Server!

Inspired by daattali/shiny-server, I decided to build my own shiny server. Finally, I made it! Now you can access it by https://czxa.top/shiny/. Since I haven’t get this server recorded (In China, all websites should be recorded at MIIT, a.k.a, Ministry of industry and information technology of China). After I get it recorded, you can access it by domain name, as is https://czxa.top, but this domain name now is point to my blog deployed at GitHub.

Stata:Finance Topics

This articles is my notes for learning two three articles which introduce some technical analysis methods in Stata.

Create Complete China Maps Using GGPLOT2 and SF

I always feel disappointment about how to plot a beautiful China map. Yesterday, I finally drew a beautiful China map, at least in my opinion.

Create a Shiny Application to Collect Subscribers

Recently, my interest in shiny is very strong. These two days, I try to build a shiny application to collect subscribers. Actually, it a very simple applications, but I spends a lot of time on UI design and submit action. Unfortunately, I failed to add submit action. According to my original imagine, after you input name and email address, click , then this application should return you a message to alert your submit action is successful.

Using RMySQL Package to Connect MySQL on Ubuntu Server

Last week, I bought a Ubuntu Serve to deply my blog and some shiny applications. This blog demonstrades how to connect MySQL database with R by ‘RMySQL’ package.

Update for My 'monitoring' Project!

Since My R packages are simple and useless, they are getting less and less downloaded… Until today, only 88 downloads of these three packages were made last week, so the downloads curve on my ‘monitoring’ application looks abrupt.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×