-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
414 lines (294 loc) · 13.8 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# translateVTT
<!-- badges: start -->
[](https://travis-ci.org/vzhomeexperiments/translateVTT)
<!-- badges: end -->
The goal of translateVTT is to translate VTT subtitles automatically. For example we can translate multiple files with Closed Captions from English to many 'other' languages...
This package is complementing Udemy Course [course referral link](https://www.udemy.com/course/automated-translation-google-translate-api/?referralCode=5C6A6465A4ADFC5CC326)
## Installation
Once available, you can install the released version of translateVTT from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("translateVTT")
```
And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("vzhomeexperiments/translateVTT")
```
# Basic description of Automatic Translations with R and Google Translator API
This document is to create a way to easier read and translate captions in `*.vtt` format
## Why do you need it?
This course is about **automating translation jobs with Google Translate API**. We will access it via R Software for Statistics and Graphics. This way we could use this course even for more advance text processing or even sentiment analysis.
Google Translate web page is free service, **Google translate API is not free**. Great news is that **we can have free trial** period and you could be able to start right away without paying.
## Google Translator API key
* Subscribe to Google Cloud Platform
* Generate your API Key
* Copy your key
## Encrypt your key in R!
See dedicated course that teach in detail about how to use Public key Cryptography in R Statistical Software. **Cryptography is more fun with R!** [course referral coupon](https://www.udemy.com/course/keep-your-secrets-under-control/?referralCode=5B78D58E7C06AFFD80AE)
## Using Public Key Cryptography to securely store your API Key
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE, paged.print=TRUE}
# Generate your private key and write it to the folder, we assume you will save it to the folder C:/Users/UserName/.ssh/ mac users can adapt the path...
# if necessary install package
# install.packages("openssl"); install.packages("tidyverse")
# loads library open ssl and tidyverse
library(openssl)
library(tidyverse)
# generate private key
rsa_keygen(bits = 5555) %>% write_pem(path = "C:/Users/UserName/.ssh/id_api")
# extract and write your public key
read_key(file = "C:/Users/UserName/.ssh/id_api", password = "") %>% `[[`("pubkey") %>% write_pem("C:/Users/UserName/.ssh/id_api.pub")
```
Now you have your personal public key which we will use to encrypt the credentials
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# encrypt your key (always clear history and delete the key used for encryption)
## Encrypt with PRIVATE key (e.g. use this code yourself)
"=== xxx Place 4 API Key xxx yyy zzz ===" %>%
# serialize the object
serialize(connection = NULL) %>%
# encrypt the object
encrypt_envelope("C:/Users/UserName/.ssh/id_api.pub") %>%
# write encrypted data to File
write_rds("api_key.enc.rds")
```
Now we have our encrypted key inside our project folder!
If you will use this script later -> delete API key from your script and feel free to use Version Control Repository!
In the next lecture we will see how to read the key back!
**NOTE:** if you plan to collaborate and use your key by multiple persons with version control check out R package 'secret'. Remember that you can learn how to use it in my course about Cryptography in R!
## Translate Hello World!
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
library(openssl)
library(tidyverse)
# to install package in R
#install.packages("translate")
library(translate)
citation("translate")
```
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# help on the package
#?translate
# we need our API key to translate
out <- read_rds("api_key.enc.rds")
# decrypting the password using public data list and private key (optional!)
api_key <- decrypt_envelope(out$data, out$iv, out$session, "C:/Users/fxtrams/.ssh/id_api", password = "") %>%
unserialize()
# usage: translate(query, source, target, key = get.key())
translate("I like this course", "en", "de", key = decrypt_envelope(out$data, out$iv, out$session, "C:/Users/fxtrams/.ssh/id_api", password = "") %>%
unserialize())
# you can detect source language of the text
detect.source("Mi piace questo corso", key = api_key)
# list of valid language mappings
languages(key = api_key)
translate("Как наÑчет выпить", "ru", "en", key = decrypt_envelope(out$data, out$iv, out$session, "C:/Users/fxtrams/.ssh/id_api", password = "") %>%
unserialize())
translate("How about a drink", "en", "ru", key = decrypt_envelope(out$data, out$iv, out$session, "C:/Users/fxtrams/.ssh/id_api", password = "") %>%
unserialize())
```
## Solving Translation problem for one VTT file
Solving for one file. Reading the file and visualizing the object
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# read file -> it will be a dataframe
t <- read.delim("C:/Users/fxtrams/Downloads/L0.vtt", stringsAsFactors = F)
t
```
Extract logical vector identifying position of the arrow '-->'
Get only piece of table with text and with timestamps
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
library(tidyverse)
# extract logical vector indicating which rows containing timestamps
x <- t %>%
# detect rows with date time (those for not translate)
apply(MARGIN = 1, str_detect, pattern = "-->")
# extract only rows containing text (e.g. not containing timestamps)
txt <- subset.data.frame(t, !x)
# extract only time stamps
tst <- subset.data.frame(t, x)
```
Translating this file.
We will need to read API key first
Selecting one column and giving it the original name
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
library(translateR)
library(openssl)
citation("translateR")
# get back our encrypted API key
out <- read_rds("api_key.enc.rds")
api_key <- decrypt_envelope(out$data, out$iv, out$session, "C:/Users/fxtrams/.ssh/id_api", password = "") %>%
unserialize()
# translate object txt or file in R
# Google, translate column in dataset
google.dataset.out <- translate(dataset = txt,
content.field = 'WEBVTT',
google.api.key = api_key,
source.lang = 'en',
target.lang = 'es')
# extract only new column
trsltd <- google.dataset.out %>% select(translatedContent)
# give original name
colnames(trsltd) <- "WEBVTT"
```
Join timestamps with translated text
Order dataframe
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# bind rows with original timestamps
abc <- rbind(tst, trsltd)
# order this file back again
bcd <- abc[ order(as.numeric(row.names(abc))), ] %>% as.character() %>% as.data.frame()
```
Add empty row
Write to the file...
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# return original name
colnames(bcd) <- "WEBVTT"
bcd <- as.tibble(bcd)
# add one row
bcd2 <- add_row(bcd, WEBVTT = "", .before = 1)
# write this file back :_)
write.table(bcd2, "translated.vtt",quote = F, row.names = F, fileEncoding = "UTF-8")
```
## Pack code into Function
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# call our translation function from R script to environment
source("translateVTT.R")
translateVTT(fileName = "C:/Users/fxtrams/Downloads/L0.vtt",
sourceLang = "en",
destLang = "nl",
apikey = api_key)
```
Package that is capable to work with Dataframes directly:
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
library(translateR)
# read file -> it will be a dataframe
t <- read.delim("C:/Users/fxtrams/Downloads/L3.vtt", stringsAsFactors = F)
# extract logical vector indicating which rows containing timestamps
x <- t %>%
# detect rows with date time (those for not translate)
apply(MARGIN = 1, str_detect, pattern = "-->")
# extract only rows containing text (e.g. not containing timestamps)
txt <- subset.data.frame(t, !x)
# extract only time stamps
tst <- subset.data.frame(t, x)
# # write to file for translation (manually)
# txt %>% write.table("translate.txt", row.names = F)
# # write lines for translations
# lns <- txt %>% as.matrix() %>% c()
```
## translate this file using translate API paid service in Google
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# translate object txt or file in R
# Google, translate column in dataset
google.dataset.out <- translate(dataset = txt,
content.field = 'WEBVTT',
google.api.key = "api key do not check me in to Version Control!",
source.lang = 'en',
target.lang = 'es')
# extract only new column
trsltd <- google.dataset.out %>% select(translatedContent)
# give original name
colnames(trsltd) <- "WEBVTT"
#
```
## place it back...
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
## read this file back
#tsltd <- read.table("translate.txt", encoding = "UTF-8", header = T, stringsAsFactors = F)
# add original row names to the table tsltd
# row.names(tsltd) <- as.numeric(row.names(txt))
# bind rows with original timestamps
abc <- rbind(tst, trsltd)
# order this file back again
bcd <- abc[ order(as.numeric(row.names(abc))), ] %>% as.character() %>% as.data.frame()
# return original name
colnames(bcd) <- "WEBVTT"
bcd <- as.tibble(bcd)
# add one row
bcd2 <- add_row(bcd, WEBVTT = "", .before = 1)
# write this file back :_)
write.table(bcd2, "translated.vtt",quote = F, row.names = F)
```
## For loop!
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# translate 10 files in .vtt format
source("translateVTT.R")
library(openssl)
library(tidyverse)
# make a list of files to translate
filesToTranslate <-list.files("C:/Users/fxtrams/Downloads/", pattern="*.vtt", full.names=TRUE)
# make a list of languages
languages <- c("fr", "tr", "it", "id", "pt", "es", "ms", "de")
# get api key
out <- read_rds("api_key.enc.rds")
# decrypting the password using public data list and private key
api_key <- decrypt_envelope(out$data, out$iv, out$session, "C:/Users/fxtrams/.ssh/id_api", password = "") %>% unserialize()
# starting time of my job
start_time <- Sys.time()
balance_before <- 256.43
# for loop
for (FILE in filesToTranslate) {
# for loop for languages
for (LANG in languages) {
# translation
translateVTT(fileName = FILE, sourceLang = "en", destLang = LANG, apikey = api_key)
}
}
## How much does it cost?
#1.26 hours and 18.37$ to translate 23 files to 8 different languages...
```
# Troubleshooting
## translateR package does not return translation (Mac)
If you are using Mac and using `translateR` package you may encounter following problem. Function translate() would return only 1 column with original text without giving you translation.
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
google.dataset.out <- translate(dataset = t, content.field = 'WEBVTT', google.api.key = api_key, source.lang = 'en', target.lang = 'es')
```
It was possible to override this problem by using code from `translate` package first...
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# executed first
res.translate <- translate::translate(query = "Hello World", source = "en", target = "de", key = apikey)
and then above code worked
# Google, translate column in dataset
res.translateR2 <- translateR::translate(dataset = txt,
content.field = 'WEBVTT',
google.api.key = apikey,
source.lang = "en",
target.lang = "de")
```
## not visualizing proper Encoding for 'special' characters
see: https://stackoverflow.com/questions/44095025/r-how-to-convert-utf-8-code-like-u9600u524d-back-to-chinese-characters
For 'special' characters like chinese, cyrilic, etc there might be problems with encoding while writing to the file!
The most possible reason is the Operating System `locale`. Function `translateVTT` tries to handle that by switching OS locale to appropriate one:
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
## dealing with locale ... only supporting few key languages at the moment
# note: we must apply fail back mechanism in case OS does not support it!
if(destLang == "fr") {
res <- Sys.setlocale(locale = "French")
if(res == "") {stop("Your OS does not support this language", call. = FALSE)}
} else if(destLang == "ru") {
res <- Sys.setlocale(locale = "Russian")
if(res == "") {stop("Your OS does not support this language", call. = FALSE)}
} else if(destLang == "it") {
res <- Sys.setlocale(locale = "Italian")
if(res == "") {stop("Your OS does not support this language", call. = FALSE)}
} else if(destLang == "zh-CN") {
res <- Sys.setlocale(locale = "Chinese")
if(res == "") {stop("Your OS does not support this language", call. = FALSE)}
} else if(destLang == "hi") {
res <- Sys.setlocale(locale = "Hindi")
if(res == "") {stop("Your OS does not support this language", call. = FALSE)}
}
# ... your code to translate
# restoring locale
Sys.setlocale()
```
## Want to learn more?
Join this *Udemy Course* with this [coupon](https://www.udemy.com/course/automated-translation-google-translate-api/?referralCode=5C6A6465A4ADFC5CC326) and get Lifetime access with 30 days money back guarantee!