generated from thedatasociety/lab-template
-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
0471782
commit eda64c0
Showing
2 changed files
with
245 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Criando uma App no Twitter\n", | ||
"\n", | ||
"## Acessar o endereço abaixo e criar uma App: https://apps.twitter.com/\n", | ||
"\n", | ||
"Criar login, senha e logar\n", | ||
"\n", | ||
"Criar uma nova App clicando em Create New App\n", | ||
"\n", | ||
"Definir os detalhes da aplicação: nome, descrição, website, etc\n", | ||
"\n", | ||
"**No menu \"Keys and Tokens\" gerar as chaves da App para usar na configuração do Flume e substituir no arquivo twitterAgent.conf abaixo.**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Configurando o agent, source, channel e sink\n", | ||
"\n", | ||
"**Agent**: Apenas um agente chamado *TwitterAgent*\n", | ||
"\n", | ||
"**Source**: Twitter\n", | ||
"\n", | ||
"**Channel**: Memória\n", | ||
"\n", | ||
"**Sink**: Registra os dados no HDFS" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"#Nome dos componentes do agente\n", | ||
"TwitterAgent.sources = Twitter\n", | ||
"TwitterAgent.channels = MemChannel\n", | ||
"TwitterAgent.sinks = HDFS\n", | ||
"\n", | ||
"#Configuração do Source\n", | ||
"TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource\n", | ||
"TwitterAgent.sources.Twitter.consumerKey = 6BWSmQX6AUfKhcNsMeav9zhi2\n", | ||
"TwitterAgent.sources.Twitter.consumerSecret = DcYHM3EFR5oJR7VEq8cBVtjTPQxftI9PMrST71P7oXW0BlGiZv\n", | ||
"TwitterAgent.sources.Twitter.accessToken = 1046705580-MTYNfMbLL6XSyQQgeL3Sah9RejwDRK5caBO9GRZ\n", | ||
"TwitterAgent.sources.Twitter.accessTokenSecret = 2FRFyHQEdAIFVCFgTxyxCvl4zqoNtTEMZuwJHCfhXW2jk\n", | ||
"TwitterAgent.sources.Twitter.keywords = #hadoop, #flume, #bigdata\n", | ||
"\n", | ||
"#Configuração do Sink\n", | ||
"TwitterAgent.sinks.HDFS.type = hdfs\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/matheus\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text\n", | ||
"\n", | ||
"#number of events written to file before it is flushed to HDFS\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.batchSize = 50 \n", | ||
"\n", | ||
"#File size to trigger roll, in bytes (0: never roll based on file size)\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.rollSize = 0\n", | ||
"\n", | ||
"#Number of events written to file before it rolled (0 = never roll based on number of events)\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.rollCount = 50\n", | ||
"\n", | ||
"#Number of seconds to wait before rolling current file (0 = never roll based on time interval)\n", | ||
"TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0\n", | ||
"\n", | ||
"#Configuração do Channel\n", | ||
"TwitterAgent.channels.MemChannel.type = memory\n", | ||
"#The maximum number of events stored in the channel\n", | ||
"TwitterAgent.channels.MemChannel.capacity = 100\n", | ||
"#The maximum number of events the channel will take from a source or give to a sink per transaction\n", | ||
"TwitterAgent.channels.MemChannel.transactionCapacity = 100\n", | ||
"\n", | ||
"#Conectando Source, Sink, Channel\n", | ||
"TwitterAgent.sources.Twitter.channels = MemChannel\n", | ||
"TwitterAgent.sinks.HDFS.channel = MemChannel\n", | ||
"\n", | ||
"\n", | ||
"#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json\n", | ||
"#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!cat /home/jovyan/labs/lab6-flume/twitterAgent.conf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Nota: \n", | ||
"Usando as configurações descritas no site do Flume - **TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource** - na configuração do source, o arquivo é gerado com caracteres ilegíveis. Assim, iremos utilizar **TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource** e temos que copiar o arquivo flume-sources-1.0-SNAPSHOT.jar para a pasta do Flume para o seu correto funcionamento." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!cp resources/flume-sources-1.0-SNAPSHOT.jar ~/resources/local/flume-${FLUME_VERSION}/lib" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Abrir um terminal e iniciar o FlumeAgent\n", | ||
"\n", | ||
"``` bash\n", | ||
"flume-ng agent --conf conf --conf-file labs/lab6-flume/twitterAgent.conf --name TwitterAgent -Dflume.looger=INFO,console\n", | ||
"``` " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# TrendTopics\n", | ||
"Criando um rank de palavras que contém #" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"No configs found; falling back on auto-configuration\n", | ||
"No configs specified for hadoop runner\n", | ||
"Looking for hadoop binary in /home/jovyan/resources/local/hadoop-2.9.2/bin...\n", | ||
"Found hadoop binary: /home/jovyan/resources/local/hadoop-2.9.2/bin/hadoop\n", | ||
"Using Hadoop version 2.9.2\n", | ||
"Creating temp directory /tmp/mrjob-ex-3.jovyan.20190913.175842.269968\n", | ||
"uploading working dir files to hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd...\n", | ||
"Copying other local files to hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/\n", | ||
"Running step 1 of 2...\n", | ||
" packageJobJar: [/tmp/hadoop-unjar4483334449690778934/] [] /tmp/streamjob4300404582507966699.jar tmpDir=null\n", | ||
" Connecting to ResourceManager at /0.0.0.0:8032\n", | ||
" Connecting to ResourceManager at /0.0.0.0:8032\n", | ||
" Total input files to process : 4\n", | ||
" Cleaning up the staging area /tmp/hadoop-yarn/staging/jovyan/.staging/job_1568395552187_0004\n", | ||
" Error Launching job : Not a file: hdfs://localhost:9000/user/matheus/output/output8\n", | ||
" Streaming Command Failed!\n", | ||
"Attempting to fetch counters from logs...\n", | ||
"Can't fetch history log; missing job ID\n", | ||
"No counters found\n", | ||
"Scanning logs for probable cause of failure...\n", | ||
"Can't fetch history log; missing job ID\n", | ||
"Can't fetch task logs; missing application ID\n", | ||
"Step 1 of 2 failed: Command '['/home/jovyan/resources/local/hadoop-2.9.2/bin/hadoop', 'jar', '/home/jovyan/resources/local/hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar', '-files', 'hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/mrjob-ex-3.py#mrjob-ex-3.py,hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/setup-wrapper.sh#setup-wrapper.sh', '-input', 'hdfs:///user/matheus/*', '-output', 'hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/step-output/0000', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 mrjob-ex-3.py --step-num=0 --mapper', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 mrjob-ex-3.py --step-num=0 --reducer']' returned non-zero exit status 1280.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!python resources/mrjob-ex-3.py -r hadoop --hadoop-streaming-jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar hdfs:///user/matheus/*" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
#Nome dos componentes do agente | ||
TwitterAgent.sources = Twitter | ||
TwitterAgent.channels = MemChannel | ||
TwitterAgent.sinks = HDFS | ||
|
||
#Configuração do Source | ||
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource | ||
TwitterAgent.sources.Twitter.consumerKey = <consumerKey> | ||
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret> | ||
TwitterAgent.sources.Twitter.accessToken = <accessToken> | ||
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret> | ||
TwitterAgent.sources.Twitter.keywords = #hadoop, #flume, #bigdata | ||
|
||
#Configuração do Sink | ||
TwitterAgent.sinks.HDFS.type = hdfs | ||
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/matheus | ||
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream | ||
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text | ||
|
||
#number of events written to file before it is flushed to HDFS | ||
TwitterAgent.sinks.HDFS.hdfs.batchSize = 50 | ||
|
||
#File size to trigger roll, in bytes (0: never roll based on file size) | ||
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 | ||
|
||
#Number of events written to file before it rolled (0 = never roll based on number of events) | ||
TwitterAgent.sinks.HDFS.hdfs.rollCount = 50 | ||
|
||
#Number of seconds to wait before rolling current file (0 = never roll based on time interval) | ||
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0 | ||
|
||
#Configuração do Channel | ||
TwitterAgent.channels.MemChannel.type = memory | ||
#The maximum number of events stored in the channel | ||
TwitterAgent.channels.MemChannel.capacity = 100 | ||
#The maximum number of events the channel will take from a source or give to a sink per transaction | ||
TwitterAgent.channels.MemChannel.transactionCapacity = 100 | ||
|
||
#Conectando Source, Sink, Channel | ||
TwitterAgent.sources.Twitter.channels = MemChannel | ||
TwitterAgent.sinks.HDFS.channel = MemChannel | ||
|
||
|
||
#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json | ||
#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object |