diff --git a/labs/lab6-flume/lab6-flume.demo02.ipynb b/labs/lab6-flume/lab6-flume.demo02.ipynb new file mode 100644 index 0000000..e2bc8e3 --- /dev/null +++ b/labs/lab6-flume/lab6-flume.demo02.ipynb @@ -0,0 +1,200 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Criando uma App no Twitter\n", + "\n", + "## Acessar o endereço abaixo e criar uma App: https://apps.twitter.com/\n", + "\n", + "Criar login, senha e logar\n", + "\n", + "Criar uma nova App clicando em Create New App\n", + "\n", + "Definir os detalhes da aplicação: nome, descrição, website, etc\n", + "\n", + "**No menu \"Keys and Tokens\" gerar as chaves da App para usar na configuração do Flume e substituir no arquivo twitterAgent.conf abaixo.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Configurando o agent, source, channel e sink\n", + "\n", + "**Agent**: Apenas um agente chamado *TwitterAgent*\n", + "\n", + "**Source**: Twitter\n", + "\n", + "**Channel**: Memória\n", + "\n", + "**Sink**: Registra os dados no HDFS" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "#Nome dos componentes do agente\n", + "TwitterAgent.sources = Twitter\n", + "TwitterAgent.channels = MemChannel\n", + "TwitterAgent.sinks = HDFS\n", + "\n", + "#Configuração do Source\n", + "TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource\n", + "TwitterAgent.sources.Twitter.consumerKey = 6BWSmQX6AUfKhcNsMeav9zhi2\n", + "TwitterAgent.sources.Twitter.consumerSecret = DcYHM3EFR5oJR7VEq8cBVtjTPQxftI9PMrST71P7oXW0BlGiZv\n", + "TwitterAgent.sources.Twitter.accessToken = 1046705580-MTYNfMbLL6XSyQQgeL3Sah9RejwDRK5caBO9GRZ\n", + "TwitterAgent.sources.Twitter.accessTokenSecret = 2FRFyHQEdAIFVCFgTxyxCvl4zqoNtTEMZuwJHCfhXW2jk\n", + "TwitterAgent.sources.Twitter.keywords = #hadoop, #flume, #bigdata\n", + "\n", + "#Configuração do Sink\n", + "TwitterAgent.sinks.HDFS.type = hdfs\n", + "TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/matheus\n", + "TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream\n", + "TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text\n", + "\n", + "#number of events written to file before it is flushed to HDFS\n", + "TwitterAgent.sinks.HDFS.hdfs.batchSize = 50 \n", + "\n", + "#File size to trigger roll, in bytes (0: never roll based on file size)\n", + "TwitterAgent.sinks.HDFS.hdfs.rollSize = 0\n", + "\n", + "#Number of events written to file before it rolled (0 = never roll based on number of events)\n", + "TwitterAgent.sinks.HDFS.hdfs.rollCount = 50\n", + "\n", + "#Number of seconds to wait before rolling current file (0 = never roll based on time interval)\n", + "TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0\n", + "\n", + "#Configuração do Channel\n", + "TwitterAgent.channels.MemChannel.type = memory\n", + "#The maximum number of events stored in the channel\n", + "TwitterAgent.channels.MemChannel.capacity = 100\n", + "#The maximum number of events the channel will take from a source or give to a sink per transaction\n", + "TwitterAgent.channels.MemChannel.transactionCapacity = 100\n", + "\n", + "#Conectando Source, Sink, Channel\n", + "TwitterAgent.sources.Twitter.channels = MemChannel\n", + "TwitterAgent.sinks.HDFS.channel = MemChannel\n", + "\n", + "\n", + "#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json\n", + "#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object" + ] + } + ], + "source": [ + "!cat /home/jovyan/labs/lab6-flume/twitterAgent.conf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Nota: \n", + "Usando as configurações descritas no site do Flume - **TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource** - na configuração do source, o arquivo é gerado com caracteres ilegíveis. Assim, iremos utilizar **TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource** e temos que copiar o arquivo flume-sources-1.0-SNAPSHOT.jar para a pasta do Flume para o seu correto funcionamento." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "!cp resources/flume-sources-1.0-SNAPSHOT.jar ~/resources/local/flume-${FLUME_VERSION}/lib" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Abrir um terminal e iniciar o FlumeAgent\n", + "\n", + "``` bash\n", + "flume-ng agent --conf conf --conf-file labs/lab6-flume/twitterAgent.conf --name TwitterAgent -Dflume.looger=INFO,console\n", + "``` " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TrendTopics\n", + "Criando um rank de palavras que contém #" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "No configs found; falling back on auto-configuration\n", + "No configs specified for hadoop runner\n", + "Looking for hadoop binary in /home/jovyan/resources/local/hadoop-2.9.2/bin...\n", + "Found hadoop binary: /home/jovyan/resources/local/hadoop-2.9.2/bin/hadoop\n", + "Using Hadoop version 2.9.2\n", + "Creating temp directory /tmp/mrjob-ex-3.jovyan.20190913.175842.269968\n", + "uploading working dir files to hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd...\n", + "Copying other local files to hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/\n", + "Running step 1 of 2...\n", + " packageJobJar: [/tmp/hadoop-unjar4483334449690778934/] [] /tmp/streamjob4300404582507966699.jar tmpDir=null\n", + " Connecting to ResourceManager at /0.0.0.0:8032\n", + " Connecting to ResourceManager at /0.0.0.0:8032\n", + " Total input files to process : 4\n", + " Cleaning up the staging area /tmp/hadoop-yarn/staging/jovyan/.staging/job_1568395552187_0004\n", + " Error Launching job : Not a file: hdfs://localhost:9000/user/matheus/output/output8\n", + " Streaming Command Failed!\n", + "Attempting to fetch counters from logs...\n", + "Can't fetch history log; missing job ID\n", + "No counters found\n", + "Scanning logs for probable cause of failure...\n", + "Can't fetch history log; missing job ID\n", + "Can't fetch task logs; missing application ID\n", + "Step 1 of 2 failed: Command '['/home/jovyan/resources/local/hadoop-2.9.2/bin/hadoop', 'jar', '/home/jovyan/resources/local/hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar', '-files', 'hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/mrjob-ex-3.py#mrjob-ex-3.py,hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/setup-wrapper.sh#setup-wrapper.sh', '-input', 'hdfs:///user/matheus/*', '-output', 'hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/step-output/0000', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 mrjob-ex-3.py --step-num=0 --mapper', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 mrjob-ex-3.py --step-num=0 --reducer']' returned non-zero exit status 1280.\n" + ] + } + ], + "source": [ + "!python resources/mrjob-ex-3.py -r hadoop --hadoop-streaming-jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar hdfs:///user/matheus/*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/labs/lab6-flume/twitterAgent.conf b/labs/lab6-flume/twitterAgent.conf new file mode 100644 index 0000000..160e201 --- /dev/null +++ b/labs/lab6-flume/twitterAgent.conf @@ -0,0 +1,45 @@ +#Nome dos componentes do agente +TwitterAgent.sources = Twitter +TwitterAgent.channels = MemChannel +TwitterAgent.sinks = HDFS + +#Configuração do Source +TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource +TwitterAgent.sources.Twitter.consumerKey = +TwitterAgent.sources.Twitter.consumerSecret = +TwitterAgent.sources.Twitter.accessToken = +TwitterAgent.sources.Twitter.accessTokenSecret = +TwitterAgent.sources.Twitter.keywords = #hadoop, #flume, #bigdata + +#Configuração do Sink +TwitterAgent.sinks.HDFS.type = hdfs +TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/matheus +TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream +TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text + +#number of events written to file before it is flushed to HDFS +TwitterAgent.sinks.HDFS.hdfs.batchSize = 50 + +#File size to trigger roll, in bytes (0: never roll based on file size) +TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 + +#Number of events written to file before it rolled (0 = never roll based on number of events) +TwitterAgent.sinks.HDFS.hdfs.rollCount = 50 + +#Number of seconds to wait before rolling current file (0 = never roll based on time interval) +TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0 + +#Configuração do Channel +TwitterAgent.channels.MemChannel.type = memory +#The maximum number of events stored in the channel +TwitterAgent.channels.MemChannel.capacity = 100 +#The maximum number of events the channel will take from a source or give to a sink per transaction +TwitterAgent.channels.MemChannel.transactionCapacity = 100 + +#Conectando Source, Sink, Channel +TwitterAgent.sources.Twitter.channels = MemChannel +TwitterAgent.sinks.HDFS.channel = MemChannel + + +#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json +#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object \ No newline at end of file