Add Tutorial for Text-Mining Chinese #5809

shiltemann · 2025-02-28T11:49:44Z

Tutorial by @Sch-Da, thanks a lot! cool! (ping also @bgruening as fyi)

github-actions · 2025-02-28T12:08:25Z

topics/statistics/tutorials/text_mining_chinese/workflows/main_workflow.ga

@@ -0,0 +1 @@
+{"a_galaxy_workflow": "true", "annotation": "", "comments": [{"child_steps": [11], "color": "red", "data": {"title": "Results sorted from the most frequently appearing characters to those that appear the least"}, "id": 8, "position": [1732.6, 163.4], "size": [220, 362], "type": "frame"}, {"child_steps": [0, 1], "color": "red", "data": {"title": "Text upload"}, "id": 0, "position": [0, 91], "size": [239, 440], "type": "frame"}, {"child_steps": [2, 3], "color": "red", "data": {"title": "Text cleaning and layout unification of the uploaded texts for easier comparison"}, "id": 1, "position": [273, 13], "size": [255, 509], "type": "frame"}, {"child_steps": [4, 5], "color": "red", "data": {"title": "Comparison of the two texts"}, "id": 2, "position": [678, 0], "size": [285, 587], "type": "frame"}, {"child_steps": [6], "color": "red", "data": {"title": "Extraction of lines censored with symbol \u00d7 and their counterparts in the second text"}, "id": 3, "position": [990.6, 79.6], "size": [262, 516], "type": "frame"}, {"child_steps": [7], "color": "red", "data": {"title": "Conversion step for further processing"}, "id": 4, "position": [1273, 161], "size": [226, 367], "type": "frame"}, {"child_steps": [8], "color": "red", "data": {"title": "Select only uncensored characters for visualisation"}, "id": 5, "position": [1278.4, 558.6], "size": [224, 261], "type": "frame"}, {"child_steps": [9], "color": "red", "data": {"title": "Summarising and quantifying the results. How often did what symbol appear?"}, "id": 6, "position": [1507, 163], "size": [211, 363], "type": "frame"}, {"child_steps": [10], "color": "red", "data": {"title": "Visualisation of results in a wordcloud"}, "id": 7, "position": [1593.2, 549.2], "size": [280, 433], "type": "frame"}], "creator": [{"class": "Person", "identifier": "0000-0001-9536-5587", "name": "Daniela Schneider"}], "format-version": "0.1", "license": "CC-BY-4.0", "name": "Text Mining Differences in Chinese Newspaper Articles", "report": {"markdown": "\n# Workflow Execution Report\n\n## Workflow Inputs\n```galaxy\ninvocation_inputs()\n```\n\n## Workflow Outputs\n```galaxy\ninvocation_outputs()\n```\n\n## Workflow\n```galaxy\nworkflow_display()\n```\n"}, "steps": {"0": {"annotation": "Upload the censored text containing replacement characters like \u2018\u00d7\u2019.", "content_id": null, "errors": null, "id": 0, "input_connections": {}, "inputs": [{"description": "Upload the censored text containing replacement characters like \u2018\u00d7\u2019.", "name": "Input censored text"}], "label": "Input censored text", "name": "Input dataset", "outputs": [], "position": {"left": 21, "top": 180.6051261035269}, "tool_id": null, "tool_state": "{\"optional\": false, \"tag\": null}", "tool_version": null, "type": "data_input", "uuid": "dac86ce1-b481-46e0-ae1a-9d1943f1a328", "when": null, "workflow_outputs": []}, "1": {"annotation": "Upload the uncensored text without replacement characters.", "content_id": null, "errors": null, "id": 1, "input_connections": {}, "inputs": [{"description": "Upload the uncensored text without replacement characters.", "name": "Input uncensored text "}], "label": "Input uncensored text ", "name": "Input dataset", "outputs": [], "position": {"left": 21.7166748046875, "top": 354.55513831055816}, "tool_id": null, "tool_state": "{\"optional\": false, \"tag\": null}", "tool_version": null, "type": "data_input", "uuid": "7b604b7b-b700-4472-b574-8ec55dc59ba2", "when": null, "workflow_outputs": []}, "2": {"annotation": "This step uses Regular Expressions to delete all empty spaces (\\s) and show only one character per line (\\1\\n).\n\nThe result is a cleaned and reformatted text showing only one character per line.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "errors": null, "id": 2, "input_connections": {"infile": {"id": 0, "output_name": "output"}}, "inputs": [], "label": "Preprocessing of Text one", "name": "Replace Text", "outputs": [{"name": "outfile", "type": "input"}], "position": {"left": 303.51666259765625, "top": 166.8018061034782}, "post_job_actions": {"RenameDatasetActionoutfile": {"action_arguments": {"newname": "Preprocessed Text one"}, "action_type": "RenameDatasetAction", "output_name": "outfile"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "tool_shed_repository": {"changeset_revision": "86755160afbf", "name": "text_processing", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"replacements\": [{\"__index__\": 0, \"find_pattern\": \"\\\\r\", \"replace_pattern\": null}, {\"__index__\": 1, \"find_pattern\": \"\\\\n\", \"replace_pattern\": null}, {\"__index__\": 2, \"find_pattern\": \"\\\\s\", \"replace_pattern\": \"\"}, {\"__index__\": 3, \"find_pattern\": \"(.)\", \"replace_pattern\": \"\\\\1\\\\n\"}], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "9.3+galaxy1", "type": "tool", "uuid": "7a765a1c-be84-45ad-838e-58f9d703512c", "when": null, "workflow_outputs": []}, "3": {"annotation": "This step uses Regular Expressions to delete all empty spaces (\\s) and show only one character per line (\\1\\n).\n\nThe result is a cleaned and reformatted text showing only one character per line.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "errors": null, "id": 3, "input_connections": {"infile": {"id": 1, "output_name": "output"}}, "inputs": [], "label": "Preprocessing of Text two", "name": "Replace Text", "outputs": [{"name": "outfile", "type": "input"}], "position": {"left": 302.25, "top": 363.9517999999626}, "post_job_actions": {"RenameDatasetActionoutfile": {"action_arguments": {"newname": "Preprocessed Text two"}, "action_type": "RenameDatasetAction", "output_name": "outfile"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "tool_shed_repository": {"changeset_revision": "86755160afbf", "name": "text_processing", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"replacements\": [{\"__index__\": 0, \"find_pattern\": \"\\\\r\", \"replace_pattern\": null}, {\"__index__\": 1, \"find_pattern\": \"\\\\n\", \"replace_pattern\": null}, {\"__index__\": 2, \"find_pattern\": \"\\\\s\", \"replace_pattern\": \"\"}, {\"__index__\": 3, \"find_pattern\": \"(.)\", \"replace_pattern\": \"\\\\1\\\\n\"}], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "9.3+galaxy1", "type": "tool", "uuid": "85a1f86d-ff89-4cd7-b14d-0b4d113ff761", "when": null, "workflow_outputs": []}, "4": {"annotation": "The diff tool compares the two cleaned texts. This version of the output (raw output) is used for the further steps of the analysis. It is less intuitive for users. Therefore, the second diff below includes a more visual version of the output (HTML).", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "errors": null, "id": 4, "input_connections": {"input1": {"id": 2, "output_name": "outfile"}, "input2": {"id": 3, "output_name": "outfile"}}, "inputs": [], "label": "Comparison with diff - computer version", "name": "diff", "outputs": [{"name": "diff_file", "type": "txt"}], "position": {"left": 724.0499877929688, "top": 126.28511298824382}, "post_job_actions": {"ChangeDatatypeActiondiff_file": {"action_arguments": {"newtype": "tabular"}, "action_type": "ChangeDatatypeAction", "output_name": "diff_file"}, "RenameDatasetActiondiff_file": {"action_arguments": {"newname": "Comparison - computer version"}, "action_type": "RenameDatasetAction", "output_name": "diff_file"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "tool_shed_repository": {"changeset_revision": "10ef1bf99074", "name": "diff", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"input1\": {\"__class__\": \"ConnectedValue\"}, \"input2\": {\"__class__\": \"ConnectedValue\"}, \"report_format\": {\"report_format_select\": \"txt_columns\", \"__current_case__\": 1}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "3.10+galaxy0", "type": "tool", "uuid": "e3a3ad39-15bd-40c3-8226-a6f35a154d92", "when": null, "workflow_outputs": []}, "5": {"annotation": "The diff tool compares the two cleaned texts. This version (HTML version) creates an HTML file, which colour codes differences as additions (green) or extractions (red) when comparing the texts.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "errors": null, "id": 5, "input_connections": {"input1": {"id": 2, "output_name": "outfile"}, "input2": {"id": 3, "output_name": "outfile"}}, "inputs": [], "label": "Comparison with diff - user version", "name": "diff", "outputs": [{"name": "diff_file", "type": "txt"}, {"name": "html_file", "type": "html"}], "position": {"left": 725.2667846679688, "top": 355.64996337890625}, "post_job_actions": {"HideDatasetActiondiff_file": {"action_arguments": {}, "action_type": "HideDatasetAction", "output_name": "diff_file"}, "RenameDatasetActionhtml_file": {"action_arguments": {"newname": "Comparison - HTML version"}, "action_type": "RenameDatasetAction", "output_name": "html_file"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "tool_shed_repository": {"changeset_revision": "10ef1bf99074", "name": "diff", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"input1\": {\"__class__\": \"ConnectedValue\"}, \"input2\": {\"__class__\": \"ConnectedValue\"}, \"report_format\": {\"report_format_select\": \"html\", \"__current_case__\": 2, \"unified\": \"3\", \"output_format\": \"side-by-side\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "3.10+galaxy0", "type": "tool", "uuid": "ef5da48f-601e-4108-8740-4778f83b5426", "when": null, "workflow_outputs": []}, "6": {"annotation": "This step selects all lines from the diff file that contain the censorship symbol \u00d7.\nThe condition \"ord(c1) == 215\" means that lines in column c1, which contain the censored text, are selected if they match \u00d7. The symbol \u00d7 is unspecific, therefore, the Unicode identifier of the character (215) is used for clarity in this condition.\n\nThis step does not show an output. If the filtered symbol is empty in the second text, this file lacks columns to compute the following steps. This is invisible for users but would cause a technical error. The compute step covers this and makes sure all necessary columns exist. It shows the output for both steps (Extracting and Compute) correctly.\n\nAdd another Unicode here if you want to select a different character, for example, '\u25a1' or '\u25b3'.\nYou can get the respective code, for example, on this website:\nhttps://www.mauvecloud.net/charsets/CharCodeFinder.html\nCopy the character you want to filter in the \"input\" window and select \"Decimal Character Codes\" as an output. If you do this for symbol \u00d7, you get 215.", "content_id": "Filter1", "errors": null, "id": 6, "input_connections": {"input": {"id": 4, "output_name": "diff_file"}}, "inputs": [], "label": "Extracting only censored passages", "name": "Filter", "outputs": [{"name": "out_file1", "type": "input"}], "position": {"left": 1022.4000244140625, "top": 286.73333740234375}, "post_job_actions": {"HideDatasetActionout_file1": {"action_arguments": {}, "action_type": "HideDatasetAction", "output_name": "out_file1"}}, "tool_id": "Filter1", "tool_state": "{\"__input_ext\": \"tabular\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"cond\": \"ord(c1) == 215\", \"header_lines\": \"0\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "1.1.1", "type": "tool", "uuid": "46d04c11-f426-456d-9d97-7882a49ee20c", "when": null, "workflow_outputs": []}, "7": {"annotation": "This step unifies the formatting and adds potentially missing columns, should lines extracted before coming up empty in the second text. This ensures the proper number of columns and allows the smooth running of the next steps.", "content_id": "toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1", "errors": null, "id": 7, "input_connections": {"input": {"id": 6, "output_name": "out_file1"}}, "inputs": [], "label": null, "name": "Compute", "outputs": [{"name": "out_file1", "type": "input"}], "position": {"left": 1280.7999877929688, "top": 288.5184503905876}, "post_job_actions": {"RenameDatasetActionout_file1": {"action_arguments": {"newname": "Censored lines"}, "action_type": "RenameDatasetAction", "output_name": "out_file1"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1", "tool_shed_repository": {"changeset_revision": "aff5135563c6", "name": "column_maker", "owner": "devteam", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"__input_ext\": \"input\", \"avoid_scientific_notation\": false, \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"error_handling\": {\"auto_col_types\": true, \"fail_on_non_existent_columns\": false, \"non_computable\": {\"action\": \"--non-computable-blank\", \"__current_case__\": 3}}, \"input\": {\"__class__\": \"ConnectedValue\"}, \"ops\": {\"header_lines_select\": \"no\", \"__current_case__\": 0, \"expressions\": [{\"__index__\": 0, \"cond\": \"c9\", \"add_column\": {\"mode\": \"R\", \"__current_case__\": 2, \"pos\": \"9\"}}]}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "2.1", "type": "tool", "uuid": "51d55250-60d0-4bd1-9f8c-0f45815befe0", "when": null, "workflow_outputs": []}, "8": {"annotation": "This step selects only column 9, which contains the uncensored characters from text two. The result is only one column with different rows of Chinese characters. \n\nThis step allows scaling words by frequency the word cloud in the next step. meaning characters that appear more often appear bigger, making the results evident at first sight.", "content_id": "Cut1", "errors": null, "id": 8, "input_connections": {"input": {"id": 7, "output_name": "out_file1"}}, "inputs": [], "label": null, "name": "Cut", "outputs": [{"name": "out_file1", "type": "tabular"}], "position": {"left": 1290.2000122070312, "top": 658.2000122070312}, "post_job_actions": {"RenameDatasetActionout_file1": {"action_arguments": {"newname": "Uncensored characters"}, "action_type": "RenameDatasetAction", "output_name": "out_file1"}}, "tool_id": "Cut1", "tool_state": "{\"columnList\": \"c9\", \"delimiter\": \"T\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "1.0.2", "type": "tool", "uuid": "96a0b94b-6bb4-40ff-ae31-8db437d7026d", "when": null, "workflow_outputs": []}, "9": {"annotation": "", "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0", "errors": null, "id": 9, "input_connections": {"in_file": {"id": 7, "output_name": "out_file1"}}, "inputs": [], "label": null, "name": "Datamash", "outputs": [{"name": "out_file", "type": "input"}], "position": {"left": 1511.7929886427692, "top": 288.52178870118314}, "post_job_actions": {"RenameDatasetActionout_file": {"action_arguments": {"newname": "Quantified Results"}, "action_type": "RenameDatasetAction", "output_name": "out_file"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0", "tool_shed_repository": {"changeset_revision": "4c07ddedc198", "name": "datamash_ops", "owner": "iuc", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"grouping\": \"9\", \"header_in\": false, \"header_out\": false, \"ignore_case\": false, \"in_file\": {\"__class__\": \"ConnectedValue\"}, \"narm\": false, \"need_sort\": true, \"operations\": [{\"__index__\": 0, \"op_name\": \"count\", \"op_column\": \"9\"}], \"print_full_line\": false, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "1.8+galaxy0", "type": "tool", "uuid": "e9c7b1a1-db93-4991-9b48-e9b3bda75edc", "when": null, "workflow_outputs": []}, "10": {"annotation": "This step shows, which characters were censored in the first text. The bigger the word, the more often it appeared in the text.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy0", "errors": null, "id": 10, "input_connections": {"text": {"id": 8, "output_name": "out_file1"}}, "inputs": [{"description": "runtime parameter for tool Generate a word cloud", "name": "fontfile"}, {"description": "runtime parameter for tool Generate a word cloud", "name": "mask"}, {"description": "runtime parameter for tool Generate a word cloud", "name": "stopwords"}], "label": null, "name": "Generate a word cloud", "outputs": [{"name": "output", "type": "png"}], "position": {"left": 1635.5499877929688, "top": 690.0499877929688}, "post_job_actions": {"RenameDatasetActionoutput": {"action_arguments": {"newname": "Wordcloud of censored characters"}, "action_type": "RenameDatasetAction", "output_name": "output"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy0", "tool_shed_repository": {"changeset_revision": "54c2d8ebf0cf", "name": "wordcloud", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"background\": \"#000000\", \"color_choice\": {\"color_option\": \"color\", \"__current_case__\": 0, \"color\": \"#00ff00\"}, \"colormap\": null, \"contour_color\": \"#000000\", \"contour_width\": \"0.0\", \"font_step\": \"1\", \"fontfile\": {\"__class__\": \"RuntimeValue\"}, \"height\": \"200\", \"include_numbers\": false, \"margin\": \"2\", \"mask\": {\"__class__\": \"RuntimeValue\"}, \"max_font_size\": null, \"max_words\": \"200\", \"min_font_size\": \"8\", \"min_word_length\": \"0\", \"mode\": null, \"no_collocations\": false, \"no_normalize_plurals\": false, \"prefer_horizontal\": \"1.0\", \"random_state\": \"10\", \"relative_scaling\": \"0.9\", \"repeat\": false, \"scale\": \"1.0\", \"stopwords\": {\"__class__\": \"RuntimeValue\"}, \"text\": {\"__class__\": \"ConnectedValue\"}, \"width\": \"400\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "1.9.4+galaxy0", "type": "tool", "uuid": "da09e429-e183-4a05-abe2-ef64ade25fce", "when": null, "workflow_outputs": [{"label": "output_graphic", "output_name": "output", "uuid": "0e468eb9-b68c-4840-b5e7-12cbf29e0c9c"}]}, "11": {"annotation": "Sorts the quantified results from those appearing most to those appearing least ", "content_id": "sort1", "errors": null, "id": 11, "input_connections": {"input": {"id": 9, "output_name": "out_file"}}, "inputs": [], "label": null, "name": "Sort", "outputs": [{"name": "out_file1", "type": "input"}], "position": {"left": 1741.3931107130816, "top": 286.9218039599722}, "post_job_actions": {"RenameDatasetActionout_file1": {"action_arguments": {"newname": "Sorted Results"}, "action_type": "RenameDatasetAction", "output_name": "out_file1"}}, "tool_id": "sort1", "tool_state": "{\"__input_ext\": \"tabular\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"column\": \"2\", \"column_set\": [], \"header_lines\": \"0\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"order\": \"DESC\", \"style\": \"num\", \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "1.2.0", "type": "tool", "uuid": "450d11ba-80d4-4367-ac2a-56a18ffeb729", "when": null, "workflow_outputs": [{"label": "output_csv", "output_name": "out_file1", "uuid": "9dfd8862-e7d7-46ae-bbe2-3e333a017764"}]}}, "tags": ["Humanities", "comparison", "text", "diff", "published"], "uuid": "55cf86fe-d044-4ff1-9891-10d89f29b7dd", "version": 2}


🚫 [rdjsonl] <GTN:027> _{reported by reviewdog 🐶}
This workflow is missing a test, which is now mandatory. Please see the FAQ on how to add tests to your workflows.

github-actions · 2025-02-28T12:08:25Z

topics/statistics/tutorials/text_mining_chinese/tutorial.bib

+# You can remove the examples below
+
+
+@book{Ng2022,


🚫 [rdjsonl] <GTN:012> _{reported by reviewdog 🐶}
Missing both a DOI and a URL. Please add one of the two.

github-actions · 2025-02-28T12:08:25Z

topics/statistics/tutorials/text_mining_chinese/tutorial.bib

+ series = {Law in Context}
+}
+
+@thesis{Schneider2024,


🚫 [rdjsonl] <GTN:012> _{reported by reviewdog 🐶}
Missing both a DOI and a URL. Please add one of the two.

github-actions · 2025-02-28T12:08:25Z

topics/statistics/tutorials/text_mining_chinese/tutorial.bib

+ location = {Freiburg}
+}
+
+@article{Anon..16.10.1938_5598,


🚫 [rdjsonl] <GTN:012> _{reported by reviewdog 🐶}
Missing both a DOI and a URL. Please add one of the two.

github-actions · 2025-02-28T12:08:25Z

topics/statistics/tutorials/text_mining_chinese/tutorial.bib

+ userd = {5598}
+}
+
+@article{TKP.16.10.1938_18864,


🚫 [rdjsonl] <GTN:012> _{reported by reviewdog 🐶}
Missing both a DOI and a URL. Please add one of the two.

shiltemann · 2025-02-28T12:15:23Z

ok, this should appear online in about 15 minutes orso under https://training.galaxyproject.org/topics/statistics/tutorials/text_mining_chinese/tutorial.html

it is in draft mode for now, but not much would be needed to take it out of draft mode:

add DOIs/URLs for the citations
review the online version @Sch-Da and check that you are happy with how it looks
workflow test (but this can also be added after publication as far as I'm concerned)

Sch-Da and others added 12 commits February 28, 2025 11:19

add tutorial from @Sch-Da

581b8fc

remove weird characters at start of file that was preventing rendering

c4d00ed

move images to image folder

aa3f5e9

fix citations

2949518

formatting

cc63700

fix citation

b14487d

remove unmatching figure caption

ab37e8a

fix broken box

068c4e7

reword autogenerated headings

41544cc

add regular expression faq

d12e566

set as draft

0fc6808

remove data library file, the GTN will create it for us after merging

41c83ce

shiltemann requested a review from a team as a code owner February 28, 2025 11:49

github-actions bot added the statistics label Feb 28, 2025

github-actions bot reviewed Feb 28, 2025

View reviewed changes

shiltemann merged commit eacc45a into main Feb 28, 2025
2 of 3 checks passed

shiltemann deleted the text-mining-chinese branch February 28, 2025 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tutorial for Text-Mining Chinese #5809

Add Tutorial for Text-Mining Chinese #5809

shiltemann commented Feb 28, 2025

github-actions bot Feb 28, 2025

github-actions bot Feb 28, 2025

github-actions bot Feb 28, 2025

github-actions bot Feb 28, 2025

github-actions bot Feb 28, 2025

shiltemann commented Feb 28, 2025 •

edited

Loading

		@@ -0,0 +1 @@
		{"a_galaxy_workflow": "true", "annotation": "", "comments": [{"child_steps": [11], "color": "red", "data": {"title": "Results sorted from the most frequently appearing characters to those that appear the least"}, "id": 8, "position": [1732.6, 163.4], "size": [220, 362], "type": "frame"}, {"child_steps": [0, 1], "color": "red", "data": {"title": "Text upload"}, "id": 0, "position": [0, 91], "size": [239, 440], "type": "frame"}, {"child_steps": [2, 3], "color": "red", "data": {"title": "Text cleaning and layout unification of the uploaded texts for easier comparison"}, "id": 1, "position": [273, 13], "size": [255, 509], "type": "frame"}, {"child_steps": [4, 5], "color": "red", "data": {"title": "Comparison of the two texts"}, "id": 2, "position": [678, 0], "size": [285, 587], "type": "frame"}, {"child_steps": [6], "color": "red", "data": {"title": "Extraction of lines censored with symbol \u00d7 and their counterparts in the second text"}, "id": 3, "position": [990.6, 79.6], "size": [262, 516], "type": "frame"}, {"child_steps": [7], "color": "red", "data": {"title": "Conversion step for further processing"}, "id": 4, "position": [1273, 161], "size": [226, 367], "type": "frame"}, {"child_steps": [8], "color": "red", "data": {"title": "Select only uncensored characters for visualisation"}, "id": 5, "position": [1278.4, 558.6], "size": [224, 261], "type": "frame"}, {"child_steps": [9], "color": "red", "data": {"title": "Summarising and quantifying the results. How often did what symbol appear?"}, "id": 6, "position": [1507, 163], "size": [211, 363], "type": "frame"}, {"child_steps": [10], "color": "red", "data": {"title": "Visualisation of results in a wordcloud"}, "id": 7, "position": [1593.2, 549.2], "size": [280, 433], "type": "frame"}], "creator": [{"class": "Person", "identifier": "0000-0001-9536-5587", "name": "Daniela Schneider"}], "format-version": "0.1", "license": "CC-BY-4.0", "name": "Text Mining Differences in Chinese Newspaper Articles", "report": {"markdown": "\n# Workflow Execution Report\n\n## Workflow Inputs\n```galaxy\ninvocation_inputs()\n```\n\n## Workflow Outputs\n```galaxy\ninvocation_outputs()\n```\n\n## Workflow\n```galaxy\nworkflow_display()\n```\n"}, "steps": {"0": {"annotation": "Upload the censored text containing replacement characters like \u2018\u00d7\u2019.", "content_id": null, "errors": null, "id": 0, "input_connections": {}, "inputs": [{"description": "Upload the censored text containing replacement characters like \u2018\u00d7\u2019.", "name": "Input censored text"}], "label": "Input censored text", "name": "Input dataset", "outputs": [], "position": {"left": 21, "top": 180.6051261035269}, "tool_id": null, "tool_state": "{\"optional\": false, \"tag\": null}", "tool_version": null, "type": "data_input", "uuid": "dac86ce1-b481-46e0-ae1a-9d1943f1a328", "when": null, "workflow_outputs": []}, "1": {"annotation": "Upload the uncensored text without replacement characters.", "content_id": null, "errors": null, "id": 1, "input_connections": {}, "inputs": [{"description": "Upload the uncensored text without replacement characters.", "name": "Input uncensored text "}], "label": "Input uncensored text ", "name": "Input dataset", "outputs": [], "position": {"left": 21.7166748046875, "top": 354.55513831055816}, "tool_id": null, "tool_state": "{\"optional\": false, \"tag\": null}", "tool_version": null, "type": "data_input", "uuid": "7b604b7b-b700-4472-b574-8ec55dc59ba2", "when": null, "workflow_outputs": []}, "2": {"annotation": "This step uses Regular Expressions to delete all empty spaces (\\s) and show only one character per line (\\1\\n).\n\nThe result is a cleaned and reformatted text showing only one character per line.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "errors": null, "id": 2, "input_connections": {"infile": {"id": 0, "output_name": "output"}}, "inputs": [], "label": "Preprocessing of Text one", "name": "Replace Text", "outputs": [{"name": "outfile", "type": "input"}], "position": {"left": 303.51666259765625, "top": 166.8018061034782}, "post_job_actions": {"RenameDatasetActionoutfile": {"action_arguments": {"newname": "Preprocessed Text one"}, "action_type": "RenameDatasetAction", "output_name": "outfile"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "tool_shed_repository": {"changeset_revision": "86755160afbf", "name": "text_processing", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"replacements\": [{\"__index__\": 0, \"find_pattern\": \"\\\\r\", \"replace_pattern\": null}, {\"__index__\": 1, \"find_pattern\": \"\\\\n\", \"replace_pattern\": null}, {\"__index__\": 2, \"find_pattern\": \"\\\\s\", \"replace_pattern\": \"\"}, {\"__index__\": 3, \"find_pattern\": \"(.)\", \"replace_pattern\": \"\\\\1\\\\n\"}], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "9.3+galaxy1", "type": "tool", "uuid": "7a765a1c-be84-45ad-838e-58f9d703512c", "when": null, "workflow_outputs": []}, "3": {"annotation": "This step uses Regular Expressions to delete all empty spaces (\\s) and show only one character per line (\\1\\n).\n\nThe result is a cleaned and reformatted text showing only one character per line.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "errors": null, "id": 3, "input_connections": {"infile": {"id": 1, "output_name": "output"}}, "inputs": [], "label": "Preprocessing of Text two", "name": "Replace Text", "outputs": [{"name": "outfile", "type": "input"}], "position": {"left": 302.25, "top": 363.9517999999626}, "post_job_actions": {"RenameDatasetActionoutfile": {"action_arguments": {"newname": "Preprocessed Text two"}, "action_type": "RenameDatasetAction", "output_name": "outfile"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.3+galaxy1", "tool_shed_repository": {"changeset_revision": "86755160afbf", "name": "text_processing", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"replacements\": [{\"__index__\": 0, \"find_pattern\": \"\\\\r\", \"replace_pattern\": null}, {\"__index__\": 1, \"find_pattern\": \"\\\\n\", \"replace_pattern\": null}, {\"__index__\": 2, \"find_pattern\": \"\\\\s\", \"replace_pattern\": \"\"}, {\"__index__\": 3, \"find_pattern\": \"(.)\", \"replace_pattern\": \"\\\\1\\\\n\"}], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "9.3+galaxy1", "type": "tool", "uuid": "85a1f86d-ff89-4cd7-b14d-0b4d113ff761", "when": null, "workflow_outputs": []}, "4": {"annotation": "The diff tool compares the two cleaned texts. This version of the output (raw output) is used for the further steps of the analysis. It is less intuitive for users. Therefore, the second diff below includes a more visual version of the output (HTML).", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "errors": null, "id": 4, "input_connections": {"input1": {"id": 2, "output_name": "outfile"}, "input2": {"id": 3, "output_name": "outfile"}}, "inputs": [], "label": "Comparison with diff - computer version", "name": "diff", "outputs": [{"name": "diff_file", "type": "txt"}], "position": {"left": 724.0499877929688, "top": 126.28511298824382}, "post_job_actions": {"ChangeDatatypeActiondiff_file": {"action_arguments": {"newtype": "tabular"}, "action_type": "ChangeDatatypeAction", "output_name": "diff_file"}, "RenameDatasetActiondiff_file": {"action_arguments": {"newname": "Comparison - computer version"}, "action_type": "RenameDatasetAction", "output_name": "diff_file"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "tool_shed_repository": {"changeset_revision": "10ef1bf99074", "name": "diff", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"input1\": {\"__class__\": \"ConnectedValue\"}, \"input2\": {\"__class__\": \"ConnectedValue\"}, \"report_format\": {\"report_format_select\": \"txt_columns\", \"__current_case__\": 1}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "3.10+galaxy0", "type": "tool", "uuid": "e3a3ad39-15bd-40c3-8226-a6f35a154d92", "when": null, "workflow_outputs": []}, "5": {"annotation": "The diff tool compares the two cleaned texts. This version (HTML version) creates an HTML file, which colour codes differences as additions (green) or extractions (red) when comparing the texts.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "errors": null, "id": 5, "input_connections": {"input1": {"id": 2, "output_name": "outfile"}, "input2": {"id": 3, "output_name": "outfile"}}, "inputs": [], "label": "Comparison with diff - user version", "name": "diff", "outputs": [{"name": "diff_file", "type": "txt"}, {"name": "html_file", "type": "html"}], "position": {"left": 725.2667846679688, "top": 355.64996337890625}, "post_job_actions": {"HideDatasetActiondiff_file": {"action_arguments": {}, "action_type": "HideDatasetAction", "output_name": "diff_file"}, "RenameDatasetActionhtml_file": {"action_arguments": {"newname": "Comparison - HTML version"}, "action_type": "RenameDatasetAction", "output_name": "html_file"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy0", "tool_shed_repository": {"changeset_revision": "10ef1bf99074", "name": "diff", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"input1\": {\"__class__\": \"ConnectedValue\"}, \"input2\": {\"__class__\": \"ConnectedValue\"}, \"report_format\": {\"report_format_select\": \"html\", \"__current_case__\": 2, \"unified\": \"3\", \"output_format\": \"side-by-side\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "3.10+galaxy0", "type": "tool", "uuid": "ef5da48f-601e-4108-8740-4778f83b5426", "when": null, "workflow_outputs": []}, "6": {"annotation": "This step selects all lines from the diff file that contain the censorship symbol \u00d7.\nThe condition \"ord(c1) == 215\" means that lines in column c1, which contain the censored text, are selected if they match \u00d7. The symbol \u00d7 is unspecific, therefore, the Unicode identifier of the character (215) is used for clarity in this condition.\n\nThis step does not show an output. If the filtered symbol is empty in the second text, this file lacks columns to compute the following steps. This is invisible for users but would cause a technical error. The compute step covers this and makes sure all necessary columns exist. It shows the output for both steps (Extracting and Compute) correctly.\n\nAdd another Unicode here if you want to select a different character, for example, '\u25a1' or '\u25b3'.\nYou can get the respective code, for example, on this website:\nhttps://www.mauvecloud.net/charsets/CharCodeFinder.html\nCopy the character you want to filter in the \"input\" window and select \"Decimal Character Codes\" as an output. If you do this for symbol \u00d7, you get 215.", "content_id": "Filter1", "errors": null, "id": 6, "input_connections": {"input": {"id": 4, "output_name": "diff_file"}}, "inputs": [], "label": "Extracting only censored passages", "name": "Filter", "outputs": [{"name": "out_file1", "type": "input"}], "position": {"left": 1022.4000244140625, "top": 286.73333740234375}, "post_job_actions": {"HideDatasetActionout_file1": {"action_arguments": {}, "action_type": "HideDatasetAction", "output_name": "out_file1"}}, "tool_id": "Filter1", "tool_state": "{\"__input_ext\": \"tabular\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"cond\": \"ord(c1) == 215\", \"header_lines\": \"0\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "1.1.1", "type": "tool", "uuid": "46d04c11-f426-456d-9d97-7882a49ee20c", "when": null, "workflow_outputs": []}, "7": {"annotation": "This step unifies the formatting and adds potentially missing columns, should lines extracted before coming up empty in the second text. This ensures the proper number of columns and allows the smooth running of the next steps.", "content_id": "toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1", "errors": null, "id": 7, "input_connections": {"input": {"id": 6, "output_name": "out_file1"}}, "inputs": [], "label": null, "name": "Compute", "outputs": [{"name": "out_file1", "type": "input"}], "position": {"left": 1280.7999877929688, "top": 288.5184503905876}, "post_job_actions": {"RenameDatasetActionout_file1": {"action_arguments": {"newname": "Censored lines"}, "action_type": "RenameDatasetAction", "output_name": "out_file1"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1", "tool_shed_repository": {"changeset_revision": "aff5135563c6", "name": "column_maker", "owner": "devteam", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"__input_ext\": \"input\", \"avoid_scientific_notation\": false, \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"error_handling\": {\"auto_col_types\": true, \"fail_on_non_existent_columns\": false, \"non_computable\": {\"action\": \"--non-computable-blank\", \"__current_case__\": 3}}, \"input\": {\"__class__\": \"ConnectedValue\"}, \"ops\": {\"header_lines_select\": \"no\", \"__current_case__\": 0, \"expressions\": [{\"__index__\": 0, \"cond\": \"c9\", \"add_column\": {\"mode\": \"R\", \"__current_case__\": 2, \"pos\": \"9\"}}]}, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "2.1", "type": "tool", "uuid": "51d55250-60d0-4bd1-9f8c-0f45815befe0", "when": null, "workflow_outputs": []}, "8": {"annotation": "This step selects only column 9, which contains the uncensored characters from text two. The result is only one column with different rows of Chinese characters. \n\nThis step allows scaling words by frequency the word cloud in the next step. meaning characters that appear more often appear bigger, making the results evident at first sight.", "content_id": "Cut1", "errors": null, "id": 8, "input_connections": {"input": {"id": 7, "output_name": "out_file1"}}, "inputs": [], "label": null, "name": "Cut", "outputs": [{"name": "out_file1", "type": "tabular"}], "position": {"left": 1290.2000122070312, "top": 658.2000122070312}, "post_job_actions": {"RenameDatasetActionout_file1": {"action_arguments": {"newname": "Uncensored characters"}, "action_type": "RenameDatasetAction", "output_name": "out_file1"}}, "tool_id": "Cut1", "tool_state": "{\"columnList\": \"c9\", \"delimiter\": \"T\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "1.0.2", "type": "tool", "uuid": "96a0b94b-6bb4-40ff-ae31-8db437d7026d", "when": null, "workflow_outputs": []}, "9": {"annotation": "", "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0", "errors": null, "id": 9, "input_connections": {"in_file": {"id": 7, "output_name": "out_file1"}}, "inputs": [], "label": null, "name": "Datamash", "outputs": [{"name": "out_file", "type": "input"}], "position": {"left": 1511.7929886427692, "top": 288.52178870118314}, "post_job_actions": {"RenameDatasetActionout_file": {"action_arguments": {"newname": "Quantified Results"}, "action_type": "RenameDatasetAction", "output_name": "out_file"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0", "tool_shed_repository": {"changeset_revision": "4c07ddedc198", "name": "datamash_ops", "owner": "iuc", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"grouping\": \"9\", \"header_in\": false, \"header_out\": false, \"ignore_case\": false, \"in_file\": {\"__class__\": \"ConnectedValue\"}, \"narm\": false, \"need_sort\": true, \"operations\": [{\"__index__\": 0, \"op_name\": \"count\", \"op_column\": \"9\"}], \"print_full_line\": false, \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "1.8+galaxy0", "type": "tool", "uuid": "e9c7b1a1-db93-4991-9b48-e9b3bda75edc", "when": null, "workflow_outputs": []}, "10": {"annotation": "This step shows, which characters were censored in the first text. The bigger the word, the more often it appeared in the text.", "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy0", "errors": null, "id": 10, "input_connections": {"text": {"id": 8, "output_name": "out_file1"}}, "inputs": [{"description": "runtime parameter for tool Generate a word cloud", "name": "fontfile"}, {"description": "runtime parameter for tool Generate a word cloud", "name": "mask"}, {"description": "runtime parameter for tool Generate a word cloud", "name": "stopwords"}], "label": null, "name": "Generate a word cloud", "outputs": [{"name": "output", "type": "png"}], "position": {"left": 1635.5499877929688, "top": 690.0499877929688}, "post_job_actions": {"RenameDatasetActionoutput": {"action_arguments": {"newname": "Wordcloud of censored characters"}, "action_type": "RenameDatasetAction", "output_name": "output"}}, "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy0", "tool_shed_repository": {"changeset_revision": "54c2d8ebf0cf", "name": "wordcloud", "owner": "bgruening", "tool_shed": "toolshed.g2.bx.psu.edu"}, "tool_state": "{\"background\": \"#000000\", \"color_choice\": {\"color_option\": \"color\", \"__current_case__\": 0, \"color\": \"#00ff00\"}, \"colormap\": null, \"contour_color\": \"#000000\", \"contour_width\": \"0.0\", \"font_step\": \"1\", \"fontfile\": {\"__class__\": \"RuntimeValue\"}, \"height\": \"200\", \"include_numbers\": false, \"margin\": \"2\", \"mask\": {\"__class__\": \"RuntimeValue\"}, \"max_font_size\": null, \"max_words\": \"200\", \"min_font_size\": \"8\", \"min_word_length\": \"0\", \"mode\": null, \"no_collocations\": false, \"no_normalize_plurals\": false, \"prefer_horizontal\": \"1.0\", \"random_state\": \"10\", \"relative_scaling\": \"0.9\", \"repeat\": false, \"scale\": \"1.0\", \"stopwords\": {\"__class__\": \"RuntimeValue\"}, \"text\": {\"__class__\": \"ConnectedValue\"}, \"width\": \"400\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", "tool_version": "1.9.4+galaxy0", "type": "tool", "uuid": "da09e429-e183-4a05-abe2-ef64ade25fce", "when": null, "workflow_outputs": [{"label": "output_graphic", "output_name": "output", "uuid": "0e468eb9-b68c-4840-b5e7-12cbf29e0c9c"}]}, "11": {"annotation": "Sorts the quantified results from those appearing most to those appearing least ", "content_id": "sort1", "errors": null, "id": 11, "input_connections": {"input": {"id": 9, "output_name": "out_file"}}, "inputs": [], "label": null, "name": "Sort", "outputs": [{"name": "out_file1", "type": "input"}], "position": {"left": 1741.3931107130816, "top": 286.9218039599722}, "post_job_actions": {"RenameDatasetActionout_file1": {"action_arguments": {"newname": "Sorted Results"}, "action_type": "RenameDatasetAction", "output_name": "out_file1"}}, "tool_id": "sort1", "tool_state": "{\"__input_ext\": \"tabular\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"column\": \"2\", \"column_set\": [], \"header_lines\": \"0\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"order\": \"DESC\", \"style\": \"num\", \"__page__\": null, \"__rerun_remap_job_id__\": null}", "tool_version": "1.2.0", "type": "tool", "uuid": "450d11ba-80d4-4367-ac2a-56a18ffeb729", "when": null, "workflow_outputs": [{"label": "output_csv", "output_name": "out_file1", "uuid": "9dfd8862-e7d7-46ae-bbe2-3e333a017764"}]}}, "tags": ["Humanities", "comparison", "text", "diff", "published"], "uuid": "55cf86fe-d044-4ff1-9891-10d89f29b7dd", "version": 2}

Add Tutorial for Text-Mining Chinese #5809

Add Tutorial for Text-Mining Chinese #5809

Conversation

shiltemann commented Feb 28, 2025

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

github-actions bot Feb 28, 2025

Choose a reason for hiding this comment

shiltemann commented Feb 28, 2025 • edited Loading

shiltemann commented Feb 28, 2025 •

edited

Loading