Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update prerender to include a content manifest csv output #2268

Merged
merged 69 commits into from
Aug 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
2411317
update prerender to include a content manifest csv output
TomWoodward Jul 1, 2024
e272313
add toc node types
TomWoodward Jul 1, 2024
a97648c
Add lanugage and book slug
TomWoodward Jul 1, 2024
820f5ba
:shirt:
TomWoodward Jul 2, 2024
6c9fc52
remove debug
TomWoodward Jul 2, 2024
647bf66
:shirt:
TomWoodward Jul 2, 2024
c2ed25d
Merge branch 'main' into content-manifest
staxly[bot] Jul 2, 2024
0bf7852
:pencil:
TomWoodward Jul 3, 2024
a17c394
Merge branch 'main' into content-manifest
staxly[bot] Jul 8, 2024
49e3edb
Merge branch 'main' into content-manifest
staxly[bot] Jul 10, 2024
4d7bd20
Merge branch 'main' into content-manifest
staxly[bot] Jul 10, 2024
5239c74
Merge branch 'main' into content-manifest
staxly[bot] Jul 10, 2024
ea47213
Merge branch 'main' into content-manifest
staxly[bot] Jul 16, 2024
5b7428a
Merge branch 'main' into content-manifest
staxly[bot] Jul 16, 2024
f358a63
Merge branch 'main' into content-manifest
staxly[bot] Jul 18, 2024
75a765d
Merge branch 'main' into content-manifest
staxly[bot] Jul 18, 2024
1bcb5cf
Merge branch 'main' into content-manifest
staxly[bot] Jul 18, 2024
c501953
Merge branch 'main' into content-manifest
staxly[bot] Jul 25, 2024
0332826
Merge branch 'main' into content-manifest
staxly[bot] Jul 26, 2024
0b2cb94
Merge branch 'main' into content-manifest
staxly[bot] Jul 26, 2024
22b7cc1
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
e91bf7c
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
7a1eee5
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
975fcc6
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
8b71007
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
a1aa4ab
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
4a02768
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
6bbb730
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
24a6423
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
f8c78fe
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
760eed6
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
6c8c6aa
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
9744545
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
d4fd4db
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
92b85c9
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
af38c32
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
7901fd4
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
6211457
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
2d49d31
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
b087990
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
3b6b4e0
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
535ac31
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
619f502
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
c3e26d5
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
693c49c
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
38ed8ff
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
a24dcc2
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
d47e334
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
4d4614d
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
b136ea3
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
e846e73
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
f11f645
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
14c4356
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
707130a
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
1d8d938
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
036c699
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
2c5eef7
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
706cca3
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
61de4b4
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
12c8787
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
be0f5b6
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
a0165e8
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
e5b8aa5
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
575b733
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
ad7ef44
Merge branch 'main' into content-manifest
staxly[bot] Jul 31, 2024
522838b
Merge branch 'main' into content-manifest
staxly[bot] Aug 8, 2024
1bd512d
Merge branch 'main' into content-manifest
staxly[bot] Aug 8, 2024
9214479
Merge branch 'main' into content-manifest
staxly[bot] Aug 8, 2024
2d69aa4
fix import
TomWoodward Aug 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions script/prerender/contentManifest.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import { BookWithOSWebData, ArchiveTreeNode, ArchiveTree } from '../../src/app/content/types';
import { content } from '../../src/app/content/routes';
import { writeAssetFile } from './fileUtils';
import { stripIdVersion } from '../../src/app/content/utils/idUtils';
import { splitTitleParts } from '../../src/app/content/utils/archiveTreeUtils';

const quoteValue = (value?: string) => value ? `"${value.replace(/"/g, '""')}"` : '""';

export const renderAndSaveContentManifest = async(
saveFile: (path: string, contents: string) => Promise<unknown>,
books: BookWithOSWebData[]
) => {

const rows = books.map(book => getContentsRows(book, book.tree))
.reduce((result, item) => ([...result, ...item]), [] as string[][]);

const manifestText = [
['id', 'title', 'text title', 'language', 'slug', 'url', 'toc type', 'toc target type'],
...rows,
].map(row => row.map(quoteValue).join(',')).join('\n');

await saveFile('/rex/content-metadata.csv', manifestText);
};

function getContentsRows(
book: BookWithOSWebData,
node: ArchiveTree | ArchiveTreeNode,
chapterNumber?: string
): string[][] {
const {title, toc_target_type} = node;
const [titleNumber, titleString] = splitTitleParts(node.title);
const textTitle = `${titleNumber || chapterNumber || ''} ${titleString}`.replace(/\s+/, ' ').trim();
const id = stripIdVersion(node.id);
const tocType = node.toc_type ?? (id === book.id ? 'book' : '');

const urlParams = tocType === 'book'
? [node.slug, '']
: 'contents' in node
? ['', '']
: [node.slug, content.getUrl({book: {slug: book.slug}, page: {slug: node.slug}})];

const contents = 'contents' in node
? node.contents.map(child => getContentsRows(book, child, titleNumber || chapterNumber))
.reduce((result, item) => ([...result, ...item]), [] as string[][])
: [];

return [
[stripIdVersion(id), title, textTitle, book.language, ...urlParams, tocType, toc_target_type ?? ''],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised the real (html) title gets used, but maybe the textTitle is used as the display value in the reports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I doubt that the html will be used by anything but I threw it in there. the text title is intended to be used by reporting I added the context number in there for the eoc pages

...contents,
];
}


// simple helper for local
const writeAssetFileAsync = async(filepath: string, contents: string) => {
return writeAssetFile(filepath, contents);
};
export const renderContentManifest = async(books: BookWithOSWebData[]) => {
return renderAndSaveContentManifest(writeAssetFileAsync, books);
};
30 changes: 10 additions & 20 deletions script/prerender/fleet.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,13 @@ import { getBooksConfigSync } from '../../src/gateways/createBookConfigLoader';
import createOSWebLoader from '../../src/gateways/createOSWebLoader';
import { readFile } from '../../src/helpers/fileUtils';
import { globalMinuteCounter, prepareBookPages } from './contentPages';
import { SerializedBookMatch, SerializedPageMatch } from './contentRoutes';
import { SerializedPageMatch } from './contentRoutes';
import createRedirects from './createRedirects';
import './logUnhandledRejectionsAndExit';
import renderManifest from './renderManifest';
import { SitemapPayload } from './sitemap';
import { SitemapPayload, renderAndSaveSitemapIndex } from './sitemap';
import { writeS3ReleaseXmlFile } from './fileUtils';
import { renderAndSaveContentManifest } from './contentManifest';

const {
ARCHIVE_URL,
Expand Down Expand Up @@ -86,7 +88,6 @@ const sqsClient = new SQSClient({ region: WORK_REGION });

type PageTask = { payload: SerializedPageMatch, type: 'page' };
type SitemapTask = { payload: SitemapPayload, type: 'sitemap' };
type SitemapIndexTask = { payload: SerializedBookMatch[], type: 'sitemapIndex' };

const booksConfig = getBooksConfigSync();
const archiveLoader = createArchiveLoader({
Expand Down Expand Up @@ -288,8 +289,7 @@ async function getQueueUrls(workersStackName: string) {
class Stats {
public pages = 0;
public sitemaps = 0;
public sitemapIndexes = 0;
get total() { return this.pages + this.sitemaps + this.sitemapIndexes; }
get total() { return this.pages + this.sitemaps; }
}

function makePrepareAndQueueBook(workQueueUrl: string, stats: Stats) {
Expand Down Expand Up @@ -347,11 +347,7 @@ function makePrepareAndQueueBook(workQueueUrl: string, stats: Stats) {

console.log(`[${book.title}] Sitemap queued`);

// Used in the sitemap index
return {
params: { book: { slug: book.slug } },
state: { bookUid: book.id, bookVersion: book.version },
};
return book;
};
}

Expand All @@ -371,14 +367,8 @@ async function queueWork(workQueueUrl: string) {
`All ${stats.pages} page prerendering jobs and all ${stats.sitemaps} sitemap jobs queued`
);

await sendWithRetries(sqsClient, new SendMessageCommand({
MessageBody: JSON.stringify({ payload: books, type: 'sitemapIndex' } as SitemapIndexTask),
QueueUrl: workQueueUrl,
}));

stats.sitemapIndexes = 1;

console.log('1 sitemap index job queued');
renderAndSaveSitemapIndex(writeS3ReleaseXmlFile, books);
renderAndSaveContentManifest(writeS3ReleaseXmlFile, books);

return stats;
}
Expand Down Expand Up @@ -463,8 +453,8 @@ async function finishRendering(stats: Stats) {
const elapsedMinutes = globalMinuteCounter();

console.log(
`Prerender complete in ${elapsedMinutes} minutes. Rendered ${stats.pages} pages, ${
stats.sitemaps} sitemaps and ${stats.sitemapIndexes} sitemap index. ${
`Prerender complete in ${elapsedMinutes} minutes. Rendered ${stats.pages} pages, and ${
stats.sitemaps} sitemaps. ${
stats.total / elapsedMinutes}ppm`
);
}
Expand Down
4 changes: 3 additions & 1 deletion script/prerender/local.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import { createDiskCache } from './fileUtils';
import renderManifest from './renderManifest';
import { renderSitemap, renderSitemapIndex } from './sitemap';
import userLoader from './stubbedUserLoader';
import { renderContentManifest } from './contentManifest';

const {
REACT_APP_HIGHLIGHTS_URL,
Expand Down Expand Up @@ -81,7 +82,8 @@ async function render() {
await renderSitemap(book.slug, sitemap);
}

await renderSitemapIndex();
await renderSitemapIndex(books);
await renderContentManifest(books);
await renderManifest();
await createRedirects(archiveLoader, osWebLoader);

Expand Down
35 changes: 9 additions & 26 deletions script/prerender/sitemap.ts
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
import filter from 'lodash/fp/filter';
import flow from 'lodash/fp/flow';
import get from 'lodash/fp/get';
import identity from 'lodash/fp/identity';
import map from 'lodash/fp/map';
import max from 'lodash/fp/max';
import sitemap, { SitemapItemOptions } from 'sitemap';
import { SerializedPageMatch } from './contentRoutes';
import { writeAssetFile } from './fileUtils';
import { BookWithOSWebData } from '../../src/app/content/types';
import { getSitemapItemOptions } from './contentPages';

export const sitemapPath = (pathName: string) => `/rex/sitemaps/${pathName}.xml`;

Expand All @@ -28,40 +24,27 @@ export const renderAndSaveSitemap = async(

export const renderAndSaveSitemapIndex = async(
saveFile: (path: string, contents: string) => Promise<unknown>,
urls: SitemapItemOptions[]
books: BookWithOSWebData[]
) => {
const sitemapIndex = sitemap.buildSitemapIndex({ urls });
const sitemapIndex = sitemap.buildSitemapIndex({urls: books.map(book =>
getSitemapItemOptions(book, `https://openstax.org${sitemapPath(book.slug)}`)
)});

const filePath = sitemapPath('index');

await saveFile(filePath, sitemapIndex.toString());

return filePath;
};

// renderSitemap() and renderSitemapIndex() are used only by single-instance prerender code

// Multi-instance code cannot store an array of sitemaps in memory and then use it across instances
const sitemaps: SitemapItemOptions[] = [];

const writeAssetFileAsync = async(filepath: string, contents: string) => {
return writeAssetFile(filepath, contents);
};

export const renderSitemap = async(filename: string, urls: SitemapItemOptions[]) => {
const lastmod = flow(
map<SitemapItemOptions, (string | undefined)>(get('lastmod')),
filter<string | undefined>(identity),
max
)(urls);

const filePath = await renderAndSaveSitemap(writeAssetFileAsync, filename, urls);

const url = `https://openstax.org${filePath}`;

sitemaps.push({url, lastmod});
await renderAndSaveSitemap(writeAssetFileAsync, filename, urls);
};

export const renderSitemapIndex = async() => {
return renderAndSaveSitemapIndex(writeAssetFileAsync, sitemaps);
export const renderSitemapIndex = async(books: BookWithOSWebData[]) => {
return renderAndSaveSitemapIndex(writeAssetFileAsync, books);
};
23 changes: 1 addition & 22 deletions script/prerender/thread.ts
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,13 @@ import createImageCDNUtils from '../../src/gateways/createImageCDNUtils';
import { getSitemapItemOptions, renderAndSavePage } from './contentPages';
import {
deserializePageMatch,
getArchiveBook,
getArchivePage,
SerializedBookMatch,
SerializedPageMatch,
} from './contentRoutes';
import { writeS3ReleaseHtmlFile, writeS3ReleaseXmlFile } from './fileUtils';
import './logUnhandledRejectionsAndExit';
import {
renderAndSaveSitemap,
renderAndSaveSitemapIndex,
sitemapPath,
SitemapPayload,
} from './sitemap';
import userLoader from './stubbedUserLoader';
Expand Down Expand Up @@ -90,24 +86,8 @@ function makeSitemapTask(services: AppOptions['services']) {
};
}

function makeSitemapIndexTask(services: AppOptions['services']) {
return async(payload: SerializedBookMatch[]) => {
const books = payload.map(
(book: SerializedBookMatch, index: number) => assertObject(
book, `Sitemap Index task payload[${index}] is not an object: ${payload}`
)
);
const items = await asyncPool(MAX_CONCURRENT_CONNECTIONS, books, async(book) => {
const archiveBook = await getArchiveBook(services, book);
return getSitemapItemOptions(archiveBook, `https://openstax.org${sitemapPath(book.params.book.slug)}`);
});
return renderAndSaveSitemapIndex(writeS3ReleaseXmlFile, items);
};
}

type AnyTaskFunction = ((payload: SerializedPageMatch) => void) |
((payload: SitemapPayload) => void) |
((payload: SerializedBookMatch[]) => void);
((payload: SitemapPayload) => void);

type TaskFunctionsMap = { [key: string]: AnyTaskFunction | undefined };

Expand Down Expand Up @@ -141,7 +121,6 @@ async function makeTaskFunctionsMap() {
return {
page: makePageTask(services),
sitemap: makeSitemapTask(services),
sitemapIndex: makeSitemapIndexTask(services),
} as TaskFunctionsMap;
}

Expand Down
2 changes: 2 additions & 0 deletions src/app/content/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ export interface ArchiveTreeNode {
id: string;
title: string;
slug: string;
toc_type?: string;
toc_target_type?: string;
}

export type ArchiveTreeSectionType = 'book' | 'unit' | 'chapter' | 'page' | 'eoc-dropdown' | 'eob-dropdown';
Expand Down
Loading